When William Shakespeare wrote “To be or not to be” for Prince Hamlet to speak and express his contemplation for embracing the universal truth; little did he know that he would be quoted in various different contexts for different types of effects. A coward soldier saying; “to flee or not to flee”; a conniving trader evaluating an unsuspecting customer; “to fleece or not to fleece”; the colonial masters strategizing their exit; “to free or not to free”. And as guessed by you; a data scientist upon stumbling on a couple of interesting variables; “to correlate or not to correlate”.
The originator of this phrase, himself never faced this dilemma in his life (except on stage); but he created a phrase consisting of 6 words (4 unique words) or 13 letters (6 unique; we know where this is going) or 2 vowels and 4 consonants. We can start counting (a fundamental element of data science and math) prepositions, conjunctions and start evaluating tense, subject & object (which in this case are unsurprisingly abstract).
The summary of the previous three sentences being that though “The Great Will” never faced this situation; but if you do a data science driven grammatical dissection of this simple phrase; those previously uninitiated in either grammar or data science would really start understanding the meaning of the aforementioned piece of priceless literature; in very much the same way in which the reader of the current content is contemplating: To read or not to read.
But Will also did mention, “Though this be madness, yet there is method in it.”
Are all Variables to be Considered?
So, what happens when a data scientist stumbles upon a few variables which seem to have correlation? Are all the variables to be included in the model or do we just include one of these variables? What if they are derived from each other or what if they were derived indirectly via a hierarchy of variables? Would we end up giving more than due importance to the impact of a single variable. Hence, the question, “To correlate or not to correlate”.
I have the privilege of working with talented and experienced data scientists and I posed this question to them.
Pooja has been working on persistency models for Insurance carriers for the last 12 quarters and in her experience correlations between variables is always a symptom which needs to be investigated. She speaks like a true relationship consultant; doesn’t she? She quotes an example of an often-used variable domain – financial attributes of a policy.
Annual premium (AP) and Modal Premium (MP) correlate. However, AP can be derived from MP by multiplying with frequency and vice-versa. Hence in practice AP is used with further consideration of better usability on a broad base of customer and not building a bias on frequency.
Neeraja has been dealing with data for the past 5 years and been building models for various practical business cases. She says “I never guess. It is a shocking habit – destructive to the logical faculty.” But Sherlock said that (actually it was Sir Arthur Conan Doyle; in this piece of literature, for me, the boundary between fiction and reality is very blurred to the point of being non-existent) and not Shakespeare. So how did we end up with Sherlock? Oh yes Neeraja is a Sherlock fan. One of her many key observations can be summarized as follows:
Sum assured of a policy is associated with the premium. One increases with the other; a classic case of correlation. But from actuarial perspective, premium is derived based on sum assured. Therefore, if both are used it may create a more than desired impact of sum assured. But one cannot be ignored against the other in this example since personal attributes of the policy holder influence the variation between sum assured and premium within a product base; therefore, a derived feature from both these is engineered instead of using both separately in most of the models.
If there ever was a data science version of William Shakespeare, his words would be somewhat in the following lines:
- All the models are a canvas, and all the variables (derived or direct) are merely players; they have their filters and rejections; and one variable in its time plays many parts, its acts being seven stages (acquisition, cleaning, feature engineering, training, testing, production and retrain)
- There is nothing either correlating or not correlating but thinking makes it so
- Some are born correlated, some achieve correlation, and some have correlation thrust on them
- We know what the variables are, but know not what the variables may be
Some correlations seem spurious and illogical in the beginning, but further analysis can reveal some useful patterns and insights. Like the revelations of famous diaper & beer sales correlation. It may seem like an opportunity to quote “The Great Shake” here by saying “All that glisters is not gold” (yup glisters and not glitters). But one would put him in an awkward position because on further investigation it was found that new dads buying beer when they come shopping for diapers was the reason behind this true case of correlation. Investigating seemingly unrelated and repeatedly co-occurring events might end up revealing behavioral patterns which can be basis for valid assumptions in the model.
We found another interesting correlation – customers who had been in lapse status in the recent past were less likely to have a death claim on their new policy in the near future. While this may seem like an illogical correlation, some thought does provide a justification. If a customer has reason to believe there will be a claim soon (perhaps due to illness or even an intent to defraud), he will keep his existing policy in-force and active, even while he applies for additional policies. Therefore a correct way of getting William to participate here would be “Our doubts are traitors and make us lose the good we oft might win by fearing to attempt.”
Our motto to this is, “When you see a correlation, investigate and if you have a doubt on a correlation; investigate more.”
One thing is clear: correlation is an important measure; as opined by a few of my colleagues. Also, it is not an existential crisis but a preferential one when we say, “To correlate or not to correlate.” Having attempted to answer this query of Data Science Willy we shall next answer, “What is in an algorithm?”