All Posts

To Correlate or Not to Correlate: The Question When Finding Multiple Variables

When William Shakespeare wrote “To be or not to be” for Prince Hamlet to speak and express his contemplation for embracing the universal truth; little did he know that he would be quoted in various different contexts for different types of effects. A coward soldier saying; “to flee or not to flee”; a conniving trader evaluating an unsuspecting customer; “to fleece or not to fleece”; the colonial masters strategizing their exit; “to free or not to free”. And as guessed by you; a data scientist upon stumbling on a couple of interesting variables; “to correlate or not to correlate”.

The originator of this phrase, himself never faced this dilemma in his life (except on stage); but he created a phrase consisting of 6 words (4 unique words) or 13 letters (6 unique; we know where this is going) or 2 vowels and 4 consonants. We can start counting (a fundamental element of data science and math) prepositions, conjunctions and start evaluating tense, subject & object (which in this case are unsurprisingly abstract).

The summary of the previous three sentences being that though “The Great Will” never faced this situation; but if you do a data science driven grammatical dissection of this simple phrase; those previously uninitiated in either grammar or data science would really start understanding the meaning of the aforementioned piece of priceless literature; in very much the same way in which the reader of the current content is contemplating: To read or not to read.

But Will also did mention, “Though this be madness, yet there is method in it.”

Are all Variables to be Considered?

So, what happens when a data scientist stumbles upon a few variables which seem to have correlation? Are all the variables to be included in the model or do we just include one of these variables? What if they are derived from each other or what if they were derived indirectly via a hierarchy of variables? Would we end up giving more than due importance to the impact of a single variable. Hence, the question, “To correlate or not to correlate”.

I have the privilege of working with talented and experienced data scientists and I posed this question to them.

Pooja has been working on persistency models for Insurance carriers for the last 12 quarters and in her experience correlations between variables is always a symptom which needs to be investigated. She speaks like a true relationship consultant; doesn’t she? She quotes an example of an often-used variable domain – financial attributes of a policy.

Annual premium (AP) and Modal Premium (MP) correlate. However, AP can be derived from MP by multiplying with frequency and vice-versa. Hence in practice AP is used with further consideration of better usability on a broad base of customer and not building a bias on frequency.

Neeraja has been dealing with data for the past 5 years and been building models for various practical business cases. She says “I never guess. It is a shocking habit – destructive to the logical faculty.” But Sherlock said that (actually it was Sir Arthur Conan Doyle; in this piece of literature, for me, the boundary between fiction and reality is very blurred to the point of being non-existent) and not Shakespeare.  So how did we end up with Sherlock? Oh yes Neeraja is a Sherlock fan. One of her many key observations can be summarized as follows:

Sum assured of a policy is associated with the premium. One increases with the other; a classic case of correlation. But from actuarial perspective, premium is derived based on sum assured. Therefore, if both are used it may create a more than desired impact of sum assured. But one cannot be ignored against the other in this example since personal attributes of the policy holder influence the variation between sum assured and premium within a product base; therefore, a derived feature from both these is engineered instead of using both separately in most of the models.

If there ever was a data science version of William Shakespeare, his words would be somewhat in the following lines:

  • All the models are a canvas, and all the variables (derived or direct) are merely players; they have their filters and rejections; and one variable in its time plays many parts, its acts being seven stages (acquisition, cleaning, feature engineering, training, testing, production and retrain)
  • There is nothing either correlating or not correlating but thinking makes it so
  • Some are born correlated, some achieve correlation, and some have correlation thrust on them
  • We know what the variables are, but know not what the variables may be

Some correlations seem spurious and illogical in the beginning, but further analysis can reveal some useful patterns and insights. Like the revelations of famous diaper & beer sales correlation. It may seem like an opportunity to quote “The Great Shake” here by saying “All that glisters is not gold” (yup glisters and not glitters).  But one would put him in an awkward position because on further investigation it was found that new dads buying beer when they come shopping for diapers was the reason behind this true case of correlation. Investigating seemingly unrelated and repeatedly co-occurring events might end up revealing behavioral patterns which can be basis for valid assumptions in the model.

We found another interesting correlation – customers who had been in lapse status in the recent past were less likely to have a death claim on their new policy in the near future. While this may seem like an illogical correlation, some thought does provide a justification. If a customer has reason to believe there will be a claim soon (perhaps due to illness or even an intent to defraud), he will keep his existing policy in-force and active, even while he applies for additional policies. Therefore a correct way of getting William to participate here would be “Our doubts are traitors and make us lose the good we oft might win by fearing to attempt.”

Our motto to this is, “When you see a correlation, investigate and if you have a doubt on a correlation; investigate more.”


One thing is clear: correlation is an important measure; as opined by a few of my colleagues. Also, it is not an existential crisis but a preferential one when we say, “To correlate or not to correlate.” Having attempted to answer this query of Data Science Willy we shall next answer, “What is in an algorithm?”

Interested in learning how Aureus can help you leverage machine learning to predict your customer's behavior? Click on the link below to get more information.
More Information

Nitin Purohit
Nitin Purohit
Nitin is CTO and co-founder at Aureus. With over 15 years of experience in leveraging technology to drive and achieve top-line and bottom-line numbers, Nitin has helped global organizations optimize value from their significant IT investments. Over the years, Nitin has been responsible for the creation of many product IPs. Prior to this role at Aureus, Nitin was the Global Practice Head for Application Services at Omnitech Infosolutions Ltd and was responsible for sales and profitability of offerings from application services across geographies.

Related Posts

Understanding Agency Sentiment

In our previous blog article, “Using AI for Increasing Agent Productivity,” we discussed how many insurance companies can only analyze agent productivity based on the premiums written and the loss ratio of their network of independent agencies. In part 2 of our series of articles on “The Top 3 Emerging Trends for Agent/Advisor Analytics Using AI”,  we will focus on the benefits of understanding agency sentiment for insurance companies that utilize a network of independent agencies.

Using AI for Increasing Agent Productivity

Currently, many insurance carriers can only analyze agent productivity based on the premiums written and the loss ratio of their network of independent agencies. Looking only at past results doesn’t necessarily provide an accurate view of how an insurance carrier can increase agent productivity going forward. By using AI for increasing agency productivity, insurers can now predict the best course of action as opposed to waiting to review past results.

AI Lessons From a Mind Master and a Grandmaster

Chess and similar games have always been used to measure the “intelligence” of machines. Chess grandmasters have always seen an able sparring partner in a good chess engine running on a capable computer. The positional evaluation, which comes by intuition and is honed and sharpened by unforgiving hours of grueling practice, can be expressed as a set of mathematical models that fast computers can use to create gameplay.