All Posts

To Correlate or Not to Correlate: The Question When Finding Multiple Variables

When William Shakespeare wrote “To be or not to be” for Prince Hamlet to speak and express his contemplation for embracing the universal truth; little did he know that he would be quoted in various different contexts for different types of effects. A coward soldier saying; “to flee or not to flee”; a conniving trader evaluating an unsuspecting customer; “to fleece or not to fleece”; the colonial masters strategizing their exit; “to free or not to free”. And as guessed by you; a data scientist upon stumbling on a couple of interesting variables; “to correlate or not to correlate”.

The originator of this phrase, himself never faced this dilemma in his life (except on stage); but he created a phrase consisting of 6 words (4 unique words) or 13 letters (6 unique; we know where this is going) or 2 vowels and 4 consonants. We can start counting (a fundamental element of data science and math) prepositions, conjunctions and start evaluating tense, subject & object (which in this case are unsurprisingly abstract).

The summary of the previous three sentences being that though “The Great Will” never faced this situation; but if you do a data science driven grammatical dissection of this simple phrase; those previously uninitiated in either grammar or data science would really start understanding the meaning of the aforementioned piece of priceless literature; in very much the same way in which the reader of the current content is contemplating: To read or not to read.

But Will also did mention, “Though this be madness, yet there is method in it.”

Are all Variables to be Considered?

So, what happens when a data scientist stumbles upon a few variables which seem to have correlation? Are all the variables to be included in the model or do we just include one of these variables? What if they are derived from each other or what if they were derived indirectly via a hierarchy of variables? Would we end up giving more than due importance to the impact of a single variable. Hence, the question, “To correlate or not to correlate”.

I have the privilege of working with talented and experienced data scientists and I posed this question to them.

Pooja has been working on persistency models for Insurance carriers for the last 12 quarters and in her experience correlations between variables is always a symptom which needs to be investigated. She speaks like a true relationship consultant; doesn’t she? She quotes an example of an often-used variable domain – financial attributes of a policy.

Annual premium (AP) and Modal Premium (MP) correlate. However, AP can be derived from MP by multiplying with frequency and vice-versa. Hence in practice AP is used with further consideration of better usability on a broad base of customer and not building a bias on frequency.

Neeraja has been dealing with data for the past 5 years and been building models for various practical business cases. She says “I never guess. It is a shocking habit – destructive to the logical faculty.” But Sherlock said that (actually it was Sir Arthur Conan Doyle; in this piece of literature, for me, the boundary between fiction and reality is very blurred to the point of being non-existent) and not Shakespeare.  So how did we end up with Sherlock? Oh yes Neeraja is a Sherlock fan. One of her many key observations can be summarized as follows:

Sum assured of a policy is associated with the premium. One increases with the other; a classic case of correlation. But from actuarial perspective, premium is derived based on sum assured. Therefore, if both are used it may create a more than desired impact of sum assured. But one cannot be ignored against the other in this example since personal attributes of the policy holder influence the variation between sum assured and premium within a product base; therefore, a derived feature from both these is engineered instead of using both separately in most of the models.

If there ever was a data science version of William Shakespeare, his words would be somewhat in the following lines:

  • All the models are a canvas, and all the variables (derived or direct) are merely players; they have their filters and rejections; and one variable in its time plays many parts, its acts being seven stages (acquisition, cleaning, feature engineering, training, testing, production and retrain)
  • There is nothing either correlating or not correlating but thinking makes it so
  • Some are born correlated, some achieve correlation, and some have correlation thrust on them
  • We know what the variables are, but know not what the variables may be

Some correlations seem spurious and illogical in the beginning, but further analysis can reveal some useful patterns and insights. Like the revelations of famous diaper & beer sales correlation. It may seem like an opportunity to quote “The Great Shake” here by saying “All that glisters is not gold” (yup glisters and not glitters).  But one would put him in an awkward position because on further investigation it was found that new dads buying beer when they come shopping for diapers was the reason behind this true case of correlation. Investigating seemingly unrelated and repeatedly co-occurring events might end up revealing behavioral patterns which can be basis for valid assumptions in the model.

We found another interesting correlation – customers who had been in lapse status in the recent past were less likely to have a death claim on their new policy in the near future. While this may seem like an illogical correlation, some thought does provide a justification. If a customer has reason to believe there will be a claim soon (perhaps due to illness or even an intent to defraud), he will keep his existing policy in-force and active, even while he applies for additional policies. Therefore a correct way of getting William to participate here would be “Our doubts are traitors and make us lose the good we oft might win by fearing to attempt.”

Our motto to this is, “When you see a correlation, investigate and if you have a doubt on a correlation; investigate more.”


One thing is clear: correlation is an important measure; as opined by a few of my colleagues. Also, it is not an existential crisis but a preferential one when we say, “To correlate or not to correlate.” Having attempted to answer this query of Data Science Willy we shall next answer, “What is in an algorithm?”

Interested in learning how Aureus can help you leverage machine learning to predict your customer's behavior? Click on the link below to get more information.
More Information

Nitin Purohit
Nitin Purohit
Nitin is CTO and co-founder at Aureus. With over 15 years of experience in leveraging technology to drive and achieve top-line and bottom-line numbers, Nitin has helped global organizations optimize value from their significant IT investments. Over the years, Nitin has been responsible for the creation of many product IPs. Prior to this role at Aureus, Nitin was the Global Practice Head for Application Services at Omnitech Infosolutions Ltd and was responsible for sales and profitability of offerings from application services across geographies.

Related Posts

Data and Innovation: 2 Sides of the Same Coin

As we set our feet in 2023, having experienced a roller-coaster ride last year thanks to the geopolitical tensions and some lingering rub-off effects of COVID-19, it drives home that "change is the only constant." Like any other industry, insurance is undergoing paradigm changes at different levels, whether recruiting potential candidates or customer onboarding, to name a few. However, a common thread that ties the myriad business functions of an insurance company has been data and innovation. There has been an ever-increasing need for insurance providers to use data and embrace innovation in their routine activities, eventually to stand the cut-throat competition.

Intelligent Risk Assessment in Insurance

Risk Management is a core function within the insurance industry. It is a vital responsibility of the underwriting team. Insurance companies collect data scattered across different business units in various formats – some of which are paper and digital, most of which are typically unstructured. The underwriting team doesn't have immediate access to the information required for internal and external decision-making, resulting in delays in making decisions and costly mistakes.

Why Does the Long-term Nature of Life Insurance Products Make Customer Retention Difficult?

Most insurers offer similar products and services, which makes it challenging to attract new customers and retain them. As an industry, insurance is low-touch, and insurers seldom interact with their customers. A report shows that the top companies have an average customer retention rate of 93 - 95 percent, while insurance companies have an average of 84 percent.