All Posts

To Correlate or Not to Correlate: The Question When Finding Multiple Variables

When William Shakespeare wrote “To be or not to be” for Prince Hamlet to speak and express his contemplation for embracing the universal truth; little did he know that he would be quoted in various different contexts for different types of effects. A coward soldier saying; “to flee or not to flee”; a conniving trader evaluating an unsuspecting customer; “to fleece or not to fleece”; the colonial masters strategizing their exit; “to free or not to free”. And as guessed by you; a data scientist upon stumbling on a couple of interesting variables; “to correlate or not to correlate”.

The originator of this phrase, himself never faced this dilemma in his life (except on stage); but he created a phrase consisting of 6 words (4 unique words) or 13 letters (6 unique; we know where this is going) or 2 vowels and 4 consonants. We can start counting (a fundamental element of data science and math) prepositions, conjunctions and start evaluating tense, subject & object (which in this case are unsurprisingly abstract).

The summary of the previous three sentences being that though “The Great Will” never faced this situation; but if you do a data science driven grammatical dissection of this simple phrase; those previously uninitiated in either grammar or data science would really start understanding the meaning of the aforementioned piece of priceless literature; in very much the same way in which the reader of the current content is contemplating: To read or not to read.

But Will also did mention, “Though this be madness, yet there is method in it.”

Are all Variables to be Considered?

So, what happens when a data scientist stumbles upon a few variables which seem to have correlation? Are all the variables to be included in the model or do we just include one of these variables? What if they are derived from each other or what if they were derived indirectly via a hierarchy of variables? Would we end up giving more than due importance to the impact of a single variable. Hence, the question, “To correlate or not to correlate”.

I have the privilege of working with talented and experienced data scientists and I posed this question to them.

Pooja has been working on persistency models for Insurance carriers for the last 12 quarters and in her experience correlations between variables is always a symptom which needs to be investigated. She speaks like a true relationship consultant; doesn’t she? She quotes an example of an often-used variable domain – financial attributes of a policy.

Annual premium (AP) and Modal Premium (MP) correlate. However, AP can be derived from MP by multiplying with frequency and vice-versa. Hence in practice AP is used with further consideration of better usability on a broad base of customer and not building a bias on frequency.

Neeraja has been dealing with data for the past 5 years and been building models for various practical business cases. She says “I never guess. It is a shocking habit – destructive to the logical faculty.” But Sherlock said that (actually it was Sir Arthur Conan Doyle; in this piece of literature, for me, the boundary between fiction and reality is very blurred to the point of being non-existent) and not Shakespeare.  So how did we end up with Sherlock? Oh yes Neeraja is a Sherlock fan. One of her many key observations can be summarized as follows:

Sum assured of a policy is associated with the premium. One increases with the other; a classic case of correlation. But from actuarial perspective, premium is derived based on sum assured. Therefore, if both are used it may create a more than desired impact of sum assured. But one cannot be ignored against the other in this example since personal attributes of the policy holder influence the variation between sum assured and premium within a product base; therefore, a derived feature from both these is engineered instead of using both separately in most of the models.

If there ever was a data science version of William Shakespeare, his words would be somewhat in the following lines:

  • All the models are a canvas, and all the variables (derived or direct) are merely players; they have their filters and rejections; and one variable in its time plays many parts, its acts being seven stages (acquisition, cleaning, feature engineering, training, testing, production and retrain)
  • There is nothing either correlating or not correlating but thinking makes it so
  • Some are born correlated, some achieve correlation, and some have correlation thrust on them
  • We know what the variables are, but know not what the variables may be

Some correlations seem spurious and illogical in the beginning, but further analysis can reveal some useful patterns and insights. Like the revelations of famous diaper & beer sales correlation. It may seem like an opportunity to quote “The Great Shake” here by saying “All that glisters is not gold” (yup glisters and not glitters).  But one would put him in an awkward position because on further investigation it was found that new dads buying beer when they come shopping for diapers was the reason behind this true case of correlation. Investigating seemingly unrelated and repeatedly co-occurring events might end up revealing behavioral patterns which can be basis for valid assumptions in the model.

We found another interesting correlation – customers who had been in lapse status in the recent past were less likely to have a death claim on their new policy in the near future. While this may seem like an illogical correlation, some thought does provide a justification. If a customer has reason to believe there will be a claim soon (perhaps due to illness or even an intent to defraud), he will keep his existing policy in-force and active, even while he applies for additional policies. Therefore a correct way of getting William to participate here would be “Our doubts are traitors and make us lose the good we oft might win by fearing to attempt.”

Our motto to this is, “When you see a correlation, investigate and if you have a doubt on a correlation; investigate more.”


One thing is clear: correlation is an important measure; as opined by a few of my colleagues. Also, it is not an existential crisis but a preferential one when we say, “To correlate or not to correlate.” Having attempted to answer this query of Data Science Willy we shall next answer, “What is in an algorithm?”

Interested in learning how Aureus can help you leverage machine learning to predict your customer's behavior? Click on the link below to get more information.
More Information

Nitin Purohit
Nitin Purohit
Nitin is CTO and co-founder at Aureus. With over 15 years of experience in leveraging technology to drive and achieve top-line and bottom-line numbers, Nitin has helped global organizations optimize value from their significant IT investments. Over the years, Nitin has been responsible for the creation of many product IPs. Prior to this role at Aureus, Nitin was the Global Practice Head for Application Services at Omnitech Infosolutions Ltd and was responsible for sales and profitability of offerings from application services across geographies.

Related Posts

Transfer Learning: A New Age of Machine Learning

In recent years, Machine Learning (ML) algorithms have advanced and are now capable of learning accurate and complex patterns provided large and labeled data samples are available. However, many ML implementations fail to generalize when new data points are encountered, especially data points with different and unseen patterns or conditions from training samples.

Trust in the Evolution of the Customer's Journey

This is part 1 of a 2-part series "Trust: The Key Ingredient for a Successful Insurance Customer Journey." Today, everyone in the business world is talking about the customer journey and experiences starting from E-commerce, banking, and many other industries. So, what is customer experience? What is new about it?

3 Ways to Target the Right Customers in the Insurance Industry

This is Part 3 of our blog series, "Data Science Use Cases in Insurance." The insurance industry isn’t the same as it was 20 years ago. It has become much more competitive as tech companies come into the picture with new and innovative ways to compete in order to gain a foothold in the insurance industry. Consumers want to save money and will make their decisions based on the lowest price available. Some websites will help the consumer compare carriers’ prices and offerings to choose the best deal. Unfortunately, this is causing insurance companies to make price their priority over quality and customer satisfaction.