All Posts

4 Stages of Predictive Modeling and How Business Aspects Influence Them

The first step in predictive modeling is defining the problem. Once done, historical data is identified, and the analytics team can now begin the actual work of model development.

In this blog, we touch on the business factors that influence model development. If you find this interesting and want a deeper dive, you’ll have the opportunity to download our whitepaper that goes into more detail on this topic.

Business Factors That Influence the 4 Stages of Model Development

#1 Choice of learning period

Predicting the future based on the past: A predictive model uses historical data to make predictions about future outcomes. This is done by learning and understanding patterns associated with past outcomes and then making predictions by applying the patterns to new data. The method to accomplish this varies, but the learning set, or historical time period from which the data is used is key. Most of the data collected will be used as the learning/training set, but some should always be kept aside to be used as the testing set.  

More data is often better: This is almost always true – the more historical data you have, the better. Some data is time-sensitive, though. For example, if you want to predict whether you will get an insurance claim within the first year, then the learning set should only include policies that have completed one year after issuance.

Sometimes less data is more: Sometimes you don’t always want all the data. If the data is not stable due to changes to business processes, then you may only want recent data. This will ensure you get the most relevant data and result in more accurate predictions.

Capturing the full picture: Some businesses have periods of fluctuation. A higher volume of claims can be expected during peak season, but always try to include more than one year for the learning period in order to get the whole picture. 

#2 Choice of independent variables

Input determines output: This is where feature engineering becomes an important step in the predictive model building process. The data is cleaned to reduce noise, and variables are identified/created that have a relationship with the event being predicted.

Domain knowledge: Understanding the business is important for the analytics team to know what data to use while building the model. For example, in a fraudulent claims model, the risk control unit can provide insights about commonly observed patterns in fraudulent cases. Variables can then be created based on these insights and used in the model.

Data availability at the point of prediction: Some data is time-sensitive. If an auto policy is up for renewal, the underwriter will use the predictive model at the time of underwriting the renewal. If a claim is submitted after the renewal is processed, any data captured related to the claim would not be available to the underwriter and cannot be used in the model.

Intended use of the model: How does the business intend to use the model? This matters and will drive how the model is built. If it’s meant for the business to review and interpret, then complex variables should be avoided. If it’s meant for prediction (like the potential of a fraudulent claim), then using complex variables will not be an issue.

Variable significance: The list of potential predictors is based on data availability, creativity and domain knowledge. However, the predictors that are retained in the finalized model will be those that have a significant relationship with the event to be predicted.

#3 Choice of algorithm

Nature of the event being predicted: At a high level, there are two kinds of models:

  •  Regression models: used for continuous target variables, such as time to claim or claim amount
  • Classification models: used for two-class (binary) or multi-class predictions, such as whether a claim will occur or a premium payment will be made before the due date, in the grace period, or after lapse

Some statistical algorithms are specific to one type of prediction. The choice of algorithm is determined by the business use case, and by trying alternate models and comparing the results on the holdout sample.

Intended use of the model: Understanding the reasons behind a prediction is as important for model adoption as having an accurate prediction. Understanding the workings of the model establishes trust that the model has learned the correct patterns, and that there are no legal or ethical violations.

When understanding the intended business use for the model, it’s important to understand the difference between white-box models and black-box models. White-box models are more transparent and easier to understand how they work. They typically use linear/logistic regression and decision tree algorithms. Black-box models are much more complex and harder to explain. They use algorithms such as deep-learning, boosting and random forest.

Model performance: After accounting for the business use case, the litmus test for choosing between alternate algorithms is a comparison of predictive power on unseen data, such as a holdout sample from the learning set as mentioned earlier.

#4 Model consumption

Model output: For a regression problem, the model output is a direct numeric prediction of the outcome. For a classification problem, the model output is a numeric score which is indicative of the probability that the outcome is a specific class. For any record, the predicted class is the class with the highest model score.

Model evaluation: Regression models are evaluated by quantifying the extent of deviations of the actual numeric value from the predicted numeric value. Commonly used metrics are mean absolute error, mean squared error and root mean squared error. Classification models are evaluated based on extent to which records are correctly classified as well as the ability to rank order records with higher probability of belonging to a particular class.

Model implementation: After a predictive model has been constructed, evaluated and finalized, the next step is the implementation of the model for use by the business. The mode of implementation depends on how frequent the predictive model needs to be scored (real-time or in batch mode).

Feedback on model performance: Once the model is deployed, continuous monitoring is needed to evaluate whether model performance meets expectations. The work isn’t complete here, though, as re-training will be necessary in the future to keep in line with business needs.


The feedback loop between business stakeholders and the analytics team developing the predictive model is necessary at every stage of model development. An optimal model will combine business knowledge and implementation needs with technical data science expertise.

Interested in learning more about this topic? Click on the link below to download an expanded version of this article as a whitepaper.
Learn More

Pooja Thayyil
Pooja Thayyil
Pooja is a part of the data sciences team, and is involved in predictive modeling and research on machine learning algorithms and metrics. She has done her graduation from St. Xavier’s College with Economics & Statistics with Distinction and also has a Masters Degree in Economics from the University of Mumbai.

Related Posts

Data and Innovation: 2 Sides of the Same Coin

As we set our feet in 2023, having experienced a roller-coaster ride last year thanks to the geopolitical tensions and some lingering rub-off effects of COVID-19, it drives home that "change is the only constant." Like any other industry, insurance is undergoing paradigm changes at different levels, whether recruiting potential candidates or customer onboarding, to name a few. However, a common thread that ties the myriad business functions of an insurance company has been data and innovation. There has been an ever-increasing need for insurance providers to use data and embrace innovation in their routine activities, eventually to stand the cut-throat competition.

Intelligent Risk Assessment in Insurance

Risk Management is a core function within the insurance industry. It is a vital responsibility of the underwriting team. Insurance companies collect data scattered across different business units in various formats – some of which are paper and digital, most of which are typically unstructured. The underwriting team doesn't have immediate access to the information required for internal and external decision-making, resulting in delays in making decisions and costly mistakes.

Why Does the Long-term Nature of Life Insurance Products Make Customer Retention Difficult?

Most insurers offer similar products and services, which makes it challenging to attract new customers and retain them. As an industry, insurance is low-touch, and insurers seldom interact with their customers. A report shows that the top companies have an average customer retention rate of 93 - 95 percent, while insurance companies have an average of 84 percent.