The first step in predictive modeling is defining the problem. Once done, historical data is identified, and the analytics team can now begin the actual work of model development.
In this blog, we touch on the business factors that influence model development. If you find this interesting and want a deeper dive, you’ll have the opportunity to download our whitepaper that goes into more detail on this topic.
Business Factors That Influence the 4 Stages of Model Development
#1 Choice of learning period
Predicting the future based on the past: A predictive model uses historical data to make predictions about future outcomes. This is done by learning and understanding patterns associated with past outcomes and then making predictions by applying the patterns to new data. The method to accomplish this varies, but the learning set, or historical time period from which the data is used is key. Most of the data collected will be used as the learning/training set, but some should always be kept aside to be used as the testing set.
More data is often better: This is almost always true – the more historical data you have, the better. Some data is time-sensitive, though. For example, if you want to predict whether you will get an insurance claim within the first year, then the learning set should only include policies that have completed one year after issuance.
Sometimes less data is more: Sometimes you don’t always want all the data. If the data is not stable due to changes to business processes, then you may only want recent data. This will ensure you get the most relevant data and result in more accurate predictions.
Capturing the full picture: Some businesses have periods of fluctuation. A higher volume of claims can be expected during peak season, but always try to include more than one year for the learning period in order to get the whole picture.
#2 Choice of independent variables
Input determines output: This is where feature engineering becomes an important step in the predictive model building process. The data is cleaned to reduce noise, and variables are identified/created that have a relationship with the event being predicted.
Domain knowledge: Understanding the business is important for the analytics team to know what data to use while building the model. For example, in a fraudulent claims model, the risk control unit can provide insights about commonly observed patterns in fraudulent cases. Variables can then be created based on these insights and used in the model.
Data availability at the point of prediction: Some data is time-sensitive. If an auto policy is up for renewal, the underwriter will use the predictive model at the time of underwriting the renewal. If a claim is submitted after the renewal is processed, any data captured related to the claim would not be available to the underwriter and cannot be used in the model.
Intended use of the model: How does the business intend to use the model? This matters and will drive how the model is built. If it’s meant for the business to review and interpret, then complex variables should be avoided. If it’s meant for prediction (like the potential of a fraudulent claim), then using complex variables will not be an issue.
Variable significance: The list of potential predictors is based on data availability, creativity and domain knowledge. However, the predictors that are retained in the finalized model will be those that have a significant relationship with the event to be predicted.
#3 Choice of algorithm
Nature of the event being predicted: At a high level, there are two kinds of models:
- Regression models: used for continuous target variables, such as time to claim or claim amount
- Classification models: used for two-class (binary) or multi-class predictions, such as whether a claim will occur or a premium payment will be made before the due date, in the grace period, or after lapse
Some statistical algorithms are specific to one type of prediction. The choice of algorithm is determined by the business use case, and by trying alternate models and comparing the results on the holdout sample.
Intended use of the model: Understanding the reasons behind a prediction is as important for model adoption as having an accurate prediction. Understanding the workings of the model establishes trust that the model has learned the correct patterns, and that there are no legal or ethical violations.
When understanding the intended business use for the model, it’s important to understand the difference between white-box models and black-box models. White-box models are more transparent and easier to understand how they work. They typically use linear/logistic regression and decision tree algorithms. Black-box models are much more complex and harder to explain. They use algorithms such as deep-learning, boosting and random forest.
Model performance: After accounting for the business use case, the litmus test for choosing between alternate algorithms is a comparison of predictive power on unseen data, such as a holdout sample from the learning set as mentioned earlier.
#4 Model consumption
Model output: For a regression problem, the model output is a direct numeric prediction of the outcome. For a classification problem, the model output is a numeric score which is indicative of the probability that the outcome is a specific class. For any record, the predicted class is the class with the highest model score.
Model evaluation: Regression models are evaluated by quantifying the extent of deviations of the actual numeric value from the predicted numeric value. Commonly used metrics are mean absolute error, mean squared error and root mean squared error. Classification models are evaluated based on extent to which records are correctly classified as well as the ability to rank order records with higher probability of belonging to a particular class.
Model implementation: After a predictive model has been constructed, evaluated and finalized, the next step is the implementation of the model for use by the business. The mode of implementation depends on how frequent the predictive model needs to be scored (real-time or in batch mode).
Feedback on model performance: Once the model is deployed, continuous monitoring is needed to evaluate whether model performance meets expectations. The work isn’t complete here, though, as re-training will be necessary in the future to keep in line with business needs.
The feedback loop between business stakeholders and the analytics team developing the predictive model is necessary at every stage of model development. An optimal model will combine business knowledge and implementation needs with technical data science expertise.