The insurance industry in India is expected to reach US $280 billion by 2020-2021. The life insurance industry is expected to increase by 14-15% annually during the next three to five years. With such rapid growth in the life insurance market, the number of fraud casess is also expected to increase. Recent reports suggest that fraud consume more than 8.5% of the revenue that the industry generates.
Fraud come in all shapes and sizes. The common ones in life insurance are application fraud using forged documents and non-disclosure of critical information, claims fraud by faking death, or getting a policy in the name of a terminally ill person. Insurers use different indicators and checkpoints, especially during the underwriting process, to identify potential frauds.
However, over time fraudsters have become smart and creative to beat the system and checks that insurers have in place. Data analytics, especially machine learning (ML) techniques, can be used to build models that can pick patterns and hidden links from data to identify frauds. Insurance industry uses ML models to provide predictions at different stages, ranging from predicting fraudulent applications during policy issuance to predicting fraudulent claims at the time of claim evaluation.
Below is an example of a fraud case uncovered by ML:
Rocketz Insurance received an application for an endowment product from a customer residing in Mumbai. An early claims risk prediction model predicted the application as a potential high-risk case for early claims. The field investigation of the policy identified irregularities with the documents submitted and details provided about the customer’s health in the policy application.
Rocketz Insurance decided to reject the application. The early claims risk prediction model detected a link between the rejected application and an existing policy, which was issued a month back to a customer residing in the same residential address and sourced by a different agent. With this new information, Rocketz Insurance decided to investigate. How was the early claims model developed to identify the link between the customers?
Building a predictive model
The first phase of building any predictive model involves a detailed analysis of available data fields. The focus should be on collating different data sources, identifying unique keys, and creating structured data tables. Here the modeler should also consider the possibility of using external data. In an insurance fraud modeling project, some of the commonly available data points are policy-level information, historical claims and identified frauds, customer demographics, details about the agent, and other entities related to policy sourcing and servicing.
Compared to other industries, the insurance sector has access to better quality in-house data about the customer. However, the primary data source is the details collected during the application process. Enriching data using external sources will provide more relevant and current information about the customer. In this age of data explosion, external data sources are plenty and they capture multifaceted data points on a consumer varying form social data, lifestyle information, spending patterns, and even credit history.
Feature Space – What is required to make predictions?
Predicting future fraud or early claims (claims during the early years, e.g. within 1-3 years), especially at the time of policy issuance, is challenging. The key to developing a robust predictive model is in the quality of input features or predictor variables.
Common Predictors
Direct features like customer age, income, gender, occupation, and policy type are useful in predicting claims. However, these features will not capture all the continuously changing intricate trends related to fraud. The focus should be on creating features or variables that capture changes with time. The modeler should experiment with all entities related to a policy.
Some examples of such features are the claim ratio of an agent in the last 1 year, proportion of specific product types sold by the agent in the last t years, proportion of customers with a graduate degree or above in a pincode calculated based on policies sold in that pincode in last t years. During model development, the modeler should consider different time ranges for creating these historical aggregates. The Indian insurance industry gives special attention to policies issued from specific regions and pincodes to prevent fraudulent claims.
Predictors related to location
How to create a multi-faceted profile for customer location? Pincode level features are good estimators for creating a profile for customer location. The modeler should explore features that capture historical performance indicators like claim ratio during the last t years, percentage of policies sold to customers from different age segments, education level or occupation type. These estimates for pincode should be updated at regular intervals to capture the changing and evolving business trends. A good practice is to update once every month or quarter.
Create features that capture business and market feedbacks, e.g. If there are identified high-risk pincodes based on field data or underwriting experience, create tags representing such pincode clusters. It is recommended to maintain large cluster groups with sufficient pincodes in each cluster to avoid overfitting. These clusters can be maintained for longer periods and are only updated a few times during the financial year. Together these features will create a profile for the pincode.
Machine Learning Algorithms
Models for predicting fraud are typically formulated as binary classification problems. The algorithm is trained to learn patterns that distinguish one group from the other. Our experience shows that ensemble tree-based boosting algorithms like Gradient boosting, Xtreme boosting, or Adaboost effectively capture trends required to distinguish fraud from genuine cases.
Other common algorithms, such as logistic regression, decision tree, Naïve Bayes, Support vector machines, show lower predictive power compared to ensemble models. However, these models are transparent and less complex. It is easier to explain the risk trends learned by these simpler models compared to black-boxes like the ensemble boosting trees.
Making use of model prediction
Predictions from ML models are typically a score that is indicative of the likelihood of the event that is of interest. A model built to predict fraudulent policy applications to address early claims will generate a score indicating the likelihood of future fraud or early claim. Based on the score, policies are categorized to different risk buckets (or segments) ranging from high to low based on each segment's expected fraud.
Based on the availability of resources, insurers apply different strategies to handle each risk bucket. These range from field level investigation of policies in the highest risk buckets to in-house verification or call-center engagement for moderate risk buckets.
Leveraging Network Analysis
A common practice is to evaluate predictions at policy level and make decisions for those policies primarily based on the risk bucket it belongs to. This will capture a large proportion of potential high-risk cases within the top buckets; however, there will be negative cases that will fall in the lower buckets and not captured by the risk model. The chances of detecting such cases can be improved by leveraging relationship networks that connect policies and customers.
Customers in an insurer's portfolio can be represented as a network by defining different connections. Some examples are insured and a policyholder in a policy, residing in the same residential address, sourced by a common agent, etc. These connections can be used to create a network that has multiple connected clusters. Each cluster is a group of customers related to each other and are likely to show some common traits related to risk factors like fraud, claims, etc.
Leveraging customer clusters along with ML predictions will significantly improve the evaluation of high-risk cases. Considering the risk profile of cluster when evaluating a high-risk policy can help filter out false positives (policies incorrectly identified as high-risk by the ML model), improving the success rate and effectiveness of the strategies used to handle high-risk cases. Further, considering policies predicted as low risk by the model but belonging to a high-risk policy cluster will enhance the capture rate by detecting additional negative policies.
Conclusion
The fields of machine learning and data analytics have had significant advancements and discoveries in the last two decades. Now we have fast and scalable ready to use algorithms that can capture complex trends from data that looked difficult a few years ago. This has considerably improved the usability and predictive power of ML algorithms.
Further, advancements in computational power and graph databases have made it easy to capture relationships and discover associations present in the data. In today's scenario, with the availability of large amounts of data from different sources, connections, and relationships are essential to make sense of data and proactively engage with users.
Data science is currently equipped with the necessary tools required to address challenging problems like fraud faced by the insurance industry. Leveraging connections between customer along with the predictions from ML algorithms can significantly improve early identification of fraud and minimize portfolio risk.
Interested in learning how Aureus can help you with fraud detection? Download the, "Life Claims Fraud Case Study" to learn how Aureus helped a major insurer identify fraudulent claims.