One of the challenges faced by data scientists is dealing with unstructured data using traditional machine learning models. These models are trained on structured data that have input features with corresponding output labels. When using unstructured data, the data cannot be directly used as an input feature. One approach is to use Artificial Neural Networks (ANN) to unlock business insights from unstructured data.
The Growth of Unstructured Data
Traditionally most of the data that was available to companies was structured data. Structured data rows is defined as data that can be easily organized in traditional relational databases in the form of and columns. For insurance companies, the structured data they possess includes:
1. Data about their customer
A. Contact details
B. Demographic information like their age, gender, etc.
2. Policy information like issue dates, expiry dates, premium, sum assured, renewal details, etc.
3. Claims history like claimed amount, date of claim, claim status, etc.
4. Product information like characteristic features of the various insurance plans
5. Prospect Data
6. Data on the company's agents
However, in recent years, with technology getting smarter, faster, and more widely accessible, companies' amount and type of data have changed drastically. Trends such as a surge in social media usage and easy availability of cameras in cell phones have resulted in much data being available to companies in the form of images, videos, audio, and free text. Even for the companies, the cost of dealing with this kind of data has gone down significantly with cloud storage and cloud computing availability.
This type of data is commonly referred to as unstructured data. It is difficult and impractical to store and process this data in the form of traditional tables with rows and columns, i.e., it is difficult to define a standard structure for this type of data.
Insurance companies naturally possess and deal with a good deal of unstructured data, a few examples being:
- Social media comments
- Customer emails
- Audio files of customer calls
- Customer feedback from surveys
- Images/videos of car accidents/car damage (auto insurance)
- X-ray images (health insurance/life insurance)
- Scanned document images submitted at the time of policy issuance
The challenge of dealing with unstructured data using traditional machine learning models
Traditional or classical machine learning models are trained on data with certain input features and corresponding output labels. The machine learning model learns from this data and progressively improves its ability to predict the input features' output label.
One complication is that the data we have cannot be directly used as input features with unstructured data. While some processing and transformation of data is required even for structured data, for unstructured data converting the data to features requires a good deal of technical and domain expertise. Feature engineering is a very crucial step in using classical machine learning techniques with unstructured data.
Let’s take the example of an auto insurance company, determining the damage to a car using images.
This image by itself cannot be used as an input feature. Deliberate feature engineering must be done to define edges, corners, contours, change in shades, etc. by different calculations using pixel values.
Deep Learning using Artificial Neural Networks
Artificial Neural Networks (ANN) are a set of algorithms inspired by the human nervous system's anatomy. ANN models are composed of multiple layers, and each layer is composed of several neurons. Each of the neurons receives information as input from the previous layers and passes on its calculations as output to the next layer. These layers and neurons together form a network, which is called as the artificial neural network.
Why do artificial neural networks perform better than classical models for unstructured data?
The principal benefit of using Artificial Neural Networks for unstructured data is these models' ability to detect input features on its own. There are quite a few types of ANNs that deal with unstructured data. Here, we will take the example of 2 types of neural networks – Convolutional Neural Network and Recurrent Neural Network.
Convolutional Neural Network
Taking the example of an auto insurance company accessing the damage to cars using car images:
This is a classic example of object detection. The purpose of the problem is to detect dents and scratches.
Convolutional Neural Network (CNN) is a special kind of ANN that works best for image analytics. Image analytics can get quite complicated as each image is composed of a large number of pixel values. If the image is colored, that also adds to the complexity. CNN modifies ANN to reduce the complexity of this input while retaining the algorithm's ability to discern the various features from the images.
Like every other ANN, CNN may also be composed of many layers. To detect the defects in the car image, the earlier layers of CNN may be simply detecting only the vertical and horizontal edges. Each successive layer detects increasingly complex features till the final layers can finally detect the dents and scratches in the car image.
CNN works better than the classical methods, as no domain or technical expertise goes into defining what features are critical in defining dents. The CNN algorithm optimizes its feature detection in such a way that the accuracy in detecting dents is also maximized.
CNNs have a lot of potential in the health and life insurance market, where companies deal with a lot of images like X-rays, MRI scans, etc. The companies can use CNNs at the stage of underwriting. The models can be used to identify abnormalities in these scans faster and with higher accuracy. This information can be used to access the risk and price the policy accordingly.
The models can also be used to examine the scans at the time of claims and possibly even identify fraudulent cases. While such use cases surely need medical personnel's expertise, artificial neural network models can help speed up the process and identify more accurate evidence.
Recurrent Neural Network
Recurrent Neural Network (RNN) is another type of ANN that is primarily tailored to look at sequences as the input. The most common application of RNN is that of text, which is represented as a sequence of words.
For an insurance company, one of the main parameters that it looks at is customer feedback. While customer feedback, if given in the form of ratings, can be a good source of structured data, the feedback is most often more valuable if expressed in the form of free text. Sentiment analysis on free form customer feedback has been successfully attempted many times using classical machine learning methods such as SVM.
The benefit of using RNN in such cases is that it has an internal memory that considers the words in the text and the order in which they appear. This optimizes the accuracy with which it recognizes the text sentiment.
e.g., Your branch service is very good.
The word good has a positive sentiment attached to it. The word 'very' by itself has no sentiment attached to it. RNN, as an algorithm with memory, while looking at the word 'good,' recognizes that when it is preceded by the word 'very' and enhances the positive sentiment inherent in the word 'good.'
Conclusion
CNN and RNN are just two examples of most used ANNs for unsupervised data. Artificial neural networks and deep learning have been around for a long time. They were not used widely until recent times as ANNs were computationally expensive in terms of time and cost. Between computation becoming cheaper and faster, and data becoming more accessible and more varied, artificial neural networks applications to unsupervised data continue to grow.
Are you trying to extract unstructured data insights to better connect with your customers? Click on the link below to get more information.