All Posts

5 Natural Language Processing Techniques for Extracting Information

The field of artificial intelligence has always envisioned machines being able to mimic the functioning and abilities of the human mind. Language is considered as one of the most significant achievements of humans that has accelerated the progress of humanity. So, it is not a surprise that there is plenty of work being done to integrate language into the field of artificial intelligence in the form of Natural Language Processing (NLP). Today we see the work being manifested in likes of Alexa and Siri.

NLP primarily comprises of natural language understanding (human to machine) and natural language generation (machine to human). This article will mainly deal with natural language understanding (NLU). In recent years there has been a surge in unstructured data in the form of text, videos, audio and photos. NLU aids in extracting valuable information from text such as social media data, customer surveys, and complaints.

Consider the text snippet below from a customer review of a fictional insurance company called Rocketz Auto Insurance Company:

The customer service of Rocketz is terrible. I must call the call center multiple times before I get a decent reply. The call center guys are extremely rude and totally ignorant. Last month I called with a request to update my correspondence address from Brooklyn to Manhattan. I spoke with about a dozen representatives – Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker to name a few. Even after writing multiple emails and filling out numerous forms, the address has still not been updated. Even my agent John is useless. The policy details he gave me were wrong. The only good thing about the company is the pricing. The premium is reasonable compared to the other insurance companies in the United States. There has not been any significant increase in my premium since 2015.

Let’s explore 5 common techniques used for extracting information from the above text.

1.    Named Entity Recognition

The most basic and useful technique in NLP is extracting the entities in the text. It highlights the fundamental concepts and references in the text. Named entity recognition (NER) identifies entities such as people, locations, organizations, dates, etc. from the text.

NER output for the sample text will typically be:

Person: Lucas Hayes, Ethan Gray, Nora Diaz, Sofia Parker, John

Location: Brooklyn, Manhattan, United States

Date: Last month, 2015

Organization: Rocketz

NER is generally based on grammar rules and supervised models. However, there are NER platforms such as open NLP that have pre-trained and built-in NER models.

2.    Sentiment Analysis

The most widely used technique in NLP is sentiment analysis. Sentiment analysis is most useful in cases such as customer surveys, reviews and social media comments where people express their opinions and feedback. The simplest output of sentiment analysis is a 3-point scale: positive/negative/neutral. In more complex cases the output can be a numeric score that can be bucketed into as many categories as required.

In the case of our text snippet, the customer clearly expresses different sentiments in various parts of the text. Because of this, the output is not very useful. Instead, we can find the sentiment of each sentence and separate out the negative and positive parts of the review. Sentiment score can also help us pick out the most negative and positive parts of the review:

Most negative comment: The call center guys are extremely rude and totally ignorant.

Sentiment Score: -1.233288

Most positive comment: The premium is reasonable compared to the other insurance companies in the United States.

Sentiment Score: 0.2672612

Sentiment Analysis can be done using supervised as well as unsupervised techniques. The most popular supervised model used for sentiment analysis is naïve Bayes. It requires a training corpus with sentiment labels, upon which a model is trained which is then used to identify the sentiment. Naive Bayes is not the only tool out there - different machine learning techniques like random forest or gradient boosting can also be used.

The unsupervised techniques also known as the lexicon-based methods require a corpus of words with their associated sentiment and polarity. The sentiment score of the sentence is calculated using the polarities of the words in the sentence.

3.    Text Summarization

As the name suggests, there are techniques in NLP that help summarize large chunks of text. Text summarization is mainly used in cases such as news articles and research articles.

Two broad approaches to text summarization are extraction and abstraction. Extraction methods create a summary by extracting parts from the text. Abstraction methods create summary by generating fresh text that conveys the crux of the original text. There are various algorithms that can be used for text summarization like LexRank, TextRank, and Latent Semantic Analysis. To take the example of LexRank, this algorithm ranks the sentences using similarity between them. A sentence is ranked higher when it is similar to more sentences, and these sentences are in turn similar to other sentences.

Using LexRank, the sample text is summarized as: I have to call the call center multiple times before I get a decent reply. The premium is reasonable compared to the other insurance companies in the United States.

4.    Aspect Mining

Aspect mining identifies the different aspects in the text. When used in conjunction with sentiment analysis, it extracts complete information from the text. One of the easiest methods of aspect mining is using part-of-speech tagging.

When aspect mining along with sentiment analysis is used on the sample text, the output conveys the complete intent of the text:

Aspects & Sentiments:

  • Customer service – negative
  • Call center – negative
  • Agent – negative
  • Pricing/Premium – positive

5.    Topic Modeling

Topic modeling is one of the more complicated methods to identify natural topics in the text. A prime advantage of topic modeling is that it is an unsupervised technique. Model training and a labeled training dataset are not required.

There are quite a few algorithms for topic modeling:

One of the most popular methods is latent Dirichlet allocation. The premise of LDA is that each text document comprises of several topics and each topic comprises of several words. The input required by LDA is merely the text documents and the expected number of topics.

Using the sample text and assuming two inherent topics, the topic modeling output will identify the common words across both topics. For our example, the main theme for the first topic 1 includes words like call, center, and service. The main theme in topic 2 are words like premium, reasonable and price. This implies that topic 1 corresponds to customer service and topic two corresponds to pricing. The diagram below shows the results in detail.

Topic Modeling


These are just a few techniques of natural language processing. Once the information is extracted from unstructured text using these methods, it can be directly consumed or used in clustering exercises and machine learning models to enhance their accuracy and performance.


Neeraja Vaidya
Neeraja Vaidya
Neeraja has a Post Graduate degree in Economics from Mumbai University and is a part of the data sciences team, responsible for developing statistical models for products.

Related Posts

Data Drifted, Now What? Identifying and Responding to Data Drift

“Change is not only likely, it’s inevitable.” – Barbara Sher What is Data Shift or Data Drift? Given human nature, it is very natural that the data we collect will change over time. Changes in data such as behavior and preferences are fast and drastic. It is even more relevant today with the impact of the Covid-19 pandemic making unprecedented changes to businesses. For a data science practitioner, the stability of data and its source are salient to develop and maintain robust ML (machine learning) solutions. Changes or drift in data will degrade the performance of predictive models.

Ways and Means to Build Trust in the Digital Era

This is part 2 of a 2-part series "Trust: The Key Ingredient for a Successful Insurance Customer Journey." In Part 1 of our blog series, we discussed how important it is for an insurer to display and build trust in any customer experience. The trust factor has continued to rise and is becoming much more significant. The buyer's journey is to garner trust. Buyers prior to 2010 usually rated insurance companies' trustworthiness and confidence level based on their responses to their inquiries. He validated information he collected from friends and relatives that were already existing customers. During that time, there was very little information about the insurer's finances in the public domain, so he had to rely on hearsay. In that process, the buyer zeroed in on a particular insurer or seller.

Unveiling a Whole New World of CX Insights

As 2021 came to a close, we couldn't help but notice how what was inevitable in the future was becoming necessary now. While a normal practice would be to predict what lies ahead of us in 2022, we decided to take a different approach this year. In our second edition of the Aureus yearbook, we will take a closer look at what has delivered superior CX (Customer Experience) in the insurance industry this past year. While there is nothing new about focusing on your customers, we shall examine the role data has played in assisting companies up their CX game.