Each team was given 80GB of real transaction data from which they had to glean insights using any tools they were comfortable with.

The organizers of hackathon had kept the problem statement open and gave the freedom to explore various aspects from the data. Team Aureus focused on building a topnotch claims rejection predictive model with a very strong predictive power. Along with the given structured health claims data, the team created intelligent predictor variables using text mining techniques on unstructured claims data.

Combining the direct and derived intelligent variables, the team developed high performing models using advanced machine learning algorithms such as Random Forest and Generalized Additive Methods. Further, detailed insights were generated using the prediction models and text mining techniques. High impact predictor variables and their complex risk patterns were also identified.

By identifying inter variable correlations, the team could deep dive into the data records to identify diseases for which most claims were filed, the rejection rates, etc…The team also demonstrated that the model could be specialized to a disease-specific models.

Together, the team provided insights with high business value and top-class prediction models for optimizing operations and resource allocation.Such a model would go a long way in helping optimize operations by improving claims review time and plan claims cash flow.

http://datameet.org/2015/05/13/mumbai-meet-6-data-science-hackathon/

]]>What kinds of problems need the use of an unsupervised learning technique? Let us look at an example. Imagine a situation where a large number of text documents need to be classified into certain categories in an automated fashion. If the categories into which the documents need to be classified are known upfront and if a good sized training sample is available (i.e., if a subset of documents with the corresponding category labels are available), then we can use a supervised classification algorithm to classify these documents. If, however, the categories are not known upfront, then an unsupervised algorithm will be required. In this latter case, the problem can be decomposed into 2 steps:

- Group similar documents together; and
- Identify descriptions for the document groups identified in step 1 in order to identify distinct categories.

The first step needs an unsupervised learning technique to analyze the population of documents and group similar documents together.

The one point that I want to emphasize here is that the adjective “unsupervised” does not mean that these algorithms run by themselves without human supervision. It simply indicates the absence of a desired or ideal output corresponding to each input. An analyst (or a data scientist) who is training an unsupervised learning model has to exercise a similar kind of modeling discipline as does the one who is training a supervised model. Alternatively, an analyst who is training an unsupervised learning model can exercise a similar amount of control on the resulting output by configuring model parameters as does the one who is training a supervised model. While supervised algorithms derive a mapping function from x to y so as to accurately estimate the y’s corresponding to new x’s, unsupervised algorithms employ predefined distance/similarity functions to map the distribution of input x’s. The accuracy of the output, therefore, depends heavily upon how effectively the analyst is able to represent the inputs as well as their choice of similarity measure. Let me elaborate on this last point using the document clustering example referred to in the previous paragraph. What are the key choices that an analyst needs to make when modeling the document clustering problem?

(This point is equally relevant for a supervised document classification scenario) There are innumerable ways of doing this. For example, a document could be represented as a long vector containing all the distinct words in the document; or it could be represented as a vector containing a subset of the words – which turn out to be significant based on some chosen measure; or it could even be represented in terms of a numerical vector which contains numbers based on some measures on the document, e.g. the total number of words, the number of sections in the document, etc. The choice of representation will depend on the nature of the documents (e.g. documents containing news articles will need different treatment than those containing stories) and to some extent also on the clustering algorithm used. In practice, the documents will likely need to be pre-processed with a feature extraction algorithm – the output of which will be a feature vector that can be used to represent the document as one data point.

If the feature vectors are numerical, one popular distance measure is the Euclidean distance which is close to the human perception of physical distance in 3 dimensions. But this is definitely not the only possible choice for a distance measure. There are several other measures, e. g. the cosine distance measures the angular distance between two vectors or the correlation distance measures the correlation coefficient between two vectors. If the feature vectors are non-numerical, a distance measure that yields a numerical distance needs to be devised.

Some of the popular techniques include K-means or variants of hierarchical clustering algorithms, but a document clustering problem can also use a custom text clustering algorithm. In general, the choice of an algorithm is likely to be correlated with choices made in (a) and (b) above.

As you can imagine, depending on the choices made for (a), (b) and (c) above, the clustering output is likely to be quite different – which corroborates my point about the need for analyst supervision on the modeling of an unsupervised learning algorithm.

I hope this was useful. We will continue looking at different aspects of predictive modeling in future articles.

image source: www.salford.ac.uk

]]>A walk with our Chief Data Scientist Dr. Nilesh Karnik is always a delight. Add to that the pleasant monsoon mornings, in the beautiful surroundings of Powai and you have a winner. We were both early to office for a client meeting. When we were 3% of the distance (we are an analytics company, nothing comes with approximation!) through to the client’s office, they called and requested us to postpone the meeting. In a world where customers are Gods, we were left with a not so attractive option of turning back on a water-logged-traffic-jammed road towards the safe haven of our office. Having a bit of unplanned time on hand we decided to take a break at Le Pain Quotidien (LPQ) at Central Avenue Powai.

We sat in and ordered the delightful coffee at LPQ. Our conversation jumped from novelty in the method of serving the coffee to novelties in chess openings to rules of Bridge to ultimately my favourite casino game – Black Jack!

You should be able to connect the dots!

This is where I got a chance to pry into the ‘HOW of things’ with respect to Data Science. I have been searching and recruiting data scientists for close 2 years now with some help from LinkedIn, blogs and some excellent content from O’Reilly seminars; so I am pretty much there with respect to the WHATs.

Since we were discussing Black Jack, Nilesh pensively pointed out Ed Thorp and his book ‘Beat the Dealer’. If you take a closer look at the story, there are two rival sets of Data Science experts. Edward Thorp is one set. The second set is the Casino “Analytics” team which marked him out, and changed the rules to counter his strategy. I am ordering a copy of “Beat the Dealer”; not to win Black Jack but to understand the science.

While Ed Thorp figured out a way to beat the dealer using data, the casino guys figured out a way to create rules which will eradicate the strategy of Ed Thorp. In round 1 I am sure Ed used data from experimentations done outside the Casino to devise the method and perfected the method using trial and error. The casino guys tried banning him and then used instances of his games to change the rules. The rules were changed without making them unattractive for the other players. In either case, the auctioning side used data and analytics. Both sides used augmentation and ran trial and errors before coming to actionable conclusions.

On the walk back, Nilesh asked me if I was in positive or negative with respect to the overall Black Jack experience across the globe. I told him Dutch casinos have been unkind as they take your money on a tie; but overall I am in the positive because of a single large win at an Australian casino. And there I got another bit of history into Data Science – ‘The Black Swan Theory’ by Naseem Nicholas Taleb. The context being that the black swan event averaged out my winnings on Black Jack table to green; however subtly highlighting that it was an unpredictable event because history of my losses till then would never predict a big win using mathematical predictive techniques. If you look at it paradoxically from the perspective of the Casinos, Ed finding a method to beat them continuously was a black Swan event. They analyzed instances to prevent this from happening in the future.

Within the above examples Nilesh had highlighted to me what goes on in his mind every time I bring a problem to him. Why certain assumptions he is not willing to accept and why his questions on certain less frequent events are so rigorous. The following can give a general understanding to non-practitioners when interacting with practitioners of Data Science.

**Rule 1:**Know thy data**Rule 2:**Be open to what you can augment & be selective with respect to what you augment into your data and models**Rule 3:**Data speaks; give it a canvas of methods to communicate to you**Rule 4:**Do not be prejudiced**Rule 5:**Build models to exploit positive black swans and restrict impact of negative ones**Rule 6:**When cleansing data for use do not introduce your influence into the data. Else the data will tell you the story that you want to hear.

Well I am sure there are many more. These also define the tough job Nilesh and team do every day -“*Define products which use data science & big data to solve business problems”.*

And for that cuppa!! I can safely say LPQ will be seeing more of me and Nilesh at their outlet.

]]>Predictive models use information from the past, i.e., historical data, to make an inference about the future. An implicit assumption is that historical patterns are going to repeat in the future. If this assumption is invalid for any reason, the prediction made by the model in question is unlikely to be reliable.

Not necessarily. Even a simple correlation check can help make a predictive inference. Consider two time series, **x** and **y**. If **x**(t) is highly correlated with **y**(t+1), then it means that having information about **x** at time t implies being able to predict **y** at time (t+1) with a reasonable accuracy.

The key to building a good predictive model is not in using any fancy math but in ensuring that the dependent and independent variable are defined carefully and the fallacy of using future information to make an inference about the same future is avoided. Let me elaborate on this last point with an example. Assume that available historical data includes customer behavior data including credit card payment history and response to a quarterly loan offer from Q1-2012 to Q1-2014, and the objective is to build a model to predict the customer’s response to the loan offer based on past payment history. One possible way of building this model is to use the response to loan offer in Q1-2014 as the dependent variable and use the payment history from Q1-2012 to Q4-2013 to form independent variables. It would be incorrect to use payment history from Q1-2012 to Q1-2014 to form independent variables though, because in that case we will be using the payment behavior in Q1-2014 to “predict” the response to the loan offer in the same timeframe!

There is no clear-cut answer to this question. Although certain business problems are more amenable to being modeled using certain kinds of statistical techniques, typically the efficacy of the model is determined by the data used to build it. A model that has access to a richer data source will generally be more effective. With a given data source, better results may be obtained by being creative about deriving new variables from the available data. As an example, consider the task of modeling the parabola y=x^{2} using OLS regression on (x,y) values. Since the technique used is linear in its parameters, if x is taken as the independent variable, the results won’t be great. But if you define x^{2} as a derived variable and use it as the independent variable for regression, the model will be a perfect fit!

A predictive model is typically built using supervised learning (regression models, decision trees, etc.), but it is possible to use unsupervised learning to make a predictive inference. Clustering algorithms use unsupervised techniques. Imagine a clustering solution obtained by clustering customer behavior data up to time ‘t’. If you overlay the customers’ response to a certain offer at time (t+1) on the clusters obtained previously and find that there is a good variation in response values across clusters, then the clustering solution can be used to make a predictive inference about response to that particular offer.

I hope this was helpful. We will be covering more aspects of predictive models in later articles.

]]>