Consider the example of John Doe. John works with a technology firm as a senior Big Data developer. The application he works on processes Petabytes of data regularly, and John has scant respect for data sizes less than a Gigabyte. In fact, he had got so used to the Big Data world that, till recently, any power of 10 where the index was less than 9 looked like a small number to him. This perception was altered one day when John’s younger cousin, who had recently graduated from High School and had a liking for numbers, pointed out some interesting facts to John:

- At 30, John is already older than most of his teammates. He was surprised, however, to find that the total time he had spent since his birth was only about 9.5 x 10
^{8}seconds. Since his birth, John’s heart has beat about 3.7 x 10^{7}times (it beats approx. 70 times per minute); the number of breaths he has taken so far total up to 8.4 x 10^{6}; and the total number of meals he has eaten is somewhere in the range of 30,000 to 40,000! - When John learnt the idea of a countably infinite set for the first time, he associated it with the number of hair on his head – having tried in vain to count them earlier. He was surprised to find that the number of strands of hair on an average human head is only of the order of 10
^{5}! - John drives to work and is irked by the increasing traffic in his city. The ever increasing number of cars on streets appear to him like another example of countably infinite objects. He was amused to find that when he bought his car in 2013, it was one of approximately 65 x 10
^{6}cars produced in the entire world in that year. - The distance from the Earth to the Moon is about 3.84 x 10
^{5}KiloMeters (Km). When John compared this to his monthly driving average of 1000 Km, he realized that it will take him over 30 years to cover a total driving distance equivalent to the distance to the Moon. Alternatively, if John were to drive 500 Km per day and continue driving day after day non-stop, it would still take him more than 2 years to cover this distance! It would take John several lifetimes of driving if he attempted to cover a distance equivalent to that from the Earth to the Sun (approximately 1.5 x 10^{8}Km). - The length of the Earth’s circumference is about 4 x 10
^{4}Km. If John attempted to cover this distance with a daily driving of about 500 Km, it would take him about 80 days to do so – which reminded him of Jules Verne’s classic novel*Around the world in eighty days*. - John is fond of reading books and has read close to a thousand books in his lifetime. He knows that this is but a tiny fraction of the total number of books available out there, but was amused to know that the total number of books ever published in this world was estimated at about 1.3 x 10
^{8}in 2010! - Finally, John realized that if he were to start counting natural numbers at the rate of 1 per second, and continued counting day and night without taking a break, it would still take him over 30 years to count up to 1 Billion (10
^{9})! - Today John Doe continues to excel in Big Data technology, but he now harbors a healthy respect for numbers which don’t look large in the context of data sizes.

**Sources of stats**

http://mathworld.wolfram.com

http://www.oica.net/category/production-statistics/2013-statistics

http://www.space.com http://mashable.com/2010/08/05/number-of-books-in-the-world/

http://bionumbers.hms.harvard.edu

What kinds of problems need the use of an unsupervised learning technique? Let us look at an example. Imagine a situation where a large number of text documents need to be classified into certain categories in an automated fashion. If the categories into which the documents need to be classified are known upfront and if a good sized training sample is available (i.e., if a subset of documents with the corresponding category labels are available), then we can use a supervised classification algorithm to classify these documents. If, however, the categories are not known upfront, then an unsupervised algorithm will be required. In this latter case, the problem can be decomposed into 2 steps:

- Group similar documents together; and
- Identify descriptions for the document groups identified in step 1 in order to identify distinct categories.

The first step needs an unsupervised learning technique to analyze the population of documents and group similar documents together.

The one point that I want to emphasize here is that the adjective “unsupervised” does not mean that these algorithms run by themselves without human supervision. It simply indicates the absence of a desired or ideal output corresponding to each input. An analyst (or a data scientist) who is training an unsupervised learning model has to exercise a similar kind of modeling discipline as does the one who is training a supervised model. Alternatively, an analyst who is training an unsupervised learning model can exercise a similar amount of control on the resulting output by configuring model parameters as does the one who is training a supervised model. While supervised algorithms derive a mapping function from x to y so as to accurately estimate the y’s corresponding to new x’s, unsupervised algorithms employ predefined distance/similarity functions to map the distribution of input x’s. The accuracy of the output, therefore, depends heavily upon how effectively the analyst is able to represent the inputs as well as their choice of similarity measure. Let me elaborate on this last point using the document clustering example referred to in the previous paragraph. What are the key choices that an analyst needs to make when modeling the document clustering problem?

(This point is equally relevant for a supervised document classification scenario) There are innumerable ways of doing this. For example, a document could be represented as a long vector containing all the distinct words in the document; or it could be represented as a vector containing a subset of the words – which turn out to be significant based on some chosen measure; or it could even be represented in terms of a numerical vector which contains numbers based on some measures on the document, e.g. the total number of words, the number of sections in the document, etc. The choice of representation will depend on the nature of the documents (e.g. documents containing news articles will need different treatment than those containing stories) and to some extent also on the clustering algorithm used. In practice, the documents will likely need to be pre-processed with a feature extraction algorithm – the output of which will be a feature vector that can be used to represent the document as one data point.

If the feature vectors are numerical, one popular distance measure is the Euclidean distance which is close to the human perception of physical distance in 3 dimensions. But this is definitely not the only possible choice for a distance measure. There are several other measures, e. g. the cosine distance measures the angular distance between two vectors or the correlation distance measures the correlation coefficient between two vectors. If the feature vectors are non-numerical, a distance measure that yields a numerical distance needs to be devised.

Some of the popular techniques include K-means or variants of hierarchical clustering algorithms, but a document clustering problem can also use a custom text clustering algorithm. In general, the choice of an algorithm is likely to be correlated with choices made in (a) and (b) above.

As you can imagine, depending on the choices made for (a), (b) and (c) above, the clustering output is likely to be quite different – which corroborates my point about the need for analyst supervision on the modeling of an unsupervised learning algorithm.

I hope this was useful. We will continue looking at different aspects of predictive modeling in future articles.

image source: www.salford.ac.uk

]]>
**Embed this infographic on your site:**

Predictive models use information from the past, i.e., historical data, to make an inference about the future. An implicit assumption is that historical patterns are going to repeat in the future. If this assumption is invalid for any reason, the prediction made by the model in question is unlikely to be reliable.

Not necessarily. Even a simple correlation check can help make a predictive inference. Consider two time series, **x** and **y**. If **x**(t) is highly correlated with **y**(t+1), then it means that having information about **x** at time t implies being able to predict **y** at time (t+1) with a reasonable accuracy.

The key to building a good predictive model is not in using any fancy math but in ensuring that the dependent and independent variable are defined carefully and the fallacy of using future information to make an inference about the same future is avoided. Let me elaborate on this last point with an example. Assume that available historical data includes customer behavior data including credit card payment history and response to a quarterly loan offer from Q1-2012 to Q1-2014, and the objective is to build a model to predict the customer’s response to the loan offer based on past payment history. One possible way of building this model is to use the response to loan offer in Q1-2014 as the dependent variable and use the payment history from Q1-2012 to Q4-2013 to form independent variables. It would be incorrect to use payment history from Q1-2012 to Q1-2014 to form independent variables though, because in that case we will be using the payment behavior in Q1-2014 to “predict” the response to the loan offer in the same timeframe!

There is no clear-cut answer to this question. Although certain business problems are more amenable to being modeled using certain kinds of statistical techniques, typically the efficacy of the model is determined by the data used to build it. A model that has access to a richer data source will generally be more effective. With a given data source, better results may be obtained by being creative about deriving new variables from the available data. As an example, consider the task of modeling the parabola y=x^{2} using OLS regression on (x,y) values. Since the technique used is linear in its parameters, if x is taken as the independent variable, the results won’t be great. But if you define x^{2} as a derived variable and use it as the independent variable for regression, the model will be a perfect fit!

A predictive model is typically built using supervised learning (regression models, decision trees, etc.), but it is possible to use unsupervised learning to make a predictive inference. Clustering algorithms use unsupervised techniques. Imagine a clustering solution obtained by clustering customer behavior data up to time ‘t’. If you overlay the customers’ response to a certain offer at time (t+1) on the clusters obtained previously and find that there is a good variation in response values across clusters, then the clustering solution can be used to make a predictive inference about response to that particular offer.

I hope this was helpful. We will be covering more aspects of predictive models in later articles.

]]>