Unsupervised learning algorithms are machine learning algorithms that work without a desired output label. A supervised machine learning algorithm typically learns a function that maps an input x into an output y, while an unsupervised learning algorithm simply analyzes the x’s without requiring the y’s. Essentially, the algorithm attempts to estimate the underlying structure of the population of x’s (in other words, the algorithm looks for natural groupings or clusters within the x’s). This is similar to the problem of density estimation in statistics. Typical unsupervised learning algorithms include clustering algorithms like K-means or hierarchical clustering methods. Some types of neural networks – like Self Organizing Maps and Adaptive Resonance Theory models – also follow the unsupervised learning paradigm.
What kinds of problems need the use of an unsupervised learning technique? Let us look at an example. Imagine a situation where a large number of text documents need to be classified into certain categories in an automated fashion. If the categories into which the documents need to be classified are known upfront and if a good sized training sample is available (i.e., if a subset of documents with the corresponding category labels are available), then we can use a supervised classification algorithm to classify these documents. If, however, the categories are not known upfront, then an unsupervised algorithm will be required. In this latter case, the problem can be decomposed into 2 steps:
- Group similar documents together; and
- Identify descriptions for the document groups identified in step 1 in order to identify distinct categories.
The first step needs an unsupervised learning technique to analyze the population of documents and group similar documents together.
The one point that I want to emphasize here is that the adjective “unsupervised” does not mean that these algorithms run by themselves without human supervision. It simply indicates the absence of a desired or ideal output corresponding to each input. An analyst (or a data scientist) who is training an unsupervised learning model has to exercise a similar kind of modeling discipline as does the one who is training a supervised model. Alternatively, an analyst who is training an unsupervised learning model can exercise a similar amount of control on the resulting output by configuring model parameters as does the one who is training a supervised model. While supervised algorithms derive a mapping function from x to y so as to accurately estimate the y’s corresponding to new x’s, unsupervised algorithms employ predefined distance/similarity functions to map the distribution of input x’s. The accuracy of the output, therefore, depends heavily upon how effectively the analyst is able to represent the inputs as well as their choice of similarity measure. Let me elaborate on this last point using the document clustering example referred to in the previous paragraph. What are the key choices that an analyst needs to make when modeling the document clustering problem?
Defining a way to represent a document as one data point for the purpose of analysis.
(This point is equally relevant for a supervised document classification scenario) There are innumerable ways of doing this. For example, a document could be represented as a long vector containing all the distinct words in the document; or it could be represented as a vector containing a subset of the words – which turn out to be significant based on some chosen measure; or it could even be represented in terms of a numerical vector which contains numbers based on some measures on the document, e.g. the total number of words, the number of sections in the document, etc. The choice of representation will depend on the nature of the documents (e.g. documents containing news articles will need different treatment than those containing stories) and to some extent also on the clustering algorithm used. In practice, the documents will likely need to be pre-processed with a feature extraction algorithm – the output of which will be a feature vector that can be used to represent the document as one data point.
Defining a measure to calculate the similarity – or equivalently, the distance – between two documents.
If the feature vectors are numerical, one popular distance measure is the Euclidean distance which is close to the human perception of physical distance in 3 dimensions. But this is definitely not the only possible choice for a distance measure. There are several other measures, e. g. the cosine distance measures the angular distance between two vectors or the correlation distance measures the correlation coefficient between two vectors. If the feature vectors are non-numerical, a distance measure that yields a numerical distance needs to be devised.
Choosing a clustering algorithm.
Some of the popular techniques include K-means or variants of hierarchical clustering algorithms, but a document clustering problem can also use a custom text clustering algorithm. In general, the choice of an algorithm is likely to be correlated with choices made in (a) and (b) above.
As you can imagine, depending on the choices made for (a), (b) and (c) above, the clustering output is likely to be quite different – which corroborates my point about the need for analyst supervision on the modeling of an unsupervised learning algorithm.
I hope this was useful. We will continue looking at different aspects of predictive modeling in future articles.
image source: www.salford.ac.uk