Till now we learnt primarily about Descriptive Analytics. Next in the study of Analytics, in terms of value addition is Predictive Analytics. What we are going to learn now is more of Supervised Learning, rather than “Predicting” an outcome. In all the following topics, we will have a Response, Outcome, Target or Y variable, and group of Input variables. The objective would be exploratory or inferential as well as Prediction.
The Outcome could be a continuous number (numeric) or it could be a categorical number (qualitative). Based on this we can have two scenarios. We could be interested in predicting the monthly credit card spending of a person. In other scenario, we could be predicting which segments he belongs to, when he have created three segments – high spenders, average spenders and low spenders. This is a very important distinction and the techniques, accuracy measures of the model and best practices differ significantly in these two cases.
Let us get a gist of Predictive Analytics with a simple case, involving numeric response (Y) and input (X) variables. Please see the calculations in the Example sheet from the link provided above. The X-Y plot of Annual Car Sales vs GDP for 136 countries looks something below. The correlation between them is 0.78, indicating a strong positive relationship. The red Trend line captures the essence of the relationship.
It attaches a number to the strength of relationship
It generalizes the relationship, as it can be used for values other than captured here.
We have shown two cases – a Linear Trend Line and a Polynomial Trend Line (order 2). Linear Trend Line is shown as top pair of plots, whereas the Polynomial Trend is shown as bottom pair of plots. Each of them have been generated for two cases – all countries included and four countries (Belize, Liberia, China and US) excluded. These four countries are extreme values at both ends and by removing them, we are going to test an important fundamental property of Predictive Analytics techniques.
In the linear trend line, the error with all data was 2.4 times Average Car Sales, which in the case of Polynomial trend line came as 2.04 times the same. When we excluded the 4 countries, the Linear trend line did not change drastically (coefficient is 817 vs. 822), however there was 53% reduction in error. In this case, the prediction for the extreme values were quite off, and that resulted in big error. Removing these extreme values brought down the error, without changing the nature of the Line drastically.
When we used the Polynomial Trend line, removal of the 4 countries changed the Polynomial drastically (positive coefficient is 1541 vs. 877). The reduction in error is merely 24% as opposed to 53% with a Linear trend.
Linear Trend line is a relatively simpler technique, which tries to capture the broad trend, and does not capture the extreme values (the 4 countries). By doing that, it is sacrificing accuracy, but shows stability (not too much impact with data change).
Polynomial trend line is a relatively sophisticated technique, which has the capability to follow the wiggles in the data. By doing that it becomes, more accurate, but turns out to be less stable (changes a lot with data change).
A simpler, linear technique suffers from high bias (due to wiggles) but shows less variance error. On the other hand a complicated technique show less bias (can capture wiggles) but suffers from variance error (not reproducible for another data set). Although the errors and changes have been exaggerated with this simple example, this phenomena is true for almost all cases.
This emanates from an important trade-off in the area of Predictive Model building. We need to strike a balance between "Simplification" and "Complication" in Machine Learning. These two extremes may lead to large errors in our results. On one extreme, we may over-simplify the case with using a very simple model. As an example, we may use a Simple Linear Regression for a scenario, which is definitely non-linear like Law of Diminishing Returns. In this case the inability of the model to capture non-linearity (flattening of curve etc.) may result in large "bias". Other extreme could be using a highly sophisticated model, like Trees or Neural Nets (we will learn about them later), which are capable of modelling very complex relationships, so much so that they start capturing the noise too (over fitting). In this case we may experience large "variance".
The optimum lies somewhere between, where we are able to model the phenomena correctly by capturing the sustainable trend, without going overboard in capturing the random parts too.
How do we manage the trade-off between bias and variance error. It is
theoretically possible to progressively reduce bias by using more and more
sophisticated techniques. As discussed earlier, Neural Networks, Decision
Trees, Support Vector Machines etc. have the ability to model the local
variations (noise) too and can give very low error. But the noise is random
in nature and as soon as we use a new data set, where this noise is not
present, we face high variance error. A remedy is to partition the available
data into two parts – one for developing the predictive model (training data)
and one for checking the variance error (validation data).
The error on training data (bias or training error) and on validation data (variance or
validation error) diminishes initially, but at some point the validation error starts
The best choice is the point where validation error is at its minimum, just at the
inflexion point. The picture on right side illustrates this phenomenon.
Although we discussed this for a numeric Outcome, this idea holds good for Qualitative outcomes too. One should always use his common sense and business acumen (the two ingredients apart from Software and Machines) to judge the best method, as there is never an absolute best model in Analytics. It all depends on the context and our objective.
Our analysis was based on relative accuracies of the two approaches. Let us understand some accuracy measures for Predictive Analytics techniques.
In case of numerical outcome, it is simply the deviation of fitted value (predicted value) from the actual value. Since the deviation can take positive as well negative signs, we square them to cancel the sign and add all the deviations. This is called the Residual Sum of Squares (SSR) or Error Sum of Squares (SSE). Depending on the Tool used, we may see both of them. One can follow their calculations in the Example sheet. SSR or SSE can be averaged to get Mean Squared Error (MSE) or a square root of MSE can be taken to get Root Mean Square Error (RMSE).
In case of qualitative outcomes, it is bit involved and several measures are used. Please look at the illustration below.
Let us consider a case where objective is to predict two classes of an Outcome.
Hence one class could be "fraudulent" insurance claim and other case could be
a "genuine" insurance claim. Let us call the "fraudulent" case as positive and
"genuine" as negative. Positive is usually used for critical cases which are rarer
and negative is used for normal cases. It has nothing to do with the intention,
ethics etc. attached to a class.
Let us look at the right box above showing "a", amongst the four boxes. This is count of cases where the actual class was "positive" and the prediction also turned out to be "positive". This is called "True Positive". In a similar fashion d is "True Negative". On a similar line, b and c are called "False Positive" and "False Negative" respectively.
Accuracy of the model is simply True cases divided by all the cases, which in this case is (a+d)/(a+b+c+d). There are other two measures which are used jointly. Precision denotes the effectiveness of the model, as it shows how many of the predicted cases are critical ones (positive class).
Recall on other hand denotes the completeness of the model. It shows how many of the actual critical class were picked by the model. Recall is also known as "Sensitivity". Specificity is the ability of the model to rule out the unimportant class.
Precision and Recall are especially important and often there is a trade-off between them. The use of one over other depends on the objective. Let us consider a simple example. The Traffic police of our city launches a drive to prevent driving under "influence". Now they can use two approaches. They can watch each driver very closely, assess his driving from several aspects and stop and check him when absolutely confident. In this case, their Precision would be very high, however their Recall may suffer, as they may leave out several drivers.
On the other hand they can stop drivers on showing even one of the characteristics of driving under influence. In this case, they may stop several drivers, but may find lot of them were not drunk. In this case the Precision suffers, however Recall would be high, as there is a remote chance of a drunk person evading the test.
Usually one uses one of these measures, maintaining a minimum criteria for others. Hence Precision can be maximized, maintaining 50% Recall.
The Y Variable could be numeric or qualitative, and in the same way the inputs can also be numeric, qualitative or combination of them. Depending on the nature of these variables, we have several scenarios, which require use of a specific type of Predictive Analytics technique. This is shown in the following picture.
With all these introductory discussions, we are ready to look at the specific Analytics techniques in detail. We will cover the following Model Driven Techniques, based on Classicial Statistical methods.