Linear Regression

Linear Regression Model is the most commonly used Predictive Analytics technique, and one of the classical techniques. In simple terms, it is a technique to model an Outcome as a linear function of one or multiple inputs. We would learn this technique, as a step-by-step process through an example in Excel. Please download the Example Sheet given at the above link. In this simple example, we have only one x (independent variable). But the concepts are same even for multiple x variables also. The univariate Regression Model is called **Simple Linear Regression**, whereas the one involving multiple X variables is **Multiple Linear Regression**.

The example in the downloadable Example Sheet lists two variables for 29 students. One variable is the Class X marks and other is University GPA (out of 4). Please use the Example sheet and the following discussions in parallel for better understanding of the model.

We would be interested in knowing whether Class X marks impact University Grades, and if yes, then what is the impact (quantify it). The X-Y plot, and Trend Line looks something like below. Visually it appears that there is a strong relationship, and evidenced by high correlation too (0.87). The Line fitted among the cloud of points captures the relationship, and it helps in two ways.

- The sign and magnitude of the coefficients show the strength of relationship.
- The equation generalizes the data as it should be valid for other data points too.

The above need not be true always, and whether the Regression Line has any real use, is something which we test with lot of diagnostics. We will discuss them now.

Please see the results in the Example Sheet. Let us examine each one of them one by one. First of all, the Regression Model assumes that

- Y (Grades) = a + b*X (Class X Marks) +
**error**

The line is kind of a "**Best fit**" line for the data points. The "best fit" is obtained by minimizing the sum of squares of the vertical distances from the data point to the line (the Red double arrow in the top right picture). This technique of finding the best fit line is called "__ Least Square Error Method__". Due to this a Simple Linear Regression is also known as

The Least Square Error method yields simple calculations for a and b in univariate case. b for univariate case is calculated by summing the products of deviation of Y and X from their Mean, taking the average and dividing that by variance of X. Check the match between manual calculation and results from Excel output. Once we have b (coefficient for X variable), finding the intercept is easy. It is simply

- a = Mean of Y – b * Mean of X

So we have established the Regression equation. Now how good is the Regression Equation? In case of no knowledge about Regression Equation, the best we can do is to take Mean (Average) of Y and use it as our prediction. This is the straight line parallel to X axis shown on the plot. This is our **No Model (Naïve) prediction**. __ Regression Equation is valuable, only if it provides some value over the Naïve Prediction.__ Otherwise there is no point in taking so much pain in calculating the coefficients and performing other analyses. One way Regression Equation provides more value than the Naïve Prediction, is that it can explain variation in the Y variable as much as possible, which the Naïve Model is not at all doing (it gives a constant value regardless of what X we chose). This is explained by the relative magnitudes of SS values (Regression, Residual and Total). Please see how they are calculated in columns H and I. The graphical equivalent is the blown up chart on the right side above.

The **R Square **value measures this. Out of Total Variation (SS Total), 76% is explained by the Regression Equation (SS Regression). Hence it seems to be a valuable exercise. Although, higher the better, we should not be too ambitious about R Square value. A value above 70% looks a good enough value, and shows that still 30% variation is unexplained by X (this is more likely to happen in practical cases). It is a clue to use our Business acumen and common sense and include more explanatory variables if needed.

Another analyses which explains the same thing, albeit without attaching any number to it, is **ANOVA** as shown in the excel output. This is essentially a Hypothesis test involving F statistics (please see explanation on Statistics page). The Null Hypothesis in this case is that all the coefficients in the Regression Equation are zero. In effect it means that a Constant value (average of all Y) is all we are left with. How do we test the Hypothesis? We do this by F statistics which is MSR/MSE. MSR and MSE are calculated by dividing the SS values by df (degree of freedom) for Regression and Residual respectively.

DOF for Regression = k

DOF for Total = n -1

DOF for Residual = n – k – 1

Where n = number of data points (29) and k = number of variables (1 in this simple case).

If F = MSR/MSE is very large compared to 1, then it is safe to reject this hypothesis, i.e., there is at least one X (in this case there is only one), which has non zero coefficient.

How large F should be compared to 1. It depends on n, size of data, for larger data sets it should be moderately above (~ 10) and in relatively smaller data sets it should be quite high (84 in our smaller data set).

What we are establishing from ANOVA is that there is definitely a merit in developing a Regression Equation, as at least one of the X variables certainly has non-zero coefficient and can explain the variation in Y (much better than the Naïve prediction).

Can we see the equivalence in R Square and ANOVA? ANOVA in Regression context is called the “**Global Test of Significance**” and it is first analysis one should look at. Only when we reject the Hypothesis here, there is any merit in judging other measures.

After the Global Test of Significance, we should turn focus on individual variables. We perform the "Test of Significance" for individual variables. The Null Hypothesis is that the variable coefficient is zero.

We have already calculated the coefficient for X variable. However this is based on a sample (29 points), but our objective is to generalize it for entire population. Essentially we want to capture the dependency of Univ. Grades on Class X marks for students in general. Hence the calculated coefficient is merely an **estimate** (refer **Inferential Statistics **page) of the actual coefficient (population). Estimates always have an uncertainty attached to them, or an error. The error is used to establish an interval. In the case of estimating a Mean, we learnt that the error (called **Standard Error**) is given as follows:

For the x variable and Intercept, the Standard Error (uncertainty in estimating) their coefficients is given as below.

Please note that two formula are listed above for intercept term (a). First look at the formulae for Standard Error for b. The numerator is the variance of Residuals, and is calculated as RSS/(n-2), where we learnt that RSS is Residual Sum of Squares (calculated in cell D31). How do we reduce the error - either by reducing the residuals or spreading out the x values (the denominator). The spread of x values is called **leverage**, and a spread out data points enhances the estimation as compared to a narrow cluster of data points.

A large value of the Standard Error is an indication that we may commit large error in rejecting the Hypothesis (b = 0), hence we do not reject it. That amounts to saying that b = 0, hence there is no statistically significant relationship between Y and the individual X.

The first formula for Intercept terms is primarily given to illustrate an important fact. If average of X is zero, then it is very similar to estimating a mean, which in this case is Average of Y values (Naive prediction).

The second formula is used for calculations, and note how it matches with the Excel output. The calculations with these two formula may vary, but the conclusion on Hypothesis Test remains same. In the Example sheet, the Standard error for intercept term is very high. A model without Intercept term is developed and the results are shown in the sheet "**No_intercept**". We can see huge improvement in R Square value for this model.

One underlying assumption in developing a Regression Model is that the errors are random

in nature and do not show any pattern. They should also on an average converge to zero.

This is an important diagnostic and should be checked qualitatively. Please see the plot of

error with X below. The error terms swing between a minimum of -0.75 and a maximum

of 0.37, in a completely random manner. The mean for the residuals is zero. This is a very

important diagnostic, which is often overlooked, but may point to some inherent flaws in the

model. In fact, residuals should not show any correlation with any of X variables and Y variable,

a property called "**Homoscedasticity**". Lack of it may arise due to some biases. Please see

discussion on this in the "**Interesting Stuff**" page.

Till now we discussed Linear Regression with only one X variable. **Multiple Regression** in the same vein would be developed like following.

Y = a + b1*X1 + b2*X2 + b3*X3 + …….. + error, where

- X1 = Class X marks
- X2 = Marks in an Aptitude Test
- X3 = Average no. of hours spent on study

Though the calculations would be not as simple as in the univariate example, the concepts are same about Tests of Significance, R Square values, distribution of errors etc.

Although the outcome (Y) has to be a numeric value, X variables could be non-numeric too, which we call Categorical variable. Hence in the above case, one of the variables could be whether a student attended an Online Training Program or not. This variable, say X3, is a binary variable, with Yes/No kind of values. In this case, this variable is internally coded as 1 and 0 (called **Dummy variables**) for Yes and No values respectively. If the value is Yes (attended training), then Y will have contribution of b3 otherwise 0. If the categorical variable is multiclass, rather than a binary, then we will create N-1 dummy variables, where no. of classes is N. If instead of simply Yes/No kind of values, we have Training A, Training B and No Training as three classes, then we would model the same problem as following.

Y = a + b1*X1 + b2*X2 + b3*Training A + b4*Training B + …….. + error

In order to summarize all the above discussion, we should follow a step-by-step procedure.

- Look at ANOVA results (F Statistics) and judge whether there is really a merit in developing a Regression model based on the chosen input variables.
- Look at the R Square value, and judge whether the model explains enough variation in Y. As a rule of thumb, a value greater that 70% should be good, but it depends on context.
- Look at the coefficients table, and assess Standard Errors, t statistics etc. for all the variables. If required, we can drop any of the variables, and redevelop the model.
- Check the distribution of error, and presence of Homoscedasticity.

Above all, one should be watchful for results which are unexpected and belies common sense. As an example, a particular coefficient may be unexpectedly small, or show a counter-intuitive sign (negative rather than positive). This approach may avoid some of the common pitfalls in Regression models like **Multi-collinearity**, **Endogeneity Bias** etc.