**Interesting Stuff**

__Moments of Probability Distribution__

We all are familiar with concept of moments in Physics, at least the term "Moment of Inertia" instantly comes to our mind. A high Moment of Inertia (MI) tells us how much mass is distributed further from the Centre of Mass (or C.G). MI is 2nd Moment, whereas Zeroth moment is Total Mass itself, and 1st Moment gives C.G. (by dividing with Total Mass).

Analogous measures are used for Data too. We express Data as Probability Distribution Functions(PDF) like Normal Distribution, Poisson etc. The Zeroth moment in this case is 1 (Sum of Total Probability). The ** 1st moment is Mean, **analogous to C.G, which is Expected Value of the random variable.

__ Skewness is the 3rd Moment__ and expressed as average of (x- Mean)

__ Kurtosis is the 4th Moment__ and expressed as average of (x- Mean)

**Endogeneity Bias**

A reliable Regression model, based on Ordinary Least Square fit, should be unbiased. It means that on an average the errors(residuals) should be zero. Hence error terms should follow a near Normal Distribution, with a mean value of zero. One of the sources of a biased model is called Endogeneity Bias. It is a term coined by Econometricians, which refers to two closely related problems. We quite often use Regression Models for Causal Analysis using Ordinary Least Square Regression method. However the model would not be valid in these two cases.

1. We are not sure whether X causes Y or Y causes X (**Simultaneity Bias**).

2. If there is another variable Z which causes both X and Y to move, then we may be tempted to regress Y based on X (based on correlation). However the model will suffer from what is called **Omitted Variable bias.**

These two biases are jointly called Endogeneity bias. As an example, we may not know whether high CEO pay package causes Good Corporate performance or Good Performance causes high CEO pay package.

As another example, higher demand for electricity may be cause of high per capita income (more industrialization, jobs etc.) or higher per capita income may be cause of high demand for electricity (more affluence, appliances, travel etc.)

There is a phenomena called **Supplier Induced Demand (SID)** which is a classic example of this. A classic example is demand for Medical Professionals. Where demand in service is causing more Medicos or availability is causing more services.

Though it can be hard to detect in the beginning itself for complex scenarios, fortunately one Regression Diagnostic can help detect this and avoid this pitfall. In case of these biases, the error (residual) term will show a correlation with X variables. This is a sign of a biased model and is called **Heteroscedasticity.**

What is the remedy for this? We need to find another variable, which in case of Case 2 be to include the underlying Cause (Root Cause) Z in place of X, if we can identify it. Or we need to develop a Two Step Regression Model (TSLS) based on a variable which is highly correlated to X and not any of the other variables already included in the model (x2, x3, x4 etc). And once again the final check is presence of Homoscedasticity i.e. no trend whatsoever of residual with any of the Xs.

**Oversampling for Rare Classes**

One of the most widely solved problems in Analytics is a Classification Problem, where we see to develop a Predictive ability to distinguish between two or more than two classes. As an example, we would be interested in predicting "Fraudulent Transaction", "Respondents to a marketing campaign" or "high risk Drivers" etc. **In these cases, usually the negative class (0 class like genuine transaction or non-respondents) is overwhelmingly bigger than the positive class.** Hence less than 1% of the people may respond to an email or indulge in fraudulent transactions. If we use a Simple Random Sample to analyse such cases, we will end up having too few cases for the positive class in our sample. Hence in a sample size of 10000, less than 100 may be Class 1 and rest may be Class 0.

We would need to inflate the numbers of Class 1, in order to have a healthy mix of both the classes. Usually we would keep a 50-50% split for the two classes, or a 33% split for three classes. This sampling plan, which seeks to have a healthy mix of two classes is called Oversampling of Rare Classes.

There is a very valid reason to do this, which comes from the __differential cost attached to misclassifying the two classes__. Cost (penalty) attached to predicting a "fraudulent transaction" as "genuine" is much more than predicting a "genuine" as "fraudulent". The cost in latter case would be more scrutiny or more detailed review. When we have such asymmetric cost attached, we would aim for more accuracy in catching the "more valuable classes" and would have more information (data) in our sample.

**Altman Z Score**

Altman Z Score is one of the most widely used Financial Distress Indicators in the finance world. It was developed by Prof. Edward Altman the 1970's. It uses Multiple Discriminant Analysis for classifying Organizations into three Classes - Safe, At Risk and Imminent Bankruptcy.

**Discriminant Analysis** is not as popular as Regression analysis, owing to its restriction that all independent variables need to be numeric. This requirement is not met in lot of analytics problems in practical world. However in essence it is very similar to Regression Analysis and is applicable to scenario where Y is a Discrete Variable (Classes) and all X variables are numerical.

Z Score uses the Financial Ratios from the Income Statement and Balance Sheet of an organization. It uses 5 following financial ratios.

1. X1 = Working Capital/Total Assets

2. X2 = Retained Earnings/total Assets

3. X3 = Earnings before Tax and Interest/Total Assets

4. X4 = Market Value of Equity/Book Value of Liabilities

4. X5 = Sales/total Assets

Z score is equal to 0.012X1 + 0.014X2 + 0.033X3 + 0.006X4 +0.999X5, A score above 3 classifies an organization in Safe, 1.8 to 3 in At Risk and below 1.8 as facing Imminent Bankruptcy. For details refer a paper by the author http://people.stern.nyu.edu/ealtman/Zscores.pdf

**Simpson’s Paradox**

This paradox is one of the most studied and discussed in the field of Statistics and Data Mining. It reinforces the view that understanding the data and applying human intelligence should always go hand-in-hand with Automatic Data Mining algorithms. These automated algorithms are just aids in our hand, but we should have the logic and ability to sense misleading results thrown from these Automated Algorithms.

This paradox shows how __results can be entirely contradictory and counter-intuitive when we combine data sets, missing out important contextual variables.__

Let us consider one example. One Retailer wants to see the efficacy of Marketing programs – emails (generic with no personalization) and direct phone (personalized calls based on profile). Samples of 5000 customers were taken for each marketing campaigns. The results showed acceptance rate for emails was 45% while the same for directed phone calls was 20%. Now this seems startling as directed and personalized phone calls (which are expensive) are expected to lead to more conversion. Now the Retailer segregates the customers in two categories – Platinum/Gold Loyalty Card holders and General Loyalty Card holders. Now the results are entirely reversed. For Platinum/Gold Members the acceptance rates are 95% and 50% for Phone Calls and Emails respectively. The same for General Members is 10% and 5% respectively for Phone Calls and Emails. Hence adding one contextual variable (type of Loyalty Member) changes the result and establishes that conversion rates for Phone Calls are much higher than emails, which was not evident when the two subgroups (two types of members) were merged together.

Simposon’s Paradox was one of the main motivation to develop set of algorithms called __ Automatic Interaction Detection (AID)__. These algorithms form the backbone of the powerful Machine Learning tool called

**Data Snooping**

Data Snooping, also called **Data Fishing** or **Data Dredging**, occurs when an analyst uses the same data set repeatedly, and ending up testing a Hypothesis without any knowledge of that before looking at the data. As an example, an analyst may try to find something interesting from the employee retention data. He stumbles upon a positive correlation between a particular Training Program and Employee Attrition (as shown by the algorithm through Significance tests). In this case, he doesn’t have any a priori Alternate Hypothesis (Attrition grows or reduces with Training). The available data set throws a possibly spurious correlation, which may be just a chance occurrence.

Sometimes this practice is intentional and sometime unintentional. **Decision Trees are the techniques which are most prone to this phenomena.** These automated techniques may find that people in particular geographies are more likely to buy iPhone than rest of people, which may be just noise in the data and may not hold true for a new data set in future. Owing to this __CHAID had a controversial history__ and several researchers doubted its usability.

__Use of a good sample, preferably large, for Training and Validation makes more sense for these algorithms__, which can inadvertently lead to Data Snooping. And it makes more sense to retrain the model as and when data set grows, business scenarios change or new data set comes. Ensemble models like __ Random Forest__ is another way to enhance the Decision Trees, where results from several models are averaged to reduce the unintentional Data Snooping results.

**No Model Prediction**

Even when we do not have access to any of the predictive analytics tools, we can still make prediction by using the Average Value of all historical data for future. This can be called a **Naïve Model or No Model prediction**. Value provided by any advanced model can be assessed by improvement in prediction over the Naïve Model. **Coefficient of Determination, R-squared (Adjusted R-squared)** is one measure which gives an assessment. It is expressed as a percentage value. A Value of 70 percent would mean that the model in use is able to explain 70 percent of variations in the Outcome (Y variable). Contrary to that a Naïve Model by its nature does not explain any variation (assumes same average value).

It is a very important diagnostic in Regression models. However **analysts tend to exaggerate its importance**. A Regression Model with 90% R-square may not be able to tell us anything new, whereas a model with just 20% R-square can give a vital clue. This is especially true for **Exploratory Objectives of a Regression Model**. Here we are interested in knowing the major drivers behind an Outcome and a low R-square may also be able to capture a vital factor. In complicated cases like modelling Stock Markets, Profitability of an organization etc. we should downplay R-square and rely more on the Significance Tests. R-square in these cases is important in comparing two models, however we should not attach too much importance to it. Econometricians are known to completely ignore this. It is suggested that R-square should be checked in combination with Significance tests and even when it is low, we should check whether the model has given us some causation, which is hard to find otherwise.

**Random Walk and Financial Markets**

Random walk is a popular theory in the art of investment in Financial Markets. It says that markets are efficient and cannot be predicted. Hence believers of this theory say that markets cannot be timed from an investment perspective and they always reflect the information available with them. These views contradicts the theory and belief of Financial Analysts who use sophisticated statistical algorithm to predict the stock markets using past patterns of stock market movements.

**Unravelling the myriad of Hypothesis Tests**

Hypothesis Tests are also known as "Inferential Statistics", as we seek to infer or Test a knowledge about a Population, based on Samples drawn from it. There are numerous techniques used for it, Z Test, t Test, Chi-square Test, F Test, to name a few. It is not very uncommon to loose track of their applications and their suitability for a particular scenario. The following table should come handy for a quick recall of their applications.

__Important point to note:__

1. Z Tests and t Tests can be used interchangeably, however t Tests are more practical as we would not know the Population Std. Deviation also if we don't know the mean. t Tests are used when we use Sample Std. Deviation in place of the Population Std. Deviation. Also t Tests are used when sample size is small. t Test results approach Z test results as the sample size increases.

2. Chi-square Tests are used for Nominal Variables (Categorical variables). They are Non-parametric tests, as opposed to Z Tests which are parametric in nature (presumes some characteristics of the data). Hence Chi-square Tests are more widely used for their non-restrictive nature and "Goodness-of-fit" Tests and "Contingency Table" approach are two widely used Chi-square tests.

3. t Tests for Comparing two populations can be done in three ways - Independent Sample Tests, Paired-Sample Tests and 1 Sample Tests. Independent Sample Tests refer to cases where the two samples are entirely independent as we can see to test the sameness (or difference) between men and women. Paired-sample Tests are performed on the same Population (pre and post treatment), for assessing impact of a training program, impact of a medicine etc. 1 Sample t Tests tests a Hypothesis where a Population mean is tested against a knowledge from past experience, research findings etc. Here single population mean is compared against a reference value (a fictitious population).

**Fixed Effect and Random Effect modelling**

If we have Sales Data for a Product (from several Sales Zones) at several Price points, we may want to Regress Sales vs. Price to get the Price Elasticity of that product. But we may encounter entirely surprising result, if we don't use __ Fixed Effect modelling__. It is quite likely that there are several variables like per capita income, average age etc. in the sales zone, which have an impact on Demand Elasticity. If we do not include them in our model, we may induce

If we have strong reasons to believe that the samples drawn from all these sales zones randomize the age, affluence, profession etc. then we need not do anything. In that case we are relying on __ Random Effects Model__. This concept is explained in various ways. Watch some good youtube videos and read one good paper on this concept.

**The "Bulging Rule" of Transformation**

The first step in using a Parametric model like Least Square Regression is to look at the x-y scatter plot and check the correlation. Very often we may not find a neat, linear x-y trend. Any attempt to fit a Linear Regression (or Multiple Regression in multivariate problems) would lead to a poor fit and non-usable Model. Transforming X or Y in some cases, can significantly improve the quality of fit. Though there are several possible functions, which can be used, a simple X^{p }or Y^{q }transformation would be enough in most cases. Mosteller-Tukey Bulging Rule is a very smart way to implement this transformation. We have to look at the x-y plot and based on the shape, use a p and q value. First we should transform X and Y to positive values, by adding an appropriate constant (by looking at their minimum value). Then based on which quadrant of a Circle centered on (0,0) the X-Y plot resembles, we can judge p and q. Please see the picture below for an example. Usually we should transform X (independent variable) only as X^{p }. However in cases where Y (response) shows lot of skewness, a log or other transformation may be required. And in case of Multiple Regression, we should transform x only. Read an excellent material (and there are several others) for details.

http://www.statpower.net/Content/MLRM/Lecture%20Slides/DiagnosticsAndTransformations2_handout.pdf