Detecting Anomalous Data

Outliers are anomalous or extreme values, which may be genuine but a rare occurrence. In other cases it may be spurious data due to error in reporting. **Outliers should be removed**, as they diminish the accuracy of Parametric and Model-driven Techniques like Regression, Discriminant Analysis etc. Cluster Analysis is also adversely impacted by Outliers. Hence Outliers should be detected and removed, before any analysis is done.

Please use the Sample Data and Self-Service Tool to detect and remove the Anomalous data. In case of Single Measurement, we can use **Histogram to detect Outliers**, and remove them by sorting the data or filtering beyond a threshold. In case of two measurements, we can use a **X-Y Scatter plot** to look at both the dimensions in tandem, and detect the Outliers. The following pictures show the detection of Outliers in our Sample Data. First we used a Histogram to detect Outliers with respect to GDP only. In the next case, we used Scatter Plot to detect Outliers on Area and GDP simultaneously.

But beyond two measurements, we can not use these Visual aids to detect outliers. **This is very similar to Cluster Analysis**, where we hit the wall in case of multi-dimensional grouping and had to use advance techniques. In a similar way, if we were to detect Outlier Countries, based on Area, Population, GDP, Roadways, Annual Car Sales and so on, we need some sophisticated technique. This is known as **Multivariate Outlier Analysis**. One of the techniques is to use Chi-square Distribution to detect data points violating the norm. In our discussion on Inferential Statistics, we discusses **Chi-Square Distribution** and its applications. If we randomly pick N Normally Distributed data, sum their squares and repeat the process infinite number of times, we get Chi-Square Distribution with N Degrees of Freedom. In Multivariate Outlier detection, the variables are distance of each data point from the Mean (**Mahalonobis Distance** to be precise), N is the number of dimensions or measurements. We use the Chi-square plots to detect Outliers in one go, without resorting to look at each measurement (GDP, Area and so on) separately.

Please use the Self-service Tool and set "Detect Outlier" option to "Yes". One has to select one Variable before doing this (generate Histogram). A Pie Chart is generated, showing percentage of Outliers (TRUE means Outlier). Please note that, although we picked only one variable for Histogram generation, the Outlier Detection (Pie Chart) is based on all measurements. Essentially we are doing a Multivariate Outlier detection. Please see a picture below, where 34.6% of data points are detected as outliers.

One can use "Remove Outlier" option to get rid of these Anomalous values. If done for our Sample Data, it will reduce number of Countries to 89 (from total of 136). We can download the cleaned up data from "Download" link in the left User Panel. By comparing with the original data set, we can find the data points which were dispensed with. This clean data can be loaded again for further analysis.