Statistics can be classified into Descriptive and Inferential Statistics. Descriptive Statistics presents summarized view of the as-it-is data, which is easier to use and interpret. Descriptive can be further classified into Graphical and Numerical Statistics. Inferential Statistics mainly refers to techniques, by which we deduce a measure about data (population), based on a part of it (sample). Before we proceed further, let us arm ourselves with basic understanding of some concepts.
Population: Population consists of all the items of interest, which can be possibly used for Analytics. Example could be all adults eligible for voting in a country, all cars plying on the road in a city or all gears manufactured at a plant and so on.
Sample: We could possibly use all items in a population, however we seldom do that due to huge cost and time involved with it. What we normally do is to take a part of the Population. But what we refrain from is, picking people from the same city or gears manufactured in a single week. There are certain methods to draw a Sample from a population, called Sampling Techniques. We would pick every 100th person (Systematic Random Sample) from a city, or from different age groups (Stratified Sample) in a city and do the same for multiple cities (Cluster Sample).
Variable is a characteristic of the Population or the Sample. As an example Vote preferences or Weights of people could be a variable. As the name suggests, it can take several values.
Data is the measured or observed value of a variable. When the data can take uncountable number of values, it is called Continuous Data (Interval, Quantitative Data). When the data can take only finite, countable number of values, it is called Categorical (Discrete, Nominal or Qualitative Data). When we measure weight of a person in absolute value, say lbs. it is Continuous. But when we put him in any of the three categories (Obese, Normal, Under-weight), it is Categorical.
Please download the Example Sheet (statistics.xlsx) and get familiarized with all the concepts. The sheets Descriptive and Graphical shows the concepts described below.
Following numerical measures are used for Continuous data.
Mean: This is the arithmetic average of all data. Please note that, presence of very high or very low values can impact the mean. In the Example sheet, depending on whether we include or exclude US, China, Belize or Burkina Faso, the mean of Annual Car Sales would change drastically. Hence Mean is complemented with few other measures, which are immune to this shortcoming.
The most widely used mean is Arithmetic Mean. When the objective is to find average growth over several years, like in Population or return on a stock, we use Geometric Mean.
Median: This is the middle value, when we arrange the data in decreasing or increasing order. Please note that Median would not be impacted by extreme values. For odd nos. of data points, it is simple, but for even nos., we will have a tie. In that case, median is average of the middle two values. In our example, if we sort the Annual Car Sales from lowest to Highest, 35500 is the Median (corresponding to Syria and Lebanon).
Mode: Mode is the most frequent data value. But it is not used as often as Mean and Median
Range: It is simply the difference between Maximum and Minimum values.
Variance: It is average of the square of all deviations from the mean value. Variance and Std. Deviation are measure of unpredictability and often used to represent risk or lack of quality in a process. We would prefer a Stock with low variance on returns (low risk) or a manufacturer strives for consistency in his output. Too much of variance (say length of a Tube) in output shows that one doesn't have control on the process, and several times the products will be rejected by end user.
Standard Deviation: It is square root of the Variance, and the most widely used Statistical Measure along with Mean.
Mean vs. Variance (or Std. Deviation) is a very useful analysis. While mean reflects the magnitude, variance is an indicator of consistency. Hence a Stock which has high mean, as well as low variance (risk) would be a preferred investment tool, as compared to a Stock which has low mean but high variance. Similarly a route for an Airline, with low mean and high variance for revenue would be very difficult to plan, and needs immediate attention of Management.
Skewness: Please observe the Histogram generated in the Example sheet. The Bars are not symmetric, with lot of small bars spreading on the right side. This is called a Skewed distribution, which in this case is the result of some extreme values (US, China, Japan etc.). This concept would be much clearer when we study Normal Distribution later. Had it been a perfectly symmetric distribution, it would have shown zero skewness. Hence skewness is the measure of the deviation from the perfectly symmetric Histogram. The example has a very high positive, non-zero skewness, indicating a right tailed distribution. Please follow its calculation in the example sheet.
Kurtosis: It is measure of the peakedness of a Histogram. A higher Kurtosis means a narrow Histogram, with most of the values close to mean values. Hence the variance, range, Std. Deviation all would be low for high Kurtosis and vice versa. Again in the example sheet, Kurtosis is very high indicating a narrow peak (several countries in a narrow range).
Covariance or Correlation: They measure the strength of relationship between two variables. The relationship need not be a Cause and Effect. It merely shows whether two variable move in tandem, as a result of causal relations or as a result of same cause. Correlation is often more useful measure as it is non-dimensional, and ranges between -1 to +1.
Query: We often want to look at data for certain conditions (subset of data). Two of the most common examples are Filters in Spreadsheets or Structured Query Language (SQL). We can filter Region, Size of Economy or Countries with more than a certain GDP or Car Sales in the Example sheet. When data is stored in a Database, we would use the SQL queries to do the same.
Please observe the calculation of all these measures in the Example Sheet. It is worthwhile to compare the Excel outputs and calculated values. They are exactly same, except for Skewness and Kurtosis, which differ due to the rounding off for the extremely large values.
Except Mean, all other calculations (Variance, Std. Deviation, Skewness, Correlation etc.) involve dividing with a number, which is one less than total sample size (135, where Total No. of countries is 136). Why is that? In all the numerical measures, barring Mean, we need Mean to complete the calculation. Hence for a given Mean, we have freedom in choosing only N-1 data points, as the Nth data point is fixed by Mean and N-1 values. This is called Degree of Freedom (DOF), which for all these Numerical measures in N-1, where N is sample size.
Primarily Line, Table, Column, Bar, Pie, Scatter Chart, Histogram etc. are used.
Categorical Data: The only measure which could be used on them is counts for each Categories. Accordingly, we can use Table, Bar (Column) or Pie Chart for them.
Continuous Data: The most widely used techniques are Histogram (also called Frequency Distribution) and Line Charts (mostly used with data stamped with Time).
Combination: We often encounter a mix of Categorical and Continuous data, which in Analytics parlance are also called Dimension and Measure respectively.
When we combine two Dimensions we get what is known as Contingency Table. We can combine two continuous variables by a Scatter plot. We can combine Categorical and Continuous data by aggregating Continuous values for Categories by means of Bar or Column Charts. Scatter Plots are Graphical equivalence of Correlation, which we learnt earlier.
Line, Bar etc. can be used to see distribution of a Measure for a particular Dimension. The measure can be either a sum or mean for a Category. Scatter Plot is used to find relationship (correlation) between two Continuous(numeric) variables. It can be an important insight for detailed Analysis like Variable Screening or Predictive Modelling. Bubble Plots are advanced Scatter Plots which use an additional Size variable. It can accomodate an ID variable too, which has to be a Categorical(Dimension) Variable.
With the advent of powerful Business Intelligence Tools like Tableau, Qlik, Power BI (Advanced Excel) etc. several other Graphical tools are available. Often Graphical Tools help in Data discovery or Preliminary Analysis, which are helpful in Data Mining.