Out of several definitions, which can be used for Analytics, the following should suffice for most of the cases. "ANALYTICS means set of techniques, which gives STRUCTURE to large amount of INFORMATION for actionable INSIGHTS." Structure by Charts, Graphs, Aggregations, Groupings, Inferences, Combinations, Trends etc.Insights by showing What happened, Why it happened, What can happen and What should be done for that.
Ingredients of Analytics
We will focus on STATISTCS and SMART SOFTWARE (Machine Learning), and mention other ingredients wherever appropriate.
Analytics relies heavily on Classical Statistics and Smart Software, sometimes used independently and sometimes in a combined fashion.
STATISTICS can be classified into Descriptive and Inferential Statistics. Descriptive Statistics presents summarized view of the as-it-is data, which is easier to use and interpret. Descriptive can be further classified into Graphical and Numerical Statistics. Inferential Statistics mainly refers to techniques, by which we deduce a measure about data(population), based on a part of it(sample) drawn from it. Before we proceed, let us arm ourselves with basic understanding of some concepts.
Population: Population consists of all the items of interest, which can be possibly used for Analytics. Example could be all adults eligible for voting in a country,all cars plying on the road in a city or all gears manufactured at a plant.
Sample: We could possibly use all items in a population, however we seldom do that due to huge cost and time involved with it. What we normally do is to take a part of the Population. But we refrain from picking people from the same city or gears manufactured in a single week. We would pick every nth person(Simple Random Sample) from a city, from different age groups(Stratified Sample) in a city and do the same for multiple Cities(Cluster Sample).
Variable is a characteristic of the Population or the Sample. As an example Vote preferences or Weights of people could be a variable. As the name suggests, it can take several values.
Data is the measured or observed value of a variable. When the data can take uncountable number of values, it is called Continuous Data (Interval, Quantitative Data). When the data can take only finite, countable number of values, it is called Categorical (Discrete, Nominal or Qualitative Data). When we measure weight of a person in absolute value, say lbs. it is Continuous. But when we put him in any of the three categories (Obese, Normal, Under-weight), it is Categorical.
Observe the alignment of Analytics Objectives and Techniques below. We are going to learn Descriptive and Inferential Stastics as foundation for more advanced concepts. Please proceed to Descriptive Statistics tab to learn some Graphical and Numerical techniques
Primarily Line, Table, Column, Bar, Pie, Scatter Chart, Histogram etc. are used.
Categorical Data: The only measure which could be used on them is counts for each Categories. Accordingly, we can use Table, Bar(Column) or Pie Chart for them.
Continuous Data: The most widely used techniques are Histogram (also called Frequency Distribution) and Line Charts (mostly used with data stamped with Time).
Please load Sample data set Country_Car_Sales on Training Page and use Descriptive Statistics option to learn different techniques.
Example: Read variables AnnualCarSales, GDP, SizeofEconomy etc. to generate different charts. Depending on whether the selected variable is Continuous or Categorical, appropriate charts are generated.
Combination: We often encounter a mix of Categorical and Continuous data, which in Analytics parlance are also called Dimension andMeasure respectively.
When we combine two Dimensions we get what is known as a Contingency Table. We can combine two continuous variables by a Scatter plot. We can combine Categorical and Continuous data by aggregating Continuous values for Categories by means of Bar or Column Charts.
Line, Bar etc. can be used to see distribution of a Measure for a particular Dimension. The measure can be either a sum or mean for a Category.
Scatter/Bubble Plot:Scatter Plot is used to find relationship (correlation) between two Continuous(numeric) variables. It can be an important insight for detailed Analysis like Variable Screening or Predictive Modelling. Bubble Plots are advanced Scatter Plots which use an additional Size variable. It can accomodate an ID variable too, which has to be a Categorical(Dimension) Variable.
Following numerical measures are used for Continuous data.
Mean(Average): This is the arithmetic average of all data. Please note that, presence of very high or very low values can impact the mean. If we remove US from the Annual Car Sales field, the mean would be far less than with US included. Hence Mean is complemented with few other measures, which are immune to this shortcoming.
Median: This is the middle value, when we arrange the data in decreasing or increasing order. Please note that Median would not be impacted by extreme values. For odd nos. of data points, it is simple, but for even nos., we will have a tie. In that case, median is average of the middle two values.
Mode: Mode is the most frequent data value. But it is not used as often as Mean and Median
Range: It is simply the difference between Maximum and Minimum values.
Variance: It is average of the square of all deviations form the mean value
Std. Deviation: It is square root of the Variance, and the most widely used Statistical Measure along with Mean.
Skewness: Skewness is the measure of the deviation from the perfectly symmetric Histogram. A perfect Bell Shaped curve has zero Skewness, and skewness increases as the curve becomes asymmetric. A positive skewness means there are outliers towards right tail of the curve, and a negative skewness means presence of extreme values towards left tail.
Kurtosis: It is measure of the peakedness of a Histogram. A higher Kurtosis means a narrow Hostogram, with most of the values close to mean values. Hence the variance, range, Std. Deviation all would be low for high Kurtosis and vice versa. Please refer to "Moments of Probability Distribution" on "Interesting Topics" page to know more.
Query: We often want to look at data for certain conditions (subset of data). Two of the most common examples are Filters in Spreasheets or Structured Query Language.
Query on the Data set can be made either through the drop down menu, or SQL statements for advanced data retrieval. SELECT Fields are the columns to be retrieved (multiple selection allowed) and WHERE Fields are used for condition to be imposed. VALUE is manually entered. Any advanced query can be made through standard SQL commands, by typing in the SQL Statement Text Box (use Data as Table or View name)
Probability & Probability Distribution:
The backbone of inferential statistics is concept of Probability. We will encounter it directly in some of the Analytics techniques, see glimpse of it in few and deeply entrenched within others (Smart Software, remember). But real appreciation of Analytics can come only after a good understanding of Probability and some related concepts. This chapter is all about them. Please download the Excel fileProbability.xlsx to practice some of the concepts mentioned below
We are using concept of Probability in all cases below.
1 - Likelihood of a delayed take-off at Airport XYZ is 10%.
2 - Less than 1% of the Customers default on the Loan repayment, 10% make delayed repayment and rest make timely repayments.
3 - The chances of completing the new bridge in less than 365 days is 5%.
In all these cases, we have an event and associated outcome with the event. Take-off at an Airport is an event and the outcomes are on-time and delayed. In the next case, there are three outcomes, timely repayment, delayed repayment and default on the loan. However in the last case, the outcome is time taken to complete the bridge, which could be any positive number, and not restricted to some limited number of possibilities. This is an important point. We are talking about Discrete value in first two cases, and Continuous value in the last case. This is analogous to the Categorical and Continuous Values we learnt previously.
Regardless of the type of outcome, we calculate Probability by relative frequency (no. of counts)for an outcome. Hence if out of 1000 flights, 100 were delayed, we calculate the probability as 10%. Though we have information on only 1000 flights (sample), we extend the probability to all possible flights till date (which could be infinite and termed population). Obviously we will have more confidence in this probability as the sample size increases and we have more and more data points.
Learn three types of probabilities from the Example data.
Joint Probability: Probability when event A and B happen together
Marginal Probability: Probability of event A across its all Outcomes
Conditional Probability: Probability of event A given that event B has already happened
Posterior Probability: It needs special attention. In our example, assume that a Candidate has to take a decision on online and class-room programs. Since classroom programs are costlier, he would enroll in this program only if his chances of grabbing >100K salary becomes double. To start with he is aware of the general opinion that probability of getting a >100K offer is 10%, which is established based on relative frequency established over several years. He uses this as his own chance of making more than 100K. Hence probability of getting >100K salary for him is 10%. He wants to know probability of getting >100K given that he enrolls in a classroom program. Hence he knows P(>100K) and want to know P(>100K|Classroom). P(>100K) in this case is called Prior probability . He comes across a new study which surveys 100 MBA grads on their program and the salary they could get. In our case, this is the data provided in the Excel file. The Candidate can revise his own probability by a very smart formula postulated by Bayes, known as Bayes formula, which can be written in the following form.
In the case mentioned above, A1 is >100K, A2 is < 100K and B is Classroom program. Please note that P(A1) or P(A2) are still own estimates of the candidate i.e. 10% and 90% (100-10).
Practice this in the Examples sheet and the new Pribabily should be 0.25. Hence his chances of getting more than 100K salary is more than double if he enrolls for Classroom program (25% vs. 10%). This new estimated probability is called Posterior Probability, which is revised value of Prior Probability in light of new information. Do the calculation yourself in the Excel sheet.
Let us consider a simpler example. It has been estimated that chances of Cancer occurrence in an adult is 1%. This is again Prior probability established based on relative frequency of large no. of cases. But one would be interested in knowing his chances of having Cancer if he tastes Positive in Diagnosis. In other words he wants to estimate P(Cancer|+ve). The most likely further information, he may get is the accuracy of the test, which the Diagnostic center may state in two ways. Chance of one tasting +ve given that he has cancer is 95% and tasting +ve given that he does not have cancer is 10%. Once again he can calculate his posterior probability in light of the new information, (0.1*0.95/(0.1*0.95+0.05*0.9), which is 67%.
Why we are taking so much trouble in getting a grasp of this? The reason is that, it is the crux of one of the Smart Techniques used in Analytics, which we will discuss later. Another reason is that the right hand side of Bayes formula, allows simplification in cases of very large data sets, involving several conditions. We have encountered only two conditions in our examples, but we may encounter 20, 100 and much more of them. Another reason to use this formula is that information (data) is easily available for the components on right hand side, as we saw for the Cancer Diagnosis test.
A simple way to remember, posterior probability is to think of it as probability of a preceding event, given that a subsequent event has happened. Preceding event may be enrolling for Online program and subsequent event could be success in MBA programs. But we may seek probability of one using online program (preceding event) given than he is an MBA grad (subsequent event).
Recall the concept of Histogram which is also known as Frequency distribution. In other words, it gives the distribution of count of each values (in fact range of values). We just discussed, how probability can be calculated by relative frequency count. Hence Histogram can be converted to a Probability Distribution by dividing each bars with total no. of count (Sample or Population size). So instead of counts, we have probability associated with each values in Probability Distribution.
To be specific, there are three distributions which are of special interest to us. Normal Distribution is type of Continuous Distribution.Binomial and Poisson are two Discrete variable distributions.
Binomial Distribution is applicable in cases, where each event (experiment) has binary outcomes. Hence it can be applied in cases where we encounter pass/fail, good/bad, yes/no kind of situations. As an example, acceptance/rejection of product quality test, or selection/rejection of Candidates in an exam. If probability of success in each trial of an experiment is p (each time a product is tested, or candidate appears for an exam), then the probability of x successes in n trials is given by the following formula. Please note that Binomial Distribution needs two parameters: N and p. The above formula rests on premise that each trial is independent, i.e. outcome of one trial does not impact the outcome of other trial. Also there are only possible outcomes of each trial.
Poisson Distribution is used in cases where number of occurrence of a certain event of interest in an interval of time or area (space) is involved. As an example, no. of customers arriving at a McDonald drive-thru in 1 hour window, no. of vehicles passing a traffic intersection each minute or no. of lions sighted in a 1 sq. km area in a forest. It rests on the premise that probability of an occurrence is same in all equal size intervals (time or space) and proportional to the interval size. Poisson Distribution needs one parameter, often denoted as Greek letter Lambda. It is the mean number of occurrence in the interval of interest, which could be average no. of vehicles crossing an intersection in 1 minute, average no. of accidents on a 1 mile stretch on a highway and so on. The Probability of x occurrence in an interval is given by the following formula.
Normal Distribution is type of Continuous Distribution which needs two parameters: Mean and Std. Deviation, which we have already learnt in Introductory pages. It is the most widely used distribution and cornerstone of Inferential Statistics. The probability of a continuous random number x, is given by the following formula.
Please see examples of all the Probability Distributions in the Probability.xlsx file. The probability charts represent probability of x for all values in the range of 1 to 100. Hence for binomial it can represent probability of 1,3,3 or 100 students passing an exam, given that there are 100 students and chances of a student passing the exam is 0.5. Observe how the plot shifts from left to right as we change probability from a low value (say 0.2) to a high value (0.8). In the same way, Poisson Distribution plot shows the probability of say 1, 2, 3,100 accidents at a traffic intersection if the average no. of accident is 2. Once again observe how the curve shifts from left to right as we change the Lambda (average) from a low value 2 to a high value 60. The broken lines in Binomial and Poisson shows Discrete variables, as x can take whole numbers like 1,2, 5 etc. only and not between them. However the Normal Distribution is a solid like, as x can take any value on the horizontal axis. If we have to find the probability of at least 5 students passing an exam, we can find probabilities for 1,2,3,4 and 5 students passing the exam and sum it. In a similar fashion, if we need to find probability of at least 5 accidents or 5 lions seen, we need to add probabilities for x = 1,2,3,4 and 5. The same is valid for Normal distribution too but with a slight difference. Observe the RHS of Normal Distribution formula. If we fix, (x - mu)/sigma, often called Z, then p(x) is same. Hence no matter what the mean and std. deviations are, for same Z, probability is same. Probability of a single value of x is meaningless for Normal Distribution, as there are infinite number of possible values for x and probability would be 0. Hence we always talk in terms of probability for a range of values. For all the Normal Distributions, the probability of values less than (or greater than) Mean + ZSigma is same. That is the reason why we observe the following. There are always 68%, 95% and 99% of values within the range of 1 sigma, 2 Sigma and 3 Sigma on both sides of the mean respectively.
Inferential Statistics is all about drawing a conclusion about a population (parameter) based on a sample (statistics) drawn from it.
We are always interested in knowing something about a Population, but what we use is Sample, due to time and cost constraints. A sample measurement is never expected to be equal to the same for population. But with what degree of accuracy and confidence can we conclude something about a population based on a sample. Hence we encounter concepts like Interval, Confidence Level, Error etc. in inferential statistics. We will discuss them in detail shortly. There are two Inferential Statistical techniques: Estimation and Hypothesis Tests.
Estimation: Let us consider an example. We want to find the average speed at which cars are driven in our city. We collect samples by measuring car speeds in different zones (say Zip codes) of the city. The sample mean is 45 KMPH. What would give us more confidence?
1 - The average speed in the entire city is 45 KMPH
2 - The average speed in the entire city is within 40 to 50 KMPH.
Though the first option looks attractive, it has drawbacks. As we saw earlier, probability of a single value on normal curve is zero. Hence probability of population mean being 45 is zero. On the other hand, as sample size increases (no. of cars observed), the sample mean is expected to be and more representative of the Population. Hence a single value is not used. The other option looks convincing, as we are giving a range of values, provided we can find the range somehow. Estimation technique is all about finding this range.
Please see an example in the Excel file. A beverage brand designs its bottling plant to fill 300ml in each bottle. The plant manager wants to check the mean amount of beverage filled in each bottle to ensure that there is no excessive filling (loss of revenue) or under-filling (danger of being sued by consumers). As earlier, he can only estimate the amount of beverage filled by a sample. He picks a sample of 30 bottles and measures the amount in each of them. The data is provided in column A of tab Estimation. How does he find the range? What would give him more confidence?
1- There is 90% probability that the population mean (entire plant) would be in a range of values centered around the sample mean (just calculated).
2- There is 95% probability that the population mean (entire plant) would be in a range of values as above.
3- There is 99% probability that the population mean (entire plant) would be in a range of values as above.
There is nothing like 100% probability as the ends of Normal curve extends to infinity on both sides. Theoretically no range can accommodate all the values, but for practical purpose 99% should be more than enough. Obviously, he feels confident as the probability grows (known as Confidence Level), but there is a trade-off. As the Confidence level grows the range also starts growing, and he starts feeling uncomfortable. Consider an example. There is a very high probability of people having income in the range of 10000 to 100000 USD, but this large range is too big for any practical use. For most of the practical purpose a 95% Confidence level is used, although it is a choice available to the end user. Now move on to the excel file to test different levels of confidence levels for estimation range.
Hypothesis testing is another technique for Inferential Statistics. It refers to set of methods, which can be used to test the validity of some knowledge on Population (Parameter) based on Sample Statistics. In Estimation, we want to know a population parameter (by an interval). However in Hypothesis Testing, we know the population parameter (from our experience, previous data etc.) and want to test its validity (due to different time, location, other factors etc.).
One often used example for analogy is Criminal Trial. It starts with a hypothesis that the subject (defendant) is innocent. Now the Criminal trial aims at rejecting this hypothesis based on the evidences produced.
The hypothesis we want to reject is called Null Hypothesis (Subject is innocent) and denoted as H0. In event of rejection of H0, alternative hypothesis is accepted (subject is guilty) and denoted as H1.
Now let us look at the same example of Beverage plant. The plant system and processes were designed in such a way that each bottle was filled with an average of 300ml. Rigorous measurements and quality test were conducted and it was established that the mean amount of beverage was 300ml. Now 1 year down the line, the manager wants to know, whether it still holds true. Essentially he wants to tests the hypothesis that mean volume in bottles is 300ml.
H0: Mean is equal to 300 ml
H1: Mean is not equal to 300 ml
Rejecting H0 does not say that mean is not 300ml, but merely states that there are not enough statistical evidences (provided by the samples) to conclude otherwise. In doing so, where can we commit error? Either we can reject the null hypothesis, when actually it is true (called Type I Error) or we can accept a null hypothesis, when it is false (called Type II error). The probability of Type I error is called Significance Level. Do not get confused with Confidence Level used earlier in Estimation. Confidence level is expressed in percentage like 90%, 95% etc., and Significance Level is expressed as a small fractional number like 0.05, 0.1 etc. Similar to Confidence level, what value of Significance level should be good enough?
A 5%(0.05) probability of committing Type1 Error
A 1%(0.01) probability of committing Type1 Error
Although 1% looks more accurate, but then probability of committing Type II Error starts increasing. Hence like Confidence Interval,5% is a widely accepted value.
1. Is Annual Car Sales in a Country is dependent on GDP, Total Area, Population, Roadways etc.?
2. What is magnitude of the impact of GDP, Population etc. on Annual Car Sales?
1. Can I find countries which are similar on GDP, Area, Population, Annual Car Sales etc.?
2. Can I find groups of similar Countries, like High Car Penetration, Afluent Countries etc?
What we are doing in the 1st Case, is that we have a predefined variable of Interest, Annual Car Sales in this case. We want to explore this variable by finding the most important factors (GDP etc.) behind it. The Variable of Interest is often called Response, Outcome, Target, Dependent or simply Y variable. Variables like GDP are called Predictors, Independent or simply X variables. Once we meet Objective 2 in first case, we have a Predictive Capability, as we can predict the Car Sales in country as soon as we have all the X Variables. Objective 1 serves as Exploratory Model and Objective 2 serves as a Predictive Model.
This area in Analytics is known by following names, with first two more prevalent in academia and last in Business World -
2.Directed Data Mining
We have a Y Variable(identified by Business User) and we want to analyze that by means of one or several X variables
In the 2nd Case, we do not have any variable of interest. We are simply exploring the natural groupings of Countries, which we are not aware of in the beginning. We expect the algorithm to provide some interesting groupings, which we can makes sense of. Hence there is no Y or X Variables in this case. This area is broadly defined as Pattern Finding or Unsupervised Learning or Undirected Data Mining.
The most important Pattern Finding Techniques is called Clustering, Grouping or Segmentation(Business Name). There are other Pattern Finding techniques, which have specific applications like Association Rules, Sequence Analysis etc.
There is another area, which is called Prescriptive Analytics these days, and consists of several Techniques. They are primarily Optimization techniques, based on Linear, Integer Programming, Heuristics etc. The reason they are called Prescriptive Analytics, is that they have the ability to provide an answer within constraints, for our stated objective.
In addition to this Time-series Analytics is an important area, which is primarily used for data ordered with time. Although they can be also called Predictive Analytics, they are quite different and have specific techniques. Text Analytics is another emerging area, which borrows lot of concepts from Predictive Analytics and Patterns finding, but has some unique elements of its own.
We have already learnt that the variables can be of two types: Continuous(Numeric) or Categorical(Qualitative). Bases on the nature of the Y and X variables, we can pick the appropriate techniques as shown below in the table. There are few techniques, which are applicable in a specific case like Discriminant Analysis, Naive Bayes etc. But on the other hand, there are techniques which are very versatile and can be used in multiple cases. We often have more than one choice, and it is quite common to use two or more than two techniques and then pick the best on the basis of its accuracy and usability