Out of several definitions, which can be used for Analytics, the following should suffice for most of the cases. Analytics means set of techniques, which give Structure to large amount of Information for actionable Insights.
Structure by charts, aggregations, groupings, inferences, combinations, trends, generalization etc. and insights by showing what happened, why did it happen, what can happen and what should be done for that.
Ingredients of Analytics
As the picture on left depicts, analytics means several things put together. Its foundation rests on Statistics, and several of the techniques use Classical Statistical Methods, as we will see later. Several algorithms and techniques, which had the genesis several decades back, have become extremely important today. That is partly due to availability of massive data and partly due to availability of powerful computing infrastructure. We refer to these algorithms and infrastructures as "Smart Software" and "Smart Machines" in the picture.
However analytics has to have a Context, and they are meant for certain objective. Their utility to a great extent depends on clarity of the objective. In addition to them, deriving meaningful and valuable information from them needs awareness of the risks and pitfalls in analyzing data, using clean data, simplifying a problem and so on. Hence business acumen and common sense should complement the available choices from several Data analytics techniques.
The picture on bottom on the left side depicts the objectives of each of the analytics techniques. Following terms are frequently used to explain the different techniques.
It is the as-it-is representation of the data. These techniques provide a structure to the data, either through visual plots or summarization, and condense the information into few charts, plots or numerical Measures. These are valuable techniques to find what happened in the past and to a great extent why did it happen too. It is also called Business Intelligence or Visualization, however the boundary between them and Predictive Analytics or Clustering is getting blurred with development of more sophisticated tools.
Predictive Analytics techniques model the relationship between an outcome and one or multiple Inputs. These models can be used for two objectives - Inferences or Prediction or both. As an example, it can tell what are the important factors driving sales of a product, and also can be used to predict the future sales, with reasonable accuracy, given knowledge of certain inputs. Hence the "What will happen" objective illustrated in the picture.
Prescriptive Analytics refers to set of techniques, which can suggest a decision to be taken under given constraints. They mainly refer to Optimization Techniques, Simulations, Heuritics etc. Heuristics are non-optimal solutions, developed for specific applications, using certain simpler techniques (compared to Optimization equivalents). As these techniques have the ability to suggest a decision, which is very hard to judge manually, they get the name "Prescriptive Analytics".
Clustering is one of the most widely used Pattern Finding techniques. It refers to techniques, which find the natural similarity or grouping in data. Hence it can group large number of people, customers, locations etc. based on several dimensions like socio-economic attributes, purchase behaviour etc. They are different from Predictive Analytics or Prescriptive Analytics, as there is no Business Outcome. It can be considered as more sophisticated version of Descriptive Analytics, as the grouping could be based on several dimensions at the same time. Association Rules, Sequence Analysis, Markov Chain etc. are some other Pattern finding methods.
One often encounters two more ways of categorizing Analytics techniques - Supervised vs Unsuprevised Learning and Parametric vs Non-parametric models.
Supervised vs. Unsupervised Learning
This is an oft-used way of categorizing Analytics techniques. It can be explained by looking at the following two objectives.
Is Annual Car Sales in a Country dependent on GDP, Total Area, Population, Roadways etc.?
What is magnitude of the impact of GDP, Population etc. on Annual Car Sales?
Can I find countries which are similar on GDP, Area, Population, Annual Car Sales etc.?
Can I find groups of similar Countries, like High Car Penetration, Affluent Countries etc?
In objective A, we have a predefined variable of Interest, Annual Car Sales in this case. We want to explore this variable by finding the most important factors (GDP etc.) behind it. The Variable of Interest is often called Response, Outcome, Target, Dependent or simply Y Variable. Variables like GDP are called Predictors, Independent or simply X Variables. Once we meet Objective 2 in first case (A), we have a Predictive Capability, as we can predict the Car Sales in country as soon as we know all the X Variables. Objective 1 is served by Exploratory Models and Objective 2 is served by Predictive Models. This area in Analytics is known by following names, with first two more prevalent in academia and last in Business World -
Directed Data Mining
We have a Y Variable (identified by end user) and we want to analyze that by means of one or several X variables.
In Objective B, we do not have any variable of interest. We are simply exploring the natural groupings of Countries, which we are not aware of in the beginning. We expect the algorithm to provide some interesting groupings, which we can makes sense of. Hence there is no Y variables in this case. This area is broadly defined as Pattern Finding or Unsupervised Learning or Undirected Data Mining.
The most important Pattern Finding Technique is called Clustering, Grouping or Segmentation. There are other Pattern Finding techniques, which have specific applications like Association Rules, Sequence Analysis etc.
In addition to this Time-series Analysis is an important area, which is primarily used for data (single variable) ordered with time. Although they can also be called Predictive Analytics, they are quite different and have specific techniques. Text Analytics is another emerging area, which borrows lot of concepts from Predictive Analytics and Pattern Finding, but has some unique elements of its own.
Parametric vs Non-parametric Models
Another way to describe them is - Data Driven vs. Model Driven techniques. In parametric models, we model the relationships between variables by estimating the parameters in the line, curves, splines, planes etc, whatever we chose to model the relationship with. As an example, we can describe relationship between Sales and No. of Sales staff as a Linear function, and we estimate the parameters of the line (a and b in Sales = a + b*Staff). Here we started with an assumption (Linear relationship) and estimate the parameters in the linear function. Hence these models are kind of rigid, as we imposed a relationship on the data. However the relationship is very explicit, easy to understand and we know the exact relationship. In addition to this, these models do not need large amount of data, the quality of sample is more important than the size of sample. If properly developed, these models tend to be very robust and replicable on other data sets too. Hence they are also called Model Driven, as the emphasis is more on robust models. They are more suitable to Inferences or Exploratory analysis, as discussed earlier.
On the contrary, Non-parametric models do not start with a rigid assumption on the relationship. The relationship is slowly developed, as the algorithm passes through more and more data. These are more like "Programming by example", where we don't instruct the algorithm to perform certain steps, rather let it learn from the examples or results themselves. As a result, they are more flexible, but the flexibility comes at a cost. They are kind of Black-Box to the end user, where we don't know exactly the relationship. In the simple example explained above, no. of Sales staff may turn out to be an important factor using these models, but we can't quantify that (we may not know the parameter b). They are more suited to Prediction, but not for interpretation (Inferences). They rely on size of data, and increase in data usually tends to improve their results. Hence the name Data driven for them.
Best Practices in Data Mining (Analytics)
We emphasized Quality and Size of Data in our discussion above. They are prime considerations in achieving reliable and accurate results from Analytics exercises. Apart from them, identifying the most relevant variables and making best use of them is another factor. We can put Analytics best practices into following three broad categories.
Relevancy: The data sets in Practical cases have large number of variables. However all these variables may not be useful for a given objective. We should resort to some simple analyses to judge whether a variable has any relation with the outcome, we are interested in. As an example, Credit Score of a Person may or may not have a relationship with the propensity of a person to "Accept" a certain offer. Superfluous variables, needlessly increase computation time and leads to over-fitting of the data. There should be a genuine Cause-effect relationship between two variables. As an example, number of bids can not be a predictor for Closing price in an Auction, as it would not be known until the closure of an auction.
Redundancy: In certain cases, two or more variables are pointing to same information, and one should use only a subset of these variables. As an example, instead of using several indicators of purchase power of a person (Income, Credit Card Spending, Frequent Flyer Miles etc.), one should use only one variable. Using as few variables as possible, without sacrificing the accuracy of the model, is widely known as "Dimension Reduction" and is one of the foundation of an accurate and reliable Data Mining result.
Sufficiency: Several of the Analytics techniques, called Machine Learning algorithms, thrive on size of data. In our discussion on these techniques, we would see how they work very effectively when exposed to large data sets. Even Model Driven techniques also require sufficient data, which is directly proportional to number of variables (dimensions) used for analysis.