Association Rule Mining

**Association Rules or Affinity Analysis **is a study on association of two or more than two items which occur in a single event. Event could be a purchase made at a store, clicks on web pages, calls made by a telephone subscriber and so on. It is more popularly known as “** Market Basket Analysis**”, as one of most widely used applications is analyzing Customer transactions and finding dependencies between different items.

This technique can be applied in cases where dataset comprises of several records, with each record listing all items (grocery, calls, page clicks and so on) in a single event. Applications of Association Rule mining can range from cross-selling in Retail Industry and improving the design of Web pages to detecting anomalous incidences in Aircraft operations. As an example, association between several items could help a retail manager to place high association items together in the store, designing a catalog, promotional campaign or cross-selling of items. On the other hand, finding association of take-off from a particular airport, at certain time, with pilot having certain exeperience and so on to a potential hazardous take-off could boost safety.

Theoretically we would like to assess all possible combinations of items, in pairs, triplets and so on. However for practical datasets, it would lead to huge amount of combinations and computational time. A practical solution is to work in terms of “__ Frequent Item Sets__” where item set is collection of items in a single event.

Frequent item sets are identified by a measure called “__ Support__”, which is proportion of all records which include a certain item set. Hence if [bread, butter] has been purchased in 40 out of 100 transactions, then support of [bread, butter] is 0.4 or 40%. All item sets which have support above a threshold (user defined) constitute the Frequent item set.

The idea of Frequent Item sets can be used for "__ Anomaly Detection__", along with Neural Network, Naive Bayes, Clustering, Nearest Neighbors etc. Items below a certain threshold Support can be construed as Anomalous items, as they are infrequent as compared to other items.

Identifying all frequent item sets, uses a very efficient algorithm called “** Apriori Algorithm**”, which performs this task very fast even on huge datasets, which is frequently encountered in practical scenarios.

First a Frequent Item set with just one item is collected, which satisfy the minimum support criterion. This needs simple count which is, for each item, how many transactions include the item. Frequent Item set for two items would be a subset of one item set, and so on. This makes this algorithm very efficient.

Prime objective of Association Rule mining is to find “__ strong rules__” which signify a strong dependence between antecedent and consequent item sets. The measure used to assess the strength of a rule is called “

Out of 100000 transactions, bread & butter have been bought 1200 times. Out of these 1200 purchases, milk was bought 600 times. The association rule (if bread & butter then milk) has a support of 600/100000 = 0.6% and a confidence of 600/1200 = 50%. Confidence is essentially conditional probability in terms of the item sets.

However confidence can be misleading in certain cases. If two items have very high support, then even if they are independent (no association), we can have a very high confidence for them. As an example, if milk and bananas are bought in almost all the purchases, then confidence of the association (if milk then banana) would be high, even when there is no association between them. One way to circumvent this is to use a measure called “__ Lift Ratio__”. A benchmark confidence assuming independence of antecedent and consequent items is calculated, which is nothing but the support of consequent item. Lift Ratio is calculated by dividing the actual confidence with benchmark confidence. In the above example, if milk was bought 5000 times out of total 100000 transactions, regardless of whether they are bought along with bread & butter or not. In this case benchmark confidence for Milk is 5000/100000 = 5%. Hence the lift in this case for the rule (if bread & butter then milk) is 50/5 = 10. A value above 1 shows that the rule is valuable and higher the value, stronger the rule.

Please download the Example sheet and go through a very simple example. The example lists a sample of just 10 transactions involving some grocery items.

As discussed earlier, the first step is to find the Frequent Item Sets. We first find the single frequent items. In this simple illustration, the support for all the items are shown on the second picture on the right side. The items have been sorted in descending order of their Support. Since "white bread" occurs 7 times in 10 transactions, the support is 0.7 and so on. If we use a cut-off of Support equal to 0.5 (in practical large data sets, it would be a relatively smaller number. Using this criterion, rice and citrus fruit are ruled out from the Frequent Item Sets. In addition to that, any pair or triplet of items involving them would also not be part of the Frequent Item Sets (Apriori Algorithm).

Using the top four items from frequent item sets, we find out the frequent pair of items. Only white bread & tropical fruits emerge as frequent item, satisfying the cut-off of 0.5.

Hence we have 5 (4 single and 1 pair) frequent item sets. The Association Rules would be restricted to these items only. The bottom most picture on the right side shows the calculations of the strong rules. Association rules are "if - then" kind of sequence of items, where item corresponding to "if" are called "antecedent" and those corresponding to "then" are called "consequent".

Let us explore strength of the rule - if "white bread" then "tropical fruit". As discussed earlier, its confidence is number of times both of them occur together in the sequence divided by number of times "white bread" occurs (irrespective to whether "tropical fruit" was bought or not). The confidence for this rule is equal to 5/7 = 0.71. Please note how the confidence changes as the sequence is changed for the same set of items (if "tropical fruit" then "white bread").

However the more effective measure of strength of rule is "Lift" as discussed earlier. The Benchmark Confidence for "Tropical Fruit" is 6/10 = 0.6. Hence the Lift for first rule is 0.71/0.6 = 1.19.

None of the other rules have lift above 1 in this simple case.

Although this is an extremely simple example, the concepts are same even for practical data sets, where it is common to have millions of transactions, involving hundreds of items.