K Nearest Neighbors

KNN is an acronym for K Nearest Neighbor, and it is one of the prominent **Data driven and Non-parametric** Predictive Analytics techniques. It comes under a generic type of algorithms, known as **Memory based Reasoning** methods. KNN is a very simple, perhaps the simplest of all predictive analytics techniques.

Let us understand this by a simple example. We are organizing a conference on Data Mining, with participants from across the world. Each participant is displaying a badge, showing his name and nationality. One of the participants is not wearing his badge, and the security personnel is trying to make a best guess on his nationality. He looks at his height and weight and makes a guess. Participant is more than 6 feet tall, and approximately more than 200 lbs. in weight. Majority of the people taller than 6 feet and more than 200 lbs. are German (in that conference), although there are people with other nationalities too. However security personnel takes a **majority voting** based on these two physical characteristics and adjudges the person as German.

It appears to be a very crude way, however an analogous technique, albeit involving several characteristics and some sophisticated calculations finds use as a popular method in Data mining called KNN. K is the user defined number of nearest records used for majority voting.

Although the outcome is categorical in nature, **KNN can be used for continuous outcomes** too. In this case, we use the average of the Y for K nearest neighbors, instead of taking a majority voting.

As with **Naïve Bayes Classifier**, it is very effective with large data sets, and even with its simplicity gives good results. One of the most famous usage has been in the Personalized __ Recommendation Systems__ used by e-Commerce companies like Amazon. What makes it reliable is massive amount of data, involving millions of customers for these organizations.

Let us understand this by a simple example, with data for just 20 households. Each household has two measurements, Income (in 1000) and number of members (household size), and an Outcome on ownership of home (Own or Rent).

We need to predict whether a new household with an income of 1400 and 4 members will own or rent a home. This could be used as a prospect for Home Loan or Insurance by an organization. We find the most **similar** households to this new record and take a majority voting based on our choice of K.

Only the top 10 records are shown in the picture below.

**Similarity is found by concept of distance**, which is defined in several ways in Data Mining. The most frequently used distance measure for numerical values is **Euclidean Distance** given as follows.

Where d is the Euclidean distance between two records (X1, X2, ……, Xn) and (U1, U2, ……, Un) respectively.

X1, X2 etc. are the measured characteristics like income, number of household members etc. Since values of income would be much higher than number of members, the difference between incomes would outweigh the calculation. Hence the **measurements are normalized**. One often used technique for normalization is to **subtract each X with its mean and divide by standard deviation (called Z Value)**. Please see how the values are normalized in the Example Sheet. With this calculation, we have found distances of each household with the new household. These distance measures are sorted in ascending order and first K smallest distances are selected. In this case, for various values of K, we have used majority voting to predict the ownership of the new household. Luckily in this case, with K = 3, 4 and 5 the prediction comes as “Own”.

However choice of K would normally need some iterations and awareness of a trade-off. Usually K is taken as an odd number to prevent a tie. If K is a very small number, we may be picking the noise in the data (random occurrences). If K is very large, then it would be oversmoothing and we may fail to pick the local structure. K is usually found by calculating the accuracy (Precision, Recall, low Misclassification etc.) on a **validation data** for several values of K. K corresponding to minimum validation error is taken as the best K. The **extreme case would be K = n**, where n is total number of records. In this case, distance calculation is immaterial, and we could just use the majority class as Prediction (a Naive Rule).

Although KNN is very simple in its approach, the distance calculations and finding best K may be computationally difficult for large training data. The following **improvements** are used to overcome this practical problem.

**Dimension reduction**: In general reducing number of predictors is a best practices, as we discussed earlier (Curse of Dimensionality). We should complement KNN with**Principal Component Analysis, Decision Tree, Anova, Chi-square test**etc. to reduce the dimensions.

**Removal of Redundant records**: There may be redundant data points, which are surrounded by records with same classes, and do not impact classification accuracy if removed from calculations. As an example, if several income values between 1000 and 1100 belong to same class (Rent) then it may be worthwhile removing the points between the two boundaries.