K Nearest Neighbors (KNN) Algorithms — A Quick Overview

5 min readOct 30, 2020

A quick overview of the K Nearest Neighbors Algorithms

Data science has found roots in almost every aspect of our daily lives. From dataset combinations for philanthropic use all the way to KMEANS and DBSCAN clustering to make data solutions easier, you’ll find extensive use of one and all algorithms in the mix. And one prime algorithm to note in the mix is the K-Nearest Neighbors algorithm (or KNN).

This is perhaps one of the most used (but not so preferred) learning algorithms for ML applications because of the simplicity it offers. However, with that simplicity, there are also several trade-offs that come along with it.

Here, I’ll give you a brief overview of what the K Nearest Neighbors algorithm represents, what it does, and more, to give your ventures into the world of feature engineering a jump start.

K Nearest Neighbors Algorithm — What Is It?

You could say that KNN represents one of the laziest machine learning algorithms out there. It is a non-parametric algorithm, meaning it doesn’t make any assumptions with its mapping function; hence the reliability. The algorithm is known to combine different data classes in order to make a prediction about a new sample point, which it then classifies accordingly to give results.

Because of its ‘laziness’, (the algorithm doesn’t use any training data points and doesn’t just ‘throw answers out there’), you can expect very little to no training phase, which makes getting answers relatively faster. Compared to linear regression models, KNN doesn’t make any assumptions and therefore studies data much more thoroughly when trying to come up with an answer (without prior knowledge).

A prime example of the KNN algorithm usage is trying to estimate the results of a census before it has actually taken place, but keep in mind that this is a very rough example.

How KNN Algorithms Work

Example 1

Let’s say you take two images from Google Maps; one of Europe (Exhibit A) and one of the Sahara Desert (Exhibit B).

Now, if you apply the algorithm on these two images and then show it a picture of anything green, it will identify it as Exhibit A. If, on the other hand, you insert an image of a Golden Retriever into the mix and ask the algorithm to identify that, it will suggest that it is a part of Exhibit B.

Another Example

Now, let’s say you have an image of a cat (Exhibit C) and a dog (Exhibit D) of somewhat similar colors

Next, you submit another photo, Exhibit E in the algorithm to identify both animals

The algorithm will successfully be able to identify the dog and the cat very quickly. This isn’t because it has learned which animal is which, but because in Exhibit D, which we named the dog in the algorithm, the dog had its tongue out. The tongue is going to be the determining factor here.

If you add Exhibit F into the mix without adding Exhibit E, the algorithm will identify the horse as a dog (Exhibit D) because of the similarities in the two photos (the white stroke and tongue-like color between the nostrils).

Here is a diagram to show how the algorithm works for better clarity:

Here, k is the parameter that refers to the number of nearest neighbors in which a given figure lies (typically the solution). Once a given data point is identified, the algorithm works toward identifying the closest cluster (k) in which the point lies and identifies it as the answer you’re looking for.

When it comes to the classification of a subject into a category, the meat and potatoes of K-nearest is basically a vote (or several). The more votes there are for a given category, the closer it is to the sample (the likelier it is to be classified into that group).

In the cat/dog example, Exhibit F was classified as Exhibit D because it has two votes (at least) in favor of D. One, the tongue-color, and the other would be its background (for example). The color can also be somewhat categorized into exhibit C, but the other vote is the one that formed a majority.

The similarities between the two points are measured based on the distance between two data points. The closest k is where the image is then classified. The most popular distance method used these days when it comes to the KNN algorithm is the Euclidean distance method.

There are other methods as well, such as the Manhattan distance method, Minkowski, and Hamming distance methods.

Pros & Cons of the KNN Algorithm

Pros

1. Simple and straightforward

2. Flexibility

3. It can handle multi-class cases with ease

4. If enough data is present in different classes, the results can be very reliable.

Cons of KNN

1. Determination of the number of nearest neighbors will need to be determined manually.

2. Every k value must be computed independently, thus increasing computation cost.

3. Data storage costs can get very high.

4. Depending on the distance function used, the results can vary.

5. Relies on trial and error when setting up, since there is no ML involved.

Depending on the K value you choose, the results can vary. For example, smaller k values will make result variations even higher and result in a lot of noise in the results. The result you get with smaller k values will be rather unreliable.

This leads to the conclusion that a higher k value would be better. Yes, the results are much more reliable and with less variance, but if you choose a higher k value, the computational expense will also increase.