close
close

first Drop

Com TW NOw News 2024

K Nearest Neighbor Classifier Explained: A Visual Guide with Code Examples for Beginners
news

K Nearest Neighbor Classifier Explained: A Visual Guide with Code Examples for Beginners

The friendly neighbor approach to machine learning

K Nearest Neighbor Classifier Explained: A Visual Guide with Code Examples for BeginnersAll illustrations in this article were created by the author and contain licensed design elements from Canva Pro.

Imagine a method that makes predictions by looking at the most similar examples it has seen before. This is the essence of the Nearest Neighbor Classifier — a simple yet intuitive algorithm that adds a touch of real-world logic to machine learning.

While the dummy classifier sets the minimum performance standard, the Nearest Neighbor approach mimics how we often make decisions in everyday life: by remembering similar past experiences. It’s like asking your neighbors how they dressed for today’s weather to determine what you should wear. In the domain of data science, this classifier examines the nearest data points to make its predictions.

Definition

AK Nearest Neighbor classifier is a machine learning model that makes predictions based on the majority class of the K nearest data points in the feature space. The KNN algorithm assumes that similar things exist nearby, making it intuitive and easy to understand.

Nearest neighbor methods are one of the simplest algorithms in machine learning.

📊 Dataset used

In this paper, we use this simple artificial golf dataset (inspired by (1)) as an example. This dataset predicts whether someone will play golf based on weather conditions. It contains features such as forecast, temperature, humidity, and wind, where the target variable is whether or not someone will play golf.

Columns: ‘Outlook’, ‘Temperature’, ‘Humidity’, ‘Wind’ and ‘Playback’ (objective function)

# Import libraries
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Make the dataset
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Play': ('No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
}
original_df = pd.DataFrame(dataset_dict)

print(original_df)

KNN algorithm requires the data to be scaled first. Convert categorical columns to 0 & 1 and also scale the numerical features so that no single feature dominates the distance metric.

The categorical columns (Outlook & Windy) are encoded using one-hot encoding, while the numerical columns are scaled using standard scaling (z-normalization). The process is performed separately for the training and test set.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Preprocess data
df = pd.get_dummies(original_df, columns=('Outlook'), prefix='', prefix_sep='', dtype=int)
df('Wind') = df('Wind').astype(int)
df('Play') = (df('Play') == 'Yes').astype(int)
df = df(('sunny','rainy','overcast','Temperature','Humidity','Wind','Play'))

# Split data and standardize features
X, y = df.drop(columns="Play"), df('Play')
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=('float64')).columns
X_train(float_cols) = scaler.fit_transform(X_train(float_cols))
X_test(float_cols) = scaler.transform(X_test(float_cols))

# Print results
print(pd.concat((X_train, y_train), axis=1).round(2), '\n')
print(pd.concat((X_test, y_test), axis=1).round(2), '\n')

Main mechanism

The KNN classifier works by finding the K nearest neighbors of a new data point and then voting for the most common class among these neighbors. Here’s how it works:

  1. Calculate the distance between the new data point and all points in the training set.
  2. Select the K nearest neighbors based on these distances.
  3. Take a majority vote from the classes of these K-neighbors.
  4. Assign the majority class to the new data point.

For our golf dataset, a KNN classifier could look at the 5 most similar weather conditions in the past to predict whether someone will play golf today.

Training steps

Unlike many other algorithms, KNN does not have a separate training phase. Instead, it remembers the entire training dataset. Here is the process:

  1. Choose a value for K (the number of neighbors to consider).

In a 2D environment it’s like looking for the closest colors.

from sklearn.neighbors import KNeighborsClassifier

# Select the Number of Neighbors ('k')
k = 5

2. Select a distance measure (e.g. Euclidean distance, Manhattan distance).

The most common distance measure is Euclidean distance. This is just like finding the straight line distance between two points in the real world.

import numpy as np

# Choose a Distance Metric
distance_metric="euclidean"

# Trying to calculate distance between ID 0 and ID 1
print(np.linalg.norm(X_train.loc(0).values - X_train.loc(1).values))

3. Save/remember all training data points and their associated labels.

# Initialize the k-NN Classifier
knn_clf = KNeighborsClassifier(n_neighbors=k, metric=distance_metric)

# "Train" the kNN (although no real training happens)
knn_clf.fit(X_train, y_train)

Classification steps

Once the Nearest Neighbor Classifier has been ‘trained’ (i.e., the training data has been saved), it makes predictions for new instances in the following manner:

  1. Distance calculation: For the new instance, calculate the distance to all saved training instances using the chosen distance metric.

For ID 14, we calculate the distance to each member of the training set (ID 0 — ID 13).

from scipy.spatial import distance

# Compute the distances from the first row of X_test to all rows in X_train
distances = distance.cdist(X_test.iloc(0:1), X_train, metric="euclidean")

# Create a DataFrame to display the distances
distance_df = pd.DataFrame({
'Train_ID': X_train.index,
'Distance': distances(0).round(2),
'Label': y_train
}).set_index('Train_ID')

print(distance_df.sort_values(by='Distance'))

2. Neighbor selection and prediction: Identify the K nearest neighbors based on the calculated distances, and then assign the most common class among these neighbors as the predicted class for the new instance.

After calculating the distance to all stored data points and sorting them from lowest to highest, we identify the 5 nearest neighbors (top 5). If the majority (3 or more) of these neighbors are labeled as “NO”, we predict “NO” for ID 14.

# Use the k-NN Classifier to make predictions
y_pred = knn_clf.predict(X_test)
print("Label :",list(y_test))
print("Prediction:",list(y_pred))

Evaluation step

With this simple model we can achieve quite some accuracy, much better than if we were to guess randomly!

from sklearn.metrics import accuracy_score

# Evaluation Phase
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy.round(4)*100}%')

Main parameters

Although KNN is conceptually simple, it does have a few important parameters:

  1. I: The number of neighbors to consider. A smaller K may lead to noise-sensitive results, while a larger K may smooth out the decision boundary.

The higher the value of k, the more likely it is that the majority class will be selected (“YES”).

labels, predictions, accuracies = list(y_test), (), ()

k_list = (3, 5, 7)
for k in k_list:
knn_clf = KNeighborsClassifier(n_neighbors=k)
knn_clf.fit(X_train, y_train)
y_pred = knn_clf.predict(X_test)
predictions.append(list(y_pred))
accuracies.append(accuracy_score(y_test, y_pred).round(4)*100)

df_predictions = pd.DataFrame({'Label': labels})
for k, pred in zip(k_list, predictions):
df_predictions(f'k = {k}') = pred

df_accuracies = pd.DataFrame({'Accuracy ': accuracies}, index=(f'k = {k}' for k in k_list)).T

print(df_predictions)
print(df_accuracies)

2. Distance metric: This determines how the similarity between points is calculated. Common options include:

  • Euclidean distance (rectilinear distance)
  • Distance Manhattan (sum of absolute differences)
  • Minkowski distance (a generalization of Euclidean and Manhattan distances)

3. Weight function: This determines how each neighbor’s contribution is weighted. Options include:

  • ‘uniform’: All neighbors are given equal weight.
  • ‘Distance’: Closer neighbors have a greater influence than neighbors who are further away.

Pros and cons

Like any algorithm in machine learning, KNN has its strengths and weaknesses.

Advantages:

  1. Simplicity: Easy to understand and implement.
  2. No assumptions: Does not make any assumptions about data distribution.
  3. Versatility: Can be used for both classification and regression tasks.
  4. No training phase: Can quickly process new data without retraining.

Disadvantages:

  1. Arithmetically expensive: Distances to all training samples must be calculated for each prediction.
  2. Memory intensive: Requires that all training data be saved.
  3. Sensitive to irrelevant features: May be confounded by features not important for classification.
  4. Curse of Dimensionality: Performance decreases in rooms with many dimensions.

Final remarks

The K-Nearest Neighbors (KNN) classifier stands out as a foundational algorithm in machine learning, providing an intuitive and effective approach to classification tasks. Its simplicity makes it an ideal starting point for beginners, while its versatility ensures its value to experienced data scientists. The power of KNN lies in its ability to make predictions based on the proximity of data points, without the need for complex training processes.

However, it is crucial to remember that KNN is just one tool in a vast machine learning toolkit. As you progress in your data science journey, you will use KNN as a stepping stone to understand more complex algorithms, always taking your specific data characteristics and problem requirements into account when choosing a model. By mastering KNN, you will gain valuable insights into classification techniques, laying a solid foundation for tackling more advanced machine learning challenges.

🌟 k Nearest Neighbor Classification Code Summarized

# Import libraries
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Load data
dataset_dict = {
'Outlook': ('sunny', 'sunny', 'overcast', 'rainy', 'rainy', 'rainy', 'overcast', 'sunny', 'sunny', 'rainy', 'sunny', 'overcast', 'overcast', 'rainy', 'sunny', 'overcast', 'rainy', 'sunny', 'sunny', 'rainy', 'overcast', 'rainy', 'sunny', 'overcast', 'sunny', 'overcast', 'rainy', 'overcast'),
'Temperature': (85.0, 80.0, 83.0, 70.0, 68.0, 65.0, 64.0, 72.0, 69.0, 75.0, 75.0, 72.0, 81.0, 71.0, 81.0, 74.0, 76.0, 78.0, 82.0, 67.0, 85.0, 73.0, 88.0, 77.0, 79.0, 80.0, 66.0, 84.0),
'Humidity': (85.0, 90.0, 78.0, 96.0, 80.0, 70.0, 65.0, 95.0, 70.0, 80.0, 70.0, 90.0, 75.0, 80.0, 88.0, 92.0, 85.0, 75.0, 92.0, 90.0, 85.0, 88.0, 65.0, 70.0, 60.0, 95.0, 70.0, 78.0),
'Wind': (False, True, False, False, False, True, True, False, False, False, True, True, False, True, True, False, False, True, False, True, True, False, True, False, False, True, False, False),
'Play': ('No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'No', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No', 'Yes')
}
df = pd.DataFrame(dataset_dict)

# Preprocess data
df = pd.get_dummies(df, columns=('Outlook'), prefix='', prefix_sep='', dtype=int)
df('Wind') = df('Wind').astype(int)
df('Play') = (df('Play') == 'Yes').astype(int)

# Split data
X, y = df.drop(columns="Play"), df('Play')
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.5, shuffle=False)

# Standardize features
scaler = StandardScaler()
float_cols = X_train.select_dtypes(include=('float64')).columns
X_train(float_cols) = scaler.fit_transform(X_train(float_cols))
X_test(float_cols) = scaler.transform(X_test(float_cols))

# Train model
knn_clf = KNeighborsClassifier(n_neighbors=3, metric="euclidean")
knn_clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = knn_clf.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, y_pred)}")

Read more

For a detailed explanation of the KNeighborsClassifier and its implementation in scikit-learn, readers can refer to the official documentation (2), which provides extensive information on its usage and parameters.

Technical environment

This article uses Python 3.7 and scikit-learn 1.5. While the concepts discussed are generally applicable, specific code implementations may vary slightly between versions.

About the illustrations

Unless otherwise stated, all images are created by the author and contain licensed design elements from Canva Pro.

Previous articles by author

Dummy Classifier Explained: A Visual Guide with Code Examples for Beginners

Reference

(1) TM Mitchell, Machine Learning (1997), McGraw-Hill Science/Engineering/Mathematics, p. 59

(2) F. Pedregosa et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, p. 2825–2830, 2011. (Online). Available: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html


K Nearest Neighbor Classifier, Explained: A Visual Guide with Code Examples for Beginners was originally published in Towards Data Science on Medium, where people continued the conversation by bookmarking and commenting on this story.