The k-nearest neighbors algorithm is a simple and intuitive non-parametric supervised learning algorithm used for classification and regression tasks.
In the context of classification:
-
Training: The algorithm memorizes the feature vectors and corresponding class labels of the training data.
-
Prediction: To predict the class of a new data point, the algorithm finds the k nearest neighbors to the data point in the feature space based on a distance metric (e.g., Euclidean distance).
-
Majority voting: For classification, the algorithm assigns the class label that is most common among the k nearest neighbors. In the case of ties, it may break them randomly or using other rules.
-
Parameter tuning: The choice of the parameter k, the number of neighbors to consider, is crucial in KNN. Larger values of k can provide smoother decision boundaries but may lead to misclassification of complex patterns, while smaller values of k can be more sensitive to noise.
KNN is a non-parametric method, meaning it does not assume any underlying probability distributions of the data. It is also instance-based, as it does not explicitly learn a model during training but rather stores the entire training dataset for prediction.
Here’s a simple example of using KNN for classification in Python using scikit-learn:
from sklearn.neighbors import KNeighborsClassifier
# Create a KNN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)
# Training data (features and labels)
X_train = [[1, 2], [3, 4], [5, 6]]
y_train = ['a', 'b', 'a']
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Test data
X_test = [[2, 3], [4, 5]]
# Predict the class labels for the test data
y_pred = knn.predict(X_test)
print("Predicted labels:", y_pred)
TODO inserire esempi e calcolo diretto