Posts

Implementing K-Means Clustering Algorithm from Scratch in Python

Introduction

K-means clustering is a widely used unsupervised learning algorithm that partitions the data into K clusters based on their similarities. It's a simple yet effective technique for identifying patterns in datasets. In this tutorial, we'll implement the K-means clustering algorithm from scratch in Python, without using any machine learning libraries like scikit-learn. This will help us understand the underlying mathematics and logic behind the algorithm.

The K-means clustering algorithm works by initializing K centroids randomly, then assigning each data point to the closest centroid. The centroids are then updated to be the mean of all data points assigned to each centroid. This process is repeated until the centroids converge or a stopping criterion is met.

Understanding the K-Means Clustering Algorithm

Key Components of the Algorithm

The K-means clustering algorithm consists of the following key components:

  • Centroids: These are the representative points of each cluster.
  • Data points: These are the individual data points that need to be clustered.
  • Distance metric: This is used to calculate the distance between each data point and the centroids.
  • Stopping criterion: This is used to determine when to stop the algorithm.

In this implementation, we'll use the Euclidean distance metric and a stopping criterion based on the convergence of the centroids.

Implementing the K-Means Clustering Algorithm in Python

Initializing the Centroids and Data Points

We'll start by initializing the centroids and data points. We'll use the numpy library to generate random centroids and data points.


import numpy as np

# Set the number of clusters (K) and the number of data points
K = 3
n_data_points = 100

# Generate random centroids
centroids = np.random.rand(K, 2)

# Generate random data points
data_points = np.random.rand(n_data_points, 2)

Next, we'll define a function to calculate the Euclidean distance between each data point and the centroids.


def calculate_distance(data_points, centroids):
    distances = np.zeros((len(data_points), len(centroids)))
    for i in range(len(data_points)):
        for j in range(len(centroids)):
            distances[i, j] = np.sqrt(np.sum((data_points[i] - centroids[j]) ** 2))
    return distances

Assigning Data Points to Clusters and Updating Centroids

Assigning Data Points to Clusters

We'll assign each data point to the cluster with the closest centroid.


def assign_clusters(data_points, centroids):
    distances = calculate_distance(data_points, centroids)
    cluster_assignments = np.argmin(distances, axis=1)
    return cluster_assignments

Next, we'll update the centroids to be the mean of all data points assigned to each cluster.


def update_centroids(data_points, cluster_assignments, K):
    centroids = np.zeros((K, 2))
    for i in range(K):
        cluster_data_points = data_points[cluster_assignments == i]
        if len(cluster_data_points) > 0:
            centroids[i] = np.mean(cluster_data_points, axis=0)
    return centroids

Running the K-Means Clustering Algorithm

We'll run the K-means clustering algorithm until the centroids converge or a stopping criterion is met.


def run_kmeans(data_points, K, max_iterations=100):
    centroids = np.random.rand(K, 2)
    for _ in range(max_iterations):
        cluster_assignments = assign_clusters(data_points, centroids)
        new_centroids = update_centroids(data_points, cluster_assignments, K)
        if np.all(centroids == new_centroids):
            break
        centroids = new_centroids
    return centroids, cluster_assignments

Finally, we'll run the K-means clustering algorithm on our data points and print the resulting cluster assignments.


centroids, cluster_assignments = run_kmeans(data_points, K)
print(cluster_assignments)

Conclusion

In this tutorial, we implemented the K-means clustering algorithm from scratch in Python. We started by initializing the centroids and data points, then defined functions to calculate the Euclidean distance between each data point and the centroids, assign data points to clusters, and update the centroids. We ran the K-means clustering algorithm until the centroids converged or a stopping criterion was met, and printed the resulting cluster assignments.

This implementation provides a basic understanding of the K-means clustering algorithm and can be used as a starting point for more complex clustering tasks. However, it's worth noting that this implementation has some limitations, such as the random initialization of centroids and the use of a simple stopping criterion. In practice, more sophisticated techniques such as k-means++ initialization and elbow method for determining the optimal number of clusters can be used to improve the performance of the algorithm.

Post a Comment

Hi! How can we help you? Send us a message and we'll get back to you.