K-Means Clustering

 

  • K-Means is an unsupervised clustering algorithm used to group similar data points.
  • It is called a centroid-based algorithm because each cluster is represented by a centroid (mean value).
  • The main aim of K-Means is to divide a dataset into K clusters, where K is a predefined number.

How K-Means Clustering Works

K-Means clustering is an unsupervised machine learning algorithm that groups similar data points into a predefined number of clusters, denoted by K. The algorithm works through a series of iterative steps to form meaningful clusters based on distance and similarity.

1. Initialization

The process begins by selecting K data points randomly from the dataset. These selected points serve as the initial centroids, which represent the centers of the clusters.

2. Assignment

For each data point in the dataset, the distance to every centroid is calculated using a distance measure such as Euclidean distance. Each data point is then assigned to the cluster whose centroid is the closest. This step results in the formation of K initial clusters.

3. Centroid Update

After all data points are assigned, the centroids are updated by computing the mean of all data points within each cluster. The newly calculated mean becomes the updated centroid for that cluster.

4. Iteration and Convergence

The assignment and centroid update steps are repeated multiple times. With each iteration, the centroids move closer to the center of their respective clusters. The algorithm continues until convergence, which occurs when the centroids stop changing significantly or when a predefined number of iterations is reached.

5. Final Result

Once convergence is achieved, the algorithm outputs the final cluster centroids along with the cluster assignment for each data point. The dataset is now partitioned into well-defined groups based on similarity.

📌 What is a Centroid?

A centroid is the center point of a cluster in clustering algorithms like K-Means.

 It represents the average position of all data points belonging to that cluster.


Centroid 

Centroid is the mean (average) of all data points in a cluster and is used to represent the cluster.

Mathematical Definition

If a cluster has n points:

y1,y2,…,yn

Then the centroid is:

Cx=x1+x2++xnn

Cy=y1+y2++ynn

Example 

Cluster C1 contains points:

  • P1 (2,15)
  • P2 (3,18)
  • P3 (4,12)
  • P7 (4,16)
  • P8 (3,14)

Centroid calculation:

Cx=2+3+4+4+35=3.2

Cy=15+18+12+16+145=15

 Centroid = (3.2, 15)

🔹 Why Centroid is Important in K-Means?

  1. It represents the cluster center
  2. Used to assign nearest data points
  3. Updated repeatedly until it stops changing
  4. Helps minimize within-cluster distance