- K-Means is an unsupervised clustering algorithm used to group similar data points.
- It is called a centroid-based algorithm because each cluster is represented by a centroid (mean value).
- The main aim of K-Means is to divide a dataset into K clusters, where K is a predefined number.
How K-Means Clustering Works
K-Means clustering is an unsupervised machine learning algorithm that groups similar data points into a predefined number of clusters, denoted by K. The algorithm works through a series of iterative steps to form meaningful clusters based on distance and similarity.
1. Initialization
The process begins by selecting K data points randomly from the dataset. These selected points serve as the initial centroids, which represent the centers of the clusters.
2. Assignment
For each data point in the dataset, the distance to every centroid is calculated using a distance measure such as Euclidean distance. Each data point is then assigned to the cluster whose centroid is the closest. This step results in the formation of K initial clusters.
3. Centroid Update
After all data points are assigned, the centroids are updated by computing the mean of all data points within each cluster. The newly calculated mean becomes the updated centroid for that cluster.
4. Iteration and Convergence
The assignment and centroid update steps are repeated multiple times. With each iteration, the centroids move closer to the center of their respective clusters. The algorithm continues until convergence, which occurs when the centroids stop changing significantly or when a predefined number of iterations is reached.
5. Final Result
Once convergence is achieved, the algorithm outputs the final cluster centroids along with the cluster assignment for each data point. The dataset is now partitioned into well-defined groups based on similarity.
📌 What is a Centroid?
A centroid is the center point of a cluster in clustering algorithms like K-Means.
It represents the average position of all data points belonging to that cluster.
Centroid
Centroid is the mean (average) of all data points in a cluster and is used to represent the cluster.
Mathematical Definition
If a cluster has n points:
y1,y2,…,yn
Then the centroid is:
Cx=x1+x2++xnn
Cy=y1+y2++ynn
Example
Cluster C1 contains points:
- P1 (2,15)
- P2 (3,18)
- P3 (4,12)
- P7 (4,16)
- P8 (3,14)
Centroid calculation:
Cx=2+3+4+4+35=3.2
Cy=15+18+12+16+145=15
Centroid = (3.2, 15)
🔹 Why Centroid is Important in K-Means?
- It represents the cluster center
- Used to assign nearest data points
- Updated repeatedly until it stops changing
- Helps minimize within-cluster distance
