k-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed apriori. It is used to define the k centers for one or each clusters which is used in our algorithms. The way to put the centres are different so use techniques to place these far away from each other .
Basically it is used to partition x data points into the set of k clusters where each data point is assigned to its closest cluster.
It is used to define two variants:
1.Class: It is used to fit the method to learn clusters on train data.
2.Function: for given train data, used to return an array of integer for different clusters.
For the class, the labels over the training data can be found in the labels_ attribute.
K-means is often referred to as Lloyd’s algorithm.
Steps to do this:
In basic terms, the algorithm has three steps.
The first step chooses the initial centroids, with the most basic method being to choose k samples from the dataset X.
The second step creates new centroids by taking the mean value of all of the samples assigned to each previous centroid.
And third, the difference between the old and the new centroids are computed and the algorithm repeats these last two steps until this value is less than a threshold. In other words, it repeats until the centroids do not move significantly.
Example
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
from sklearn.cluster import Kmeans
X = np.array([[5,3], [10,15], [15,12], [24,10], [30,45], [85,70], [71,80], [60,78], [55,52], [80,91],])
#data visualization
plt.scatter(X[:,0],X[:,1], label='True Position')
#Creating Clusters
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)
print(kmeans.cluster_centers_)
Output show in 2D form:
[[ 16.8 17. ]
[ 70.2 74.2]]
#show the data levels
print(kmeans.labels_)
[0 0 0 0 0 1 1 1 1 1]
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
Now let's plot the points along with the centroid coordinates:
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
MiniBatchKMeans
The MiniBatchKMeans is a variant of the KMeans algorithm which uses mini-batches to reduce the computation time, while still attempting to optimise the same objective function. Mini-batches are subsets of the input data, randomly sampled in each training iteration.
In contrast to other algorithms that reduce the convergence time of k-means, mini-batch k-means produces results that are generally only slightly worse than the standard algorithm.
MiniBatchKMeans converges faster than KMeans, but the quality of the results is reduced. In practice this difference in quality can be quite small, as shown in the example and cited reference.
Example:
# Load libraries
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import MiniBatchKMeans
# Load data
iris = datasets.load_iris()
X = iris.data
# Standarize features
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
MiniBatchKMeans works similarly to KMeans, with one significance difference: the batch_size parameter.
# Create k-mean object
clustering = MiniBatchKMeans(n_clusters=3, random_state=0, batch_size=100)
# Train model
model = clustering.fit(X_std)
model.cluster_centers_
Affinity Propagation
In this creating clusters by sending messages between pairs of samples until convergence. A dataset is then described using a small number of exemplars, which are identified as those most representative of other samples. The messages sent between pairs represent the suitability for one sample to be the exemplar of the other, which is updated in response to the values from other pairs.
Example:
Below is the Python implementation of the Affinity Propagation clustering using scikit-learn library:
#import all the libraries
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1], [-1, -1]]
X, labels_true = make_blobs(n_samples = 400, centers = centers,
cluster_std = 0.5, random_state = 0)
# Compute Affinity Propagation
af = AffinityPropagation(preference =-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
Comments