K-means
K-means clustering is a type of unsupervised learning, which is used with unlabeled dataset. The goal of this algorithm is to find K groups in the data. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
K-means works by defining spherical clusters that are separable in a way so that the mean value converges towards the cluster center. Because of this, K-Means may underperform sometimes.
Use Cases:
Document Classification
Delivery Store Optimization
Customer Segmentation
Insurance Fraud Detection etc.
Algorithm :
Κ-means clustering algorithm inputs are the number of clusters Κ and the data set. Algorithm starts with initial estimates for the Κ centroids, which can either be randomly generated or randomly selected from the data set. The algorithm then iterates between two steps:
1. Data assigment step:
Each centroid defines one of the clusters. In this step, each data point based on the squared Euclidean distance is assigned to its nearest centroid. If 𝑐𝑖ci is the collection of centroids in set C, then each data point x is assigned to a cluster based on
where dist( · ) is the standard (L2) Euclidean distance.
2. Centroid update step:
Centroids are recomputed by taking the mean of all data points assigned to that centroid's cluster.
The algorithm iterates between step one and two until a stopping criteria is met (no data points change clusters, the sum of the distances is minimized, or some maximum number of iterations is reached). In this assignment, we will use the first criterion, i.e. that no cluster centers change.
This algorithm may converge on a local optimum. Assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.
Implementation
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from scipy.stats import multivariate_normal
Generating the data
We will generate 3 datasets to use in this assignment. They are created in the following code, and stored in X, X1 and X2 respectively. They are each sets of 500 2D points. We store them as matrices of size 500×2.
Please use these variables X1, X2 and X3 to run your kmeans and EM in this assignment.
"""
This cell has been completed for you. You don't need to change it.
"""
X,Y = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=3)
X = np.dot(X,np.random.RandomState(0).randn(2,2))
print('X shape: ', X.shape)
plt.scatter([x[0] for x in X], [x[1] for x in X])
plt.title('X')
plt.show()
centers = [[4, 7], [9, 9], [9, 2]]
X1,Y1 = make_blobs(cluster_std=1.5,random_state=20,n_samples=500,centers=centers)
X1 = np.dot(X1,np.random.RandomState(0).randn(2,2))
print('X1 shape: ', X1.shape)
plt.scatter([x[0] for x in X1], [x[1] for x in X1])
plt.title('X1')
plt.show()
centers = [[5, 5]]
X21,Y21 = make_blobs(cluster_std=1.5,random_state=20,n_samples=200,centers=centers)
X21 = np.dot(X21, np.array([[1.0, 0], [0, 5.0]]))
centers = [[5, 5]]
X22,Y22 = make_blobs(cluster_std=1.5,random_state=20,n_samples=200,centers=centers)
X22 = np.dot(X21, np.array([[5.0, 0], [0, 1.0]]))
centers = [[7, 7]]
X23, Y23 = make_blobs(cluster_std=1.5,random_state=20,n_samples=100,centers=centers)
X23 = np.dot(X23, np.random.RandomState(0).randn(2,2))
X2 = np.vstack((X21, X22, X23))
print('X2 shape: ', X2.shape)
plt.scatter([x[0] for x in X2], [x[1] for x in X2])
plt.title('X2')
plt.show()
Output
Creating Class
class KmeansModel:
def __init__(self, X, k, max_iters):
self.X = X
self.k = k
self.max_iters = max_iters
self.dim = X.shape[1]
self.N = X.shape[0]
centroid_indices = np.random.RandomState(2).permutation(X.shape[0])[:k]
self.centroids = X[centroid_indices]
initial_labels = self.get_labels(self.X, self.centroids)
self.plot_data(initial_labels, 'Data with initial random clusters')
def get_labels(self, X, centroids):
labels=[]
for i in range(X.shape[0]):
point=X[i]
# initialise a list to store the distance between the data-point and the centroids
Euclidean_distance=[]
for index,C in enumerate(centroids):
# calculating euclidean distance between the data-point and the k-centroid points and appending into a list
Euclidean_distance.append(np.power((point[0]-C[0]),2)+np.power((point[1]-C[1]),2))
# selecting the index with the least distance
label=Euclidean_distance.index(min(Euclidean_distance))
labels.append([label])
return (np.array(labels))
KmeansModel.get_labels = get_labels # Here we are just adding the function to the KmeansModel class.
def plot_data(self, labels, title):
fig = plt.figure(figsize=(12,6))
ax0 = fig.add_subplot(111)
ax0.scatter(self.X[:,0], self.X[:,1], c=labels)
ax0.set_title(title)
KmeansModel.plot_data = plot_data
Creating Main Function
def run(self):
iters = 0
while True:
labels=self.get_labels(self.X,self.centroids)
#assigning points to cluster based on labels
clusters={}
for i,j in zip(labels,self.X):
if i[0] not in clusters:
clusters[i[0]]=[]
clusters[i[0]].append(j)
else:
clusters[i[0]].append(j)
# calculating new centroids
new_centroid=[]
for i in clusters:
new_centroid.append(list(np.average(clusters[i],axis=0)))
# ADD THE CONVERGENCE CONDITION HERE, IE IF THE CENTROIDS DON'T CHANGE SINCE THE LAST ITERATION THEN BREAK
comparison = self.centroids == np.array(new_centroid)
equal_arrays = comparison.all()
if equal_arrays==True:
break
# ALSO UPDATE THE CENTROIDS TO THE NEW VALUES
self.centroids=np.array(new_centroid)
if iters == self.max_iters:
break
iters += 1
final_labels = self.get_labels(self.X, self.centroids)
self.plot_data(final_labels, 'Final clusters')
KmeansModel.run = run
Finally, we create an instance of KmeansModel and run our algorithm on some data.
km1 = KmeansModel(X, 3, 100)
km1.run()
km2 = KmeansModel(X1, 3, 100)
km2.run()
km3 = KmeansModel(X2, 3, 100)
km3.run()
EM Algorithms
EM Algorithm finds clusters by treating each cluster as a 2D Gaussian distribution, and then fits the parameters of each distribution to accurately model the data. The goal is to find optimal parameters for each of the Gaussians which model the data accurately.
Creating Class
class EMModel:
def __init__(self, X, k, max_iters):
"""
This function initializes our parameters (mu, pi and sigma) and plots our data points.
"""
self.X = X
self.k = k
self.max_iters = max_iters
self.dim = self.X.shape[1] # Equals 2, as we are considering 2D points
self.N = self.X.shape[0] # Equals the number of points in the dataset
"""
Here we initialize mu, pi and sigma. Mu is k X 2, Pi is k X 1, Sigma is k X 2 X 2
"""
self.mu = np.random.randint(max(min(self.X[:, 0]), min(self.X[:, 1])), min(max(self.X[:, 0]), max(self.X[:, 1])), size=(self.k, self.dim))
self.pi = np.ones(self.k) / self.k
self.sigma = np.zeros((self.k, self.dim, self.dim))
for i in range(self.k):
np.fill_diagonal(self.sigma[i], 5.0)
"""
This part is required to plot clusters
"""
x,y = np.meshgrid(np.sort(self.X[:,0]),np.sort(self.X[:,1]))
self.XY = np.array([x.flatten(),y.flatten()]).T
"""
This variable is required in case the covariance matrix turns out to not be positive semidefinite
"""
self.sigma_correction = 0.0*np.identity(self.dim)
"""
Finally, let's visualize our data points
"""
self.plot_data('Initial State')
def plot_data(self, title, colors=None):
"""
This function creates a scatter plot of all the data points. It also creates a contour plot of the probability
distributions of each of the clusters (specified by mu, pi and sigma)
"""
fig = plt.figure(figsize=(12,6))
ax0 = fig.add_subplot(111)
ax0.scatter(self.X[:,0], self.X[:,1], c=colors)
ax0.set_title(title)
for m,c in zip(self.mu,self.sigma):
c += self.sigma_correction
multi_normal = multivariate_normal(mean=m,cov=c)
ax0.contour(np.sort(self.X[:,0]),np.sort(self.X[:,1]),multi_normal.pdf(self.XY).reshape(len(self.X),len(self.X)),colors='black',alpha=0.3)
ax0.scatter(m[0],m[1],c='grey',zorder=10,s=100)
EMModel.plot_data = plot_data
def expectation(self):
# REPLACE THIS WITH THE ACTUAL r_ic COMPUTATION
r = np.zeros((self.N, self.k))
for i in range(self.k):
for n in range(self.N):
r[n,i] = self.pi[i] * multivariate_normal.pdf(self.X[n],
self.mu[i], self.sigma[i])
cost = np.log(r.sum(axis=1)).sum()
r = r / r.sum(axis=1, keepdims=True)
return r
EMModel.expectation = expectation
def maximization(self, r):
self.mu = []
self.sigma = []
self.pi = []
# PLEASE ADD THE ACTUAL PARAMETERS COMPUTATION HERE. SET self.mu, self.sigma and self.pi TO THE CORRECT VALUES
for i in range(self.k):
Nk = r[:,i].sum()
self.pi.append( Nk / self.N)
Mu_k = r[:,i].dot(self.X)/Nk
self.mu.append(Mu_k)
delta = self.X - Mu_k
Rdelta = np.expand_dims(r[:,i], -1) * delta
self.sigma.append(Rdelta.T.dot(delta) / Nk + np.eye(self.dim)*self.sigma_correction )
EMModel.maximization = maximization
Creating Main Function
def run(self):
r = None
for it in range(self.max_iters):
r = self.expectation()
self.maximization(r)
color_labels = np.argmax(r, axis=1)
self.plot_data('Final State', color_labels)
EMModel.run = run
Finally, we instantiate the EMModel class and run the algorithm on some data.
EM = EMModel(X, 3, 50)
EM.run()
EM1 = EMModel(X1, 3, 100)
EM1.run()
EM2 = EMModel(X2, 3, 100)
EM2.run()
If you need any programming assignment help in Machine Learning programming, Machien Learning project or Machine Learning homework then we are ready to help you.
Send your request at realcode4you@gmail.com and get instant help with an affordable price.
We are always focus to delivered unique or without plagiarism code which is written by our highly educated professional which provide well structured code within your given time frame.
If you are looking other programming language help like C, C++, Java, Python, PHP, Asp.Net, NodeJs, ReactJs, etc. with the different types of databases like MySQL, MongoDB, SQL Server, Oracle, etc. then also contact us.
Comments