Section 1: Explore the dataset and set up variables
1. Distance between points
Load the dataset clusterdat1.csv as an array called data1.
Split your data into 2 separate sets, storing the first half of the dataset as Set1 and the second half as Set2. Calculate the distance between each data point in Set1 and Set2 using the dist() function defined in the notebook (200 values total). You may optionally use a loop.
Report both the sum and the mean of your distances.
2. Distance between 1 point and a dataset
For a cluster center C0 = [-10,-5], calculate the distance between this point (C0) and every point in your dataset, data1. You should use the dist() function and you may optionally use a loop. Store your result in an array called dist0.
Report both the length and the mean of your dist0 array.
3. Distances between cluster centers and a dataset
Now we will use 2 cluster centers stored in a variable called C.
C = [[-2,-2],
[10,10]]
Create a new array called distances that contains all zeros and has a row for each sample (data point) in data1 and a column for each point stored in C. You may use np.zeros() to create your array.
Hint: Try to set up distances so that your code will work with any size of dataset and any number of values stored in C.
In the first column of distances, store the distance from each point in data1 to the first cluster center in C. In the second column of distances, store the distance from each point in data1 to the second cluster center in C.
Hint: You can optionally use the dist() function with axis=1 to compute the distance between a datapoint and each point in your C array like this: dist(data1[i],C,axis=1)
Report both the shape of your distances array and the mean of its columns (a 1x2 value).
4. Assigning points to clusters Using your distances array from the above step, create a new array called cluster_ind that stores the index of the cluster that each data point should be assigned to. If the data point belongs to the first cluster ([-2,-2]), the index should be set to 0. If it belongs to the second cluster ([10,10]), it should be set to 1.
Hint: You can use np.argmin() to accomplish this.
Report the first 5 values of cluster_ind.
5. Updating cluster centers
Create a new C array with the first cluster center equal to the mean of all data points in data1 with cluster_ind = 0 and the second cluster center equal to the mean of all data points in data1 with cluster_ind = 1.
Report your value of C.
Note: Depending on how you code your answer, you may need to store C as np.array(C) for your output to look nice.
You should complete Section 1 before moving on to the rest of the assignment. The rest of the assignment should make use of the same variable names and much of the same code created in Section 1.
Section 2: Performing K-means for a small number of iterations
6. Perform K-means step by step, k=2 Load the dataset clusterdat2.csv as an array called data2. You will use data2 for the remainder of the questions in this assignment.
Initialize the cluster centers (the means) at (-2,10) and (-2,6).
Calculate the distance between each data point and each center (you should have an array containing a row for every data point and a distance value for each cluster mean). c. Assign each data point with the label of its nearest cluster (cluster_ind)
Plot the clustered data points using different colors for each cluster. You may optionally plot the cluster centers in the same plot.
Update the cluster centers (means).
Continue for 4 iterations, creating one plot for each iteration inside a 2x2 subplot.
Repeat for a new initialization of cluster centers: (3, -8) and (5, -10)
7. K-means step by step, k=3
Follow the same instructions as above, using two different cluster initializations:
Centers = (2,-2), (-2,-2), (2,2) and
Centers = (0,-3), (0,0), (0,3)
8. K-means step by step, k=4
Follow the same instructions as above (only one cluster initialization)
Centers = (0,-3), (0,-1), (0,1), (0,2)
9. Evaluating k
Which value of k do you think will produce the best result? How can you tell?
10. Perform K-means for k = [4, 8, 12, 20] Using the same K-means algorithm as above, cluster the data, but only plot the final results. The convergence criterion is when the total distance change between centers at two different iterations is less than 0.001 (or you can simply run the algorithm for a large number of iterations, like 200). Initialize your cluster centers by randomly selecting k points from your dataset as your initial values. Create a figure with 2x2 subplots. In each subplot, plot the final result of k-means clustering with different k values. Subplot 1: k=4; Subplot 2: k=8; Subplot 3: k=12; Subplot 4: k=20
11. Perform K-means for k = 1:25 Using the same K-means algorithm, run k-means for k = 1 through 25. You will need to use a for loop. For each value of k, store the total distance of all data points from their cluster centers. Note: This may take some time to run.
a. Plot the total distance for each value of k. Which k do you think explains the data best?
b. You may notice that your distance graph does not follow a smooth-line trend. If it doesn't, can you explain why there are "jumps" in the distance values?
12.Extra credit Run your code from the above step for generating the total distances, but this time use a loop to repeat the process a few times (~5 or more). Store the minimum total distance for each value of k and use this to create a new version of your plot from part
(a). Explain the difference between the two plots in terms of cluster initialization
We are also covering all other machine learning algorithms related task. Here expert to get help in machine learning advance implementation and get instant help with an affordable price.
Send your requirement details at:
realcode4you@gmail.com
Comments