Statistical Data Analytics In Machine Learning

In this blog you learn some important machine learning algorithms using MNIST Dataset:

Table of content

k-Nearest Neighbors.
Linear Regression.
Support Vector Machines.
Naïve Bayes.
Model Evaluation.
Exercises.

In this exercise, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black.

To load dataset, using the following code:

Display a random number to verify the dataset

Output:

Before applying the classifier, we need to split the dataset into training and testing parts.

https://developers.google.com/machine-learning/crash-course/training-and-test-sets/splitting-data

1. k-Nearest Neighbors

Build KNN classifier for the above dataset

1.1 Varying Number of Neighbours

In this exercise, you need to compute and plot the training and testing accuracy scores with different values of k (e.g. 1 to 8).

Output:

1.2 Overfitting vs. Underfitting

Which values of k makes the discrepancy between training accuracy and testing accuracy bigger or smaller? Which case is underfitting and which case is overfitting? Explain why.