Introduction
In this work, you need to work on two data files. You can obtain them from the module site on Canvas. One file includes the training set, named Algerian forest fires train.csv, and the other file, Algerian forest fires test.csv, includes the test set.
These two datasets are extracted from the Algerian Forest Fires Dataset. The following link https://archive.ics.uci.edu/ml/datasets/Algerian+Forest+Fires+Dataset++ shows the website of the archive of the original Algerian Forest Fires Dataset. It includes 13 feature columns (or 11 features if we consider day, month, and year as a single feature: date ) plus class label information. The last column in both files is the class label. Classes 0 and 1 denote not fire and fire, respectively.
Note that when analysing the data,
you need to work on the 10 features (Temperature, RH, Ws, Rain, FFMC, DMC, DC, ISI, BUI, and FWI), excluding day, month, and year in both files.
you need not consider the region information given in the original Algerian Forest Fires Dataset.
The training set you have been given consists of 204 instances, 88 labeled as 0, and 116 labeled as 1. For this coursework, you may treat this set as a balanced dataset. The test set contains 40 instances. You can assume that the data is of satisfactory quality and requires no preprocessing / data cleansing other than normalisation.
To classify the data you will be using Support Vector Machines (SVMs). The type of SVM you need to use is the C-SVC (Cost-Support Vector Classifier) and the kernel function you should use is the Gaussian radial basis function (RBF).
Software Required
For this coursework you will need to write your Python code (in version 3 and above) in the Jupyter Notebook. You can use functions from the following packages: Numpy, Pandas, Matplotlib, Seaborn and Sklearn. Your practical session notes should be very useful - these are all available on Canvas.
1. Task 1 - Data Exploration
In this task, you need to use Principal Component Analysis (PCA) to understand the characteristics of the datasets.
(a) Use Pandas to load both the training set and the test set. (Let’s denote this original training set as training set (I).)
(b) Plot two subplots in one figure:
one is a scatter plot of Temperature against BUI of the training set;
the other is a scatter plot of Temperature against FWI of the test set.
You need to separately set the label for the x-axis and y-axis and use different colours to distinguish the two classes. Make it clear which subplot is for which dataset. (Hint: examples on how to use pyplot.subplot in matplotlib can be found here: https://matp lotlib.org/stable/api/ as gen/matplotlib.pyplot.subplot.html.)
(c) Normalise the training set and the test set using StandardScaler() (Hint: the parameters should come from the training set only).
(d) Perform a PCA analysis on the scaled training set and plot the scree plot to report variances captured by each principal component.
(e) Plot two subplots in one figure:
one for projecting the training set in the projection space constructed by the training set in Task 1 (d) using the first principal component (PC1) and the second principal component (PC2).
the other one for projecting the test set in the same projection space produced by the training set in Task 1 (d) using the first two principal components.
You need to label the data using different colours in the picture according to its class and set the label for the x-axis and y-axis, separately.
2. Task 2
(a) Divide the training dataset into a smaller training set (II) and a validation set using the train test split function and report the number of points in each set. Usually, we use 20%-30% of the total data points in the whole training set as the validation data. It is your choice on how to set the exact ratio.
(b) Normalise both the training set (II) and the validation set (Hint: the parameters should come from the training set (II) only).
3. Task 3 - Non-linear Classification
(a) Basic task
i. Choosing the most suitable parameters
When using the C-SVC SVM with the Gaussian radial basis kernel there are two tunable parameters, C (cost) and γ (gamma). You have been given the following combinations: [C=1, γ=0.1], [C=5, γ=0.1], [C=10, γ=0.1], and [C=10, γ=0.01]. You should train an SVM model for each combination from the given 4 combinations and then test it on the normalised validation set. The accuracy rate for each combination on the validation set should be reported. Finally, you need to select the best combination of parameters and report your result.
ii. Non-linear classification
You should now be in a position to further test your model with the selected parameters by classifying the test data. With the normalised whole training set (I) as the input, you will need to train an SVM model with the suitable parameter values discovered for C and γ in Task 3 (a)i. When the classification model is built you will then need to use it to classify the normalised test set, and report a) the accuracy rate; b) the confusion matrix
(b) Advanced task - non-linear classification with reduced features
i. Reduce features for both the normalised training set (I) and the normalised test set using the first seven features, excluding ISI, BUI, and FWI.
ii. Do the classification using the Gaussian radial basis kernel SVM with parameter values selected in Task 3 (a) .
Train an SVM model on the training set with reduced features.
Test the model on the corresponding test set, that is the one with reduced features and report the classification result on the test set.
(c) Summarize your findings and write your conclusions in critical thinking. For example, which model gives a better classification result: the one trained on the original features or the one trained on the reduced features? Or, do they produce the same performance? Is this what you have expected? Why? You need to provide evidence to support reasons you give.
To get solution of the above problem or need any other project help which is related to Support Vector Classifier then send your requirement at:
Commentaires