Context Education is fast becoming a very competitive sector with hundreds of institutions to choose from. It is a life-transforming experience for any student and it has to be a thoughtful decision. There are ranking agencies that do a survey of all the colleges to provide more insights to students. Agency RankForYou wants to leverage this year's survey to roll out an editorial article in leading newspapers, on the state of engineering education in the country. Head of PR (Public Relations) comes to you, the data scientist working at RankForYou, and asks you to come up with evidence-based insights for that article.
Objective
To identify different types of engineering colleges in the country to better understand the state of affairs.
Key Questions
How many different types (clusters/segments) of colleges can be found from the data?
How do these different groups of colleges differ from each other?
Do you get slightly different solutions from two different techniques? How would you explain the difference?
Data Description
The data contains survey results for 26 engineering colleges. The initial survey data has been summarized into a rating scale of 1-5 for different factors.
Factor rating index
1 - Very low
2 - Low
3 - Medium
4 - High
5 - Very high
Data Dictionary
SR_NO: Serial Number
Engg_College: 26 Engineering colleges with pseudonyms A to Z
Teaching: Quality of teaching at the engineering college
Fees: Fees at the engineering college
Placements: Job placements after a student graduates from the engineering college
Internship: Student Internships at the engineering college
Infrastructure: Infrastructure of the engineering college
Let's start coding! Importing necessary libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
Read Dataset
# loading the dataset
data = pd.read_excel("Engineering Colleges Case Study.xlsx")
Shape of Dataset
data.shape
output:
(26, 7)
The dataset has 26 rows and 7 columns
View Dataset
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
Output:
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# dropping the serial no. column as it does not provide any information
df.drop("SR_NO", axis=1, inplace=True)
df.info()
Output:
Observations
Engg_College is a categorical variable with 26 levels that indicate each college's name.
The 5 rating variables are of type int (integer).
df.describe()
Output:
Observations
The median value of fees is 4, indicating that most of the engineering colleges have high fees.
The mean and median of other ratings lie between 2 and 3, except the mean infrastructure rating.
# checking for missing values
df.isna().sum()
Output:
There are no missing values in our data
EDA(Exploratory Data Analysis) Univariate Analysis
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
histogram_boxplot(df, item, kde=True, figsize=(8, 4))
Output:
...
...
...
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
for item in num_col:
labeled_barplot(df, item, perc=True)
Output:
Observations
More than 75% of the colleges have a rating less than 4 for placements.
More than 80% of the colleges have a rating of 3 or more for infrastructure.
Bivariate Analsysis
Let's check for correlations.
plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
Output:
Observation
Rating for teaching is strongly positively correlated with the rating for placements and internships.
This is obvious because if teaching quality is high, students are more likely to get placements and internships.
sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()
Output:
Observations
Teaching is almost normally distributed.
Distribution of fees seems to be bimodal.
Distribution of Internships seems to be bimodal.
# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
clusters = range(1, 9)
meanDistortions = []
for k in clusters:
model = KMeans(n_clusters=k)
model.fit(subset_scaled_df)
prediction = model.predict(subset_scaled_df)
distortion = (
sum(
np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)
Output:
Number of Clusters: 1 Average Distortion: 2.087990295998642
Number of Clusters: 2 Average Distortion: 1.6030760049686552
Number of Clusters: 3 Average Distortion: 1.3542868697697457
Number of Clusters: 4 Average Distortion: 1.16633874818233
Number of Clusters: 5 Average Distortion: 1.0468661954812357
Number of Clusters: 6 Average Distortion: 0.94008107013181
Number of Clusters: 7 Average Distortion: 0.8210830918162462
Number of Clusters: 8 Average Distortion: 0.7161563518185236
Output:
The appropriate value of k from the elbow curve seems to be 4 or 5.
Let's check the silhouette scores.
sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict((subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
Output:
For n_clusters = 2, silhouette score is 0.3347415593639785
For n_clusters = 3, silhouette score is 0.28965899397924016
For n_clusters = 4, silhouette score is 0.3490226771698325
For n_clusters = 5, silhouette score is 0.3578484211066675
For n_clusters = 6, silhouette score is 0.3689672936661078
For n_clusters = 7, silhouette score is 0.39463090921434624
For n_clusters = 8, silhouette score is 0.3975512934421172
For n_clusters = 9, silhouette score is 0.40224546126764127
From the silhouette scores, it seems that 7 is a good value of k.
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Output:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Output:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
output:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>
# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()
Output:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>
Let's take 5 as the appropriate no. of clusters as the silhouette score is high enough and there is knick at 5 in the elbow curve.
kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(subset_scaled_df)
KMeans(n_clusters=5, random_state=0) <IPython.core.display.Javascript object>
# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
cluster_profile = df.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segment"] = (
df.groupby("K_means_segments")["Fees"].count().values
)
# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)
Output:
fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of numerical variables for each cluster")
counter = 0
for ii in range(5):
sns.boxplot(ax=axes[ii], y=df[num_col[counter]], x=df["K_means_segments"])
counter = counter + 1
fig.tight_layout(pad=2.0)
Output:
df.groupby("K_means_segments").mean().plot.bar(figsize=(15, 6))
Output:
Ref: Great Learning
If you have any project related to Case study implementation or report writing then don't worry about it. Expert do better as per your expectation. Hire Realcode4you expert and get instant help with an affordable price.
Send Your requirement details at: realcode4you@gmail.com
Comments