top of page
realcode4you

Case Study Assignment Help | Engineering Colleges Case Study

Context Education is fast becoming a very competitive sector with hundreds of institutions to choose from. It is a life-transforming experience for any student and it has to be a thoughtful decision. There are ranking agencies that do a survey of all the colleges to provide more insights to students. Agency RankForYou wants to leverage this year's survey to roll out an editorial article in leading newspapers, on the state of engineering education in the country. Head of PR (Public Relations) comes to you, the data scientist working at RankForYou, and asks you to come up with evidence-based insights for that article.

Objective To identify different types of engineering colleges in the country to better understand the state of affairs.

Key Questions

  • How many different types (clusters/segments) of colleges can be found from the data?

  • How do these different groups of colleges differ from each other?

  • Do you get slightly different solutions from two different techniques? How would you explain the difference?

Data Description The data contains survey results for 26 engineering colleges. The initial survey data has been summarized into a rating scale of 1-5 for different factors.

Factor rating index

  • 1 - Very low

  • 2 - Low

  • 3 - Medium

  • 4 - High

  • 5 - Very high

Data Dictionary

  • SR_NO: Serial Number

  • Engg_College: 26 Engineering colleges with pseudonyms A to Z

  • Teaching: Quality of teaching at the engineering college

  • Fees: Fees at the engineering college

  • Placements: Job placements after a student graduates from the engineering college

  • Internship: Student Internships at the engineering college

  • Infrastructure: Infrastructure of the engineering college


Let's start coding! Importing necessary libraries

# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black

# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd

# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

Read Dataset

# loading the dataset
data = pd.read_excel("Engineering Colleges Case Study.xlsx")

Shape of Dataset

data.shape

output:

(26, 7)
  • The dataset has 26 rows and 7 columns


View Dataset

# viewing a random sample of the dataset
data.sample(n=10, random_state=1)

Output:













# copying the data to another variable to avoid any changes to original data
df = data.copy()
# dropping the serial no. column as it does not provide any information
df.drop("SR_NO", axis=1, inplace=True)
df.info()

Output:













Observations

  • Engg_College is a categorical variable with 26 levels that indicate each college's name.

  • The 5 rating variables are of type int (integer).


df.describe()

Output:












Observations

  • The median value of fees is 4, indicating that most of the engineering colleges have high fees.

  • The mean and median of other ratings lie between 2 and 3, except the mean infrastructure rating.


# checking for missing values
df.isna().sum()

Output:











  • There are no missing values in our data


EDA(Exploratory Data Analysis) Univariate Analysis

# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={"height_ratios": (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color="green", linestyle="--"
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color="black", linestyle="-"
    )  # Add median to the histogram
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
    histogram_boxplot(df, item, kde=True, figsize=(8, 4))

Output:













...

...

...



# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot
for item in num_col:
    labeled_barplot(df, item, perc=True)

Output:


















Observations

  • More than 75% of the colleges have a rating less than 4 for placements.

  • More than 80% of the colleges have a rating of 3 or more for infrastructure.



Bivariate Analsysis

Let's check for correlations.

plt.figure(figsize=(15, 7))
sns.heatmap(df[num_col].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()

Output:











Observation

  • Rating for teaching is strongly positively correlated with the rating for placements and internships.

  • This is obvious because if teaching quality is high, students are more likely to get placements and internships.


sns.pairplot(data=df[num_col], diag_kind="kde")
plt.show()

Output:



















Observations

  • Teaching is almost normally distributed.

  • Distribution of fees seems to be bimodal.

  • Distribution of Internships seems to be bimodal.


# Scaling the data set before clustering
scaler = StandardScaler()
subset = df[num_col].copy()
subset_scaled = scaler.fit_transform(subset)
# Creating a dataframe from the scaled data
subset_scaled_df = pd.DataFrame(subset_scaled, columns=subset.columns)
clusters = range(1, 9)
meanDistortions = []

for k in clusters:
    model = KMeans(n_clusters=k)
    model.fit(subset_scaled_df)
    prediction = model.predict(subset_scaled_df)
    distortion = (
        sum(
            np.min(cdist(subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
        )
        / subset_scaled_df.shape[0]
    )

    meanDistortions.append(distortion)

    print("Number of Clusters:", k, "\tAverage Distortion:", distortion)

plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k with the Elbow Method", fontsize=20)

Output:

Number of Clusters: 1 	Average Distortion: 2.087990295998642
Number of Clusters: 2 	Average Distortion: 1.6030760049686552
Number of Clusters: 3 	Average Distortion: 1.3542868697697457
Number of Clusters: 4 	Average Distortion: 1.16633874818233
Number of Clusters: 5 	Average Distortion: 1.0468661954812357
Number of Clusters: 6 	Average Distortion: 0.94008107013181
Number of Clusters: 7 	Average Distortion: 0.8210830918162462
Number of Clusters: 8 	Average Distortion: 0.7161563518185236

Output:















The appropriate value of k from the elbow curve seems to be 4 or 5.

Let's check the silhouette scores.

sil_score = []
cluster_list = list(range(2, 10))
for n_clusters in cluster_list:
    clusterer = KMeans(n_clusters=n_clusters)
    preds = clusterer.fit_predict((subset_scaled_df))
    # centers = clusterer.cluster_centers_
    score = silhouette_score(subset_scaled_df, preds)
    sil_score.append(score)
    print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))

plt.plot(cluster_list, sil_score)

Output:

For n_clusters = 2, silhouette score is 0.3347415593639785
For n_clusters = 3, silhouette score is 0.28965899397924016
For n_clusters = 4, silhouette score is 0.3490226771698325
For n_clusters = 5, silhouette score is 0.3578484211066675
For n_clusters = 6, silhouette score is 0.3689672936661078
For n_clusters = 7, silhouette score is 0.39463090921434624
For n_clusters = 8, silhouette score is 0.3975512934421172
For n_clusters = 9, silhouette score is 0.40224546126764127











From the silhouette scores, it seems that 7 is a good value of k.

# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

Output:












<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 7 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>



# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

Output:












<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 6 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>


# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

output:












<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>


# finding optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(subset_scaled_df)
visualizer.show()

Output:














<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 26 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'> <IPython.core.display.Javascript object>



Let's take 5 as the appropriate no. of clusters as the silhouette score is high enough and there is knick at 5 in the elbow curve.


kmeans = KMeans(n_clusters=5, random_state=0)
kmeans.fit(subset_scaled_df)

KMeans(n_clusters=5, random_state=0) <IPython.core.display.Javascript object>


# adding kmeans cluster labels to the original dataframe
df["K_means_segments"] = kmeans.labels_
cluster_profile = df.groupby("K_means_segments").mean()
cluster_profile["count_in_each_segment"] = (
    df.groupby("K_means_segments")["Fees"].count().values
)
# let's display cluster profiles
cluster_profile.style.highlight_max(color="lightgreen", axis=0)

Output:









fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle("Boxplot of numerical variables for each cluster")
counter = 0
for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[num_col[counter]], x=df["K_means_segments"])
    counter = counter + 1

fig.tight_layout(pad=2.0)

Output:











df.groupby("K_means_segments").mean().plot.bar(figsize=(15, 6))

Output:











Ref: Great Learning




If you have any project related to Case study implementation or report writing then don't worry about it. Expert do better as per your expectation. Hire Realcode4you expert and get instant help with an affordable price.


Send Your requirement details at: realcode4you@gmail.com

Comments


bottom of page