Context When you think of sneakers for a trip, the importance of good footwear cannot be discarded, and the obvious brands that come to mind are Adidas and Nike. Adidas vs Nike is a constant debate as the two giants in the apparel market, with a large market cap and market share, battle it out to come on top. As a newly hired Data Scientist in a market research company, you have been given the task of extracting insights from the data of men's and women's shoes, and grouping products together to identify similarities and differences between the product range of these renowned brands.
Objective
To perform an exploratory data analysis and cluster the products based on various factors
Key Questions
Which variables are most important for clustering?
How each cluster is different from the others?
What are the business recommendations?
Data Description The dataset consists of 3268 products from Nike and Adidas with features of information including their ratings, discount, sales price, listed price, product name, and the number of reviews.
Product Name: Name of the product
Product ID: ID of the product
Listing Price: Listed price of the product
Sale Price: Sale price of the product
Discount: Percentage of discount on the product
Brand: Brand of the product
Rating: Rating of the product
Reviews: Number of reviews for the product
Let's start coding!
Importing necessary libraries
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import pdist
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# loading the dataset
data = pd.read_csv("product.csv")
Check shape of dataset
data.shape
# viewing a random sample of the dataset
data.sample(n=10, random_state=1)
Output:
# copying the data to another variable to avoid any changes to original data
df = data.copy()
# fixing column names
df.columns = [c.replace(" ", "_") for c in df.columns]
# let's look at the structure of the data
df.info()
Output:
We won't need Product_ID for analysis, so let's drop this column.
df.drop("Product_ID", axis=1, inplace=True)
# let's check for duplicate observations
df.duplicated().sum()
There are 117 duplicate observations. We will remove them from the data.
df = df[(~df.duplicated())].copy()
Let's take a look at the summary of the data
df.describe()
Output:
Observations
0 in the listing price indicates missing values.
The average listing price is 7046.
The average sale price is 5983.
The average discount is 28%.
The average rating is 3.3.
The average number of reviews is 42.
# let's check how many products have listing price 0
(df.Listing_Price == 0).sum()
# let's check the products which have listing price 0
df[(df.Listing_Price == 0)]
output:
df[(df.Listing_Price == 0)].describe()
output:
There are 336 observations that have missing values in the listing price column
We see that the discount for the products with listing price 0 is 0.
So, we will replace the listing price with the corresponding sale price for those observations.
df.loc[(df.Listing_Price == 0), ["Listing_Price"]] = df.loc[
(df.Listing_Price == 0), ["Sale_Price"]
].values
df.Listing_Price.describe()
Output:
# checking missing values
df.isna().sum()
output:
Exploratory Data Analysis
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# selecting numerical columns
num_col = df.select_dtypes(include=np.number).columns.tolist()
for item in num_col:
histogram_boxplot(df, item)
output:
Observations
Listing price and sale price have right-skewed distributions with upper outliers, which indicates the presence of very expensive products.
The maximum discount given is 60%.
Rating is left-skewed and most of the ratings are between 2.5 and 4.5.
The number of reviews is between 1 and 100, with an outlier value above 200.
fig, axes = plt.subplots(3, 2, figsize=(20, 15))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(3):
sns.ecdfplot(ax=axes[ii][0], x=df[num_col[counter]])
counter = counter + 1
if counter != 5:
sns.ecdfplot(ax=axes[ii][1], x=df[num_col[counter]])
counter = counter + 1
else:
pass
fig.tight_layout(pad=2.0)
Output:
Observations
90% of the products have listing prices less than 15000.
95% of the product have a sale price of less than 15000.
80% of the products have at least 50% discount or less than 50%.
50% off the products have a rating of 3.5 or less than 3.5.
Almost all products have 100 or fewer reviews.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# let's explore discounts further
labeled_barplot(df, "Discount", perc=True)
Comentários