Context
Buying and selling used phones and tablets used to be something that happened on a handful of online marketplace sites. But the used and refurbished device market has grown considerably over the past decade, and a new IDC (International Data Corporation) forecast predicts that the used phone market would be worth $52.7bn by 2023 with a compound annual growth rate (CAGR) of 13.6% from 2018 to 2023. This growth can be attributed to an uptick in demand for used phones and tablets that offer considerable savings compared with new models.
Refurbished and used devices continue to provide cost-effective alternatives to both consumers and businesses that are looking to save money when purchasing one. There are plenty of other benefits associated with the used device market. Used and refurbished devices can be sold with warranties and can also be insured with proof of purchase. Third-party vendors/platforms, such as Verizon, Amazon, etc., provide attractive offers to customers for refurbished devices. Maximizing the longevity of devices through second-hand trade also reduces their environmental impact and helps in recycling and reducing waste. The impact of the COVID-19 outbreak may further boost this segment as consumers cut back on discretionary spending and buy phones and tablets only for immediate needs.
Objective
The rising potential of this comparatively under-the-radar market fuels the need for an ML-based solution to develop a dynamic pricing strategy for used and refurbished devices. ReCell, a startup aiming to tap the potential in this market, has hired you as a data scientist. They want you to analyze the data provided and build a linear regression model to predict the price of a used phone/tablet and identify factors that significantly influence it.
Data Description
The data contains the different attributes of used/refurbished phones and tablets. The detailed data dictionary is given below.
Let us start by importing necessary libraries and data
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# split the data into train and test
from sklearn.model_selection import train_test_split
# to build linear regression_model
from sklearn.linear_model import LinearRegression
# to check model performance
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
# to build linear regression_model using statsmodels
import statsmodels.api as sm
# to compute VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
# loading data
data = pd.read_csv('4017861-used_device_data.csv') ## Fill the blank to read the data
# checking shape of the data
print(f"There are {data.shape[0]} rows and {data.shape[1]} columns.")
output:
There are 3454 rows and 15 columns.
# let's view a sample of the data
data.sample(n=10, random_state=1)
output:
# let's create a copy of the data to avoid any changes to original data
df = data.copy()
# checking the column names and datatypes
df.info()
output:
# checking for duplicate values
df.duplicated().value_counts() ## Complete the code to check dulipcate entries in the data
output:
False 3454 dtype: int64
# checking for missing values in the data
df.isnull().sum() ## Complete the code to check the missing values in the data
output:
Exploratory Data Analysis Let's check the statistical summary of the data.
df.describe()
output:
Univariate Analysis
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a star will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
used_price
histogram_boxplot(df, "used_price")
output:
df["used_price_log"] = np.log(df["used_price"])
histogram_boxplot(df,"used_price_log")
output:
new_price
histogram_boxplot(df,"new_price")
output:
screen_size
histogram_boxplot(df,"screen_size")
output:
main_camera_mp
histogram_boxplot(df,"main_camera_mp")
output:
days_used
histogram_boxplot(df,"days_used")
output:
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
brand_name
labeled_barplot(df, "brand_name", perc=True, n=10)
output:
os
labeled_barplot(df,"os")
output:
4g
labeled_barplot(df,"4g")
output:
5g
labeled_barplot(df,"5g")
output:
Bivariate Analysis
cols_list = df.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
cols_list.remove("release_year")
plt.figure(figsize=(15, 7))
sns.heatmap(
df[cols_list].corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral"
)
plt.show()
output:
The amount of RAM is important for the smooth functioning of a device. Let's see how the amount of RAM varies across brands.
plt.figure(figsize=(15, 5))
sns.barplot(data=df, x="brand_name", y="ram")
plt.xticks(rotation=90)
plt.show()
output:
People who travel frequently require devices with large batteries to run through the day. But large battery often increases a device's weight, making it feel uncomfortable in the hands. Let's create a new dataframe of only those devices which offer a large battery and analyze.
df_large_battery = df[df.battery > 4500]
df_large_battery.shape
df_large_battery.groupby("brand_name")["weight"].mean().sort_values(ascending=True)
plt.figure(figsize=(15, 5))
sns.barplot(x="brand_name",data=df,y="weight") ## Complete the code to create barplot for 'brand_name' and 'weight'
plt.xticks(rotation=60)
plt.show()
output:
People who buy devices primarily for entertainment purposes prefer a large screen as they offer a better viewing experience. Let's create a new dataframe of only those devices which are suitable for such people and analyze.
df_large_screen = df[df.screen_size > 6 * 2.54]
df_large_screen.shape
df_large_screen.brand_name.value_counts()
labeled_barplot(df,"brand_name")
output:
Data Preprocessing
Feature Engineering
Let's create a new column device_category from the new_price column to tag devices as budget, mid-ranger, or premium.
df["device_category"] = pd.cut(
x=df.new_price,
bins=[-np.infty, 200, 350, np.infty],
labels=["Budget", "Mid-ranger", "Premium"],
)
df["device_category"].value_counts()
output:
Budget 1844 Mid-ranger 1025 Premium 585 Name: device_category, dtype: int64
Let's check the distribution of 4G and 5G devices wrt price segments.
plt.figure(figsize=(15, 4))
plt.subplot(121)
sns.heatmap(
pd.crosstab(df["4g"], df["device_category"], normalize="columns"),
annot=True,
fmt=".4f",
cmap="Spectral",
)
plt.subplot(122)
sns.heatmap(pd.crosstab(df["5g"], df["device_category"], normalize="columns"),
annot=True,
fmt=".4f",
cmap="Spectral",) ## Complete the code to create crosstab for 5g
plt.show()
output:
Data Preprocessing
Missing Value Imputation
We will impute the missing values in the data by the column medians grouped by release_year and brand_name.
# let's create a copy of the data
df1 = df.copy()
# checking for missing values
df1.isnull().sum()
output:
Code to impute missing values in cols_impute with median by grouping the data on release year and brand name
cols_impute = [
"screen_size",
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
for col in cols_impute:
df1[col] = df1.groupby(["release_year","brand_name"])[col].transform(
lambda x: x.fillna(x.median())
)
Outlier Check
Let's check for outliers in the data.
# outlier detection using boxplot
numeric_columns = df1.select_dtypes(include=np.number).columns.tolist()
# dropping release_year as it is a temporal variable
numeric_columns.remove("release_year")
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
sns.boxplot(df1[variable]) ## Complete the code to create boxplots for all the columns
plt.show()
output:
Data Preparation for modeling
We want to predict the used device price, so we will use the normalized version used_price_log for modeling.
Before we proceed to build a model, we'll have to encode categorical features.
We'll split the data into train and test to be able to evaluate the model that we build on the train data.
# defining the dependent and independent variables
X = df1.drop(["new_price", "used_price", "used_price_log", "device_category"],axis=1) ## Complete the code to drop "new_price", "used_price", "used_price_log", "device_category" from the data
y = df1["used_price_log"]
print(X.head())
print()
print(y.head())
output:
# creating dummy variables
X = pd.get_dummies(
X,
columns=X.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
X.head()
output:
# splitting the data in 70:30 ratio for train to test data
x_train, x_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=1)
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in test data =", x_test.shape[0])
output:
Number of rows in train data = 2417 Number of rows in test data = 1037
Linear Regression using statsmodels
Let's build a linear regression model using statsmodels
# adding constant to the train data
x_train1 = sm.add_constant(x_train)
# adding constant to the test data
x_test1 = sm.add_constant(x_test) ## Complete the code to add contant to the test data
olsmodel1 = sm.OLS(y_train,x_train1).fit() ## Complete the code to fit OLS model
print(olsmodel1.summary())
output:
...
...
Let's check the model performance.
We will check the model performance on the actual prices and not the log values.
We will create a function that will convert the log prices to actual prices and then check the performance.
We will be using metric functions defined in sklearn for RMSE and MAE.
We will define a function to calculate MAPE.
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
# computing the actual prices by using the exponential function
target = np.exp(target)
pred = np.exp(pred)
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame({"RMSE": rmse, "MAE": mae, "MAPE": mape,}, index=[0],)
return df_perf
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel1_train_perf = model_performance_regression(olsmodel1,x_train1,y_train) ## Complete the code to check the performance on train data
olsmodel1_train_perf
output:
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel1_test_perf = model_performance_regression(olsmodel1,x_test1,y_test) ## Complete the code to check the performance on test data
olsmodel1_test_perf
output:
Checking Linear Regression Assumptions We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
TEST FOR MULTICOLLINEARITY
We will test for multicollinearity using VIF.
output:
from statsmodels.stats.outliers_influence import variance_inflation_factor
# we will define a function to check VIF
def checking_vif(predictors):
vif = pd.DataFrame()
vif["feature"] = predictors.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(predictors.values, i)
for i in range(len(predictors.columns))
]
return vif
checking_vif(x_train1) ## Code to check VIF on train data
output:
...
...
Dropping high p-value variables
We will drop the predictor variables having a p-value greater than 0.05 as they do not significantly impact the target variable.
But sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
Instead, we will do the following:
Build a model, check the p-values of the variables, and drop the column with the highest p-value.
Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
Repeat the above two steps till there are no columns with p-value > 0.05.
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
# initial list of columns
cols = x_train1.columns.tolist() ## Complete the code to check for p-values on the right dataset
# setting an initial max p-value
max_p_value = 1
while len(cols) > 0:
# defining the train set
x_train_aux = x_train1[cols] ## Complete the code to check for p-values on the right dataset
# fitting the model
model = sm.OLS(y_train, x_train_aux).fit()
# getting the p-values and the maximum p-value
p_values = model.pvalues
max_p_value = max(p_values)
# name of the variable with maximum p-value
feature_with_p_max = p_values.idxmax()
if max_p_value > 0.05:
cols.remove(feature_with_p_max)
else:
break
selected_features = cols
print(selected_features)
output:
['const', 'screen_size', 'main_camera_mp', 'selfie_camera_mp', 'ram', 'battery', 'weight', 'release_year', 'new_price_log', 'brand_name_Lenovo', 'brand_name_Nokia', 'brand_name_Xiaomi', 'os_Others', '4g_yes']
x_train2 = x_train1[selected_features]
x_test2 = x_test1[selected_features]
## Code fit OLS() on y_train and x_train2
olsmodel2 = sm.OLS(y_train,x_train2).fit()
print(olsmodel2.summary())
output:
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel2_train_perf = model_performance_regression(olsmodel2, x_train2, y_train)
olsmodel2_train_perf
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel2_test_perf = model_performance_regression(olsmodel2,x_test2,y_test) ## Complete the code to check performance on test data
olsmodel2_test_perf
output:
Now we'll check the rest of the assumptions on olsmod2.
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
TEST FOR LINEARITY AND INDEPENDENCE
We will test for linearity and independence by making a plot of fitted values vs residuals and checking for patterns.
# let us create a dataframe with actual, fitted and residual values
df_pred = pd.DataFrame()
df_pred["Actual Values"] = y_train ## Complete the code to store the actual values
df_pred["Fitted Values"] = olsmodel2.fittedvalues # predicted values
df_pred["Residuals"] = olsmodel2.resid # residuals
df_pred.head()
output:
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
output:
TEST FOR NORMALITY
We will test for normality by checking the distribution of residuals, by checking the Q-Q plot of residuals, and by using the Shapiro-Wilk test.
sns.histplot(data=df_pred, bins="auto") ## Complete the code to test the normality
plt.title("Normality of residuals")
plt.show()
output:
## Code check Q-Q plot
import pylab
import scipy.stats as stats
stats.probplot(df_pred["Residuals"]) ## Code check Q-Q plot
plt.show()
## Code to check p-value
stats.shapiro(df_pred["Residuals"])
TEST FOR HOMOSCEDASTICITY
We will test for homoscedasticity by using the goldfeldquandt test.
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred.Residuals,olsmodel2.model.exog) ## Complete the code to check homoscedasticity
lzip(name, test)
output:
[('F statistic', 1.0438035947010265), ('p-value', 0.22944475832466343)]
Final Model Summary
## Code to fit the final model
olsmodel_final = sm.OLS(y_train,x_train2).fit()
print(olsmodel_final.summary())
output:
# checking model performance on train set (seen 70% data)
print("Training Performance\n")
olsmodel_final_train_perf = model_performance_regression(olsmodel_final,x_train2,y_train) ## Complete the code to check the performance on train data
olsmodel_final_train_perf
output:
# checking model performance on test set (seen 30% data)
print("Test Performance\n")
olsmodel_final_test_perf = model_performance_regression(olsmodel_final,x_test2,y_test) ## Complete the code to check performance on test data
olsmodel_final_test_perf
output:
output:
Actionable Insights and Recommendations
We can see from the regression model that all the parameters have significance and also most of the parameters have their coefficient closer to 0 suggests almost no relationship. For the whole model we have r2 to be 0.82 and train mae=16.733 and test mae is 16.47 which suggests no overfitting in the model.
Comments