Description
We want to evaluate performance of millions of products and want to show the most relevant products to our users. please assign each product in the [product_scoring.csv](product_scoring.csv) a `score` which you would suggest we should sort by. The product with the biggest score means the most relevant product for our users - so it would end up on top of a page.
Here is an explanation of the fields:
- `product_id`: The id of the product
- `category_id`: The category, that product belongs to
- `clicks`: The number of times this product was clicked
- `impressions`: The number of times this product was shown to users
- `cpc`: the amount of money we get per click
In addition to providing us with a score per product, please also attach the code that you used and explain the algorithm that you used to calculate this score (this is the most important part of this task).
Also, explain current shortcomings and ideas for possible future improvements!
Please note that the score calculation in this task does not require you to think about any sales conversion rates.
Implementation
#Read Data
df = pd.read_csv('/content/drive/My Drive/ProductScoring/product_scoring.csv')
df.head()
#fill Nan Values
for col in df.columns:
df[col] = df[col].fillna(0)
#Check null values
df.isna().any()
#Check Features Dtype
df.info()
#Adding Score column
score = df.cpc + df.clicks + df.impressions
df['score'] = score
#install supporting libraries 'scikit-surprise'
%pip install scikit-surprise
from surprise import Reader,Dataset
from surprise.model_selection import cross_validate
df['score'] = df.score/max(df.score)
data = df[['product_id','category_id','score']]
reader = Reader(rating_scale=(0, 9))
data = Dataset.load_from_df(data[['product_id', 'category_id', 'score']], reader)
from surprise.model_selection import train_test_split
trainset, testset = train_test_split(data, test_size=0.2)
trainsetfull = data.build_full_trainset()
print('Number of products: ', trainset.n_users, '\n')
print('Number of categories: ', trainset.n_items, '\n')
output:
Number of products: 8000
Number of categories: 168
trainset_iids = list(trainset.all_items())
iid_converter = lambda x: trainset.to_raw_iid(x)
trainset_raw_iids = list(map(iid_converter, trainset_iids))
my_sim_option = {'name':'MSD', 'user_based':False, 'min_support' : 1}
The actual number of neighbors that are aggregated to compute an estimation is necessarily less than or equal to k. First, there might just not exist enough neighbors and second, the sets Nki(u) and Nku(i) only include neighbors for which the similarity measure is positive. As for our case 'u' stands for product , and i for category.
# knnwithmeans model
from surprise import KNNWithMeans
my_k = 15
my_min_k = 5
my_sim_option = {
'name':'pearson', 'user_based':False,
}
algo = KNNWithMeans(
k = my_k, min_k = my_min_k, sim_option = my_sim_option
)
algo.fit(trainset)
output:
Computing the msd similarity matrix...
Done computing similarity matrix.
#similarity matrix
algo.sim
Output:
array([[1., 0., 0., ..., 0., 0., 0.],
[0., 1., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 1., 1., 0.],
...,
[0., 0., 1., ..., 1., 1., 0.],
[0., 0., 1., ..., 1., 1., 0.],
[0., 0., 0., ..., 0., 0., 1.]])
#model evaluation
from surprise import accuracy
predictions = algo.test(testset)
accuracy.rmse(predictions)
output:
RMSE: 0.0334
0.033383282733501995
#predictions on test
predictions
output:
[Prediction(uid=3.4920901088886313e+18, iid=7234842626579207420, r_ui=3.2977212583575376e-05, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=5.097274081276356e+18, iid=4604948445965754860, r_ui=0.017131034811257308, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=7.417206446595283e+18, iid=6909670041687169853, r_ui=4.7959784639558994e-05, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=4.901531798377119e+18, iid=5165268297756540484, r_ui=0.007354337161598853, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=7.410764198364191e+18, iid=7234842626579207420, r_ui=3.296180013758429e-05, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=7.138567804384278e+18, iid=7234842626579207420, r_ui=0.0018066604999361158, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=7.374568628140124e+18, iid=5165268297756540484, r_ui=4.917518362889132e-05, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
Prediction(uid=7.331135881170135e+18, iid=7016191804580997334, r_ui=0.007452269933233676, est=0.0033303863201705744, details={'was_impossible': True, 'reason': 'User and/or item is unknown.'}),
...
...
Kommentare