The multilayer perceptron(MLP) has a large wide of classification and regression applications in many fields: pattern recognition, voice and classification problems. But the architecture choice has a great impact on the convergence of these networks. In the present paper we introduce a new approach to optimize the AMAZON REVIEW DATA, for solving the obtained model we use the genetic algorithm and we train the amazon review.
Introduction
Here we will analyze positive and nagative review of amazon dataset and test the accuracy of train and test data.
Data preparation
Data source: http://jmcauley.ucsd.edu/data/amazon/index_2014.html
#importing Libraries
import gzip
import itertools
import numpy as np
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, f1_score
%matplotlib inline
Reviews into Pandas DataFrame Here we will first parse the data sets parse_gz() method using which is given in zip formate and then we will convert it
into the dataframe by using convert_to_DataFrame() methods
def parse_gz(file_path):
g = gzip.open(file_path, 'rb')
for l in g:
yield eval(l)
def convert_to_DataFrame(file_path):
i = 0
df = {}
for d in parse_gz(file_path):
df[i] = d
i += 1
return pd.DataFrame.from_dict(df, orient='index')
Loading Data
We are going to classify Amazon product reviews to understand the positive or negative review. Amazon has different rating(1-stars, 2-stars, etc), which is given in overall column. We will use that to compare our prediction.
#passing file path or name
sports_data = convert_to_DataFrame('reviews_Sports_and_Outdoors_5.json.gz')
#checking size of dataset in words
print('Dataset review size: {:,} rows'.format(len(sports_data)))
#Selecting Datasets three records
sports_data[:3]
Output:
sports_data.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 296337 entries, 0 to 296336
Data columns (total 9 columns):
reviewerID 296337 non-null object
asin 296337 non-null object
reviewerName 294935 non-null object
helpful 296337 non-null object
reviewText 296337 non-null object
overall 296337 non-null float64
summary 296337 non-null object
unixReviewTime 296337 non-null int64
reviewTime 296337 non-null object
dtypes: float64(1), int64(1), object(7)
memory usage: 14.7+ MB
sports_data.describe()
Output:
#displaying shape of data
sports_data.shape
Output:
(296337, 9)
Reformat 𝑑𝑎𝑡𝑒𝑡𝑖𝑚𝑒⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ from raw form.
sports_data["reviewTime"] = pd.to_datetime(sports_data["reviewTime"])
choosing selected field
sports_data = sports_data[['asin', 'summary', 'reviewText', 'overall', 'reviewerID', 'reviewerName', 'helpful', 'reviewTime',
'unixReviewTime']]
view top three records
sports_data.head(3)
Output:
view 3 bottom records of dataframe
sports_data.tail(3)
Output:
Number of Reviews by Unique Products
products = sports_data['overall'].groupby(sports_data['asin']).count()
print("Number of Unique Products in the Sports Category = {}".format(products.count()))
Output:
Number of Unique Products in the Sports Category = 18357
Modeling
sports_data[:3]
Output:
Insert review_in_float column for Sentiment modeling
Here we add nagative review for 1-3 overall ranking and positive review for 4-5 overall ranking
Negative reviews: 1-3 Stars = 0
Positive reviews: 4-5 Stars = 1
review_text = sports_data["reviewText"]
Train/Test Split over a "overall" column which used for view analysis
Build a sentiment classifier to identify whether the review has positive or negative sentiment. MLP Classifier model
will use the words reviewText( column) and ratings (overall) from the training data to develop a model to
predict target (overall).
x_train, x_test, y_train, y_test = train_test_split(sports_data.reviewText, sports_data.overall, random_state=0)
print("x_train shape: {}".format(x_train.shape), end='\n')
print("y_train shape: {}".format(y_train.shape), end='\n\n')
print("x_test shape: {}".format(x_test.shape), end='\n')
print("y_test shape: {}".format(y_test.shape), end='\n\n')
Output:
x_train shape: (222252,)
y_train shape: (222252,)
x_test shape: (74085,)
y_test shape: (74085,)
Use for reading text data as per integer value because model reads only integer value
Here we select 10 fields because memory issue if your system memory is large then you can use full train and test data for fit into the model
data = x_train[:500]
data1 = x_test[:500]
test_y = y_test[:500]
train_y = y_train[:500]
Here we have Using Countvectorizer() because it chaged data from string format to integer array format so that it can be fit into the model
cv = CountVectorizer()
X_traincv = cv.fit_transform(data)
X_testcv = cv.transform(data1)
feature_names1 = cv.get_feature_names()
print("Number of features: {}".format(len(feature_names1)))
Number of features: 5479
ML Deep neural networks'
Now we will fit train and test data into the MLP Classifer to predict the score
from sklearn.preprocessing import StandardScaler
# Training the model
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report,confusion_matrix
mlp = MLPClassifier()
X_traincv
mlp.fit(X_traincv,train_y)
Output:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
beta_2=0.999, early_stopping=False, epsilon=1e-08,
hidden_layer_sizes=(100,), learning_rate='constant',
learning_rate_init=0.001, max_iter=200, momentum=0.9,
n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
random_state=None, shuffle=True, solver='adam', tol=0.0001,
validation_fraction=0.1, verbose=False, warm_start=False)
# predict the target on the train dataset
pred_train = mlp.predict(X_traincv)
pred_train
Output:
# Accuray Score on train dataset
accur_train = accuracy_score(train_y,pred_train)
print('accuracy_score on train dataset : ', accur_train)
# Predictions and Evaluation
predictions = mlp.predict(X_testcv)
predictions
#confusion matrix to find to mark predicted value
cnf = confusion_matrix(test_y,predictions)
cnf
Result with score and accuracy
#result with score and accuracy
print(classification_report(test_y,predictions))
Output:
Comments