Introduction
To address the problem, we used the Yelp dataset.
Yelp is a crowd-sourced local business review and social networking site. The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a 1 to 5 stars rating system. These reviews and ratings help other Yelp users to evaluate a business or a service and make a choice.
While reviews add more detail, it is not possible or feasible to read each and every review to find out whether people liked the service or not. Our aim is to find out meaningful information to better understand the data in a time saving and cost-effective way. We will try to find out whether the reviews were positive or negative in order to find out how the businesses are being perceived among people.
Why focus on the reviews when we have the ratings? It is true that users generally go by the rating of the services provided by other users. But the ratings don’t help in determining what is good about one business than the other, it is the ratings that answer questions like ‘How is the ambience of a restaurant?’, ‘What is the best dish a restaurant serves’, ‘how is the customer care of a service?’ etc. To sum it up, ratings don’t provide much detail about anything other than the popularity of a business.
One more thing to consider is the fact that ratings can be fake as it hardly takes any time to rate a service. On the other hand, we can assume that reviews are a more genuine source of information as it takes some efforts from the end user of the service giving it more validation.
Since the reviews are more trustworthy source of information, they play a vital role in determining the popularity of a business. While searching for services, customers tend to look for reviews. Thus, a business should be aware of what kind of reviews (positive or negative) are being written about them. To get an insight of these emotions (positive or negative) from the review, we built a BERT model to perform sentiment analysis using the reviews on Yelp. We chose BERT as it is bidirectional and can read the entire sequence of words at once. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word). The model would assess the review and let us know whether the reviews are positive or negative. The two predicted classes can then be easily visualised and it would help us to better understand the user’s emotions at one glance.
BERT model:
BERT stands for Bidirectional Representation for Transformers, was proposed by researchers at Google AI language in 2018. Although the main aim of that was to improve the understanding of the meaning of queries related to Google Search, BERT became one of the most important and complete architecture for various natural language tasks having generated state-of-the-art results on Sentence pair classification task, question-answer task, etc.
How BERT works
BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.
As opposed to directional models, which read the text input sequentially (left-to-right or right-to-left), the Transformer encoder reads the entire sequence of words at once. Therefore, it is considered bidirectional, though it would be more accurate to say that it’s non-directional. This characteristic allows the model to learn the context of a word based on all of its surroundings (left and right of the word).
BERT is proposed in two versions:
· BERT (BASE): 12 layers of encoder stack with 12 bidirectional self-attention heads and 768 hidden units.
· BERT (LARGE): 24 layers of encoder stack with 24 bidirectional self-attention heads and 1024 hidden units.
For TensorFlow implementation, Google has provided two versions of both the BERT BASE and BERT LARGE: Uncased and Cased. In an uncased version, letters are lowercased before WordPiece tokenization
The steps to build the model are in accordance with the CRISP methodology.
Data Collection:
Dataset used: lite version of the Yelp review dataset.
The dataset was collected from Kaggle datasets. The dataset is already classified, the reviews are divided into two classes: 1 (negative) and 2 (positive). The original data has 560000 instances. Since the dataset is huge, we will only use 2000 instances having the same distribution as the dataset in order to reduce the training time of the model. Training a big dataset could take a large amount of time ranging from a few hours to a few days. class distribution: 1 (negative): 1000 instances 2 (positive): 1000 instances
Implementation
#Install transform libraries
!pip install transformers
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast
# specify GPU
device = torch.device("cuda")
Mount Drive to read data
from google.colab import drive
drive.mount('/content/drive')
Read Data
loc = '/content/drive/My Drive/yelp/raw_train.csv'
import pandas as pd
d= pd.read_csv(loc,header=None)
d.head()
Output:
Assign Label To Dataset
d.columns = ['label','text']
d = d.iloc[:1000,:]
d.to_csv('/content/drive/My Drive/yelp/dummy.csv')
df = pd.read_csv('/content/drive/My Drive/yelp/dummy.csv')
df.drop(['Unnamed: 0'],axis=1,inplace=True)
df.dropna(inplace=True)
df.head()
Output:
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 label 1000 non-null int64
1 text 1000 non-null object
dtypes: int64(1), object(1)
memory usage: 23.4+ KB
import tensorflow as tf
import tensorflow_hub as hub
import pandas as pd
from sklearn.model_selection import train_test_split
import numpy as np
import re
import unicodedata
import nltk
from nltk.corpus import stopwords
import keras
from tqdm import tqdm
import pickle
from keras.models import Model
import keras.backend as K
from sklearn.metrics import confusion_matrix,f1_score,classification_report
import matplotlib.pyplot as plt
from keras.callbacks import ModelCheckpoint
import itertools
from keras.models import load_model
from sklearn.utils import shuffle
from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
def unicode_to_ascii(s):
return ''.join(c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn')
def clean_stopwords_shortwords(w):
stopwords_list=stopwords.words('english')
words = w.split()
clean_words = [word for word in words if (word not in stopwords_list) and len(word) > 2]
return " ".join(clean_words)
def preprocess_sentence(w):
w = unicode_to_ascii(w.lower().strip())
w = re.sub(r"([?.!,¿])", r" ", w)
w = re.sub(r'[" "]+', " ", w)
w = re.sub(r"[^a-zA-Z?.!,¿]+", " ", w)
w=clean_stopwords_shortwords(w)
w=re.sub(r'@\w+', '',w)
return w
from transformers import *
from transformers import BertTokenizer, TFBertModel, BertConfig
data = df
num_classes=len(data.label.unique())
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = TFBertForSequenceClassification.from_pretrained('bert-base-uncased',num_labels=num_classes)
Output:
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=536063208.0, style=ProgressStyle(descri…
All model checkpoint layers were used when initializing TFBertForSequenceClassification. Some layers of TFBertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
data['gt'] = data['label'].map({2:0,1:1})
data.head()
Output:
sentences=data['text']
labels=data['gt']
len(sentences),len(labels)
Output:
(1000, 1000)
input_ids=[]
attention_masks=[]
for sent in sentences:
bert_inp=bert_tokenizer.encode_plus(sent,add_special_tokens = True,max_length =64,pad_to_max_length = True,return_attention_mask = True)
input_ids.append(bert_inp['input_ids'])
attention_masks.append(bert_inp['attention_mask'])
input_ids=np.asarray(input_ids)
attention_masks=np.array(attention_masks)
labels=np.array(labels)
print('Preparing the pickle file.....')
pickle_inp_path='/content/drive/My Drive/yelp/bert_inp.pkl'
pickle_mask_path='/content/drive/My Drive/yelp/bert_mask.pkl'
pickle_label_path='/content/drive/My Drive/yelp/bert_label.pkl'
pickle.dump((input_ids),open(pickle_inp_path,'wb'))
pickle.dump((attention_masks),open(pickle_mask_path,'wb'))
pickle.dump((labels),open(pickle_label_path,'wb'))
print('Pickle files saved as ',pickle_inp_path,pickle_mask_path,pickle_label_path)
Output:
Preparing the pickle file.....
Pickle files saved as /content/drive/My Drive/yelp/bert_inp.pkl /content/drive/My Drive/yelp/bert_mask.pkl /content/drive/My Drive/yelp/bert_label.pkl
print('Loading the saved pickle files..')
input_ids=pickle.load(open(pickle_inp_path, 'rb'))
attention_masks=pickle.load(open(pickle_mask_path, 'rb'))
labels=pickle.load(open(pickle_label_path, 'rb'))
print('Input shape {} Attention mask shape {} Input label shape {}'.format(input_ids.shape,attention_masks.shape,labels.shape))
Output:
Loading the saved pickle files..
Input shape (1000, 64) Attention mask shape (1000, 64) Input label shape (1000,)
Split Dataset
train_inp,val_inp,train_label,val_label,train_mask,val_mask=train_test_split(input_ids,labels,attention_masks,test_size=0.2)
print('Train inp shape {} Val input shape {}\nTrain label shape {} Val label shape {}\nTrain attention mask shape {} Val attention mask shape {}'.format(train_inp.shape,val_inp.shape,train_label.shape,val_label.shape,train_mask.shape,val_mask.shape))
output:
Train inp shape (800, 64) Val input shape (200, 64)
Train label shape (800,) Val label shape (200,)
Train attention mask shape (800, 64) Val attention mask shape (200, 64)
log_dir='/content/drive/My Drive/yelp/'
model_save_path='/content/drive/My Drive/yelp/bert_model.h5'
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath=model_save_path,save_weights_only=True,monitor='val_loss',mode='min',save_best_only=True),keras.callbacks.TensorBoard(log_dir=log_dir)]
print('\nBert Model',bert_model.summary())
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')
optimizer = tf.keras.optimizers.Adam(learning_rate=2e-5,epsilon=1e-08)
bert_model.compile(loss=loss,optimizer=optimizer,metrics=[metric])
Output:
Model: "tf_bert_for_sequence_classification"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
bert (TFBertMainLayer) multiple 109482240
_________________________________________________________________
dropout_37 (Dropout) multiple 0
_________________________________________________________________
classifier (Dense) multiple 1538
=================================================================
Total params: 109,483,778
Trainable params: 109,483,778
Non-trainable params: 0
_________________________________________________________________
Bert Model None
Check History
history=bert_model.fit([train_inp,train_mask],train_label,batch_size=32,epochs=4,validation_data=([val_inp,val_mask],val_label),callbacks=callbacks)
Output:
Epoch 1/4
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7fd940733ec0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <bound method Socket.send of <zmq.sugar.socket.Socket object at 0x7fd940733ec0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:AutoGraph could not transform <function wrap at 0x7fd95bfe4c20> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function wrap at 0x7fd95bfe4c20> and will run it as-is.
Cause: while/else statement not yet supported
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
25/25 [==============================] - ETA: 0s - loss: 0.6849 - accuracy: 0.5539WARNING:tensorflow:The parameters `output_attentions`, `output_hidden_states` and `use_cache` cannot be updated when calling a model.They have to be set to True/False in the config object (i.e.: `config=XConfig.from_pretrained('name', output_attentions=True)`).
WARNING:tensorflow:The parameter `return_dict` cannot be set in graph mode and will always be set to `True`.
25/25 [==============================] - 69s 928ms/step - loss: 0.6836 - accuracy: 0.5566 - val_loss: 0.5149 - val_accuracy: 0.8600
Epoch 2/4
25/25 [==============================] - 20s 815ms/step - loss: 0.4722 - accuracy: 0.8197 - val_loss: 0.3463 - val_accuracy: 0.8500
Epoch 3/4
25/25 [==============================] - 21s 819ms/step - loss: 0.3136 - accuracy: 0.8731 - val_loss: 0.3807 - val_accuracy: 0.8550
Epoch 4/4
25/25 [==============================] - 20s 820ms/step - loss: 0.1678 - accuracy: 0.9419 - val_loss: 0.3309 - val_accuracy: 0.8550
Contact Us to get any other assignment help related to Sentiment Analysis.
Send your requiremenet details at realcode4you@gmail.com
And get instant help with an affordable price.
Comentarios