top of page
realcode4you

NLP Project Help, Assignment Help and Homework Help | Hire NLP Expert | Practice Exercises

Updated: Nov 11, 2022

NLP(Natural Language Processing)

In this blog we will check our understanding of the concepts learned in NLP.


To summarize NLP or Natural Language Processing is:

  • Computer manipulation of natural languages.

  • Set of methods/algorithms for making natural language accessible to computer.

The image below summarizes the basic steps involved in any NLP task:



There are 5 exercises in total and an optional exercise. To answer some of the exercises (3, 4 and 5) you will be required to write a code from scratch in the code cells containing:


# write your code here


Before starting make sure you are using the GPU.

!nvidia-smi

Tokenization We will use TensorFlow Keras Tokenizer to tokenize our text. As per the TensorFlow documentation: “This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf.” There are many functions that we can use, but below we will be using these two functions to train the tokenizer to our text data and convert given text to tokens:

  • fit_on_texts: Updates internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency. So if you give it something like, "The cat sat on the mat." It will create a dictionary “word_index” such that every word gets a unique integer value. 0 is reserved for padding. So lower integer means more frequent word.

  • texts_to_sequences Transforms each text in texts to a sequence of integers. It takes each word in the text and replaces it with its corresponding integer value from the word_index dictionary.


from tensorflow.keras.preprocessing.text import Tokenizer
t  = Tokenizer()
fit_text = ["In science, you can say things that seem crazy, but in the long run, they can turn out to be right"]
t.fit_on_texts(fit_text)

test_text1 = "I would like to take a right turn"
test_text2 = "That man is crazy"
sequences = t.texts_to_sequences([test_text1, test_text2])

print('sequences : ',sequences,'\n')

print('word_index : ',t.word_index)

Exercise 1 In the code above we tokenize two sentences:

  • "I would like to take a right turn"

  • "That man is crazy"

a. What is the tokenized version of these sentences? b. The first sentence has 8 words, and second sentence has 4 words, however the tokenized version has 3 and 2 integers respectively for them. Why is it so?


Answer 1: Write Your Answer Here



Embeddings


Exercise 2 In the class we learned about embeddings, let us explore them a little more. Kindly go to the site Embeddings Projector. Play around a bit and answer the following questions:

  1. For the word 'fantastic' list the five nearest neighbours, when using Word2Vec 10K embedding.

  2. Repeat the exercise by changing the embeddings to Word2Vec All.

Reflect on the result. How do you think the world fantastic is related to its five nearest neighbours?


Answer 2: Write Your answer Here



Word Similarities Let us now train Word2Vec model on text8 dataset


!mkdir data

import gensim.downloader as api
from gensim.models import Word2Vec

info = api.info("text8")
assert(len(info) > 0)

dataset = api.load("text8")  # download and load text 8  dataset
model = Word2Vec(dataset) # we create an embedding using Word2vec model for this data
model.save("data/text8-word2vec.bin")

Load the saved model as KeyedVector to save space.


from gensim.models import KeyedVectors
model = KeyedVectors.load("data/text8-word2vec.bin")  # Help in saving memory by shedding the internal data structures necessary for training
word_vectors = model.wv   ## Gives the word vectors

Helper function to print


def print_most_similar(word_conf_pairs, k):
    for i, (word, conf) in enumerate(word_conf_pairs):
        print("{:.3f} {:s}".format(conf, word))
        if i >= k-1:
            break
    if k < len(word_conf_pairs):
        print("...")
print_most_similar(word_vectors.most_similar('king'),5)


Exercise 3 In the class, we learned how to use the Word2Vec embeddings in Gensim. When the model is trained on the ‘text8’ dataset, give five most similar words to the word ‘tree’ using word2vec embedding trained on ‘text8’ dataset.


## Write your code here


Answer 3: Write your Answer Here



Word Arithmetics

Exercise 4 With the Word2Vec model trained on text8 dataset, calculate the following:

  • woman + king - man = ?

  • chair + table - work = ?

  • Queens - queen + person = ?


## Write your code here


Answer 4: Write Your answer Here



Spam Classifier

Some helper codes:

  • importing required modules

  • defining helper functions

  • Building model

The code cells below are hidden, that is by default you cannot see the code in them, but remember to run these cells. You can check the code by double clicking the cells.


#@title
# The modules needed to run the code
import argparse  # To read commandline argument and parse it
import gensim.downloader as api
import numpy as np
import os  # For file and directory handling
import shutil  # For file and directory handling
import tensorflow as tf

from sklearn.metrics import accuracy_score, confusion_matrix  #For measuring performance

# Some parameters
DATA_DIR = "data"   # Data directory to save embedding
EMBEDDING_NUMPY_FILE = os.path.join(DATA_DIR, "E.npy")  # Numpy file containing word embeddings
DATASET_URL = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"  # Dataset URL from where data is downloaded
EMBEDDING_MODEL = "glove-wiki-gigaword-300"  # The gensim embedding model we will use
EMBEDDING_DIM = 300  # The embedding dimensions
NUM_CLASSES = 2  # The number of classes in output-- Spam or Ham
BATCH_SIZE = 128  # The batch size
NUM_EPOCHS = 3  # number of epochs for which model is to be trained


# data distribution is 4827 ham and 747 spam (total 5574), which 
# works out to approx 87% ham and 13% spam, so we take reciprocals
# and this works out to being each spam (1) item as being approximately
# 8 times as important as each ham (0) message.
CLASS_WEIGHTS = { 0: 1, 1: 8 }  # To take care of imbalance in classes

tf.random.set_seed(42)  # Set the seed for random number generation to be able to reproduce results. 


# Data downloading and data Processing



def download_and_read(url):
    """
    The function downloads the data from given url, splits it into Text and Labels
    Uses tf.keras.utils.get_file() function to download the data from url--> function 
    downloads the data from the given url, extracts it from the zip file and place it in folder "datasets" 
    with the name specified in the first argument.
    tf.keras.utils.get_file(
    fname, origin, untar=False, md5_hash=None, file_hash=None,
    cache_subdir='datasets', hash_algorithm='auto', extract=False,
    archive_format='auto', cache_dir=None)

    Arguments:
    url: The url link of the dataset in zip format

    Returns:
    Two lists containing texts and respective labels

    """
    local_file = url.split('/')[-1]  # split the file name (last string after '/') from url
    p = tf.keras.utils.get_file(local_file, url, 
        extract=True, cache_dir=".")  #function to download the data from url to folder datasets with name given in local_file
    labels, texts = [], []
    local_file = os.path.join("datasets", "SMSSpamCollection")  # define the path of the file from which to read data: datasets/SMSSpamCollection
    with open(local_file, "r") as fin:
        for line in fin:
            label, text = line.strip().split('\t')  # The labels and text are in one line separated by tab space.
            labels.append(1 if label == "spam" else 0)
            texts.append(text)
    return texts, labels

def build_embedding_matrix(sequences, word2idx, embedding_dim, 
        embedding_file):
    """
    The function reads the dict word2idx (word --> number) and written the corresponding
    word vector for each word as defined by the Embedding model

    Arguments:
    sequences: not needed, not used-- just there because to suport back support for TF1 book
    word2idx: Dictionary  containing words in the text and their respective idx as given by tokenizer.
    embedding_dim: The number of units for the embedding layer
    embedding_file: The data file in which embeddings will be store for future use.

    """
    if os.path.exists(embedding_file):  # Checks if the embedding file already exists- then it justs loads it in the memory
        E = np.load(embedding_file)
    else:  # Else it creates the embedding file using the model specified in EMBEDDING_MODEL
        vocab_size = len(word2idx)  # The vocabulary size is number of unique words in the text
        E = np.zeros((vocab_size, embedding_dim)) # Creates a variable to store embeddings
        word_vectors = api.load(EMBEDDING_MODEL)  # Get the embeddings from Gensim
        for word, idx in word2idx.items():
            try:
                E[idx] = word_vectors.word_vec(word)  # For each word it converts it to respective word vector and store in Embedding file
            except KeyError:   # word not in embedding
                pass
            # except IndexError: # UNKs are mapped to seq over VOCAB_SIZE as well as 1
            #     pass
        np.save(embedding_file, E)  # The embeddings are saved in a file for future reference
    return E
#@title
class SpamClassifierModel(tf.keras.Model):  # The model is build using model API of Keras with tf.Keras.Model as the parent class. 
# The class inherits train, predict methods of the parent class.
    def __init__(self, vocab_sz, embed_sz, input_length,
            num_filters, kernel_sz, output_sz, 
            run_mode, embedding_weights, 
            **kwargs):
        super(SpamClassifierModel, self).__init__(**kwargs)
        if run_mode == "scratch":  # Choose the embedding layer scratch means the weights wil be traned from scratch
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                trainable=True)
        elif run_mode == "vectorizer":  # Vectorizer means we use the pre-trained weights--> Transfer Learning
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                weights=[embedding_weights],
                trainable=False)
        else:  # This is the fine tuning mode- we use pre-trained weights for the embedding layer and fine tune them. 
            self.embedding = tf.keras.layers.Embedding(vocab_sz, 
                embed_sz,
                input_length=input_length,
                weights=[embedding_weights],
                trainable=True)
        self.dropout = tf.keras.layers.SpatialDropout1D(0.2)  # Add droput layer to avoid overfotting. 
        self.conv = tf.keras.layers.Conv1D(filters=num_filters,  # Define the 1D convolutional layer 
            kernel_size=kernel_sz,
            activation="relu")
        self.pool = tf.keras.layers.GlobalMaxPooling1D()  # The pooling layer
        self.dense = tf.keras.layers.Dense(output_sz, 
            activation="softmax")  # And the last classifying layer consists of a fully connected Dense layer

    def call(self, x):  # This function performs forward pass in the model. 
        x = self.embedding(x)
        x = self.dropout(x)
        x = self.conv(x)
        x = self.pool(x)
        x = self.dense(x)
        return x
#@title
# The code below requires a folder to be created
!mkdir data

## Now we will use the functions and model defined above --> ideally they should be done in a separate file-- main.py

# read data
texts, labels = download_and_read(DATASET_URL)

# tokenize and pad text so that each text is of same size
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(texts)
text_sequences = tokenizer.texts_to_sequences(texts)
text_sequences = tf.keras.preprocessing.sequence.pad_sequences(text_sequences)
num_records = len(text_sequences)
max_seqlen = len(text_sequences[0])
#print("{:d} sentences, max length: {:d}".format(num_records, max_seqlen))

# labels --> convert labels to categorical labels (one hot encoded)
cat_labels = tf.keras.utils.to_categorical(labels, num_classes=NUM_CLASSES)

# vocabulary --> Create word mapping and its inverse
word2idx = tokenizer.word_index
idx2word = {v:k for k, v in word2idx.items()}
word2idx["PAD"] = 0
idx2word[0] = "PAD"
vocab_size = len(word2idx)
#print("vocab size: {:d}".format(vocab_size))

# load the dataset as tensors, split it into test, train and validation set
dataset = tf.data.Dataset.from_tensor_slices((text_sequences, cat_labels))
dataset = dataset.shuffle(10000)
test_size = num_records // 4
val_size = (num_records - test_size) // 10
test_dataset = dataset.take(test_size)
val_dataset = dataset.skip(test_size).take(val_size)
train_dataset = dataset.skip(test_size + val_size)

test_dataset = test_dataset.batch(BATCH_SIZE, drop_remainder=True)
val_dataset = val_dataset.batch(BATCH_SIZE, drop_remainder=True)
train_dataset = train_dataset.batch(BATCH_SIZE, drop_remainder=True)

# Build the embedding
E = build_embedding_matrix(text_sequences, word2idx, EMBEDDING_DIM,
    EMBEDDING_NUMPY_FILE)
#print("Embedding matrix:", E.shape)

#Since we are not passing the mode by command line in this file we need to give a value to run_mode
run_mode = 'scratch'
# Now we use the SpamClassifierModel class to create a model
conv_num_filters = 256
conv_kernel_size = 3
model = SpamClassifierModel(
    vocab_size, EMBEDDING_DIM, max_seqlen, 
    conv_num_filters, conv_kernel_size, NUM_CLASSES,
    run_mode, E)
model.build(input_shape=(None, max_seqlen))
model.summary()

Compile and train the model


# Define  compile and train
model.compile(optimizer="adam", loss="categorical_crossentropy",
    metrics=["accuracy"])
​
# Now we train the model
model.fit(train_dataset, epochs=NUM_EPOCHS, 
    validation_data=val_dataset,
    class_weight=CLASS_WEIGHTS)

And now we evaluate the model on test dataset.


# Lastly we evaluate the trained model against test set
labels, predictions = [], []
for Xtest, Ytest in test_dataset:  
    Ytest_ = model.predict_on_batch(Xtest)  # for each test test predict the label
    ytest = np.argmax(Ytest, axis=1)  # Get the label with highest probabilty from actual test output
    ytest_ = np.argmax(Ytest_, axis=1) # Get the label with highest probabilty from predictted test output
    labels.extend(ytest.tolist())  # add to list
    predictions.extend(ytest.tolist())  # add to list
​
print("test accuracy: {:.3f}".format(accuracy_score(labels, predictions)))  # Calculate accuracy score

Exercise 5 In the spam classifier what is the false positive and false negative on the test dataset? What does it tell you about the trained model? # Write your code here


Answer 5: Write Your answer Here




To get Complete solution of above exercises you can contact us or if you have any other NLP related project assignments then also share with you.


Realcode4you.com Experts team provide complete support to do your project. Here you get plagiarism free work and quality code with a reasonable price.


Send your NLP Project requirement details at:


realcode4you@gmail.com


Kommentare


bottom of page