Fit method:
With this function, we will find all unique words in the data and we will assign a dimension-number to each unique word.
We will create a python dictionary to save all the unique words, such that the key of dictionary represents a unique word and the corresponding value represent it's dimension-number.
For example, if you have a review, __'very bad pizza'__ then you can represent each unique word with a dimension_number as, dict = { 'very' : 1, 'bad' : 2, 'pizza' : 3}
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
from tqdm import tqdm
import os
Creating Fit Method
# tqdm is a library that helps us to visualize the runtime of for loop. refer this to know more about tqdm
#https://tqdm.github.io/
from tqdm import tqdm
# it accepts only list of sentances
def fit(dataset):
unique_words = set() # at first we will initialize an empty set
# check if its list type or not
if isinstance(dataset, (list,)):
for row in dataset: # for each review in the dataset
for word in row.split(" "): # for each word in the review. #split method converts a string into list of words
if len(word) < 2:
continue
unique_words.add(word)
unique_words = sorted(list(unique_words))
vocab = {j:i for i,j in enumerate(unique_words)}
return vocab
else:
print("you need to pass list of sentance")
vocab = fit(["abc def aaa prq", "lmn pqr aaaaaaa aaa abbb baaa"])
print(vocab)
Output:
{'aaa': 0, 'aaaaaaa': 1, 'abbb': 2, 'abc': 3, 'baaa': 4, 'def': 5, 'lmn': 6, 'pqr': 7, 'prq': 8}
What is a Sparse Matrix?
Before going further into details about Transform method, we will understand what sparse matrix is.
Sparse matrix stores only non-zero elements and they occupy less amount of RAM comapre to a dense matrix. You can refer to this link.
For example, assume you have a matrix,
[[1, 0, 0, 0, 0],
[0, 0, 0, 1, 0],
[0, 0, 4, 0, 0]]
from sys import getsizeof
import numpy as np
# we store every element here
a = np.array([[1, 0, 0, 0, 0], [0, 0, 0, 1, 0], [0, 0, 4, 0, 0]])
print(getsizeof(a))
# here we are storing only non zero elements here (row, col, value)
a = [ (0, 0, 1), (1, 3, 1), (2,2,4)]
# with this way of storing we are saving alomost 50% memory for this example
print(getsizeof(a))
Output:
172 88
How to write a Sparse Matrix?:
You can use csr_matrix() method of scipy.sparse to write a sparse matrix.
You need to pass indices of non-zero elements into csr_matrix() for creating a sparse matrix.
You also need to pass element value of each pair of indices.
You can use lists to save the indices of non-zero elements and their corresponding element values.
For example,
Assume you have a matrix,
[[1, 0, 0],
[0, 0, 1],
[4, 0, 6]]
Then you can save the indices using a list as, list_of_indices = [(0,0), (1,2), (2,0), (2,2)]
And you can save the corresponding element values as, element_values = [1, 1, 4, 6]
6. Further you can refer to the documentation here.
Transform method:
With this function, we will write a feature matrix using sprase matrix.
from collections import Counter
from scipy.sparse import csr_matrix
test = 'abc def abc def zzz zzz pqr'
a = dict(Counter(test.split()))
for i,j in a.items():
print(i, j)
Output:
abc 2 def 2 zzz 2 pqr 1
# https://stackoverflow.com/questions/9919604/efficiently-calculate-word-frequency-in-a-string
# https://docs.scipy.org/doc/scipy-0.19.0/reference/generated/scipy.sparse.csr_matrix.html
# note that we are we need to send the preprocessing text here, we have not inlcuded the processing
def transform(dataset,vocab):
rows = []
columns = []
values = []
if isinstance(dataset, (list,)):
for idx, row in enumerate(tqdm(dataset)): # for each document in the dataset
# it will return a dict type object where key is the word and values is its frequency, {word:frequency}
word_freq = dict(Counter(row.split()))
# for every unique word in the document
for word, freq in word_freq.items(): # for each unique word in the review.
if len(word) < 2:
continue
# we will check if its there in the vocabulary that we build in fit() function
# dict.get() function will return the values, if the key doesn't exits it will return -1
col_index = vocab.get(word, -1) # retreving the dimension number of a word
# if the word exists
if col_index !=-1:
# we are storing the index of the document
rows.append(idx)
# we are storing the dimensions of the word
columns.append(col_index)
# we are storing the frequency of the word
values.append(freq)
return csr_matrix((values, (rows,columns)), shape=(len(dataset),len(vocab)))
else:
print("you need to pass list of strings")
strings = ["the method of lagrange multipliers is the economists workhorse for solving optimization problems",
"the technique is a centerpiece of economic theory but unfortunately its usually taught poorly"]
vocab = fit(strings)
print(list(vocab.keys()))
print(transform(strings, vocab).toarray())
Output:
['but', 'centerpiece', 'economic', 'economists', 'for', 'is', 'its', 'lagrange', 'method', 'multipliers', 'of', 'optimization', 'poorly', 'problems', 'solving', 'taught', 'technique', 'the', 'theory', 'unfortunately', 'usually', 'workhorse']
100%|████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<?, ?it/s]
[[0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 2 0 0 0 1]
[1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0]]
Comparing results with countvectorizer
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(analyzer='word')
vec.fit(strings)
feature_matrix_2 = vec.transform(strings)
print(feature_matrix_2.toarray())
Output:
[[0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 0 0 2 0 0 0 1] [1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 1 1 1 1 1 0]]
Send Your Requirement Details at(If you have any issues or need any help):
realcode4you@gmail.com
Comments