What is TF( Term Frequency) and IDF(Inverse Document Frequency) In Machine Learning ?

TF(Term Frequency)

- Assign each term in a document a weight for that term.

- The weight of a term t in a document d is a function of the number of times t appears in d.

tf (t, d) = count (t, d)

Inverse Document Frequency (Idf)

idf(t) = log [N/df(t)]

- Measures term uniqueness in corpus

- Indicates the importance of the term

TF-IDF and Modified Retrieval Algorithm

- term t in document d:

tfidf(t, d) = tf (t, d) * idf(t)

query: brick, phone

- Document with "brick" a few times more relevant than document with "phone" many times

- Measure of Relevance with tf-idf

- Call up all the documents that have any of the terms from the query, and sum up the tf-idf of each term:

TF-IDF and Modified Retrieval Algorithm, example

The process to find meaning of documents using TF-IDF is very similar to Bag of words,
Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).
Tokenize words with frequency
Find TF for words
Find IDF for words
Vectorize vocab

Example

To find TF-IDF we need to perform the steps we laid out above, let’s get to it.

- Step 1: Clean data and Tokenize

- Step 2: Find TF for all docs

- Step 3: Find IDF

IDF =Log[(Number of documents) / (Number of documents containing the word)]

In Excel use LN(3/3)

- Step 4: Build model i.e. stack all words next to each other

IDF Value and TF value of 3 documents

- Step 5: Compare results and use table to ask questions

Remember, the final equation = TF-IDF = TF * IDF

Analysis and outcomes

You can easily see using this table that words like ‘it’,’is’,’rain’ are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain.
You can also say that Document 1 and 2 talk about something ‘today’, and document 2 and 3 discuss something about the writer because of the word ‘I’.
This table helps you find similarities and non similarities between documents, words and more much better than Bag Of Words.

RealCode4You