Term Frequency(TF) and Inverse Document Frequency (IDF) | What is TF and IDF?

What is Term Frequency(TF)?

- Assign each term in a document a weight for that term.

- The weight of a term t in a document d is a function of the number of times t appears in d.

tf (t, d) = count (t, d)

What is Inverse Document Frequency (idf)

idf(t) = log [N/df(t)]

- Measures term uniqueness in corpus

- Indicates the importance of the term

TF-IDF and Modified Retrieval Algorithm

- term t in document d:

tfidf(t, d) = tf (t, d) * idf(t)

query: brick, phone

Document with "brick" a few times more relevant than document with "phone" many times
Measure of Relevance with tf-idf
Call up all the documents that have any of the terms from the query, and sum up the tf-idf of each term:

TF-IDF and Modified Retrieval Algorithm, example

The process to find meaning of documents using TF-IDF is very similar to Bag of words,
Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).
Tokenize words with frequency
Find TF for words
Find IDF for words
Vectorize vocab

Example

To find TF-IDF we need to perform the steps we laid out above, let’s get to it.

Example, continue

- Step 2: Find TF for all docs

TF = (Number of repetitions of word in a document) / (# of words in a document)

- Step 3: Find IDF

IDF =Log[(Number of documents) / (Number of documents containing the word)]

In Excel use LN(3/3)

- Step 4: Build model i.e. stack all words next to each other

IDF Value and TF value of 3 documents.

- Step 5: Compare results and use table to ask questions

Remember, the final equation = TF-IDF = TF * IDF

Example, continue- Analysis and outcomes

You can easily see using this table that words like ‘it’,’is’,’rain’ are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain.
You can also say that Document 1 and 2 talk about something ‘today’, and document 2 and 3 discuss something about the writer because of the word ‘I’.
This table helps you find similarities and non similarities between documents, words and more much better than Bag Of Words.

RealCode4You