top of page
realcode4you

What is TF( Term Frequency) and IDF(Inverse Document Frequency) In Machine Learning ?

TF(Term Frequency)

- Assign each term in a document a weight for that term.

- The weight of a term t in a document d is a function of the number of times t appears in d.

  • The weight can be simply set to the number of occurrences of t in d :

tf (t, d) = count (t, d)


  • The term frequency may optionally be normalized


Inverse Document Frequency (Idf)


idf(t) = log [N/df(t)]


  • N: Number of documents in the corpus

  • df(t): Number of documents in the corpus that contain a term t

- Measures term uniqueness in corpus

  • "phone" vs. "brick"

- Indicates the importance of the term

  • Search (relevance)

  • Classification (discriminatory power)


TF-IDF and Modified Retrieval Algorithm

- term t in document d:

tfidf(t, d) = tf (t, d) * idf(t)

query: brick, phone


- Document with "brick" a few times more relevant than document with "phone" many times

- Measure of Relevance with tf-idf

- Call up all the documents that have any of the terms from the query, and sum up the tf-idf of each term:



TF-IDF and Modified Retrieval Algorithm, example

  • The process to find meaning of documents using TF-IDF is very similar to Bag of words,

  • Clean data / Preprocessing — Clean data (standardise data) , Normalize data( all lower case) , lemmatize data ( all words to root words ).

  • Tokenize words with frequency

  • Find TF for words

  • Find IDF for words

  • Vectorize vocab


Example

  • Let’s cover an example of 3 documents -

  • Document 1 It is going to rain today. 1/6

  • Document 2 Today I am not going outside.

  • Document 3 I am going to watch the season premiere. 1/8


To find TF-IDF we need to perform the steps we laid out above, let’s get to it.


- Step 1: Clean data and Tokenize


- Step 2: Find TF for all docs

  • TF = (Number of repetitions of word in a document) / (# of words in a document)



- Step 3: Find IDF

IDF =Log[(Number of documents) / (Number of documents containing the word)]

In Excel use LN(3/3)



- Step 4: Build model i.e. stack all words next to each other

IDF Value and TF value of 3 documents



- Step 5: Compare results and use table to ask questions


Remember, the final equation = TF-IDF = TF * IDF



Analysis and outcomes

  • You can easily see using this table that words like ‘it’,’is’,’rain’ are important for document 1 but not for document 2 and document 3 which means Document 1 and 2&3 are different w.r.t talking about rain.

  • You can also say that Document 1 and 2 talk about something ‘today’, and document 2 and 3 discuss something about the writer because of the word ‘I’.

  • This table helps you find similarities and non similarities between documents, words and more much better than Bag Of Words.

コメント


bottom of page