Text Mining | term frequency-inverse document frequency (TF-IDF) and CountVectorization
- June 15, 2018
- By Pawan Prasad
- 0 Comments
A typical Text classification problem can be solved by the using various machine learning algorithms but prior to this we need to prepare our data, we can't directly feed our data into the machine learning algorithms. The text data needs to be converted into numbers in such form that this numerical data usable for machine learning algorithms, this process of transforming the text data in numerical features is called as feature extraction and also known as vectorization.
The feature extraction or vectorization from text involves tokenizing the string and assigning a unique id to each token, counting the frequencies of each token and normalizing. This technique is also called as "Bag-Of-Word". The two commonly used vectorizers are CountVectorizer and TF-IDF.
TF-IDF is term frequency-inverse document frequency. TF-IDF is used in text analytics to find the high-quality information from text. In this article, I will be explaining TF-IDF and how it is different from CountVectorizer.
To understand TF-IDF first we need to understand CountVectorizer.
CountVectorizer in simple words count the word frequencies and In the sklearn implementation, it Converts a collection of text documents to a matrix of token counts. On feeding the human language text data or corpus in CountVectorizer feature extraction model, it will return a matrix, the columns will be the unique word found in the corpus and one row corresponding to each document in the corpus, the row values will be the count of occurrence of the word(columns) in that document. Thus the weightage is determined by the word count in the document.
however, it may be possible the most frequent word is not very useful in fetching the high-quality information, the words like 'the','a', 'that' occurs very frequently across all documents but doesn't possess any information about the document.
Consider a document about the sports cars and in the document the word "the" has the highest frequency. The word 'the' doesn't give extra information about the document, However, the word "Lamborghini" set context to the document because from the word "Lamborghini" we can say that the document is related to sports cars. You may be thinking why we don't give more weightage to the less occurring word in the document and just reversing the case of the most frequent count and adjust the weightage to the less frequently occurring word but there is a catch in this solution, for example, the word "Jota" occurring one time in the same document and it means a folk dance in northern Spain. now do you think the word "jota" tells something about context?
Nothing!
If we feed this vector to our model, it will give higher weightage to meaningless features and it will miss the important features. So the features in CountVectorizer needs to be normalized.
This problem is solved by using TF-IDF
TF-IDF is the product of term-frequency(TF) and Inverse document frequency (IDF).
Now first let us understand what is term-frequency(TF), TF of a word represents how many times that word appears in a single document.
IDF of a word is the logarithm of the ratio of the total number document in corpus and no. of the document in which that word occurs.
IDF = log(total no. of document/no. of the document in which the word is found)
consider the words "the ", "Lamborghini" and "jota" and their respective term frequency (tf) is :
tf("the")=100
tf("Lamborghini")=10
tf("jota")=1
and total no. of documents in the corpus is 100, so Inverse document frequency of the words will be
idf("the") = log(100/100) = log(1) = 0
idf("Lamborghini") = log(100/10) = log(10) = 1
idf("jota") = log(100/1) = log(100) = 2
as we know TF-IDF is the product of TF and IDF so let's calculate the TF-IDF now
tf-idf("the") = 100*0 = 0
tf-idf("Lamborghini") = 10*1 = 10
tf-idf("jota") = 2*1 = 2
it is evident from the above mathematical explanation that using tf-id how we have adjusted the weightage of words. Now the word "the" as 0 zero weightage and the weightage of the word "Lamborghini" is more than that of the word "jota".
However, there is some disadvantage also, the feature extraction is based on the word of bags model, which has nothing to do with the position of the word in the text and semantics and it doesn't explain the co-occurrence with other words how their combination change the context of a sentence.
Hope you liked this post and it was helpful for you, please let me know your feedback and suggestion in the comment section.
The feature extraction or vectorization from text involves tokenizing the string and assigning a unique id to each token, counting the frequencies of each token and normalizing. This technique is also called as "Bag-Of-Word". The two commonly used vectorizers are CountVectorizer and TF-IDF.
TF-IDF is term frequency-inverse document frequency. TF-IDF is used in text analytics to find the high-quality information from text. In this article, I will be explaining TF-IDF and how it is different from CountVectorizer.
To understand TF-IDF first we need to understand CountVectorizer.
CountVectorizer in simple words count the word frequencies and In the sklearn implementation, it Converts a collection of text documents to a matrix of token counts. On feeding the human language text data or corpus in CountVectorizer feature extraction model, it will return a matrix, the columns will be the unique word found in the corpus and one row corresponding to each document in the corpus, the row values will be the count of occurrence of the word(columns) in that document. Thus the weightage is determined by the word count in the document.
however, it may be possible the most frequent word is not very useful in fetching the high-quality information, the words like 'the','a', 'that' occurs very frequently across all documents but doesn't possess any information about the document.
Consider a document about the sports cars and in the document the word "the" has the highest frequency. The word 'the' doesn't give extra information about the document, However, the word "Lamborghini" set context to the document because from the word "Lamborghini" we can say that the document is related to sports cars. You may be thinking why we don't give more weightage to the less occurring word in the document and just reversing the case of the most frequent count and adjust the weightage to the less frequently occurring word but there is a catch in this solution, for example, the word "Jota" occurring one time in the same document and it means a folk dance in northern Spain. now do you think the word "jota" tells something about context?
Nothing!
If we feed this vector to our model, it will give higher weightage to meaningless features and it will miss the important features. So the features in CountVectorizer needs to be normalized.
This problem is solved by using TF-IDF
TF-IDF is the product of term-frequency(TF) and Inverse document frequency (IDF).
Now first let us understand what is term-frequency(TF), TF of a word represents how many times that word appears in a single document.
IDF of a word is the logarithm of the ratio of the total number document in corpus and no. of the document in which that word occurs.
IDF = log(total no. of document/no. of the document in which the word is found)
consider the words "the ", "Lamborghini" and "jota" and their respective term frequency (tf) is :
tf("the")=100
tf("Lamborghini")=10
tf("jota")=1
and total no. of documents in the corpus is 100, so Inverse document frequency of the words will be
idf("the") = log(100/100) = log(1) = 0
idf("Lamborghini") = log(100/10) = log(10) = 1
idf("jota") = log(100/1) = log(100) = 2
as we know TF-IDF is the product of TF and IDF so let's calculate the TF-IDF now
tf-idf("the") = 100*0 = 0
tf-idf("Lamborghini") = 10*1 = 10
tf-idf("jota") = 2*1 = 2
it is evident from the above mathematical explanation that using tf-id how we have adjusted the weightage of words. Now the word "the" as 0 zero weightage and the weightage of the word "Lamborghini" is more than that of the word "jota".
However, there is some disadvantage also, the feature extraction is based on the word of bags model, which has nothing to do with the position of the word in the text and semantics and it doesn't explain the co-occurrence with other words how their combination change the context of a sentence.
Hope you liked this post and it was helpful for you, please let me know your feedback and suggestion in the comment section.