In this blog post let’s see in details what is TD-IDF
Introduction
We saw previously the Bag of Words representation which was quite simple and produced a very sparce matrix. In short that was not very efficient. Next lets look at TF/IDF
TF: Term Frequency
IDF: Inverse Document Frequency
What exactly does this mean,
“TF” means the frequency of a word in a document.
“IDF” means inverse of a frequency of words across documents.
Also here document can be mean anything either a sentence or paragraph etc. It depends mainly on what we send to the vectorizer as we will see later on.
So in TF we find the frequency of words in a single document and in IDF we multiple the “inverse” of frequency of a word across documents .
So to understand the purpose of this, if suppose we have lot of documents with us. There would be many keywords like which are repeated access documents especially stop words or even common keywords which are repeated a lot. Using IDF we multiple with inverse of document frequency hence its relevance gets reduced. So this way we are able to increase the weights of important words and reduce the weights of unimportant words.
Implementation
import spacy
nlp = spacy.load("en_core_web_sm")
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import nltk
nltk.download('punkt')
docs= [
"NLP is awesome",
"I love NLP"
]
count_vectorizer = TfidfVectorizer(
analyzer="word", tokenizer=nltk.word_tokenize,
preprocessor=None, stop_words='english', max_features=None)
tfidf = count_vectorizer.fit_transform(docs)
print(tfidf.todense())
#output
[[0.81480247 0. 0.57973867]
[0. 0.81480247 0.57973867]]
So, in this we see instead of simplar 0/1 vector these vectors have more information in terms of TF/IDF.
Let’s analyse TF/IDF further
Lets run TF/IDF on newsgroup data from skilearn
import nltk
nltk.download('punkt')
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset="train")
print(news.keys())
df = pd.DataFrame(news['data'])
print(df.head())
tfidf_vectorizer = TfidfVectorizer(
analyzer="word", tokenizer=nltk.word_tokenize,
preprocessor=None, stop_words='english', max_features=100, max_df=.9)
tfidf_vectorizer.fit(news["data"])
# matrix = count_vectorizer.transform(new_sentense.split())
# print(matrix.todense())
print(tfidf_vectorizer.get_feature_names())
print(tfidf_vectorizer.vocabulary_)
Xtr = tfidf_vectorizer.transform(news["data"])
The output looks like this
But as you can see the output doesn’t make much sense we need to clear the data and also add some utility functions which can help us analyze our data further.
Data Pre Processing
As we see in the output of tf/idf there is lot data which is not relevant like punctuation’s etc so we need to clean data. We saw in previous posts how to clean data with spacy so this the function we are going to use
def normalize(comment, lowercase=True, remove_stopwords=True):
if lowercase:
comment = comment.lower()
lines = comment.splitlines()
lines = [x.strip(' ') for x in lines]
lines = [x.replace('"', '') for x in lines]
lines = [x.replace('\\"', '') for x in lines]
lines = [x.replace(u'\xa0', u'') for x in lines]
comment = " ".join(lines)
doc = nlp(comment)
# for token in doc:
# print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
# token.shape_, token.is_alpha, token.is_stop)
words = [token for token in doc if token.is_stop !=
True and token.is_punct != True]
# return " ".join(words)
lemmatized = list()
for word in words:
lemma = word.lemma_.strip()
if lemma:
lemmatized.append(lemma)
return " ".join(lemmatized)
Now, after we have a function to cleaned our text. Let’s use tf/idf again. Also to analyze our data
TF/IDF Analyzing the Data
Its quite difficult to analyze tf/idf data as its a matrix of vectors. There are few util functions which i found very helpful borrowed from https://buhrmann.github.io/tfidf-analysis.html
def top_tfidf_feats(row, features, top_n=25):
''' Get top n tfidf values in row and return them with their corresponding feature names.'''
topn_ids = np.argsort(row)[::-1][:top_n]
top_feats = [(features[i], row[i]) for i in topn_ids]
df = pd.DataFrame(top_feats)
df.columns = ['feature', 'tfidf']
return df
def top_feats_in_doc(Xtr, features, row_id, top_n=25):
''' Top tfidf features in specific document (matrix row) '''
row = np.squeeze(Xtr[row_id].toarray())
return top_tfidf_feats(row, features, top_n)
def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
''' Return the top n features that on average are most important amongst documents in rows
indentified by indices in grp_ids. '''
if grp_ids:
D = Xtr[grp_ids].toarray()
else:
D = Xtr.toarray()
D[D < min_tfidf] = 0
tfidf_means = np.mean(D, axis=0)
return top_tfidf_feats(tfidf_means, features, top_n)
Lets look at our code finally
data_to_process = news["data"][:100]
clean_data = []
print("cleaning data")
for row in data_to_process:
clean_data.append(normalize(row))
print("data cleaned")
tfidf_vectorizer = TfidfVectorizer()
tfidf_vectorizer.fit(clean_data)
Xtr = tfidf_vectorizer.transform(clean_data)
features = tfidf_vectorizer.get_feature_names()
df2 = top_feats_in_doc(Xtr, features, 0, 10)
print("data without cleaning")
print(data_to_process[0])
print("cleaned data")
print(clean_data[0])
print("top features of the first document")
print(df2)
print("")
print("top means features across all documents")
df3 = top_mean_feats(Xtr, features)
print(df3)
As we can see after cleaning data, its much better and even the output makes sense now.