TF IDF – Word Embedding

In this blog post let’s see in details what is TD-IDF

Introduction

We saw previously the Bag of Words representation which was quite simple and produced a very sparce matrix. In short that was not very efficient. Next lets look at TF/IDF

TF: Term Frequency

IDF: Inverse Document Frequency

What exactly does this mean,

“TF” means the frequency of a word in a document.

“IDF” means inverse of a frequency of words across documents.

Also here document can be mean anything either a sentence or paragraph etc. It depends mainly on what we send to the vectorizer as we will see later on.

So in TF we find the frequency of words in a single document and in IDF we multiple the “inverse” of frequency of a word across documents .

So to understand the purpose of this, if suppose we have lot of documents with us. There would be many keywords like which are repeated access documents especially stop words or even common keywords which are repeated a lot. Using IDF we multiple with inverse of document frequency hence its relevance gets reduced. So this way we are able to increase the weights of important words and reduce the weights of unimportant words.

Implementation

import spacy 
nlp = spacy.load("en_core_web_sm")

from sklearn.feature_extraction.text import TfidfVectorizer

import matplotlib.pyplot as plt

import nltk
nltk.download('punkt')
docs= [
  "NLP is awesome",
  "I love NLP"
]


count_vectorizer = TfidfVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=None)    

tfidf = count_vectorizer.fit_transform(docs)

print(tfidf.todense())

#output
[[0.81480247 0.         0.57973867]
 [0.         0.81480247 0.57973867]]

So, in this we see instead of simplar 0/1 vector these vectors have more information in terms of TF/IDF.

Let’s analyse TF/IDF further

Lets run TF/IDF on newsgroup data from skilearn

import nltk
nltk.download('punkt')

import pandas as pd
import numpy as np

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset="train")

print(news.keys())

df = pd.DataFrame(news['data'])
print(df.head())


tfidf_vectorizer = TfidfVectorizer(
    analyzer="word", tokenizer=nltk.word_tokenize,
    preprocessor=None, stop_words='english', max_features=100, max_df=.9)  

tfidf_vectorizer.fit(news["data"])

# matrix = count_vectorizer.transform(new_sentense.split())
# print(matrix.todense())
print(tfidf_vectorizer.get_feature_names())
print(tfidf_vectorizer.vocabulary_)

Xtr = tfidf_vectorizer.transform(news["data"])

The output looks like this

But as you can see the output doesn’t make much sense we need to clear the data and also add some utility functions which can help us analyze our data further.

Data Pre Processing

As we see in the output of tf/idf there is lot data which is not relevant like punctuation’s etc so we need to clean data. We saw in previous posts how to clean data with spacy so this the function we are going to use

def normalize(comment, lowercase=True, remove_stopwords=True):
    if lowercase:
        comment = comment.lower()
    lines = comment.splitlines()
    lines = [x.strip(' ') for x in lines]
    lines = [x.replace('"', '') for x in lines]
    lines = [x.replace('\\"', '') for x in lines]
    lines = [x.replace(u'\xa0', u'') for x in lines]
    comment = " ".join(lines)
    doc = nlp(comment)

    # for token in doc:
    #   print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
    #     token.shape_, token.is_alpha, token.is_stop)

    words = [token for token in doc if token.is_stop !=
             True and token.is_punct != True]
    # return " ".join(words)
    lemmatized = list()
    for word in words:
        lemma = word.lemma_.strip()
        if lemma:
            lemmatized.append(lemma)
    return " ".join(lemmatized)

Now, after we have a function to cleaned our text. Let’s use tf/idf again. Also to analyze our data

TF/IDF Analyzing the Data

Its quite difficult to analyze tf/idf data as its a matrix of vectors. There are few util functions which i found very helpful borrowed from https://buhrmann.github.io/tfidf-analysis.html

def top_tfidf_feats(row, features, top_n=25):
    ''' Get top n tfidf values in row and return them with their corresponding feature names.'''
    topn_ids = np.argsort(row)[::-1][:top_n]
    top_feats = [(features[i], row[i]) for i in topn_ids]
    df = pd.DataFrame(top_feats)
    df.columns = ['feature', 'tfidf']
    return df

def top_feats_in_doc(Xtr, features, row_id, top_n=25):
    ''' Top tfidf features in specific document (matrix row) '''
    row = np.squeeze(Xtr[row_id].toarray())
    return top_tfidf_feats(row, features, top_n)


def top_mean_feats(Xtr, features, grp_ids=None, min_tfidf=0.1, top_n=25):
    ''' Return the top n features that on average are most important amongst documents in rows
        indentified by indices in grp_ids. '''
    if grp_ids:
        D = Xtr[grp_ids].toarray()
    else:
        D = Xtr.toarray()

    D[D < min_tfidf] = 0
    tfidf_means = np.mean(D, axis=0)
    return top_tfidf_feats(tfidf_means, features, top_n)

Lets look at our code finally

data_to_process = news["data"][:100]

clean_data  = []


print("cleaning data")
for row in data_to_process:
  clean_data.append(normalize(row))

print("data cleaned")

tfidf_vectorizer = TfidfVectorizer()  

tfidf_vectorizer.fit(clean_data)

Xtr = tfidf_vectorizer.transform(clean_data)

features = tfidf_vectorizer.get_feature_names()

df2 = top_feats_in_doc(Xtr, features, 0, 10)

print("data without cleaning")
print(data_to_process[0])

print("cleaned data")
print(clean_data[0])

print("top features of the first document")
print(df2)

print("")

print("top means features across all documents")
df3 = top_mean_feats(Xtr, features)

print(df3)

As we can see after cleaning data, its much better and even the output makes sense now.

excellence-social-linkdin
excellence-social-facebook
excellence-social-instagram
excellence-social-skype