In this blog post we will understand bag of words model and see its implementation in detail as well
Introduction (Bag of Words)
This is one of the most basic and simple methods to convert a list of words to vectors. The idea behind this is a simple, suppose we have a list of words lets say (n) in our corpus. We create a vector of size n and put the value 1 where that word is present and rest all values to 0. This is called one hot encoding as well.
To explain this further, lets suppose our corpus has words “NLP”, “is” , “awesome”. To convert this into bag of words model then it would be some thing like
"NLP" => [1,0,0]
"is" => [0,1,0]
"awesome" => [0,0,1]
So we convert the words to vectors using simple one hot encoding.
Ofcouse, this is a very simple model and has lot of problems
- If our list of words is very large this would create very large word vectors which are sparse that all values 0 expect 1. This is not very efficient
- We loose any semantic information of the words, there relevance etc in this model
Let’s see this in practice how this looks here
Basic Implementation
Let’s first do the basic imports
import spacy
nlp = spacy.load("en_core_web_sm")
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
Scikit learn has library called “CountVectorizer” which does the same https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
sentence = "NLP is awesome"
count_vectorizer = CountVectorizer()
count_vectorizer.fit(sentence.split())
matrix = count_vectorizer.transform(sentence.split())
print(matrix.todense())
#output
[[0 0 1]
[0 1 0]
[1 0 0]]
So this is the representation we have in vector format for our words and its very easy to understand.
Let’s see another example
# bag of words very simple example
sentences = ["NLP is awesome","I want to learn NLP"]
count_vectorizer = CountVectorizer()
count_vectorizer.fit(sentences)
new_sentense = "How to learn NLP?"
matrix = count_vectorizer.transform(new_sentense.split())
print(matrix.todense())
#output
[[0 0 0 0 0 0]
[0 0 0 0 1 0]
[0 0 1 0 0 0]
[0 0 0 1 0 0]]
as we can see the first word “How” is not present in our bag of words, hence its represented as 0
More advanced usage
In this we are using a dataset from ski learn
import nltk
nltk.download('punkt')
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_20newsgroups
news = fetch_20newsgroups(subset="train")
print(news.keys())
df = pd.DataFrame(news['data'])
print(df.head())
count_vectorizer = CountVectorizer(
analyzer="word", tokenizer=nltk.word_tokenize,
preprocessor=None, stop_words='english', max_features=1000, max_df=.9)
count_vectorizer.fit(news["data"])
# matrix = count_vectorizer.transform(new_sentense.split())
# print(matrix.todense())
print(count_vectorizer.get_feature_names())
print(count_vectorizer.vocabulary_)
There are few important features if you are using CountVectors
max_df : float in range [0.0, 1.0] or int, default=1.0
When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
min_df : float in range [0.0, 1.0] or int, default=1
When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
max_features : int or None, default=None
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
This parameter is ignored if vocabulary is not None.
Full Source Code can be seen here https://colab.research.google.com/drive/1oiJ-kXc_Vdt46xOwLSkLqPBSqGTtGPAl