NLP – Basics – Getting Started

NLP

NLP is a branch of machine learning which mainly deals with text language and making it able for a system to understand human language.

We look into libraries like spacy, nltk and scikit-learn and implement some basic things used with NLP.

Spacy , NLTK, Scikit-Learn

Let’s see what are the these libraries about.

Scikit-Lean is a general purpose library used for machine learning and data science in general. This library has lot of functions which are used for various tasks for our ML programs.

NLTK this is toolkit specific for NLP written in python. This is lot of different modules, functions for various nlp functions which need to be done. This is a very old and a broad toolkit supporting many different functions and algorithems https://github.com/nltk/nltk

Spacy is a also a toolkit specific for NLP. Spacy is very opinionated toolkit and has lot of prebuild functions doing specific nlp tasks. Spacy is a very new toolkit, is very fast and specific pre trained modules to work with. This isn’t a general purpose library as in the goal of spacy is provide specific function for nlp and getting this done fast based on predefined models / algorithms. Spacy doesn’t provide you the option to play with different models, algorithms for nlp tasks but rather has chosen best models/algorithms already and provides you with very simple functions for doing the nlp tasks. https://spacy.io/

I would personally recommend spacy if you are new to nlp

Let’s see basic’s of nlp with actual code using both libraries nltk and spacy.

Let’s do the basic imports first

import nltk
nltk.download('stopwords')
nltk.download('punkt')

import spacy 
nlp = spacy.load("en_core_web_sm")

Tokenization 

Tokenization is simply the process of converting sentences (or documents) into smaller individual words, sometime even removing punctuation etc and mainly just breaking larger sentences into smaller tokens.

Lets see this in practice

# tokenize using nltk and spacy
sentence = "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data"

words = nltk.word_tokenize(sentence)
print(words)


doc = nlp(sentence)

tokens = []

for token in doc:
    tokens.append(token.text)

print(tokens)   

##outputs
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data']
['Natural', 'language', 'processing', '(', 'NLP', ')', 'is', 'a', 'subfield', 'of', 'linguistics', ',', 'computer', 'science', ',', 'information', 'engineering', ',', 'and', 'artificial', 'intelligence', 'concerned', 'with', 'the', 'interactions', 'between', 'computers', 'and', 'human', '(', 'natural', ')', 'languages', ',', 'in', 'particular', 'how', 'to', 'program', 'computers', 'to', 'process', 'and', 'analyze', 'large', 'amounts', 'of', 'natural', 'language', 'data']

Lemmatization and Stemming

This process means to reduce words to this base form based on the language . Most the times documents contain words in different words e.g dogs, dog, dog’s, dogs’ are all the same word or even “am”, “are”, “is” => “be”

So the process of reducing words to there base form.

Lemmatization is more advanced and converts words to the base form based on the language vocabulary.

Stemming converts the words to its base form simply by removing “prefixes” or “affixes”

Let’s see this in practice

# stemming nltk
# import these modules 
from nltk.stem import PorterStemmer 
from nltk.tokenize import word_tokenize 
   
ps = PorterStemmer() 
  
# choose some words to be stemmed 
words = ["program", "programs", "programer", "programing", "programers"] 
  
for w in words: 
    print(w, " : ", ps.stem(w)) 

## output
program  :  program
programs  :  program
programer  :  program
programing  :  program
programers  :  program

As we see above stemming mainly reduces a word to its base form by removing affix/prefix

#nltk lemmantization

from nltk.stem import WordNetLemmatizer 
  
lemmatizer = WordNetLemmatizer() 
  
print("programers : ", lemmatizer.lemmatize("programers"))
print("am :", lemmatizer.lemmatize("am")) 
print("feet :", lemmatizer.lemmatize("feet")) 

# output
programers :  programers
am : am
feet : foot

Lemma does more than just stemming.

# spacy lemma
sentence = "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data"

doc = nlp(sentence)

for token in doc:
    print(token.text , ":" , token.lemma_)

# spacy doesn't support stemming, as lemma is much better than stemming so its not # needed.

POS Tagging

POS or part of speech means to assign different part’s of speech to words like noun, adjective, punctuation etc. To implement this

# pos tagging using nltk and spacy

sentence = "Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data"

print(nltk.pos_tag(nltk.word_tokenize(sentence)))

doc = nlp(sentence)

for token in doc:
    print(token.text , ":" , token.lemma_, token.pos_, token.tag_)


##output
[('Natural', 'JJ'), ('language', 'NN'), ('processing', 'NN'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('subfield', 'NN'), ('of', 'IN'), ('linguistics', 'NNS'), (',', ','), ('computer', 'NN'), ('science', 'NN'), (',', ','), ('information', 'NN'), ('engineering', 'NN'), (',', ','), ('and', 'CC'), ('artificial', 'JJ'), ('intelligence', 'NN'), ('concerned', 'VBN'), ('with', 'IN'), ('the', 'DT'), ('interactions', 'NNS'), ('between', 'IN'), ('computers', 'NNS'), ('and', 'CC'), ('human', 'JJ'), ('(', '('), ('natural', 'JJ'), (')', ')'), ('languages', 'NNS'), (',', ','), ('in', 'IN'), ('particular', 'JJ'), ('how', 'WRB'), ('to', 'TO'), ('program', 'NN'), ('computers', 'NNS'), ('to', 'TO'), ('process', 'VB'), ('and', 'CC'), ('analyze', 'VB'), ('large', 'JJ'), ('amounts', 'NNS'), ('of', 'IN'), ('natural', 'JJ'), ('language', 'NN'), ('data', 'NNS')]
Natural : natural ADJ JJ
language : language NOUN NN
processing : processing NOUN NN
( : ( PUNCT -LRB-
NLP : NLP PROPN NNP
) : ) PUNCT -RRB-
is : be VERB VBZ
a : a DET DT
subfield : subfield NOUN NN
of : of ADP IN
linguistics : linguistic NOUN NNS
, : , PUNCT ,
computer : computer NOUN NN
science : science NOUN NN
, : , PUNCT ,
information : information NOUN NN
engineering : engineering NOUN NN
, : , PUNCT ,
and : and CCONJ CC
artificial : artificial ADJ JJ
intelligence : intelligence NOUN NN
concerned : concern VERB VBN
with : with ADP IN
the : the DET DT
interactions : interaction NOUN NNS
between : between ADP IN
computers : computer NOUN NNS
and : and CCONJ CC
human : human NOUN NN
( : ( PUNCT -LRB-
natural : natural ADJ JJ
) : ) PUNCT -RRB-
languages : language NOUN NNS
, : , PUNCT ,
in : in ADP IN
particular : particular ADJ JJ
how : how ADV WRB
to : to PART TO
program : program NOUN NN
computers : computer NOUN NNS
to : to PART TO
process : process VERB VB
and : and CCONJ CC
analyze : analyze VERB VB
large : large ADJ JJ
amounts : amount NOUN NNS
of : of ADP IN
natural : natural ADJ JJ
language : language NOUN NN
data : datum NOUN NNS

Text PreProcessing

When working with NLP, pre processing data is only the essential things to do. This would including removing spaces, unnecessary characters, html tags etc.

One of the important aspects of pre processing is also to remove “stop words“. This is only of the important aspects to pre process to remove common english words like “the”, “and”, “I” etc which don’t add much context to the data. There are many more things involved in pre processing and we will look into those later on.

Let’s see this in practice using both nltk and spacy

# simple stop words
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
print(stopwords.words("english"))

===
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

So this return’s a simple list of stop words from english language.

As we saw above, spacy seems a very simple and effective toolkit to do most basic nlp tasks.

Let’s write a small code generic code to clean our text which we can reuse later as well

def normalize(comment, lowercase=True, remove_stopwords=True):
    if lowercase:
        comment = comment.lower()
    lines = comment.splitlines()
    lines = [x.strip(' ') for x in lines]
    lines = [x.replace('"', '') for x in lines]
    lines = [x.replace('\\"', '') for x in lines]
    lines = [x.replace(u'\xa0', u'') for x in lines]
    comment = " ".join(lines)
    doc = nlp(comment)
    
    words = [token.text for token in doc if token.is_stop !=
             True and token.is_punct != True]
    return " ".join(words)
    # lemmatized = list()  # if you want to give leema uncomment this code below
    # for word in comment:
    #     lemma = word.lemma_.strip()
    #     if lemma:
    #         if not remove_stopwords or (remove_stopwords and lemma not in stops):
    #             lemmatized.append(lemma)
    # return " ".join(lemmatized)

This completes a list of basic nlp tasks and how to implement them. Also spacy is very powerful library and includes lot of things.They have a very good documentation as well and its worth going through it https://spacy.io/usage/linguistic-features as it is much more powerful than what we saw above.

Also here is full source code for this blog post https://colab.research.google.com/drive/10c6m61uK-Hat4ZmYbtPOM6C6Mp1TQBgu

excellence-social-linkdin
excellence-social-facebook
excellence-social-instagram
excellence-social-skype