pyVision
A Machine Learning and Signal Processing toolbox

Download .zip Download.tar.gz View on GitHub


Natural Language Processing Pre Processing Stemming,Lemmatization,Stop Words

Introduction

The fundamental purpose of Natural Language Processing is to do some form of analysis, or processing, where the machine can understand, at least to some level, what the text means, says, or implies.

One of the Basic Steps in Nautral Language Processing pipeline is pre-processing the text.

This article describes some pre-processing steps that are commonly used in Information Retrieval (IR), Natural Language Processing (NLP) and text analytics applications.

  1. Stop words removal
  2. Stemming
  3. Lemmatisation

Stop Word Removal

Simply put stop words are words which we want to remove before performing any processing on the text.The nature of stop word are that these are high frequency words that occur in our text aren’t giving any additional information but bias the NLP task because they occur very frequently.

For example if we want to find out the most important word in a piece of text,if you use a simple technique to infer what the text means if to count of words occuring in the text and this actually forms the basis of most advanced text analytics . However if analyze the text you would find that the words like the,it,is,not,and would be at the top of the list.And this would be the case with any piece of text . This analysis would also give you these high frequency but irrelevant words as important words in your textual analysis and it would not enable you to differentiate once document from another since all the documents would contain these words at the top of the list.

The stop word removal is a common sense approach to remove words that are irrelevant to the Natural language tasks being performed before starting with the actual NLP processing.

There is no universal stop word list that can apply to all applications.Since NLP is often domain dependent ,the stop word list needs to be customized for you specific application and domain.

Few of stop words defined in NLTK stop word list are as follows

from nltk.corpus import stopwords
stopWords = set(stopwords.words('english'))
print("number of stop words in NLTK "+str(len(stopWords)))
print " ".join(map(unicode, stopWords))
number of stop words in NLTK 153
all:just:being:over:both:through:yourselves:its:before:o:hadn:herself:ll:had:should:to:only:won:under:ours:has:do:them:his:very:they:not:during:now:him:nor:d:did:didn:this:she:each:further:where:few:because:doing:some:hasn:are:our:ourselves:out:what:for:while:re:does:above:between:mustn:t:be:we:who:were:here:shouldn:hers:by:on:about:couldn:of:against:s:isn:or:own:into:yourself:down:mightn:wasn:your:from:her:their:aren:there:been:whom:too:wouldn:themselves:weren:was:until:more:himself:that:but:don:with:than:those:he:me:myself:ma:these:up:will:below:ain:can:theirs:my:and:ve:then:is:am:it:doesn:an:as:itself:at:have:in:any:if:again:no:when:same:how:other:which:you:shan:needn:haven:after:most:such:why:a:off:i:m:yours:so:y:the:having:once

Maybe for the application at hand you would want to words like won or you would want to have words like no,nor,not,of,off,on,once,only,or,other,ought,our,ours

Following link contains a more exhaustive list of stop words

  • https://www.link-assistant.com/seo-stop-words.html
  • https://meta.wikimedia.org/wiki/Stop_word_list/consolidated_stop_word_list

This may be applicable for search engines but for tasks like document or text classification or intent detection words like what,whywhen make important key words.For example to identifying and answering questions like who is rahul gandhi and what is a palindrome to understand these sentenses if we remove the stop words who and what` we may not be able to differentiate the statements.

There are ways to include these words in stop word list and still achieve the task at hand.But main intent of above example is that you may customize the pre defined stop word lists in various NLU toolkits according to the NLP task and domain.

Stemming and Lemmatization

Stemming is the process of reducing a word into its stem, i.e. its root form. The root form is not necessarily a word by itself, but it can be used to generate words by concatenating the right suffix

For example,

  • the words fish, fishes and fishing all stem into fish
  • banks and banking become bank
  • investing and invested become invest

Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as thelemma

The most common algorithm for stemming English, and one that has repeatedly been shown to be empirically very effective, is Porter's algorithm

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()
sentence = "programming books are more better than others"
words = word_tokenize(sentence)

for word in words:
    print(word + ":" + ps.stem(word))
		

The output of the above program is as follows

programming:program
books:book
are:are
more:more
better:better
than:than
others:other

Lemmatization does full morphological analysis to accurately identify the lemma for each word. Doing full morphological analysis produces at most very modest benefits for NLP.Both the forms of normalization tends not improve the NLP processing by a huge degree.It may help is some tasks but

The NLTK Lemmatization method is based on WordNet’s built-in morphy function

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

sentence = "programming books are more better than others"
words = word_tokenize(sentence)

for word in words:
    print(word + ":" + wordnet_lemmatizer.lemmatize(word))
		

The output of the above program is

books:book
are:are
more:more
better:better
than:than
others:others

If again we were to analyze the text by computing the relative frequency count of words.Without the stemming process the words fish,fishes,fishing would be treated as different words or banks and banking would be treated as different words though by grouping it together would provide a better analysis .

By stemming them, it groups the frequencies of different inflection to just one term — in this case, invest

References

  • http://www.ranks.nl/stopwords
blog comments powered by Disqus