NLP Dictionary,Bag Of Words and TFIDF using Gensim

15 May 2017

In Natural language processing one of the most common questions is how to convert a sentense to some kind of numeric representation for machine learning algorithms.

One of the elemenatry ways of doing this to represent a sentense by its mathematical represetation is by measuring the relative frequency count of words occuring in sentense

Given a dictionary we can associate an index with every word occuring in the text document.

A dictionary can be constructued from the training corpus or we can use a pre defined dictionary containing a word list

We will be using Gensim NLP software for Topic modelling in this article.

we will be using the websters dictionary for building the dictionary for NLP purposes Download the dictionary.txt file from https://github.com/pi19404/dictionary github repository

The code to create the dictionary file is as follows

    def saveDictionary(self,source,dest):
        # Set up input and output files
        dict_file = self.cwd + '/'+source;
        dest_file=self.cwd+"/" + dest;

        #read the input text file
        f = open(dict_file, 'r')
        lines = f.readlines()
        f.close()

        #tokenize the data
        tokenize_data = [[word for word in line.lower().split()] for line in lines]

         #create and save the dictionary file
        dictionary = corpora.Dictionary(tokenize_data)
        dictionary.save(dest_file)
				
				
    def loadDictionary(self,filename):
        dict_file = self.cwd + '/'+filename;
        dictionary = corpora.Dictionary.load(dict_file)
        return dictionary;
				

The function saveDictionary accepts as a input a text file source and creates a dictionary file as outputdest .

The loadDictionary can load a dictionary from the file saved by saveDictionary function

Next to represent a sentense we use a bag of vector model of representation.Where every word in a sentense is associated with a dictionary index and frequency count of occurence of the word in sentense.

    def bowfeature(self,sentense,dictionary):
        corpus = dictionary.doc2bow(sentense)
        return corpus
				
tokens="do you know to go to market"
tokenize_data = [word for word in tokens.lower().split()]

feature1=bowfeature(tokenize_data,dict)				
print feature1

The above function accepts as input a sentense and dictionary object and returns the bag of words representation of a sentense

For example The text do you know to go to market has the bow feature representation as [(29, 1), (116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)]

do -> [(1688, 1)]
you -> [(29, 1)]
know -> [(3685, 1)]
to -> [(116, 1)]
go->[(928, 1)]
to->[(116, 1)]
market->[(23187, 1)]

Each sentense may contain variable word lengths and its mathetical representation consists of sequence of words and its associated relative frequency of words occuring in the sentense wrt to entire courpus being analyze.

Tf-idf

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus

Typically, the tf-idf weight is composed by two terms: the first computes the normalized Term Frequency (TF), aka. the number of times a word appears in a document, divided by the total number of words in that document; the second term is the Inverse Document Frequency (IDF), computed as the logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears

The TF term computes the relative frequency count of words occuring the document/sentence being analyzed and computes how important is the word in a document.The IDF term computes the imporance of word wrt to the entire corpus.The words which occur very frequently in the corpus will be assigned lower values than word which occur rarely.

function computetfidf(self,features)
		tfidf = models.TfidfModel(features)
		return tfidf

Let us consider the following two statemets as toy example

([u'do', u'you', u'know', u'to', u'cook'], [(29, 1), (116, 1), (1688, 1), (3685, 1), (83291, 1)])
([u'do', u'know', u'to', u'go', u'to', u'market'], [(116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)])

[(116, 2), (928, 1), (1688, 1), (3685, 1), (23187, 1)] [(928, 0.7071067811865475), (23187, 0.7071067811865475)]
[(29, 1), (116, 1), (1688, 1), (3685, 1), (83291, 1)] [(29, 0.7071067811865475), (83291, 0.7071067811865475)]

The TF term for sentences are

[(116, 2/6), (928, 1/6), (1688, 1/6), (3685, 1/6), (23187, 1/6)]
 [(29, 1/5), (116, 1/5), (1688, 1/5), (3685, 1/5), (83291, 1/5)]

The IDF term is

[(116, 0), (928, 1), (1688, 0), (3685, 0), (23187, 1)]
 [(29, 1), (116, 0), (1688, 0), (3685, 0), (832911)]

Now multiplying above two we get

[(116, 0), (928, 1), (1688, 0), (3685, 0), (23187,1)]
 [(29,1), (116, 0), (1688, 0), (3685, 0), (83291,1)]

We can normalize the tfidf values for each document

[(116, 0), (928, 1/sqrt(2)), (1688, 0), (3685, 0), (23187,1/sqrt(2))]
 [(29,1/sqrt(2)), (116, 0), (1688, 0), (3685, 0), (83291,1/sqrt(2))]

Thus we can see that TFIDF has zero values for terms which are repeating the corpus while has non zero values for terms that are less frequenct in the corpus.

Assigning low values to stop words is automatically done by the TFIDF process therby eliminating the need to do stop word removal as pre processing stage in the NLP pipeline.

Thus given a set of sentenses we have obtained a mathematical representation of sentenses.

References

https://radimrehurek.com/gensim/index.html

pyVision A Machine Learning and Signal Processing toolbox

NLP Dictionary,Bag Of Words and TFIDF using Gensim

pyVision
A Machine Learning and Signal Processing toolbox