NLP Latent semantic indexing using Gensim

15 May 2017

In the article http://34.194.184.172/emotix/index.php/2017/05/15/nlp-dictionarybag-of-words-and-tfidf-using-gensim/ we saw how to compute the TFIDF vectors of document enabling us to reprepsent a sentense/document using a mathematical representation.

However the each sentense/document is reprepsented by unequal length feature vectors.

To perform task like find which two documents are similar or find documents similar to given document.A typical ways is to compute the term document matrix.Where the rows consists of the terms while the columns consist of the documents.Thus by analyzing which terms are occuring in which documents we can compute similarity between document pairs.

For performing tasks like finding the topic of document,recommendation we need a discriminative classifier.

Whenever we want to compute similarity we need to compute the input document/sentence against all the sentence/document in term document matrix.The we can have a weighted count of documents the input document is similar to and assign the input document to the class which has registered maximum values of similarity measure.This simple approach may be satisfactory for a host of application especially if we want to run it only once and offline .

Similarity over entire term-document matrix is not feasible method of discriminative classification especially if we want to run the algorithm in real time .

Latent Semantic Analysis arose from the problem of how to find relevant documents from search words. LSA assumes that words that are close in meaning will occur will belong to the same topic.

This enables us to

It generates latent topics/models which are represented by fixed dimensional vector
It enables us to represent documents as linear combination of the topic vectors.
It converts a variable length feature vector representing a document to fixed length feature vector
It converts a large length feature vector to a short length feature vector and therby reduces the computational load

In simple terms ,LSA uses SVD to perform dimensionality reduction on the tf-idf vectors.It groups document into latent topics/classes and expresses and mathematicall represents the document as a linear combination of vectors representing the topics/classes

The term-document matrix is converted to topic-document matrix.Some words/terms or dimensions of the matrix are combined and depend on more than one term.The the rank lowering is expected to merge the dimensions associated with terms that have similar meanings.

The words belonging to similar documents are expected to be merged to a single dimension representing topic/concept represented by the words.

To construct the topic models or vectors the LSI algorithm accepts as input the tfidf vectors and dictionary used to construct the tfidf vectors.

tfidf = models.TfidfModel(bow,normalize=False)
corpus_tfidf=tfidf[bow]

NO_OF_TOPICS=10;
lsi = models.LsiModel(corpus_tfidf, id2word=dict, num_topics=NO_OF_TOPICS)
corpus_lsi = lsi[corpus_tfidf]

for i in lsi.print_topics():
    print i
		
for doc in corpus_lsi:
     print(doc)
		 

For example the statement has a tfidf vector as

[(116, 0.08954685018891612), (292, 0.35664692448848984), (23575, 0.468033500575449), (76631, 0.6858442297536098), (396448, 0.4187441554540362)]

The topic vectors are expressed as linear combination of word

(0, u'-0.491*"2" + -0.377*"a" + -0.366*"into" + -0.241*"to" + -0.235*"ounces" + -0.227*"pounds" + -0.199*"1" + -0.186*"pound" + -0.174*"gram" + -0.172*"years"')
(1, u'-0.408*"the" + -0.383*"weather" + -0.366*"what" + -0.318*"in" + -0.298*"is" + -0.261*"\'s" + -0.211*"distance" + -0.164*"from" + 0.145*"into" + -0.144*"how"')
(2, u'0.440*"1" + -0.298*"2" + -0.293*"into" + 0.285*"pound" + 0.281*"to" + 0.229*"us" + 0.220*"liquid" + 0.208*"a" + 0.208*"gallon" + -0.201*"ounces"')
(3, u'-0.403*"from" + -0.311*"far" + -0.301*"distance" + 0.291*"weather" + 0.260*"\'s" + -0.250*"is" + -0.249*"how" + 0.229*"in" + -0.218*"road" + -0.218*"by"')
(4, u'0.440*"us" + 0.416*"liquid" + 0.386*"gallon" + -0.336*"a" + -0.297*"pound" + -0.205*"gram" + 0.184*"into" + 0.183*"imperial" + -0.129*"tonne" + -0.127*"milligram"')
(5, u'-0.418*"indian" + -0.363*"rupees" + -0.340*"dollar" + -0.285*"convert" + 0.228*"distance" + -0.205*"rupee" + -0.204*"in" + 0.170*"a" + -0.166*"you" + -0.166*"do"')
(6, u'0.632*"ounce" + 0.505*"an" + -0.349*"pounds" + 0.247*"into" + -0.180*"a" + -0.141*"2" + -0.125*"gram" + -0.125*"to" + 0.098*"years" + 0.085*"1"')
(7, u'-0.403*"distance" + 0.375*"far" + -0.340*"between" + 0.241*"road" + 0.241*"by" + -0.227*"and" + 0.221*"from" + 0.214*"how" + 0.182*"weather" + 0.154*"goa"')
(8, u'-0.724*"years" + 0.433*"pounds" + 0.298*"ounces" + 0.173*"grams" + -0.159*"a" + 0.154*"ounce" + -0.148*"minutes" + 0.111*"an" + -0.107*"weeks" + -0.079*"seconds"')
(9, u'0.673*"pound" + -0.543*"gram" + 0.222*"ounces" + 0.209*"into" + -0.190*"ounce" + -0.169*"an" + -0.128*"grams" + -0.128*"2" + -0.106*"a" + -0.093*"in"')

This show the 10 highest weighted words contributing to the topic vector

After Performing LSI over the corpus containing the sentenses

[(0, -0.051051416577990574), (1, -0.036492033868053303), (2, 0.074519246143707196), (3, -0.018580909401619147), (4, 0.016247308095280522), (5, -0.24649553328692864), (6, -0.01305561872828372), (7, -0.1020150188870306), (8, 0.00079849557800111158), (9, -0.035308315739288908)]

Thus we converted the set of documents to fixed length feature vector containing 10 vector.

Now since each sentence/document is expressed as fixed length vectors a host of machine learning algorithms open up which can enable us to perform discriminative classification.

As seen above the LSI gives us a method to transformation to convert any document to a fixed length feature representation. Once LSI model is created and we can use it to transform another set of documents not related to the training dataset used to generate the LSI model.

We can train the LSI model on a very large corpus collected over several years consisting of serveral millions of documents.This model will not be updated frequenctly due to computation and time complexities.

While the training database for classification task or recommendation task may be a smaller database and which might be updated regularily or when new content becomes available.

LSI captures the hidden context in the articles,any new documents belonging to the domain will exhibit similar characteristics . Incremental addition of few thousand documents will not affect the LSI model computed over a few million documents.

However we cannot train the LSI model on medical corpus and expect it to perform well on news corpus.

Transformation of a document to fixed length feature vector using LSI can be considered as pre-processing step for other Machine learning algorithms.LSI process can be abstracted as a method to convert any text to fixed length vector representation.

References

https://en.wikipedia.org/wiki/Latent_semantic_analysis#Rank_lowering
https://radimrehurek.com/gensim/tut1.html

pyVision A Machine Learning and Signal Processing toolbox

NLP Latent semantic indexing using Gensim

pyVision
A Machine Learning and Signal Processing toolbox