pyVision
A Machine Learning and Signal Processing toolbox

Download .zip Download.tar.gz View on GitHub


NLP spaCy Word and document vectors

Introduction

spaCy is a library for advanced natural language processing in Python and Cython.

A elementary step in NLP applications is to convert textual to mathematical reperations which can be processed by various NLP alorithms.

For automatic natural language processing, it’s often more effective to use dictionaries that define concepts in terms of their usage statistics

The word2vec family of models are the most popular way of creating these dictionaries. Given a large sample of text, word2vec gives you a dictionary where each definition is just a row of, say, 300 floating-point numbers. To find out whether two entries in the dictionary are similar, you ask how similar their definitions are — a well-defined mathematical operation.

Modern NLP techniques represent words by a fixed dimensional feature vector.The GloVe common crawl vectors have become a de facto standard for practical NLP. The default English model used by spacy uses vectors for one million vocabulary entries, using the 300-dimensional vectors trained on the Common Crawl corpus using the GloVe algorithm.

Word Vectors

below program gives the word vectors for each word/token in the sentence

import spacy
nlp = spacy.load('en')

sentence = "programming books are more better than others and"
doc = nlp(sentence)
for word in doc:
    print (word.text, word.lemma, word.lemma_, word.tag, word.tag_, word.pos, word.pos_,word.has_vector,word.vector)
		
		

By default, Token.vector returns the vector for its underlying words or tokens

Document Vectors

Spacy also provides document vector by calculating the average of all the word/token vectors that occur in a sentense.

import spacy
nlp1 = spacy.load('en')

sentence = u"programming books are more better than others and"

doc = nlp1.tokenizer(sentence)

print("length of sentence vector is ",len(doc.vector))
print("sentence vector is ",doc.vector[0],doc.vector[1],doc.vector[2])
print("number of tokens are ",len(doc))

v=np.array(300);

vec = []
for token in doc:
    if token.has_vector:
        vec.append(token.vector)

v1=np.sum(vec, axis=0) / len(vec)


print(v1[0],v1[1],v1[2])

we can see that doc.vector[0],doc.vector[1],doc.vector[2] is equal to value of (v1[0],v1[1],v1[2]

Although this is not the best method to compute the document vector ,it provides good performance in a lot of applications.

By representing a document vector by a sum of word vectors we loose the local information of word vectors .The document vector captures the global information at the cost of local information.

We will look at better ways to compute the document vectors in the future articles.

References

  • https://spacy.io/
blog comments powered by Disqus