Sequence classification with LSTM

30 Jan 2018

In this article, we will look at how to use LSTM recurrent neural network models for sequence classification problems using the Keras deep learning library

A standard dataset used to demonstrate sequence classification is sentiment classficiation on IMDB movie review dataset.

Each movie review is a sentense and sentiment associated with each movie review must be determined.The training dataset consists of movie review and associated sentiment.A sentense is modelled as a sequence of words and hence we look sequence classification problem.

Dataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive/negative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer “3” encodes the 3rd most frequent word in the data. This allows for quick filtering operations such as: “only consider the top 10,000 most common words, but eliminate the top 20 most common words”.

A representation of review of imdb database is as follows

 [1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]

For this example we will only take first 50 words of the review . we will pad dataset to a maximum review length in words.

The same review is represented as

[    1    14    22    16    43   530   973  1622  1385    65   458  4468
3941     4   173    36   256     5    25   100    43   838   112
 670 22665     9    35   480   284     5   150     4   172   112
21631   336   385    39     4   172  4536  1111    17   546    38
 447]

import numpy as np

from keras.datasets import imdb
from keras.preprocessing import sequence

import numpy as np
import keras
from keras.datasets import reuters
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.preprocessing.text import Tokenizer

#from matplotlib import pyplot
# load the dataset
(X_train, y_train), (X_test, y_test) = imdb.load_data()
print("Training data: ")
print(X_train.shape)
print("Testing data: ")
print(X_test.shape)

num_classes = np.max(y_train) + 1
print(num_classes, 'classes')

max_words = 50
X_train = sequence.pad_sequences(X_train, maxlen=max_words,padding="post",truncating="post")
X_test = sequence.pad_sequences(X_test, maxlen=max_words,padding="post",truncating="post")

A sentense can be modelled as sequence of words indexes,however there is no contextual relation between index 1 and index 2 .

We will use LSTM to model sequences,where input to LSTM is sequence of indexs representing words and output is sentiment associated with the sentense

Input to LSTM is a 3D tensor with shape (batch_size, timesteps, input_dim). In out present case the batch_size will be the size of training data.Timesteps will be sequence length 50 and input dimensions will be 1 since the sequences as integer indexes

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

def save_model(model, filename):
    logging.info("saving  the model %s" % (filename))
    model_json = model.to_json()
    with open(filename + '.model', "w") as json_file:
        json_file.write(model_json)
        json_file.close();
    model.save_weights(filename + ".weights")

def load_model(filename):
    logging.info("loading the model %s" % (filename))
    json_file = open(filename + '.model', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    loaded_model.load_weights(filename + ".weights")

    return loaded_model;
		
model=Sequential()

model.add(LSTM(50,return_sequences=False,input_shape=(X_train.shape[1],X_train.shape[2])))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train,y_train,epochs=1000, batch_size=100, verbose=1)

save_model(model,"/tmp/model1")

model.evaluate(X_test,y_test)

We obtain a training accuracy of 58% on the training data after 75 iterations

In this approach words are simply represented as word indexes and sequence of words are used to train LSTM . The word indexes are themselves do not represent any information are in general orthogonal . Word indexes may be represented as one hot vector encoding,but for large dictionary sizes such a representation will be too large and occury lot of memory and computational resources.

Word Embeddings

A word embedding is a class of approaches for representing words and documents using a dense vector representation. The position of word in N dimensional space is referred to as its embedding. The word embeddings are learned from text and is based on the words that surround the word when it is used ie context in which words are used.

If we represent words as index of vocubulary we end up with large dimensional sparse vector using techniques like one hot encoding.The idea of word vectors can also be viewed as method to cast a high dimensional sparse representation to a low dimensional dense representation.

Keras offers an Embedding layer that can be used for neural networks on text data.

The output of the Embedding layer is a 2D vector with one embedding for each word in the input sequence of words (input document).

The parameters of Keras Embedding Layer is

keras.layers.Embedding(input_dim, output_dim, embeddings_initializer='uniform', embeddings_regularizer=None, activity_regularizer=None, embeddings_constraint=None, mask_zero=False, input_length=None)

where

input_dim is the vocabulary size
output_dim is the dimension of the output embedding

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

def save_model(model, filename):
    logging.info("saving  the model %s" % (filename))
    model_json = model.to_json()
    with open(filename + '.model', "w") as json_file:
        json_file.write(model_json)
        json_file.close();
    model.save_weights(filename + ".weights")

def load_model(filename):
    logging.info("loading the model %s" % (filename))
    json_file = open(filename + '.model', 'r')
    loaded_model_json = json_file.read()
    json_file.close()
    loaded_model = model_from_json(loaded_model_json)
    loaded_model.load_weights(filename + ".weights")

    return loaded_model;
		
X_train=np.reshape(X_train,(X_train.shape[0],max_words));
X_test=np.reshape(X_test,(X_test.shape[0],max_words));


print("input tensor")
print(X_train.shape),y_train.shape

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Dense, Embedding

model=Sequential()

model.add(Embedding(vocab_size,300,input_length=max_words))
model.add(LSTM(128,return_sequences=False,input_shape=(300,max_words)))
model.add(Dense(num_classes, activation='softmax'))
print(model.summary())
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train,y_train,epochs=1000, batch_size=100, verbose=1)

save_model(model,"/tmp/model1")

model.evaluate(X_test,y_test)

We can see that by adding embedding layer at 1 iterations we obtained an accuracy of 73% . Thus by simply adding a embedding layer that encapsulates context of words to arrive at word vector embedding provides a significantly higher accuracy.

References

https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/
http://pi19404.github.io/pyVision/2017/05/13/spacy1/
https://keras.io/preprocessing/sequence/
https://sandipanweb.wordpress.com/2017/03/30/sentiment-analysis-using-linear-model-classifier-with-hinge-loss-and-l1-penalty-with-language-model-features-and-stochastic-gradient-descent-in-python/

pyVision A Machine Learning and Signal Processing toolbox

Sequence classification with LSTM

pyVision
A Machine Learning and Signal Processing toolbox