Intuitive guide to word embedding, RNN (SimpleRNN, LSTM) with step by step implementation in keras for spam detection

9 min readJun 20, 2019

This tutorial will guide you through the implementation and intuitive grasp on what is actually happening underneath the RNN networks.

There has been extensive writing on this subject but I could not find a single source where the complete walk through of word embedding (what, why and how), SimpleRNN & LSTM(with detailed structure diagram, number of parameters calculation etc) were given at a single place. Hence this is an effort to do the same. I will quote the appropriate source from where the material has been refereed from.

We will be using csv file containing messages and their labels (whether it is spam or ham)

Word Embedding

First of all our neural network cannot understand words (that’s what we are trying to make it learn). So we need an efficient way of representing words into mathematical concepts.

As top of the mind one can think why not assign a unique integer to a unique word. Sure enough that is one way of doing things as shown below:

Also one can go further and use one-hot encoding to represent one bit to one word as shown below:

These are also called sparse matrix representations as you can see there are lot of zeros in the vector representation of words.

These ways of representation has couple of drawbacks. First of all there is no relation captured between words. We want the mathematical structure of word representation to hold meaning rather than simple integers representing words. It would be good if there was some way we can capture similar words together. Secondly, the sparse representation of words need big vectors as our vocabulary size grows so it is not efficient.

That’s where word embedding come in. It has a mathematical structure to represent words more efficiently, specifically we use dense representation to capture words relation. In such a representation words having similar meanings are closer to each other in vector space.

Word vectors in 2D (similar words are closer to each other) (source)

As seen above this is a much better representation as we can capture the word similarity using closeness between two vectors. Also here we are using just 2 dimensions for so many words (dense representation), while the earlier methods would require much more dimensions.

There are three ways to calculate these word vectors for the vocabulary you are using.

Train the neural network on the fly for the word vectors
Calculate the word vectors before hand to use it in the network
Use a pre-trained embedding matrix

We are gonna focus on the first method, for the second and third method I would recommend visiting this article for detailed explanation.

For the first method what we really want is a word vector for a specified word. Hence we can imagine a matrix having the same number of rows as number of unique words in our vocabulary and each row representing the word vector. We will call this matrix as word embedding matrix.

This matrix has rows as the number of unique words in the vocabulary and number of columns as the hyper parameter/user-specified (dimensions of vector space, in our example it is kept to be 32). We can keep such a layer at the beginning of the network and train the Embedding layer with the rest of the network for our custom data set.

Keras provides us with such an Embedding layer to train, we will use it for our network’s first layer.

Spam Message Detection

We are going to train our network to detect spam messages (spam or ham). The data set looks like as following:

Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro.......Category                                            Message
5567     spam  This is the 2nd time we have tried 2 contact u...
5568      ham               Will ü b going to esplanade fr home?
5569      ham  Pity, * was in mood for that. So...any other s...
5570      ham  The guy did some bitching but I acted like i'd...
5571      ham                         Rofl. Its true to its name

There are in total 5572 messages with labels/category as spam or ham.

The job of the Embedding layer would be to keep the words having similar meanings/contexts together in vector space.

Pre-processing the data

import pandas as pd
import numpy as npdata = pd.read_csv("./spam_text_message_data.csv")messages = []
labels = []
for index, row in data.iterrows():
    messages.append(row['Message'])
    if row['Category'] == 'ham':
        labels.append(0)
    else:
        labels.append(1)messages = np.asarray(messages)
labels = np.asarray(labels)

We read the data and save the messages as list of strings and corresponding labels as list of integers (0 for ham or not spam, 1 for spam) and convert them to numpy array.

Next we will be using keras’ Tokenizer class to convert the array of sequences of strings (messages) to list of sequences of integers.

from keras.preprocessing.text import Tokenizermax_vocab = 10000
max_len = 500tokenizer = Tokenizer(num_words=max_vocab)
tokenizer.fit_on_texts(messages)
sequences = tokenizer.texts_to_sequences(messages)

Tokenizer class lets us specify the maximum number of vocabulary words to consider using num_words argument i.e keep the 10000 most frequent words, ignore the others.

fit_on_texts method calculates the frequency of each word in our corpus/messages.

texts_to_sequences method finally converts our array of sequences of strings to list of sequences of integers (most frequent word is assigned 1 and so on).

Since our network expects array not list as input, convert the list to 2D array using pad_sequences method. maxlen specifies the maximum length of sequence (truncated if longer, padded if shorter)

from keras.preprocessing.sequence import pad_sequencesword_index = tokenizer.word_index
data = pad_sequences(sequences, maxlen=max_len)

Now our data is ready to be fed into the network!

data has a shape of (5572, 500) i.e we have 5572 messages in our csv file and we restricted each message to 500 words/integers.

word_index attribute from Tokenizer class is a dictionary keeping track of word to their index/integer representation as calculated by fit_on_texts method. It will come handy later on when we feed custom message for testing the network.

Preparing the data set

Split the data set for training/validation and testing. 80% for training and 20% for testing. We will later on split the 80% data set into training and validation .

train_samples = int(len(messages)*0.8)messages_train = data[:train_samples]
labels_train = labels[:train_samples]messages_test = data[train_samples:len(messages)-2]
labels_test = labels[train_samples:len(messages)-2]

Network

Now lets talk about the network to be used. We will train/test the data set with two RNN networks.

SimpleRNN or Elman Network
LSTM

SimpleRNN

X 1,t … X m,t is input vector of size m at time instant t

H 1,t … H n,t is output vector of size n at time instant t

H 1,t-1 … H n,t-1 is output vector of size n at time instant t-1

Here we are feeding the output of the hidden layer back to itself after one timestep delay.

The input vector we are referring here is basically just the word vector. Hence we will be feeding the network one word vector at a time. Since we have at max 500 words in our each message, we will feeding 500 word vectors to the above model.

Each element of the input vector ( 32 elements in one word vector) is connected to each node in the output layer (output dimension of SimpleRNN). If we consider word vector of size m and output dimension of size n. Then there are mxn weights to connect the input vector to output nodes.

Also the output of previous timestep is also connected to the output nodes. Each element of the output vector (total n elements) of previous time step is connected to every element of the output vector of current time step. Thus there are nxn weights. (I have shown all connections for H 1, t-1 only for simplicity)

Finally we have n number of biases (simply adding the value).

Hence in total we have following number of trainable parameters:

Trainable parameters in SimpleRNN

Now for ease of explanation and continuity with LSTM section, it is better to represent the above SimpleRNN as below:

It is accomplishing the same task as before but now we can see that it is similar to fully connected network with the input concatenated with the previous time step output vector.

Now, Keras’ SimpleRNN uses the final output (after the bias and tanh) to be fed back as concatenated input.

It can be represented in a compact manner as below:

Lets build the network in keras!

embedding_mat_columns=32
model = Sequential()
model.add(Embedding(input_dim=max_vocab,
                    output_dim=embedding_mat_columns,
                    input_length=max_len))
model.add(SimpleRNN(units=embedding_mat_columns))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy',
              metrics=['acc'])
model.summary()

Embedding class lets us create an word Embedding layer for the network. As discussed before it is simply a weights matrix with every row as word vector for all unique words in our vocabulary/corpus.

input_dim argument is to specify the number of rows of the Embedding matrix.

output_dim is to specify the number of columns of the Embedding matrix.

input_length is to specify the maximum length of input sequence.

SimpleRNN class is to construct the SimpleRNN/Elman Network discussed above. units is to specify the output dimension for the SimpleRNN. The default activation for SimpleRNN is tanh.

Train the model:

model.fit(messages_train, labels_train, epochs=10, batch_size=60, validation_split=0.2)

After training lets evaluate:

acc = model.evaluate(messages_test, labels_test)
print("Test loss is {0:.2f} accuracy is {1:.2f}  ".format(acc[0],acc[1]))

Output:

Test loss is 0.11 accuracy is 0.96

Lets try to give it a custom message and check the prediction:

def message_to_array(msg):
    msg = msg.lower().split(' ')
    test_seq = np.array([word_index[word] for word in msg])
    test_seq = np.pad(test_seq, (500-len(test_seq), 0),
                      'constant', constant_values=(0))
    test_seq = test_seq.reshape(1, 500)
    return test_seqcustom_msg = 'Congratulations ur awarded 500 of CD vouchers or 125gift guaranteed Free entry for movies'
test_seq = message_to_array(custom_msg)
pred = model.predict_classes(test_seq)
print(pred)

Output:

[[1]]

where 1 means it is spam. Hence the model seems to be working!

LSTM

In SimpleRNN we had input vector of size m and output dimensions of size n and we saw how they are connected to each other. In total there were (nxn + nXm + n) trainable parameters.

Now in LSTM instead of one such fully connected recurrent network we have four. Specifically (in Figure SimpleRNN compact) the tanh block denotes 2 weight matrices of size nxn and nxm each and a bias vector of size n. In LSTM we will have four such FFNNs as shown below (the boxes in yellow):

Each of these four FFNNs have their own 2 weight matrices (of size nxm and nxn each) and a bias vector of size n. Hence the total number of trainable parameters for LSTM would be:

LSTM trainable parameters (with bias)

The connections diagram for LSTM with all nodes would be too complex to draw but having grasped SimpleRNN, one can imagine there are four SimpleRNN structures (three with sigmoid activation and one with tanh) at the input of LSTM.

Another good source for visualization is:

Lets build the network in keras!

model.add(LSTM(units=embedding_mat_columns))

We just need to replace SimpleRNN class with LSTM class in the previous code snippets, everything else remains the same.

The output for test accuracy is:

Test loss is 0.06 accuracy is 0.99

As seen above, the accuracy of LSTM is better than SimpleRNN.

The complete source code can be found at github.