Recurrent neural network

Densely connected and convolutional networks have no memory state between the inputs. Conversely, recurrent neural networks (RNN) keep a memory of the previous input. It was inspired by biological intelligence: information is processed incrementally, building a model from past information and continually updating the model while new information arrives.

An RNN iterates through sequence elements and maintains a state relative to previous information. The network has an internal loop:

{{< mermaid >}} graph LR A[Input] --> B[RNN] B --> C[Output] B -- Recurrent connection --> B {{< /mermaid >}}

The state of RNN is reset between two different independent sequences (for example, two different IMDB reviews), so one sequence (one IMDB review) is still one data input. However, now the sequence is not processed in one step, but is iterated through its elements. At each iteration, RNN considers two data points: current iteration input (sequence element) and the ouput from the previous iteration (the state). It then performs a calculation on these two data points and returns an output that will be used as an input in the next iteration. At the first iteration the output from the precious iteration is not defined, so we set it to be a zero vector.

The calculation that RNN performs at each iteration is the transformation of the input and state by two matrices and a bias vector:

import numpy as np

timesteps = 100
input_features = 32
output_features = 64

inputs = np.random.random((timesteps, input_features))
state_t = np.zeros((output_features,)) # initial state

W = np.random.random((output_features, input_features))
U = np.random.random((output_features, output_features))
b = np.random.random((output_features,))

outputs = []
for input_t in inputs:
    # calculate current output
    # @ stands for `numpy.dot` the dot product
    output_t = np.tanh(W @ input_t + U @ state_t + b)
    outputs.append(output_t)
    # save the current ouput as state for next iteration
    state_t = output_t

outputs = np.asarray(outputs)

In Keras there is the SimpleRNN layer that is the same as the numpy example above, but it processes batches of sequences instead of a single sequence. Therefore its input is of shape (batch_size, timesteps, input_features) rather than (timesteps, input_features), and its output is of shape (batch_size, output_features).

In many cases we need only the last timestep output because it contain information about the entire sequence. In Keras this is controled by the argument return_sequences that is passed to constructor SimpleRNN. If it is set to True, then it returns the entire output. The default value is False, which returns only the last timestep output.

from keras.models import Sequential
from keras.layers import Embedding, SimpleRNN

model = Sequential()
model.add(Embedding(10000, 32))
# for multilayer SimpleRNN, we have to return entire outputs
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
model.add(SimpleRNN(32, return_sequences=True))
# the last layer returns only the last outputs
model.add(SimpleRNN(32))
model.summary()

In order to train the model on IMDB dataset, we first need to preprocess the data:

from keras.datasets import imdb
from keras.preprocessing import sequence

max_features = 10000  # number of words to consider as features
maxlen = 500  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(input_train, y_train), (input_test, y_test) = imdb.load_data(num_words=max_features)
print(len(input_train), 'train sequences')
print(len(input_test), 'test sequences')

print('Pad sequences (samples x time)')
input_train = sequence.pad_sequences(input_train, maxlen=maxlen)
input_test = sequence.pad_sequences(input_test, maxlen=maxlen)
print('input_train shape:', input_train.shape)
print('input_test shape:', input_test.shape)

Now we can train the network using Embedding and SimpleRNN layer:

from keras.layers import Dense

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(SimpleRNN(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

Unfortunately, the network is not doing good on this task, giving a maximum validation accuracy below 0.87. We will look at more advanced models such as LSTM and GRU that perform better.

Although theoretically simple RNN layer should be able to retain information about inputs seen many timesteps before, in practice it is impossible to learn due to the vanishing gradient problem, more on it here.

Long short term memory algorithm was developed by Hochreiter and Schmidhuber in 1997. The main advantage of LSTM is that it saves information processed at each time step that could be accessed by later time steps, effectively fighting the vanishing gradient problem. This information is combined with the current input and recurrent information (state that was computed at a previous timestep). It is calculated in a similar fashion as a simple RNN layer:

output_t = activation(W @ input_t + U @ state_t + V @ carry_t + b)
i_t = activation(Wi @ input_t + Ui @ state_t + bi)
f_t = activation(Wf @ input_t + Uf @ state_t + bf)
k_t = activation(Wk @ input_t + Uk @ state_t + bk)
# new carry state
carry_t_next = i_t * k_t + carry_t * f_t

This is how we implement LSTM in Keras:

from keras.layers import LSTM

model = Sequential()
model.add(Embedding(max_features, 32))
model.add(LSTM(32))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model.fit(input_train, y_train,
                    epochs=10,
                    batch_size=128,
                    validation_split=0.2)

We get a validation accuracy of 0.88 which is better than a simple RNN we did before.