Preprocessing text data

Introduction

Recurrent neural networks and 1D convolutional networks are the two main deep learning algorithms to work with sequence processing in deep learning.

Applications of sequence processing include: - Document and time series classification (identifying the topic or author of an article); - Time series comparisons (estimating how close two articles or stock tickers are); - Sequence-to-sequence learning (English to Chinese translation); - Sentiment analysis (classifying the sentiment of tweets or movie reviews); - Time series forecasting (predicting the weather in the future).

We can represent text as sequence of data in the form of words or characters. It is most common to work with words. The models we are dealing with do not understand text. They use algorithms to learn statistical structure of the text to perform pattern recognition for the words, sentences, and paragraphs.

We need to vectorize text with the following techniques: - Divide text into words and transform each word into a vector; - Divide text into characters and transform each character into a vector; - Identify n-grams, which are overlapping groups of words or characters, and transform each n-gram into a vector.

These individual units into which we break text are called tokens, and the process of breaking text into tokens is called tokenization.

N-grams

One method of tokenization is called bag-of-words. The term bag refers to a set of tokens. With this method, we form a bag of 2 grams, or a bag of 3 grams, etc as a group of consecutive words that we can extract from a sentence. The set has no more than n consecutive words. It can have less than n consecutive words (including a single word). For example, the sentence "The dog had a kink in its tail." could be transformed into the following set of 2-grams:

{"The", "The dog", "dog", "dog had", "had", "had a", "a", "a kink",
"kink", "kink in", "in", "in its", "its", "its tail", "tail"}

N-grams is a powerful feature engineering tool that is used in shallow text-processing models such as logistic regression and random forests. Because it does not preserve the order of sequence, in deep learning we use hierarchical feature learning.

One-hot encoding

One-hot encoding is the most common tokenization method. We construct a matrix where each row represents each word in the sentence, and the row number is a unique index of each word. Every word is encoded as a binary vector of all zeros except for the entry which corresponds to its index in the dictionary, which has a value 1.

Word-level one-hot encoding:

import numpy as np

text = ["The dog had a kink in its tail.", "It looked happy."]

# first build an index of all tokens
token_idx = {}
for sentence in text:
    for word in sentence.split():
    # strip punctuation and special characters in real application
        if word not in token_idx:
            token_idx[word] = len(token_idx) + 1
            # token_idx = {... 'It': 9, 'looked': 10, 'happy.': 11}
            # Keras does not create token index at zero

# next vectorize the text
# we need to limit the sentence to `max_length` words
max_length = 10
results = np.zeros(shape=(len(text),max_length, max(token_idx.values()) + 1))
for i, sentence in enumerate(text):
    for j, word in list(enumerate(sentence.split()))[:max_length]:
        # for j, word in [(0, 'It'), (1, 'looked'), (2, 'happy.')]:
        idx = token_idx.get(word)
        results[i, j, idx] = 1.

Character-level one-hot encoding:

import string

text = ["The dog had a kink in its tail.", "It looked happy."]

characters = string.printable # all printable ASCII characters
# Keras does not create token index at zero
token_idx = dict(zip(range(1, len(characters) + 1), characters))
# we need to limit the sentence to max_length characters
max_length = 50
results = np.zeros(shape=(len(text), max_length, max(token_idx.keys()) + 1))
for i, sentence in enumerate(text):
    for j, character in list(enumerate(sentence))[:max_length]:
        idx = token_idx.get(character)
        results[i, j, idx] = 1.

We can use Keras for word-level and character-level one-hot encoding. It automatically strips special characters from strings and limits the data to N most common words in the dataset to avoid creating very large vectors.

from keras.preprocessing.text import Tokenizer

text = ["The dog had a kink in its tail.", "It looked happy."]

# create a tokenizer with 1,000 most common words
tokenizer = Tokenizer(num_words=1000)
# build the word index
tokenizer.fit_on_texts(text)
# turn strings into lists of integer indices
sequences = tokenizer.texts_to_sequences(text)

one_hot_results = tokenizer.texts_to_matrix(text, mode='binary')
word_index = tokenizer.word_index # word-to-index dictionary

For large dictionary of words we can generate hashing numbers instead of integer indices in order to speed up access to values. However, this may lead to hash collisions when two different words have the same corresponding hash number. Hash collisions can be reduced if we use the hashing space that is much larger than the total number of tokens.

text = ["The dog had a kink in its tail.", "It looked happy."]
# beware of hash collisions for texts with 1,000+ words
# because our word dictionary is 1,000 words long
word_dictionary_size = 1000
max_length = 10 # sentence limit in words
results = np.zeros((len(text), max_length, word_dictionary_size))
for i, sentence in enumerate(text):
    for j, word in list(enumerate(sentence.split()))[:max_length]:
        # hash the word into a "random" integer index
        # that is between 0 and word dictionary size
        idx = abs(hash(word)) % word_dictionary_size
        results[i, j, idx] = 1.

Word embeddings

Word embedding vector is low-dimensional floating point vector which is dense in contrast to one-hot encoding, which is sparse (sparse vectors have zeros in most of their values). Word embeddings are learned from data. We can pack more data into word-embedding vector with fewer dimensions than one-hot encoding technique.

Common dimensions for word embedding vectors are 256, 512, or 1024. In contrast, one-hot encoding generates vectors of 20,000 dimensions or grater. Word embeddings pack more information in fewer dimensions.

To construct word embeddings, we can either: 1. Learn word embeddings at the training time; 2. Pretrain word embeddings with a different model than the main task.

Word embeddings are supposed to map human language into a geometric space: words that are close in semantic meaning should be put closer together in the embedding vector space. Moreover, we also want the directions in the embedding space to mean specific ideas.

For example, we would like to see the same vector transformations that can transform cat into tiger and dog into wolf as going from pet to wild animal. Also, another vector transformation can transform wolf into tiger and dog into cat as going from canine to feline. In real world we use plural and gender transformations, like going from king to queen, and king to kings.

^
|    x wolf(0.2, 0.8)
|                            x tiger(0.8, 0.7)
|
|
|    x dog(0.2, 0.2)
|                            x cat(0.8, 0.1)
----------------------------------------------->

There is no perfect map of human language that could be used for any natural language processing task. Also, languages differ based on culture and context. We adapt word embeddings for specific task because important semantic relationship differs on each case. Therefore we learn a new embedding space with each new task.

Learning word embeddings at the training time

In Keras we can add word embeddings as a layer and learn the weights during backpropagation.

from keras.layers import Embedding
# first argument is the number of possible tokens (1 + maximum word index)
# second argument is the dimensionality of the embeddings
embedding_layer = Embedding(1000, 64)

The embedding layer is like a dictionary that maps integer indices (which stand for specific words) to dense vectors. It takes integers as input, looks up these integers in an internal dictionary, and returns the associated vectors.

The input to embedding layer takes a 2D tensor of shape (samples, sequence_length). It can embed sequences of variable langth, however the length in one batch should be the same. For example, one batch may be with shape (32, 10), and another with shape (64, 20). Sequences that are shorter should be padded with zeros, and sequences that are longer should be truncated.

The layer outputs a 3D tensor of shape (samples, sequence_length, embedding_dimensionality). Initialized at random, the layer learns relationships between words by an RNN layer or 1D convolution layer and gradually transforms the weights into a structured data.

If we apply the idea to IMDB sentiment prediction task, we get the following. First, we will restrict the movie reviews to the top 10,000 most common words, and set cut off reviews longer than 20 words. We will encode each of 10,000 words into 8-dimensional embeddings and train a single dense layer for classification.

from keras.datasets import imdb
from keras import preprocessing
max_features = 10000 # number of words in a dictionary
maxlen = 20 # maximum number of words in one sample text
# load the data as lists of integers
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
# turn data into a 2D integer tensor of shape `(samples, maxlen)`
x_train = preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
x_test = preprocessing.sequence.pad_sequences(x_test, maxlen=maxlen)

from keras.models import Sequential
from keras.layers import Flatten, Dense

model = Sequential()
# specify the maximum input length to flatten the inputs later
model.add(Embedding(10000, 8, input_length=maxlen))
# after the embedding layer, 
# the activations have shape `(samples, maxlen, 8)`

# we flatten the 3D tensor of embeddings
# into a 2D tensor of shape `(samples, maxlen * 8)`
model.add(Flatten())

# then we add the classifier on top
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='rmsprop', loss='binary_crossentropy', metrics=['acc'])
model.summary()

history = model.fit(x_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

We get a validation accuracy of about 75%. Note that the model treats each word separately, without considering word relationships and sentence structure. We would like the model to account for such relationships between words by using recurrent layers or 1D convolutional layers.

Using pretrained word embeddings

When we don't have enough data to learn task-specific embedding of our vocabulary. Instead of learning word embeddings together with the problem that we want to solve, we can load embedding layers that have already been trained on a different problem.

Yoshua Bengio was the first who explored an idea of a dense low-dimensional embedding space for words in his 2010 paper Neural Probabilistic Language Model. Word2vec algorithm is one of the best word-embedding schemes developed by Tomas Mikov in 2013. Another popular word embedding technique is called Global Vectors for Word Representation (GloVe). It was developed by Stanford researchers in 2014. Both of these word embeddings are available in Keras.