Deep learning with Keras

Learning best practices

The main problem in machine learning is to find a balance between optimization and generalization. If we fit the training data to the model too well, the result is overfitting and lower generalization. Overfitting is likely to happen when the data is noisy, when it involves uncertainty, and when there are rare features that forces you to make specious correlations.

One of the possible ways to reduce noise includes doing feature selection, for example restricting data to 10,000 most common words, or deciding which features are more informative to the problem and keeping only those. Overfitting can also be lowered if we train the model with more data or better data that represents the phenomena well. When more or better data is unavailable, we can introduce regularization to the model, which forces the optimization process to focus on the most prominent patterns, which have a better chance to generalize well.

Model validation

When we evaluate a model, we need three sets of data: training data, validation data, and test data. Validation data is used to learn hyperparameters. We cannot expose test data during learning parameters or hyperparameters because of the information leak phenomenon: every time we expose data to the learning process, some information about it leaks into the model. We may end up overfitting to the test data if we expose it too much during learning.

Simple hold-out validation

Split the data into the traning and validation data. Then train the model, tune hyperparameters, and evaluate the data. At the end, train the final model from scratch on the all non-test data that is available.

def validation_split(train_data, train_labels, ratio=0.3):
    n = len(train_data)
    shuffled_data, shuffled_labels = permute_data(train_data, train_labels)
    num_valid_samples = int(ratio * n)
    valid_data = shuffled_data[:num_valid_samples]
    valid_labels = shuffled_labels[:num_valid_samples]
    subtrain_data = shuffled_data[num_valid_samples:]
    subtrain_labels = shuffled_labels[num_valid_samples:]
    return (valid_data, valid_labels), (subtrain_data, subtrain_labels)

def permute_data(train_data, train_labels):
    n = len(train_data)
    indices_permutation = np.random.permutation(n)
    shuffled_data = train_data[indices_permutation]
    shuffled_labels = train_labels[indices_permutation]
    return shuffled_data, shuffled_labels

def build_model():
    model = keras.Sequential([
        layers.Dense(64, activation="relu"),
        layers.Dense(64, activation="relu"),
        layers.Dense(1)
    ])
    model.compile(optimizer="rmsprop", loss="mse", metrics=["mae"])
    return model

valid_data, valid_labels, subtrain_data, subtrain_labels = validation_split(train_data, train_labels)
model = build_model()
model.train(subtrain_data, subtrain_labels)
valid_score = model.evaluate(valid_data)
#=========================
# tune the hyperparameters
#=========================
model = build_model()
model.train(train_data)
test_score = model.evaluate(test_data)

K-fold cross-validation

With k-fold cross-validation method, we split the data into k partitions of equal size. For each partition we train a model on the remaining \(k-1\) partitions and evaluate on the remaining one partition. The final score is the average of \(k\) scores obtained. This method can help when the data shows high variations.

def k_fold_cross_validation(train_data, train_labels, build_model,
        k=4, epochs=100):
    m = len(train_data) // k # number of validation samples
    valid_scores = []
    shuffled_data, shuffled_labels = permute_data(train_data, train_labels)
    for fold in range(k):
        print(f"Processing fold #{fold}")
        valid_data = shuffled_data[m*fold:m*(fold+1)]
        valid_labels = shuffled_labels[m*fold:m*(fold+1)]
        subtrain_data = np.concatenate(
            (shuffled_data[:m*fold],
             shuffled_data[m*(fold+1):]),
            axis=0)
        subtrain_labels = np.concatenate(
            (shuffled_labels[:m*fold],
             shuffled_labels[m*(fold+1):]),
            axis=0)
        model = build_model()
        model.train(subtrain_data, subtrain_labels, epochs=epochs, batch_size=1)
        valid_score = model.evaluate(valid_data, valid_labels)
        valid_scores.append(valid_score)
    return np.mean(valid_scores)

valid_score = k_fold_cross_validation(train_data, train_labels, build_model)
#============================
# choose best hyperparameters
#============================
model = build_model()
model.train(train_data, train_labels, epochs=best_epochs, batch_size=16)
test_score = model.evaluate(test_data, test_labels)

K-fold cross-validation multiple times with shuffling

With this approach, we do k-fold cross validation multiple times, shuffling the data before each iteration. After that we take the average of the scores obtained at each run of k-fold validation.

valid_scores = []
for p in range(5):
    valid_score = k_fold_cross_validation(train_data, train_labels, build_model)
    valid_scores.append(valid_score)
valid_score = np.mean(valid_scores)
#============================
# choose best hyperparameters
#============================
model = build_model()
model.train(train_data, train_labels, epochs=best_epochs, batch_size=16)
test_score = model.evaluate(test_data, test_labels)

Choose a baseline

Beating a common sense baseline principle: pick a trivial baseline like random classifier or non-machine learning technique and compare it with machine-learning implementation.

Caveats

Make sure that the training set and the test set are representative of the data. It is a good practice to shuffle the data randomly before learning.

Make sure that the test set is posterior to the training set if you are trying to predict the future. Don't shuffle the data that if the model predicts the future.

Make sure that the training set, validation set, and test set are disjoint. If we have redundancy in our data, training set and test set will share some information which is unacceptable.

Improving model fit

First we have to overfit in order to test the boundary. Once we have a model that is able to overfit, we can focus on refining generalization. Common training problems:

Training loss doesn't go down over time. The problem is the configuration of the gradient descent process: choice of optimizer, initial weights, learning rate, or batch size. We can try adjusting the learning rate: low learning rate slows down the learning which may appear to stall, while high learning rate makes the optimizer to overshoot a proper fit. A batch size with more samples with lead to gradients that have lower variance.
Training loss does go down but the model doesn't generalize. It means that something is fundamentally wrong with the approach we are taking. One of the reasons is that the training data may not containt enough information to predict the targets. Another reason is that the model may be wrong for the problem we are dealing with. There are different architectures for different data modalities like tet, images, timeseries, etc.
Training loss goes down and we are able to bypass the trivial baseline, but the model doesn't overfit, which is an indicator that we are still underfitting. In this case the validation loss is going down almost all of the time. The problem is that the model isn't big enough to represent the information. We can increase representational power by adding more layers, adding more neurons per layer, or using the kinds of layers that are more appropriate for the problem.

Improving generalization

Use a larger dataset
Minimize labeling errors
Clean the data and deal with missing values
Do feature selection

Feature engineering is the key to improving model generalization. It is the process of using our own knowledge about the data and the neural network to feed the algorithm with data that will make the job easier. Clock example: raw pixels vs. rectangular coordinates of the clock's hand pointer vs. polar coordinates. The benefits of feature engineering are the following: saving computational resources and the advantage of using less data to achieve the same results.

Early stopping

Our task is to find a point of compromize between the underfitting and overfitting curves. Earlier we started by training the models for longer epochs to identify best validation metrics. Then we re-trained a new model for the best number of epochs. Early stopping interrupts learning as soon as validation metrixs have stopped improving.

callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
history = model.fit(np.arange(100).reshape(5, 20), np.zeros(5),
    epochs=10, batch_size=1, callbacks=[callback],
    verbose=0)

Regularization

Regularization makes the model simpler, less specific to the training data, so that it can better generalize.

The simplest way to mitigate overfitting is to reduce the size of the model, determined by the number of layers and the number of neuron per each layer. If the model has limited memorization resources, it couldn't memorize the training data. At the same time, we should use enough parameters to represent the data. We seek a compromize between too much capacity and not nough capacity. The general workflow is to start with few layers and parameters, and increase the size of the model while monitoring evaluation on validation set.

Occam's razor principle: given two explanations for something, the explanation most likely to be correct is the simplest one which makes fewer assumptions. The same principle can be applied to neural networks: we can put constaints on the complexity of the model by forcing its weights to take only small values. This technique is called weight regularization. We add regularization in Keras as following:

from tensorflow.keras import regularizers
model = keras.Sequential([
    layers.Dense(16,
                 kernel_regularizer=regularizers.l2(0.002),
                 activation="relu"),
    layers.Dense(16,
                 kernel_regularizer=regularizers.l2(0.002),
                 activation="relu"),
    layers.Dense(1, activation="sigmoid")
])

We can use other regularizers as well:

from keras import regularizers
regularizers.l1(0.001)
regularizers.l1_l2(l1=0.001, l2=0.001)

Dropout

When the model is very large, weight regularization doesn't work because the model is over-parameterized. Instead, we use another method call dropout: we randomly setting to zero a number of output neurons from the layer during training. The dropout rate is the fraction of the neurons that are zeroed out. The usual setting is between 0.2 and 0.5. At test time, the layers output values are scaled down by the factor equal to dropout rate to compensate for more active number of neurons.

For example, if we have an output from a layer layer_output of shape (batch_size, features), we randomly zero out a fraction of the values at training time the following way:

layer_output *= np.random.randint(0, high=2, size=layer_output.shape)

At test time we scale down the output from the layer by the same dropout rate:

layer_output *= 0.5

We can do both operation at training time and leave the output unchanged at test time by scaling up the outputs that are not set to zero by \(\frac{1}{1-r}\), where \(r\) is the dropout rate (Keras uses this technique to implement dropout):

layer_output *= np.random.randint(0, high=2, size=layer_output.shape)
layer_output *= 1/(1-0.5)

In Keras we add dropout layer which is applied to the output of the layer just before it:

model = keras.Sequential([
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(16, activation="relu"),
    layers.Dropout(0.5),
    layers.Dense(1, activation="sigmoid")
])

Machine learning workflow

Define the task
- Frame the problem
- Collect a dataset
- Understand the data
- Choose a measure of success
Develop a model
- Prepare the data
- Choose an evaluation protocol
- Beat a baseline
- Develop a model that overfits
- Regularize and tune the model
Deploy the model
- Explain the work to stakeholders and set expectations
- Ship an inference model
- Monitor the model in the wild

Define the task by asking the following questions:

Why there is a need to solve this problem?
What value can be obtained by solving this problem?
What data is available?
What kind of machine learning technique may be applied to the problem?

Caveats

non-representative data: production data differs too much from the training data
concept drift: properties of production data changes over time
sampling bias: data collection process is biased and therefore does not represent the population
target leaking: presence of features in the data that provide information about the targets

Choose metrics. For balances classification problem, where every class is equally likely, accuracy and area under the receiver operating characteristic curve (ROC AUC) are most common metrics.

Data preparation includes: - data vectoriztion: turning data into vectors - data normalization: values in range [0,1] that are homogeneous (same range). Normalized data have a mean of 0 and standard deviation of 1.

x -= x.mean(axis=0)
x /= x.std(axis=0)

Handling missing values. If feature is categorical, create a new category that means "the value is missing". If the feature is numerical, replace the missing value with the average or median value instead of zero. If missing values is expected in the test data, be sure to train with missing values as well.

TensorFlow Serving allows deploying a model on a server through REST API. TensorFlow Lite allows to deploy models on mobile or embedded devices. TensorFlow JavaScript is a library to deploy models in a web browser.