Implementation

Network class

We will be using MNIST data which consists of 60,000 training images and 10,000 test images. We will further separate 60,000 training set into two parts: a set of 50,000 images which we will use to train our network, and a separate 10,000 images validation set. We will use validation set later to figure out how to set certain hyper-parameters of the neural network, such as the learning rate which are not directly selected by our learning algorithm. We will also need Numpy library in Python which will allow us to do fast linear algebra operations.

The center of the neural network code is the Network class, which we use to represent a neural network.

import numpy as np

class Network(object):
    def __init__(self, sizes, seed=None):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network. For example, if the list
        was [3, 4, 2] then it would be a three-layer network, with the
        first layer containing 3 neurons, the second layer 4 neurons,
        and the third layer 2 neurons. The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0 and variance 1. Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers. Integer
        ``seed`` may be provided to achieve reproducible results."""
        self.rng = np.random.default_rng(seed)
        self.num_layers = len(sizes)
        self.biases = [self.rng.standard_normal(size=(j, 1), dtype=np.float64)
            for j in sizes[1:]]
        self.weights = [self.rng.standard_normal(size=(j, k), dtype=np.float64)
            for j, k in zip(sizes[1:], sizes[:-1])]

In the code, sizes is a list that contains the number of neurons in respective layers. For example, if we want to create a Network() object with 3 neurons in the first layer, 4 neurons in the second layer, and 2 neurons in the third layer, we will initialize the class object as following:

net = Network([3, 4, 2])

Bias and activation notation

The biases and weights are initialized randomly by using the Numpy rng.standard_normal() function. Later we will find better ways to initialize biases and weights. We assume that the first layer in the network is the input layer, and we omit setting any biases or weights for those neurons, since biases and weights are only ever used in computing the outputs from later layers.

We store biases and weights in Numpy arrays. For example, net.weights[1] may store the following weights connecting the second and the third layers of neurons ($w^3$):

     k=1     k=2      k=3     k=4
[[ 0.3490, 0.2552,  1.7718, -0.6382 ],  j=1
 [ 1.5340, 0.2998, -1.0761, -1.4656 ]]  j=2

net.weights[1] is a matrix such that $w_{jk}$ is the weight for the connection between the $k^{\text{th}}$ neuron in the second layer and the $j^{\text{th}}$ neuron in the third layer. The advantage of this ordering is that we can multiply two matrices, weights $w^l$ and activations $a^{l-1}$ directly (recall Equation (7)):

$$ \begin{align} a^{l} = \sigma(w^l a^{l-1}+b^l). \end{align} $$

To illustrate how this works, consider an example of a vector of activations in the second layer ($a^2$):

     k=1
[[0.2871],  j=1
 [0.7834],  j=2
 [0.6244],  j=3
 [0.6702]]  j=4

A vector of biases in the third layer ($b^3$) may look like this:

    k=1
[[-1.3518],  j=1
 [-2.1354]]  j=2

From these vectors and matrices it is easy to see how we can compute a vector of activations for the third layer by using simple matrix multiplication from Equation (7). Note that the sigmoid function $\sigma$ is applied elementwise to the resulting weighted output vector $z^l = w^l a^{l-1}+b^l$. The vector of activations for the third layer ($a^3$) will be the following:

    k=1
[[0.4078],  j=1
 [0.0425]]  j=2

After calculating the weighted input $z$, it is easy to write the code computing the output from a Network() instance. We begin by defining the sigmoid function:

@staticmethod
def sigmoid(z):
    """Returns the sigmoid function."""
    return 1.0/(1.0+np.exp(-z)) # Equation (3)

Note that when the input z is a vector or Numpy array, Numpy automatically applies the function sigmoid() elementwise in vectorized form.

After that we define feedforward() method for the Network() class, which returns an output vector from a given input for each layer:

def feedforward(self, x):
    """Returns a list of ``activations`` in each layer in the network
    for one example."""
    activation = x
    # list to store all the activations, layer by layer
    activations = [x]
    for b, w in zip(self.biases, self.weights):
        z = np.matmul(w, activation) + b # Equation (8)
        activation = self.sigmoid(z) # Equation (7)
        activations.append(activation)
    return activations

Stochastic gradient descent

The main thing that we want our Network() to do is to learn. We will give it an SGD() method which implements stochastic gradient descent:

def SGD(self, train_data, epochs, mini_batch_size, eta,
        test_data=None):
    """Trains the neural network using mini-batch stochastic
    gradient descent. The ``train_data`` is a list of tuples
    ``(x, y)`` representing the training inputs and the
    corresponding desired outputs."""
    n = len(train_data)
    for j in range(epochs):
        # divide train data into mini-batches randomly
        self.rng.shuffle(train_data)
        mini_batches = [train_data[m:m+mini_batch_size]
            for m in range(0, n, mini_batch_size)]
        for mini_batch in mini_batches:
            self.update_mini_batch(mini_batch, mini_batch_size, eta)
        # evaluate metrics
        if test_data:
            correct = self.evaluate(test_data)
            print(f"Epoch {j} : {correct} / {len(test_data)}")
        else:
            print(f"Epoch {j} complete")

The train_data is a list of tuples (x, y), representing the training inputs and corresponding desired outputs. The variables epochs and mini_batch_size are self-explanatory. eta is the learning rate $\eta$. If the optional argument test_data is supplied, then the program will evaluate the network after each epoch of training, and print out partial progress.

The code works as follows. We start each epoch by randomly shuffling the training data, and then we partition data into mini-batches of the appropriate size. After that for each mini-batch we apply a single step of gradient descent. This is done by update_mini_batch(), which updates the network weights and biases according to a single iteration of gradient descent, using just the training data in the mini_batch. Here is the code for the update_mini_batch() method:

def update_mini_batch(self, mini_batch, mini_batch_size, eta):
    """Updates the network's weights and biases by applying
    gradient descent using backpropagation to a single mini-batch.
    The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
    is the learning rate."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    for x, y in mini_batch:
        delta_nabla_b, delta_nabla_w = self.backprop(x, y)
        nabla_b = [nb + dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
        nabla_w = [nw + dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
    self.biases = [b-(eta/mini_batch_size)*nb
                    for b, nb in zip(self.biases, nabla_b)] # Equation (25)
    self.weights = [w-(eta/mini_batch_size)*nw
                    for w, nw in zip(self.weights, nabla_w)] # Equation (24)

Most of the work is done by the backpropagation algorithm to figure out partial derivatives $\frac{\partial C_x}{\partial b^l}$ and $\frac{\partial C_x}{\partial w^l}$. So update_mini_batch() works simply by obtaining these gradients from backprop() method for every training example in the mini_batch, summing them up, and then updating self.weights and self.biases appropriately.

Backpropagation

Backpropagation code starts with initializing zero matrices with the shape of the bias and weight matrices. Then we do the feedforward computation to calculate the weighted sums and activations for each layer. After that we compute partial derivatives for the output layer. And finally we go backward and compute the partial derivatives for all the previous layers in the network and return the results.

def backprop(self, x, y):
    """Returns a tuple ``(nabla_b, nabla_w)`` representing the
    gradient for the cost function C_x for one example. ``nabla_b``
    and ``nabla_w`` are layer-by-layer lists of numpy arrays,
    similar to ``self.biases`` and ``self.weights``."""
    nabla_b = [np.zeros(b.shape) for b in self.biases]
    nabla_w = [np.zeros(w.shape) for w in self.weights]
    activations = self.feedforward(x)
    # compute dC/dz, dC/db, and dC/dw for the output layer
    dCda = self.cost_derivative(activations[-1], y) # 1st term of BP1
    sp = self.sigmoid_prime(activations[-1]) # 2nd term of BP1
    delta = np.multiply(dCda, sp) # BP1
    nabla_b[-1] = delta # BP3
    nabla_w[-1] = np.matmul(delta, activations[-2].transpose()) # BP4
    # compute dC/dz, dC/db, and dC/dw for all the previous layers
    for l in range(2, self.num_layers):
        d = np.matmul(self.weights[-l+1].transpose(), delta) # 1st term of BP2
        sp = self.sigmoid_prime(activations[-l]) # 2nd term of BP2
        delta =  np.multiply(d, sp) # BP2
        nabla_b[-l] = delta # BP3
        nabla_w[-l] = np.matmul(delta, activations[-l-1].transpose()) # BP4
    return (nabla_b, nabla_w)

We will also need two derivatives: cost and sigmoid functions for backpropagation and feedforward procedures.

@staticmethod
def cost_derivative(output_activations, y):
    """Returns the vector of partial derivatives dC_x/da
    for the output activations."""
    return (output_activations - y) # Equation (39)

def sigmoid_prime(self, activation):
    """Returns the derivative of the sigmoid function."""
    return activation*(1-activation) # Equation (40)

The evaluate procedure returns the number of correctly identified examples for the testing dataset.

def evaluate(self, test_data):
    """Returns the number of correctly identified examples."""
    test_results = []
    for x, y in test_data:
        activations = self.feedforward(x)
        test_results.append((np.argmax(activations[-1]), y))
    correct = sum(int(x == y) for (x, y) in test_results)
    return correct

Loading MNIST data

We are going to use a helper program to load the MNIST data:

import gzip
import numpy as np

def load_images(file_name):
    """Returns a Numpy array of the images for MNIST data set."""
    with gzip.open(file_name, 'r') as f:
        # first 4 bytes is a magic number
        # MSB is at the beginning of the byte array
        magic_number = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # second 4 bytes is the number of images
        image_count = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # third 4 bytes is the row count
        row_count = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # fourth 4 bytes is the column count
        column_count = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # rest is the image pixel data, each pixel is stored as an unsigned byte
        # pixel values are 0 to 255
        image_data = f.read()
        images = np.frombuffer(image_data, dtype=np.uint8)\
            .reshape((image_count, row_count, column_count))
        return images

def load_labels(file_name):
    """Returns a Numpy array of the labels for MNIST data set."""
    with gzip.open(file_name, 'r') as f:
        # first 4 bytes is a magic number
        # MSB is at the beginning of the byte array
        magic_number = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # second 4 bytes is the number of labels
        label_count = int.from_bytes(f.read(4), byteorder='big', signed=False)
        # rest is the label data, each label is stored as unsigned byte
        # label values are 0 to 9
        label_data = f.read()
        labels = np.frombuffer(label_data, dtype=np.uint8)
        return labels

def load_train_data():
    """Returns the training dataset with specified filenames."""
    train_images = load_images('data/train-images-idx3-ubyte.gz')
    train_labels = load_labels('data/train-labels-idx1-ubyte.gz')
    return (train_images, train_labels)

def load_test_data():
    """Returns the test dataset with specified filenames."""
    test_images = load_images('data/t10k-images-idx3-ubyte.gz')
    test_labels = load_labels('data/t10k-labels-idx1-ubyte.gz')
    return (test_images, test_labels)

def vectorize(labels):
    """Return a 2-dimensional tensor with the second vector
    consisting of 10-dimensional unit vector with a 1.0 in the jth
    position and zeroes elsewhere. This is used to convert a digit
    (0...9) into a corresponding desired output from the neural
    network."""
    result = np.zeros((len(labels), 10, 1), dtype=np.uint8)
    for i in range(len(labels)):
        j = labels[i]
        result[i, j] = 1
    return result

def load_data():
    """Returns preprocessed MNIST dataset for the neural network. ``train_data``
    is a list of tuples (x_train, y_train) representing the training inputs and
    the corresponding labels. len(train_data)=60000. ``test_data`` is a list of
    tuples (x_test, y_test) representing the testing inputs and the
    corresponding labels. len(test_data)=10000. Testing labels are not
    vectorized. x_train.shape=(784, 1), y_train.shape=(10, 1),
    x_test.shape=(784, 1), y_test is an integer."""

    (train_images, train_labels) = load_train_data()
    (test_images, test_labels) = load_test_data()
    # preprocess train data
    train_images = train_images.reshape((60000, 28*28, 1))
    train_images = train_images.astype('float64') / 255
    train_labels = vectorize(train_labels)
    train_data = zip(train_images, train_labels)
    train_data = list(train_data)
    # preprocess test data (labels are not vectorized)
    test_images = test_images.reshape((10000, 28*28, 1))
    test_images = test_images.astype('float64') / 255
    test_data = zip(test_images, test_labels)
    test_data = list(test_data)
    return (train_data, test_data)

The code is available at my GitHub repository.

Running the code

First we will load the MNIST dataset by executing the following command:

import mnist
(train_data, valid_data, test_data) = mnist.load_data()

After loading the MNIST data, we will set up a Network() with a single hidden layer consisting of 15 hidden neurons. For the testing purposes, we can pass a seed value to make sure that the random numbers are initialized from the same pool for different Network() models:

import network
net = Network([784, 15, 10], seed=0)

Finally, we will use stochastic gradient descent to learn from MNIST train_data over 30 epochs, with a mini-batch size of 10, and a learning rate of $\eta = 3.0$:

net.SGD(train_data, 30, 10, 3.0, test_data)

The transcript of learning progress shows us that we started with 9,007 correctly recognized images out of 10,000 at the first epoch, reaching to 9,348 out of 10,000 at the end of the last epoch.

Epoch 0 : 8881 / 10000
Epoch 1 : 9080 / 10000
Epoch 2 : 9144 / 10000
...
Epoch 27 : 9369 / 10000
Epoch 28 : 9334 / 10000
Epoch 29 : 9361 / 10000

We can change the number of hidden neurons to 100 and see what happens.

net = Network([784, 100, 10], seed=0)
net.SGD(train_data, 30, 10, 3.0, test_data)

Unfortunately, the performance has dropped to 87.26 percent:

Epoch 0 : 7488 / 10000
Epoch 1 : 7613 / 10000
Epoch 2 : 7655 / 10000
...
Epoch 27 : 8747 / 10000
Epoch 28 : 8751 / 10000
Epoch 29 : 8747 / 10000

Adjusting hyper-parameters

To obtain these accuracies we made specific choices for the number of epochs of training, the mini-batch size, and the learning rate $\eta$. These are called hyper-parameters for our neural network, which are distinct from the parameters (weights and biases) learned by our learning algorithm. If we choose our hyper-parameters poorly, we can get bad results. For example, if we choose the learning rate to be $\eta = 0.001$,


net = Network([784, 15, 10], seed=0)
net.SGD(train_data, 30, 10, 0.001, test_data)

the results will be much less encouraging:

Epoch 0 : 816 / 10000
Epoch 1 : 1040 / 10000
Epoch 2 : 1217 / 10000
...
Epoch 27 : 2438 / 10000
Epoch 28 : 2487 / 10000
Epoch 29 : 2532 / 10000

In general, debugging a neural network can be challenging when initial choice of hyper-parameters produces results no better than random noise. Suppose we try to set the learning rate to $\eta = 100$:

net = Network([784, 15, 10], seed=0)
net.SGD(train_data, 30, 10, 100, test_data)

The results show that we have gone too far, and our learning rate is too high:

Epoch 0 : 958 / 10000
Epoch 1 : 958 / 10000
Epoch 2 : 890 / 10000
...
Epoch 27 : 890 / 10000
Epoch 28 : 890 / 10000
Epoch 29 : 890 / 10000

Now if we were coming to this problem for the first time, the output would not tell us much about how we can improve the results. We might worry not only about the learning rate, but about every other aspect of our neural network:

Did we initialized the weights and biases in the way that makes it hard for the network to learn?
Maybe we don't have enough training data to get meaningful learning?
Perhaps we haven't run for enough epochs?
Maybe it's impossible for a neural network with this architecture to learn to recognize handwritten digits?
Learning rate is too high?
Learning rate is too low?

Adapted from Michael A. Nielsen, "Neural Networks and Deep Learning"