Results for Training Neural Networks with Bayesian Linear Regression

In this post I will demonstrate results from training neural nets using approximate Bayesian updating. The algorithm is described here, the code for everything can be found here.

I will first demonstrate method is capable of actually learning something. For this I will teach a neural network to fit a simple sine wave.

Then I will look at a slightly more complicated function, fitting a radial cosine. For this example the input data is 2 dimensional, for the $x$ and $y$ coordinates.

Finally, I will show this algorithm performing surprisingly well on MNIST. I did not expect this to perform anywhere near as well as it did, and assumed that this would fail entirely on remotely complicated classification tasks.

Fitting a Sine Wave

As a proof of concept, I fit a sine wave (with noise) using various different options for the hidden layers. Each data point was only used once.

The blue dots are the training data, the orange line is the output of the network, calculated for each point along the x-axis.

Fitting a Radial Cosine

For more complicated data I fit a radial cosine (with noise). The input data was the (x,y) coordinates. This was fit using data where the coordinates were from a regular grid which was then shuffled, and data were the points were sampled randomly. The hidden layers had sizes 128, 256, 256, 128.

On the left are the radial cosine functions used for training, with the red dots representing the points used for training. On the right are the learned functions.

MNIST Classification

To test whether this method works for training a classifier at all I tried it on MNIST. There are a couple of important things to consider here; I am not using a CNN, just a standard network using dense linear layers with ReLU activation functions; and the output layer is not a sigmoid function, it is an identity function. Because the output layer is not a sigmoid function, the numbers it outputs will not be probabilities, and will probably hurt the performance of the classifier.

The setup used here is that 50,000 images from the MNIST dataset are flattened into 1D input vectors of length 784, and the corresponding correct outputs are one-hot vectors of length 10. It seems like adding some convolutional layers would dramatically improve performance, I may try this later. These networks are effectively mapping from a 784 dimensional space to a 10 dimensional space.

I tried various different options for the hidden layers, as shown below. There was no hyperparameter tuning, and each sample was only used once. During training I saved a copy of the networks after every 1000 samples. Then tested each of these copies on a test dataset of 10,000 MNIST images.

Additionally I thought it would be interesting to look at $\Lambda$ matrices for each of the weight matrices. $\Lambda$ represents the precision of each of the weight matrices, and so the (pseudo¹) determinant should give us an idea of the precision or certainty we have in our weight matrices. The inverse of the determinant will give us some idea of the variance. We expect that $\text{det}(\Lambda)$ will increase over time as we become more ‘sure’ about the values of the weights.

Results from training different networks on MNIST. On the left is the performance as a function of the number of samples. On the right is the logarithm of the (pseudo) determinant of $\Lambda$, which are all strictly increasing. We are using the logarithm because otherwise the numbers are too big to represent as a floating point number. As an observation, the $\text{det}\Lambda$ for the first layer of each network all look very similar.

The results were surprisingly good, best model (with one hidden layer of 265 neurons) achieved an accuracy of over 91% on the test dataset. This model was already achieving above 82% after 1000 samples.
Larger, more complicated models didn’t achieve better accuracy. One of the largest models (bottom row of the figure above) seemed to have quite unstable training. As expected, the determinant of $\Lambda$ for each of the layers increased monotonically, meaning there was higher precision in the weights.

I also investigated the effects of adding additional layers. I multiple neural networks with the number of hidden layers ranging from 1 to 6. Each hidden layer had 256 neurons. The more layers in general led to worse performance.

Performance for multiple different networks with different numbers of hidden layers (each with 256 neurons). The smaller networks performed better.

Seeing these results I also tried models with a single hidden layer, of varying sizes. A larger hidden layer lead to better final test accurary; the model with a hidden layer of 1024 neurons achieved a test accuracy of 95%.

Performance for various networks with a single hidden layer of differing sizes. A larger single hidden layer lead to better final performance.

Hyperparameter discussion

All the examples above were trained using a single pass of the data, taking $c=1$ (as defined in the previous post). This means that we are ‘updating all the way’ for each piece of data. This could be thought of as training for one epoch. From here the only parameters to choose are the initialization for the weights and the precision $\Lambda$ for all the weight matrices.

The Bayesian Multivariate Linear Regression makes the assumption that the error is normally distributed, and the weights are drawn from a matrix normal distribution. Therefore it seems like we should make our initialization for the weights normally distributed. The initialized weights are drawn from a normal distribution with centered on 0, with a standard deviation of $\sigma=\sqrt{6/(n_{nows}+n_{columns})}$. This is ‘inspired’ by Xavier Normal initialization, pulled from here, because I misread it a bit. But this seems to work better than standard Xavier Normal initialization.

The precision/uncertainty $\Lambda$ was a bit more tricky, after some fiddling around, and squinting at what role $\Lambda^{-1}$ plays in the matrix normal distribution ($\Lambda^{-1}$ is the covariance between the rows of the weight matrix), I settled on the identity divided by the number of rows of $\Lambda$:
$$\Lambda_{init}=\mathbb{1}/n_{rows}$$

Bigger $\Lambda$ corresponds to more regularization; the prior on the weights is more precise. So the tradeoff is that if $\Lambda$ is too big the weights don’t update enough to learn the function quickly and stay close to their initialized values, and if $\Lambda$ is too small then the early pieces of training data may have too large an impact and ‘anchor’ the training. This point about $\Lambda$ being too small is mostly just a guess, but it seems to be correct in practice. There are some constraints on $\Lambda$, mainly that $\Lambda^{-1}$ must be positive-definite, so all the eigenvalues must be positive. Because of this, something proportional to the identity matrix seemed like the easiest choice.

Footnotes

The determinant of a matrix is equal to the product of the eigenvalues, but $\Lambda$ may have eigenvalues equal to 0, and so we use the pseudo determinant, which is equal to the product of the non-zero eigenvalues.

Fitting a Sine Wave

MNIST Classification

Hyperparameter discussion

Footnotes

1 thought on “Results for Training Neural Networks with Bayesian Linear Regression”