Lobotomy's

Saturday 18 September 2021

Implementing word2vec on the GPU

Some time ago I finished my master's thesis on word2vec. For the thesis I used the gensim implementation of word2vec, which is well known and robust. The thesis was not really about implementing word2vec; I was also working on my own implementation, but did not have time to complete it as it was not a priority.

After taking a small break from everything word2vec-related, I recently finished my own implementation. The code is here: https://github.com/tsaastam/myw2v

I wanted to specifically implement word2vec on the GPU, unlike the original reference code and gensim which both run on the CPU only. One reason why I wanted to do this was that I couldn't find many GPU implementations of word2vec. There's a paper by Gupta & Khare, BlazingText: Scaling and Accelerating Word2Vec using Multiple GPUs, where a multi-GPU implementation is discussed, but it's a closed-source proprietary system owned by Amazon. There are some other papers as well, but I couldn't find an established open source GPU edition of word2vec. There are several partial GPU implementations to be found, but none of these seem to be complete, or very performant.

A straightforward way to write a GPU implementation would be to use one of the modern machine learning libraries, such as TensorFlow or PyTorch. However, these libraries (at least by default) use a "synchronous" implementation of backpropagation, whereas word2vec has always been implemented in a "non-thread-safe" manner instead; see the gensim explanation for example. The difference is that the latter implementation style, i.e. so-called asynchronous stochastic gradient descent, is massively faster, at least in the case of word2vec. I did not in fact check how much the "accuracy" suffers from this thread-unsafety, but my GPU implementation seems to be reasonably competitive, scoring some 10% lower on the word analogy task than gensim on the same material while being noticeably faster to train.

In more detail, the common "synchronous" method of training is basically: take a small batch of input-output pairs (or similar), compute the loss gradient for all of these in parallel, take the average and adjust the weights by that, then repeat. But it seems that the backpropagation part of word2vec is simple enough that this way of computing it is not very efficient, whether on GPU or CPU. Basically the gradient involves two vectors of length D (embedding dimensionality, usually 100-300 or so) and their dot product, and that's about it. Now one can calculate, say, one thousand such backpropagations in parallel (synchronously), but - based on my experiments at least - it seems that this involves too much overhead compared to the actual duration of the backpropagation calculation. One needs to generate a bunch of input-output pairs, put these in some kind of array, transfer the array to the GPU (if using a GPU), run the parallel calculation, then perhaps synchronise the GPU "threads" and copy the resulting data back, and then calculate the average and adjust. The problem is that the calculation takes, probably, less time than all the generating and copying and averaging. (I didn't measure the individual durations of these things very accurately, but basically, an implementation like this in practice is slow as molasses.)

One could also generate the input-output pairs on the GPU itself so that there's less copying and transferring, and then use them in a batched-and-averaged way. I was tempted to try this, and made some preliminary attempts, but the amount of bookkeeping required makes this annoying. It is much more straightforward to instead do what I ended up doing: as soon as each input-output pair is generated from the input data (on the GPU), it is immediately used in the backpropagation step, and that's it. This way one does not need to wait for a certain number of input-output pairs to be generated first - how would you know (in your CUDA code) that all the items in your "n input-output pairs" batch array are now generated successfully? - and then run the next step of the calculation on them, then average, and go back to the generation phase. Of course, the immediate, completely unsynchronised backpropagation step means that some updates are lost - see "asynchronous" above.

It is interesting that this method works fine, despite its utterly non-thread-safe nature. But this is probably not all that surprising. If one were to calculate, precisely and synchronously, a batch of say 128 gradients, and then average them and use that as the adjustment, would not some of these 128 gradients (namely the more extreme ones) be "lost" as well in the averaging? Obviously with the asynchronous algorithm the lost values are more or less random, rather than being the most extreme values; and it is somewhat tricky to ascertain even what fraction of the updates were lost; but it turns out that this does not matter all that much. The dot products still get nicely trained towards where we want them, on average.

Here's to word2vec, a truly optimised algorithm.

P.S. I did not experiment with different ways of dividing up the input material across CUDA thread blocks, as Gupta and Khare do in their paper. That could be one avenue of optimising and improving my code. And of course, proper "online" batching of data into GPU memory, rather than loading it in all at once and then calculating all at once on the GPU, would be a significant improvement one would want in a real production system.

Monday 30 August 2021

Simple Wikipedia parsing

I'm still not sure how to completely parse Wikipedia data, but I think it involves installing a local copy of the MediaWiki software and a proper database, so that you can deal with all the special syntax, links to other pages, categories, templates etc.

However, just extracting the text of Wikipedia pages seems to be pretty easy. One way is to use gensim's Wikipedia utility:

from gensim.scripts import segment_wiki

segment_wiki.segment_and_write_all_articles(full_path, full_out_path)

Here "full_path" is the path to a Wikipedia XML dump file, which is bz2-compressed XML, such as one of the .bz2 files from here. This will turn the XML dump into a .json.gz dump which is a little bit easier to work with. You can use gensim's "utils" package to open the .json.gz file and read article by article:

with utils.open(json_gz_file, 'rb') as jf:
    for line in jf:
        article = json.loads(line)
        do_something_with(article["title"])
        for section_title, section_text in zip(article['section_titles'],
                                               article['section_texts']):
            do_something_with(section_title)
            do_something_with(section_text)

See here for full example code (albeit with over-simplistic parsing of the resulting text). Note that the code does some other things as well.

Sunday 8 August 2021

Word2vec and overfitting

After getting my bachelor's in statistics in 2016, I continued studying and finished my master's in late 2020. My master's thesis was titled Word2vec and its application to examining the changes in word contexts over time (direct download link here).

An excerpt from the thesis follows.

[----]

-- Nevertheless, an evaluation of the "closeness" or "packed-togetherness" of the word2vec clustering can be attempted. It turns out that models trained with different dimensionality of embeddings, all other parameters and the data set being the same, end up with embeddings where the average similarity between each word and its nearest neighbour, as measured by cosine similarity, differs. More precisely, given a model with embedding vectors $v_i$, denote by $\hat{v}_i$ the embedding vector that is closest to $v_i$ as measured by cosine similarity, i.e. $\hat{v}_i := \arg \max_{j \neq i} \mathrm{cos}(v_i, v_j)$. We examine how the distribution of $\hat{v}_i$ varies for models trained with different dimensionality of embeddings, the source material and all other hyperparameters being the same.

For an arbitrarily selected year, namely Yle corpus 2014, this distribution of closest-neighbour similarity by embedding dimension is depicted in figure 3.8. The average and the central 95-percentile range (i.e. from 2.5% to 97.5%) of $\hat{v}_i$ are shown. Note that in the word2vec implementation, the closeness function uses normalised vectors, which means that the maximum cosine similarity is 1.

Figure 3.8: Effect of embedding dimension on the distribution of nearest-neighbour similarity. The models were trained on year 2014 of the Yle corpus.

It is interesting that while the 97.5th percentile of the nearest-neighbour similarity remains fairly high regardless of embedding dimension, it still decreases as the dimension grows, and the 2.5th percentile decreases very significantly, from 0.814 in the 25-dimensional case all the way to 0.409 for the 300-dimensional model. This means that the model is unable to pack the embeddings as tightly as dimensionality grows, which could indicate overfitting. However, it must be noted that while interesting, this measure of spread does not directly inform us of how well each of the models depicted will perform on the actual task they were trained for.

[----]

It appears from figure 3.10 that the model begins to overfit as dimensionality grows beyond 100--150. Overfitting, of course, is traditionally defined as related to the generalisation error of a model, i.e. a model's ability to produce consistently decent predictions when faced with new data [43, 15]. As such, the concept of overfitting does not completely apply to word2vec, since the word2vec model is not used for predictions. However, overfitting refers more generally to a situation where a model of too high complexity is used, such that the model can perform well on the training data simply by memorising it. This can clearly occur with the word2vec model, which has a very large number of parameters, regardless of whether said model is then used for predictions or not. The non-predictive nature of the model, then, merely makes it more tricky to ascertain whether overfitting has occurred.

Figure 3.10: Accuracy on analogy task by embedding dimension: five arbitrarily selected data sets.

[15] Dietterich, T. (1995). Overfitting and undercomputing in machine learning. ACM computing surveys (CSUR), 27(3), 326–327.

[43] Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27(1), 7–50.

Sunday 28 October 2018

Thesis on PLSA (Probabilistic Latent Semantic Analysis)

A few years ago I finally finished my bachelor's degree in statistics that I had been working on for a number of years. (It took me a while as I've been working full time and also because the University of Helsinki is in Helsinki while I've been in London since 2011.) My thesis was on PLSA, which is a latent topic model for clustering text documents.

Download the thesis here as PDF. I decided to publish it online in case it proves interesting to people interested in statistical models. The thesis was originally written in Finnish but I've translated it into English for publication.

Hope y'all find it interesting.

Sunday 10 June 2018

Layers & gradient shapes in neural networks

TLDR: backpropagation is reasonably straightforward, but the details are subtle.

This post became much bigger and took much longer than I expected. Hopefully it is useful for others who might be interested in the mathematical foundations of backpropagation. Basically, at each layer we calculate $z=wa$, and the goal is to optimise various things by taking the derivative of this. Well, the derivative of $z$ with respect to $w$ can only be $a$ (or some reasonable variation thereof, such as a matrix transpose $a^T$), and the derivative with respect to $a$ can only be $w$ (or some reasonable variation thereof). One of these will be used to adjust the weights $w$ at this layer, while the other will be passed backwards, in an application of the chain rule, to adjust the weights of previous layers. That's it, in a nutshell. For the details, which are more complicated, read on.

Preface

I recently completed all of Andrew Ng's Deep Learning courses on Coursera. These courses are really good and I highly recommend them; Professor Ng has the skill to explain complicated concepts in a very understandable way.

To make sure I learn this stuff properly, I wanted to deepen my understanding of the mathematics involved in gradient descent (see also the previous two posts). Now Prof Ng doesn't actually get very deep into the mathematics on his courses, which is fine, but I wanted a deeper dive.

Simple network

Let's start at the beginning. I will use a column-oriented layout ("$w x$") as in the previous posts and start to construct a neural network model by building on the simple linear model. Ng's notation is somewhat confusing when it comes to matrix orientation: the input data $X$ is column-oriented, as are the activations $a^{[1]}, a^{[2]}$ etc; but the weights are row-oriented, i.e. all the weight matrices $w^{[1]}, w^{[2]}$ etc are row matrices. This is probably because ${w^{[i]}}^T$ is a bit cumbersome to write everywhere. I will follow this same convention: weights are rows, other stuff is columns.

One thing that will be different from the simple model is the bias variables. In traditional statistics, one can add an extra row (or column) of ones to the data matrix to use as bias variables (see previous posts), which simplifies the math, since all the variables, weights and biases, will then be contained in one matrix. However in a multi-layered network, this becomes impractical, since one would have to modify the output matrix at each layer to add this bias row. It is possible but not desirable to keep doing these modifications at every layer. Another issue is that in larger neural networks the initialisation of the weights is very important (see so-called Xavier initialisation), whereas biases can just be initialised to zero. I'm not sure how much it matters how the biases are initialised, but this is something that could have an effect.

So, building on the traditional linear model from the previous posts, the simplest model that could reasonably be called a neural network would be a logistic regression model (or another generalised linear model, but here I'm using logistic regression as an example). In a logistic regression model we have as input some data points $X$ and the correct answers $y$, which in this case are binary (0 and 1), and the model uses a weight vector $w$ plus a (separate) bias $b$ to produce predictions $\hat{y}$: \[ \begin{eqnarray} z & = & w^T X + b \\ a & = & \sigma(z) = \frac{1}{1+e^{-z}} \\ \hat{y} & = & a \end{eqnarray} \] The loss function used here would usually be the log loss: \[ L(y,\hat{y}) = -y \log \hat{y} - (1-y) \log (1-\hat{y}) \] So the goal of course is to adjust the parameters $w$ and $b$ so as to minimise the loss; and traditional gradient descent of course does this by using the gradient. How do we compute the gradient? That might get complicated here since we have a whole matrix full of data, a bunch of vectors for the weights and outcomes, and more than one function chained together. It is this multi-dimensional shape of things that gave me some difficulty. Basically there are two separate but related things: how to do the math correctly, taking all the vectors and matrices into account; and, since we actually want the average gradient rather than $n$ separate ones, how to compute that efficiently using suitable matrix operations. Let's see how it all develops.

First consider the logistic regression model above in the simplest case, where we only have one data point, a $d$-dimensional column vector $x$. The weight vector is a $d$-dimensional vector as well, and computing the gradients of the loss with respect to the parameters is straightforward enough, just using the standard formulas for vector derivatives: \[ \begin{eqnarray} \frac{\mathrm{d} L}{\mathrm{d} w} & = & \frac{\mathrm{d} L}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} z} \frac{\mathrm{d} z}{\mathrm{d} w} \\ & = & (-\frac{y}{a} + \frac{1-y}{1-a}) \, a(1-a) \, x^T \\ & = & (a-y)x^T; \\ \frac{\mathrm{d} L}{\mathrm{d} b} & = & \frac{\mathrm{d} L}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} z} \frac{\mathrm{d} z}{\mathrm{d} b} \\ & = & (-\frac{y}{a} + \frac{1-y}{1-a}) \, a(1-a) \, 1 \\ & = & a-y. \end{eqnarray} \] Here we got away without too much worry about the fact that some things are vectors. The next thing to do is to generalise these results to the more realistic case of $n$ data points, for which we want the average gradient. Consider that for any given data point $x$, the gradients will of course be the same as above, namely $(a-y)x^T$ and $a-y$. Given all the data points in a matrix, \[ X = \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix}, \] it is easy enough to basically guess how to efficiently compute the sum of all the gradients with respect to $w$: \[ g = (a-y)X^T = \begin{bmatrix} a_1-y_1 & a_2-y_2 & \ldots & a_n-y_n \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix} \] And all we need to do is divide the result by $n$ to get the average gradient. The case for the average gradient with respect to $b$ is similar but simpler.

More complex network: single data point

Let's move on to more complicated models. In a proper neural network, you will obviously have multiple layers. Each layer is a collection of neurons, each of which has its own weight vector and bias, so in total you will have a matrix of weight vectors and a vector of biases, per layer. For example, with two layers (using Ng's notation): \[ \begin{eqnarray} z^{[1]} & = & w^{[1]} X + b^{[1]} \\ a^{[1]} & = & \sigma(z^{[1]}) \\ z^{[2]} & = & w^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} & = & \sigma(z^{[2]}) \\ \hat{y} & = & a^{[2]} \end{eqnarray} \] In this network we're using sigmoid activation for both layers and the last layer is meant to produce a one-dimensional output, which then gets squeezed to $[0,1]$ by the sigmoid as usual. To make the dimensions more concrete, consider a simple case of 3-dimensional data points, as column vectors in $X$, being reduced first to 2-dimensional activations by the first layer and then to 1-dimensional predictions by the second and final layer. As you can check, in this case $w^{[1]}$ would be a $(2,3)$ matrix (with the weight vectors as rows) and $w^{[2]}$ would be a $(1,2)$ row vector. The outputs $z^{[1]}$ would then be $(2,n)$ and the outputs $z^{[2]}$ would be $(1,n)$. Of course the activation functions do not affect these shapes, being component-wise.

How about the gradient calculation in this case? Well, it gets a bit complicated. Basically the chain rule, which I used somewhat recklessly before, still applies in the case of vectors, but one needs to use Jacobians rather than ordinary derivatives, Jacobians being the proper generalisation. However in the multiple-layer case we are dealing not only with vectors but also with matrices.

Consider the two-layer example above. The derivatives of the loss with respect to the last layer's parameters are the same as they were in the one-layer case. However, we also want to adjust the parameters of the previous layer. Ignoring the activations for the moment (since they don't affect the shapes of the gradients), we want to compute $\mathrm{J} a^{[2]}(w^{[1]})$, that is, the Jacobian of the last layer's activations with respect to the first layer's weights. But there are $n$ outcomes from the last layer and the first layer's weights are a matrix, so the result is a higher-dimensional tensor. With a vector input and a vector output we would get a nice, understandable Jacobian matrix, but with a vector and a matrix we get something else.

There are a few ways of making sense of this. One way would be to learn how tensors work, but this didn't seem like the best avenue here for various reasons. Another way would be to vectorise the inputs, that is, treat the last-layer activations as a function of a vector which is all the first-layer weights concatenated. I ended up choosing a third approach, which is to consider each vector in the weight matrix separately. Let's see how that works in practice.

Imagine a two-layer network with a very simple low-dimensional structure as before: three-dimensional input data, two neurons in the first layer, and one neuron in the final layer. Imagine further, just for a moment, that there are no activation functions or biases (since they don't affect the shapes of the gradients w.r.t. weights). Concretely, \[ \begin{eqnarray} a^{[1]} & = & w^{[1]} X = \begin{bmatrix} \text{---} & w^{[1]}_a & \text{---} \\ \text{---} & w^{[1]}_b & \text{---} \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix}; \\ a^{[2]} & = & w^{[2]} a^{[1]} = \begin{bmatrix} \text{---} & w^{[2]} & \text{---} \\ \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ a_1 & \ldots & a_n \\ \vert & & \vert \end{bmatrix}. \end{eqnarray} \] The difficult bit now is to calculate the Jacobian of $a^{[2]}$ with respect to the first-layer weight matrix $w^{[1]}$. So, let's do that part by part, by first computing the Jacobian with respect to the first weight vector in the first layer, i.e. $w^{[1]}_a$. Considered this way, the Jacobian is an ordinary matrix, as it is computed for a function that takes a vector and returns a vector. We can then do the same computation for the second weight vector $w^{[1]}_b$ and, if there are more than two neurons in the first layer, we will also be able to see how it generalises beyond that.

Let's again start with the case of just one input vector $x$. By the Jacobian chain rule, \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & \textrm{J} a^{[2]}(a^{[1]}) \textrm{J} a^{[1]}(w^{[1]}_a). \\ \end{eqnarray} \] Now if there is only one data point, the final layer produces only one outcome, so the Jacobian of $a^{[2]}$ will have one row; it will have 3 columns, since that is the number of dimensions in $w^{[1]}_a$. Of the "intermediate" components, $\textrm{J} a^{[2]}(a^{[1]})$ is $(1,2)$, i.e. one outcome in $a^{[2]}$ from two components of $a^{[1]}$, and $\textrm{J} a^{[1]}(w^{[1]}_a)$ is $(2,3)$, i.e. two outcomes in $a^{[1]}$ from three components of $w^{[1]}_a$. Of course $(1,2) \times (2,3) = (1,3)$ as expected. These two "component" Jacobians are now straightforward to calculate: \[ \begin{eqnarray} \textrm{J} a^{[2]}(a^{[1]}) & = & \begin{bmatrix} \text{---} & \partial a^{[2]}/\partial a^{[1]} & \text{---} \end{bmatrix} \\ & = & \begin{bmatrix} \text{---} & \partial w^{[2]} a^{[1]}/\partial a^{[1]} & \text{---} \end{bmatrix} \\ & = & w^{[2]}; \\ \textrm{J} a^{[1]}(w^{[1]}_a) & = & \begin{bmatrix} \text{---} & \partial a^{[1]}_1/\partial w^{[1]}_a & \text{---} \\ \text{---} & \partial a^{[1]}_2/\partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[1]}_a x /\partial w^{[1]}_a & \text{---} \\ \text{---} & \partial w^{[1]}_b x /\partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & x^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix},\\ \end{eqnarray} \] and therefore the whole Jacobian is \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & w^{[2]} \begin{bmatrix} \text{---} & x^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix}\\ & = & w^{[2]}_1 x^T. \end{eqnarray} \] So this is the gradient we can use to adjust the first weight vector of the first layer. Doing the same calculation for the second weight vector, we get a very similar result: \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_b) & = & w^{[2]}_2 x^T. \end{eqnarray} \] And we can now put these two together to get the complete Jacobian, i.e. the complete gradient: \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}) & = & \begin{bmatrix} \text{---} & w^{[2]}_1 x^T & \text{---} \\ \text{---} & w^{[2]}_2 x^T & \text{---} \end{bmatrix} \\ & = & \begin{bmatrix} w^{[2]}_1 \\ w^{[2]}_2 \end{bmatrix} \begin{bmatrix} \text{---} & x^T & \text{---} \end{bmatrix} \\ & = & {w^{[2]}}^T x^T. \\ \end{eqnarray} \] This gradient matrix is just the two gradient vectors in one. Of course it has the same shape as $w^{[1]}$, so we can use it directly.

More complex network: multiple data points

So what about more than one data point? In that case, the Jacobian of $a^{[2]}$ will have $n$ rows, corresponding to the $n$ outcomes. There are now also $n$ components in $a^{[1]}$, which makes the calculation a little more complicated. Basically, since we're going from $a^{[2]}$ to $w^{[1]}$ through multiple routes - all the various $a^{[1]}_i$s - we can calculate the Jacobian as the total derivative of all these routes. For the first weight vector of the first layer, \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & \textrm{J} a^{[2]}(a^{[1]}_1) \textrm{J} a^{[1]}_1(w^{[1]}_a) + \textrm{J} a^{[2]}(a^{[1]}_2) \textrm{J} a^{[1]}_2(w^{[1]}_a) + \ldots \end{eqnarray} \] To make this more concrete, let's do an example calculation of the first element of this sum, that is, through the first element, $a^{[1]}_1$; this is the first output element of the first layer. This partial Jacobian will have the same shape as the full one, but most of its contents will be zero, similar to what we saw before: \[ \begin{eqnarray} \textrm{J} a^{[2]}(a^{[1]}_1) & = & \begin{bmatrix} \text{---} & \partial a^{[2]}_1 / \partial a^{[1]}_1 & \text{---} \\ \text{---} & \partial a^{[2]}_2 / \partial a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & \partial a^{[2]}_n / \partial a^{[1]}_1 & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[2]} a^{[1]}_1 / \partial a^{[1]}_1 & \text{---} \\ \text{---} & \partial w^{[2]} a^{[1]}_2 / \partial a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & \partial w^{[2]} a^{[1]}_n / \partial a^{[1]}_1 & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & w^{[2]} & \text{---} \\ \text{---} & 0 & \text{---} \\ & \ldots & \\ \text{---} & 0 & \text{---} \\ \end{bmatrix};\\ \textrm{J} a^{[1]}_1(w^{[1]}_a) & = & \begin{bmatrix} \text{---} & \partial a^{[1]}_{11} / \partial w^{[1]}_a & \text{---} \\ \text{---} & \partial a^{[1]}_{12} / \partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[1]}_a x_1 / \partial w^{[1]}_a & \text{---} \\ \text{---} & \partial w^{[1]}_b x_1 / \partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & x_1^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix}.\\ \end{eqnarray} \] The first thing here is an $(n, 2)$ matrix, and the second thing is a $(2, 3)$ matrix, so together they produce an $(n, 3)$ matrix. Notice that we now have multiple data points $x_i$, but as the model is very "regular", only the first data point $x_1$ contributes to the first activation $a^{[1]}_1$, i.e. $a^{[1]}_1 = \begin{bmatrix} w^{[1]}_a x_1 & w^{[1]}_b x_1 \end{bmatrix}^T$.

So what do we do with this? By itself, not much; but if we do the same calculation for the second element of the sum - i.e. through the second activation vector, to which only the second data point contributes - and then for the third element and so on, the result for the whole sum is easily seen to be \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & w^{[2]}_1 \begin{bmatrix} \text{---} & x_1^T & \text{---} \\ \text{---} & x_2^T & \text{---} \\ & \ldots & \\ \text{---} & x_n^T & \text{---} \\ \end{bmatrix}\\ & = & w^{[2]}_1 X^T .\\ \end{eqnarray} \] Here of course $w^{[2]}_1$ is simply a scalar, the first element of the two-element weight vector.

So starting from $n$ data points, we have calculated $n$ different gradients for the first weight vector - one gradient per point. This seems logical enough. Similarly, if we do the math for the other weight vector $w^{[1]}_b$, we end up with a slightly different arrangement of matrices, such that the second element of $w^{[2]}$ is now used, and the result is \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_b) & = & w^{[2]}_2 X^T ,\\ \end{eqnarray} \] i.e. the second element of the second weight vector, times all the gradients of each point.

So this is pretty understandable. For the first weight vector, the Jacobian involves all of the gradients of each input to that layer - here just the original data points, since this is the first layer - times the element in the second-layer weight vector that corresponds to the first weight vector. For the second weight vector, we have the same gradients, times the second element of the weight vector from the next layer. It is obvious that if there are more neurons and hence more weight vectors in the first layers, say $q$ of them, there also needs to be $q$ columns in the next layer's weight vector or matrix, and this same process happens $q$ times.

How to best calculate this? Keeping in mind that we are actually interested in the average gradient, the most straightforward thing to do from an implementation point of view would be to calculate the average of the original data points first, then do a simple for loop that runs $q$ times to calculate the rest. However such for loops are just distasteful when doing matrix calculations. Fortunately there is a better way.

It turns out to be helpful to consider also the role of the (immediate) gradients of the loss function and the activation functions, as they help with doing all the required computations more efficiently.

Loss and activation

There are a few different ways of considering the gradients of the loss function and the activation functions. The more straightforward way is to reason that since these functions are taken elementwise, their derivatives are also elementwise, and that's that. In other words, there's no need for any matrix math, since the functions and their derivatives are simply $\mathbb{R} \to \mathbb{R}$. Done this way, when one uses the resulting matrices in an application of the chain rule, the multiplication must be done elementwise as well.

More mathematically, the loss function can be considered to be a function that takes in $n$ predictions and outputs the $n$ corresponding losses (the correct answers $y$ are of course constant). Hence, the Jacobian of the loss function is a matrix with $n$ rows, one for each loss, and $n$ columns, one for each input. However, due to the "regular" structure of the model, the Jacobian will be a diagonal matrix, since clearly the first input, say, has no effect on losses other than the first one. So we have \[ \begin{eqnarray} \textrm{J} L (a^{[2]}) & = & \textrm{diag}(\partial L(a^{[2]}_i) / \partial a^{[2]}_i) . \end{eqnarray} \] Similarly, the activation function of the last layer takes in $n$ inputs and produces $n$ outputs, and since each input only affects the output at the same index, the immediate Jacobian of an activation function $\sigma$ is \[ \begin{eqnarray} \textrm{J} \sigma (z^{[2]}) & = & \textrm{diag}(\partial \sigma(z^{[2]}_i) / \partial z^{[2]}_i) . \end{eqnarray} \] For example, in the case of traditional linear regression there is no activation function (i.e. the activation function is the identity function and its Jacobian is the identity matrix) and hence the product of these two Jacobians is simply $\textrm{diag}(a_i-y_i)$, which is the direct Jacobian of the loss. If we are using a sigmoid in the last layer and the logistic loss function (e.g. logistic regression), then the product of the Jacobians turns out to be the same, as is known.

In practice of course it doesn't make sense to construct a matrix with $n^2$ elements, only $n$ of which are non-zero. But from the diagonal matrix form we can see that the mathematically correct way is indeed equivalent to the simpler elementwise multiplication mentioned before. In inner layers, where the Jacobians would again be more complex, I didn't do the math for the Jacobians, since the simple elementwise computation is sufficient (as done also by Ng and others).

How do these loss & activation derivatives fit in with the rest of the calculations? Well, previously we saw that a primitive way to calculate the gradients for the first-layer weights would be to calculate $w^{[2]}_q X^T$ for each (scalar) element $q$. This of course is an $(n,d)$ matrix, so it's no problem to multiply it from the left with an $(n,n)$ diagonal matrix or, equivalently, elementwise from the left with an $(n,1)$ column vector. (The latter operation is slightly tricky, since the dimensions don't quite match; we need to be careful there to note that the elementwise product here means first fattening, or "broadcasting", the $(n,1)$ vector into an $(n,d)$ matrix, by duplicating the column. But it turns out the operation we actually want to do will not require this.)

So we now have all the pieces. We know how to compute the gradients, mathematically speaking, and we have all of the various matrices containing these gradients. The next step is to compute the sum, and then the average, of the gradients in an efficient way.

Optimised calculations

So we want to calculate, for each weight vector, the average gradient (across all data points) for that weight vector. The average is of course just the sum of the gradients divided by a scalar. Consider now all the components that go into the full gradient of a given weight vector. We have the derivatives of the loss, of the activation function, and of the linear function, followed by more activations and linears in case of multiple layers. We want the sum of this for each data point; each data point has its corresponding loss derivative and activation derivative.

Starting from the loss, first do the element-wise multiplication of the loss derivatives with the activation derivatives for the final layer; these of course have the same shape. For example, in our case of sigmoid activation with logistic loss, the result will be a row vector containing $a^{[2]}_i-y_i$ for each element $i$.

To adjust the weights of the last layer, as we saw before with the simple network, the sum of the gradients is simply $(a^{[2]}-y){a^{[1]}}^T$ in our example, or more generally, the product of the previous gradients and the transpose of the input to the last layer. As you can check, the dimensions match, the loss and activation gradients being $(1,n)$ and the input transpose being $(n,2)$. In general, multiplying an $(a,b)$ matrix by a $(b,c)$ matrix produces a sum of each $a$-times-$c$ element; here, the $a$s are the loss-and-activation-derivatives and the $c$s are the inputs to the current layer. Mathematically, \[ \begin{eqnarray} & & \begin{bmatrix} a^{[2]}_1-y_1 & a^{[2]}_2-y_2 & \ldots & a^{[2]}_n-y_n \\ \end{bmatrix} \begin{bmatrix} \text{---} & a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & a^{[1]}_n & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \sum_i (a^{[2]}_i-y_i) a^{[1]}_{i1} & \ldots & \sum_i (a^{[2]}_i-y_i) a^{[1]}_{iq} \\ \end{bmatrix}\\ \end{eqnarray} \] where $q$ is the dimensionality of each previous-layer activation, in our example 2. The math looks complicated, but I encourage you to ponder the loss-etc elements and the inputs until the sums computed here become clear. Now all we need to do is to divide this by $n$ and we have the average.

One thing to note with this calculation is that the $a^{[2]}-y$ represents the elementwise product of the loss gradient and the final-layer activation gradient. In the following we need to treat these two separately.

Now let's see how to proceed to the previous layer. In short, the last layer's weight adjustments are loss-etc times layer input; the layer before that is loss-etc times next-layer weights times layer input. How do we compute this? The loss-etc is $(1,n)$ as before; the weights of the final layer are $(1,q)$; and the layer input for the second-to-last layer is $(q,n)$ (in our example $q=2$).

So for a non-final layer, we want to compute basically the same thing, loss-etc times layer input - except we want $q$ copies of it, where each copy is multiplied by a different element of the next-layer weight. This is the "for loop" we saw before. We need to also consider the activation derivatives for the current layer, but since these are elementwise, there are no complicated matrix shapes or sums to consider. I'll do the derivation first without the activation derivatives and then add them in after, since the calculations are easier to understand that way.

We first compute an "outer product" of the loss-etc and the next-layer weights, to get \[ \begin{eqnarray} {w^{[2]}}^T (a^{[2]}-y) & = & \begin{bmatrix} \text{---} & w^{[2]}_1 (a^{[2]}-y) & \text{---} \\ & \ldots & \\ \text{---} & w^{[2]}_q (a^{[2]}-y) & \text{---} \\ \end{bmatrix} \end{eqnarray}. \] This is just the same $a-y$ multiplied by each of the $q$ elements of the final weights, in a $(q,n)$ matrix. We can then tie this together, along the $n$ axis, with the layer input of the first layer. The layer input has $n$ elements, and we want the elementwise sum as before, so the correct calculation is \[ \begin{eqnarray} ({w^{[2]}}^T (a^{[2]}-y)) X^T & = & \begin{bmatrix} \text{---} & w^{[2]}_1 (a^{[2]}-y) & \text{---} \\ & \ldots & \\ \text{---} & w^{[2]}_n (a^{[2]}-y) & \text{---} \\ \end{bmatrix} \begin{bmatrix} \text{---} & x^1 & \text{---} \\ & \ldots & \\ \text{---} & x^n & \text{---} \\ \end{bmatrix} \end{eqnarray}. \] As you can check, the dimensions match: $((2,1) \times (1,n)) \times (n,3) = (2,3)$, which is the dimensionality of the first-layer weights, as desired. Careful checking reveals that each row of the resulting matrix corresponds to one element of $w^{[2]}$ times $\sum_i (a_i-y_i) x_i^T$, which is what we wanted.

As was said before, we still need to consider the derivatives of the first-layer activation. These can be taken elementwise, and the result is another $(q,n)$ matrix. So to incorporate the activation derivatives, we take the elementwise product of ${w^{[2]}}^T (a^{[2]}-y)$ and those derivatives, call the result $M$: \[ \begin{eqnarray} M & = & ({w^{[2]}}^T (a^{[2]}-y)) \circ (\partial \sigma^{[1]}(z^{[1]}) / \partial z^{[1]}) \end{eqnarray}. \] We then multiply $M$ by $X^T$ to get the final answer for the weights, this time with the activation derivatives included.

What about layers deeper back than two? I won't go into the details there, but you should be able to convince yourself, with a bit of thinking and calculating, that applying this same method of loss-etc times next-layer weights, and that multiplied by current-layer input, will continue to work for as many layers as you need. In our example, think of a "layer zero" with, say, 4-dimensional inputs. The "loss-etc" is now ${w^{[2]}}^T (a^{[2]}-y)$, and the next-layer weights are $w^{[1]}$; the calculation is \[ M' = {w^{[1]}}^T ({w^{[2]}}^T (a^{[2]}-y)), \] and this can then be both multiplied by the layer-zero inputs to compute the layer-zero weight gradient and also passed further backward if need be. As you can check, in each case the sizes match and the calculation produces the desired result. (I've omitted the activation derivatives in the last equation for clarity, but these are straightforward to add, as above.)

Putting it all together

Let's summarise these findings into a final algorithm. This end result is of course familiar from Prof Ng's lecture notes and many other sources. The algorithm here is valid for a network with one-dimensional output, e.g. a binary classifier.

Given the input and the network, compute the predictions $\hat{y}$ (forward propagation). This will be a $(1,n)$ row vector.
Compute the loss $L(\hat{y}, y)$, another $(1,n)$ row vector.
Compute the gradient of the loss, i.e. the elementwise derivatives, and store it in another $(1,n)$ vector; call this $dL$.
Let $M := dL$.
For each layer $i$, starting from the last:
- Compute the elementwise derivatives of the activations $a^{[i]}$ produced by the current layer; call these $dA$. This will be the same shape as $M$ (e.g. $(1,n)$ for the last layer).
- Compute the elementwise product of $M$ with $dA$, i.e. $C := M \circ dA$. This will of course be the same shape again.
- Compute $dW := \frac{1}{n} C {a^{[i-1]}}^T$, where $a^{[i-1]}$ is the input to the current layer. The resulting $dW$ is the same shape as the current layer's weights, as you can check.
- Adjust the weights according to $dW$, i.e. take a gradient descent step.
- Update $M$ with $M := {w^{[i]}}^T C = {w^{[i]}}^T (M \circ dA)$.
- Set $i := i-1$ and proceed to the previous layer.
Repeat until convergence.

You will notice I've left the biases out of the algorithm. I recommend adding them yourself, as this is a good way to verify your understanding of both the mathematics and the practical computation.

If you made it this far, congratulations and I hope this post has been useful.

Friday 29 December 2017

Various gradient descent methods

Edit: fixed a bug in the code (was using total gradient instead of average). As a result, the first two graphs are slightly different, and the scale of the gradient comparison graphs is different.

After the previous post I wanted to solve the same problem with gradient descent. Turns out this particular problem illuminates rather well the somewhat finicky nature of gradient descent, while also being simple enough to understand.

The three points $(10,37)$, $(243,328)$ and $(341,451)$ are of course collinear, or very nearly so, and thus fitting a straight line through them is not the hardest problem I've ever solved. The correct answer, as determined in the previous post, is $y = 24.416 + 1.250 x$. But finding this out with ordinary gradient descent is harder than I thought.

By "ordinary gradient descent", I mean one where the weights are initialised randomly to "small" values and then the gradient of the loss function is followed, with a constant learning rate for each weight component, until convergence occurs. The basic problem with this is that the optimal values 24.416 and 1.250 are far enough apart in terms of magnitude that, since we're using the same learning rate for both weight components, either the gradient for the latter ($w_1$) will oscillate wildly and possibly diverge (learning rate too big), or the gradient for the former ($w_0$) will converge to the correct value very slowly (learning rate too small). To illustrate, here's a picture of completely standard gradient descent with a small learning rate:

Recall that the optimal value for $w_0$ is 24.416; yet after ten thousand iterations, standard gradient descent has only managed to increase it to about 3. Increasing the learning rate from $10^{-5}$ to, say, $10^{-4}$ doesn't help, since that makes the algorithm diverge - the other weight is so small that it cannot abide a larger learning rate.

There are various ways to try to remedy this. One thing I'm not sure would help much is better initialisation (I went for (unscaled) standard normal for what it's worth - being careful to keep the same seed across all my experiments). The optimal values for the weights are only about an order of magnitude apart, and it's not clear whether there's any general method that could guess more suitable initial values "in one go": if you get it wrong by about an order of magnitude, that doesn't help at all, and if you could somehow do better than that, then gradient descent wouldn't even be needed: just use your magical guessing method instead.

A better idea is to normalise the data, so that the gradients are of similar magnitude in each direction. I'll come back to this point later.

One thing that is often done is adaptive learning rates of various kinds. These seem to be the go-to method in practice, based both on Andrew Ng's Coursera courses and what's commonly available in APIs such as Tensorflow; of course these more advanced methods become more valuable the more difficult the problem is. So I implemented (from scratch) some of them.

Momentum

The momentum method is a straightforward optimisation that maintains a moving average of previous gradients, and uses this in place of the current gradient at each step. The idea is that averaging the gradients will cancel out any oscillations and thus result in a smaller effective learning rate for any gradient that would otherwise oscillate, while non-oscillating gradients are not affected by the averaging and will remain as large as they were. This enables the use of a larger learning rate, since the likelihood of divergence will be lower. (In practice the averaging is done with a simple-to-implement exponentially decaying average; it would be possible to do an "exact" average as well that remembers the previous $k$ gradients, but in practice exponential decay works just fine.) And indeed, with momentum (all other things being equal), a higher learning rate is achieved and the learning goes much faster, though still not amazingly fast:

RMSProp

RMSProp is another way of averaging gradients; it's based only on the magnitudes, and not the signs. This can work better in some cases I guess, but in my case it behaved rather strangely. It allowed me to use a larger learning rate than momentum, and converged more quickly, but after that it started oscillating a bit. This could be due to how RMSProp uses (square roots of) squared gradients, which cannot really prevent back-and-forth oscillations, since the information about the sign of the gradient is lost.

Adam

Finally, Adam is a combination of momentum and RMSProp. Basically, both of these are done at the same time, and the combined method should therefore prevent oscillations and also converge fairly quickly, even with a fairly high base learning rate. And this indeed seems to be the case:

Gradient comparisons

As a further comparison of the various methods, I also had a look at the magnitudes of the gradients. Now as you recall, we want each component of the gradient to reach zero (and stay there), at which point the algorithm has converged. Plotting the gradient components directly basically corroborates the story above: plain gradient descent is stable but extremely slow; momentum is better; RMSProp is faster but less stable than momentum; and Adam is the best overall, achieving both speed and stability.

Code used

Here's the quick and dirty code I used for Adam; It's written in Scala using the ND4J library for the matrix computations. (The code for the other methods is similar but less interesting.)

case object SquaredError extends Error {

  override def averageError(z: INDArray, y: INDArray): INDArray = {
    val diff = z.sub(y)
    diff.mmul(diff.transpose()).div(2.0*diff.shape()(1))
  }

  override def errorGradient(z: INDArray, y: INDArray, X: INDArray): INDArray = {
    val diff = z.sub(y)
    diff.mmul(X.transpose()).mul(1.0/diff.shape()(1))
  }
}

case class Adam(
                 betaMomentum: Double = 0.9,
                 betaSquares: Double = 0.999,
                 epsilon: Double = 1E-8
               ) extends Optimiser {
  override def runForNRounds(n: Int, wInitial: INDArray, data: Data, learningRate: Double, error: Error,
                             printTooMuchCrap: Boolean): (INDArray, Seq[Double]) = {
    var w = wInitial
    val errors = ArrayBuffer[Double]()

    var momentum = Nd4j.zeros(w.shape()(1))
    var squares = Nd4j.zeros(w.shape()(1))

    for (i <- 0 until n) {
      val z = w.mmul(data.X)
      val avgErr = error.averageError(z, data.y)
      val dw = error.errorGradient(z, data.y, data.X)

      momentum = momentum.mul(betaMomentum).add(dw.mul(1.0 - betaMomentum))
      val dw2 = Transforms.pow(dw, 2.0)
      squares = squares.mul(betaSquares).add(dw2.mul(1.0 - betaSquares))

      w = w.sub(momentum.div(Transforms.sqrt(squares).add(epsilon)).mul(learningRate))

      val avgErrDbl = avgErr.getDouble(0)
      GradientDescent.maybePrintTooMuchCrap(printTooMuchCrap, w, dw, avgErrDbl)
      errors.append(avgErrDbl)
    }
    (w, errors)
  }
}

Normalising the data

Finally, a few words about normalising the data. I'm not quite sure what to think of this. Normalisation seems to be both quite elegant and straightforward, solving the uneven gradient problem effectively, and also somewhat unwieldy to apply to any problem where you might not have all of the data to hand to begin with, such as any kind of online learning or reinforcement learning, or a real-world project where after six months you might get a batch of brand new data. In actual applications it seems one can usually get away with only centering the data (i.e. subtracting the mean) or even forgo any normalisation at all. I suspect normalising the data is one of those techniques where I will eventually come across a problem that genuinely requires it, but this simplest of examples here doesn't.

In any case, here's what happens to any collinear or nearly collinear data set when you normalise each input and output dimension by subtracting the mean and dividing by the standard deviation:

Pictured: the three points, normalised, and the line $y = x$. Needless to say, this line is not difficult to find even with the most primitive gradient descent method!

Tuesday 26 December 2017

Normal equations for the simple linear model

Edit: corrected an error in the second derivation (which did not affect the end result).

I've been recently going through some Deep Learning courses on Coursera. Just to refresh my memory I started writing some basic code to implement gradient descent in a few different ways. I started with the simplest possible thing, which is a linear model. Since this is solvable analytically I did that first, to get the "right" answer, which I could then compare to the gradient descent result.

It turns out that since there are different ways to arrange the data and the weights in their respective matrices, there are also different forms for the normal equations of the linear model. And as it happens, Andrew Ng's Coursera courses use one data layout, while traditional literature on linear regression uses the other data layout. I could not quickly find a reference for the normal equations for the former, so I did some investigating myself.

In the traditional layout, the model is something like (to take a simple example), \[ y = X w = \begin{bmatrix} 37\\ 328\\ 451 \end{bmatrix} = \begin{bmatrix} 1 & 10\\ 1 & 243\\ 1 & 341 \end{bmatrix} \begin{bmatrix} w_0\\ w_1 \end{bmatrix}, \] where the data points are $10$, $243$ and $341$ and the corresponding dependent variables are $37$, $328$ and $451$. The column of ones is added to the data matrix in the usual way so that $y_i = w_0 + w_1 x_i$ can be expressed directly without needing a separate bias variable for the $w_0$.

Now for a system like this, given some weights $w$, we can compute the prediction $\hat{y} = X w$ and then figure out how close to the real $y$ we got by simply calculating the squared difference, i.e. $L(\hat{y},y) = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y)$. Gradient descent is of course based on the derivative of this loss function, but because the model here is so simple, the derivative can also directly give us the answer that minimises the loss: \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y) \\ & = (w^T X^T - y^T)(Xw-y) \\ & = w^T X^T X w - w^T X^T y - y^T X w + y^T y \end{align} \] and taking the derivative of this w.r.t. $w$ and setting it to zero we get (since $w^T X^T y = y^T X w$ as they are both scalars) \[ \begin{align} 2 X^T X w - 2 X^T y & = 0 \\ X^T X w & = X^T y \\ \hat{w} & = (X^T X)^{-1} X^T y, \end{align} \] i.e. the familiar normal equations. (The matrix derivative formulas can be found on Wikipedia or in the Matrix Cookbook; note that $X^T X$ is symmetric.)

Now as I said, Andrew Ng favours the other orientation for the data matrices on his Deep Learning courses. On these courses the model would instead be \[ y = w X = \begin{bmatrix} 37 & 328 & 451 \end{bmatrix} = \begin{bmatrix} w_0 & w_1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 10 & 243 & 341 \end{bmatrix}, \] which is the same data just in a column-vector format, i.e. $X$ is now $D \times n$ where $D$ is the dimension (plus one) and $n$ is the number of data points. So how will the normal equations look now? I couldn't quickly find a reference for this, but it's easy to derive them directly. The loss function is, similar to what we did before, \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (wX-y)(wX-y)^T \\ & = (wX-y)(X^T w^T - y^T) \\ & = w X X^T w^T - w X y^T - y X^T w^T + y y^T \\ & = w X X^T w^T - 2 w X y^T + y y^T, \end{align} \] where we have again combined the middle two terms, which are just real numbers. Taking the derivative of this with respect to $w$ and setting it to zero we get (again consulting the sources for the appropriate matrix derivatives) \[ \begin{align} 2 w X X^T - 2 y X^T & = 0 \\ w X X^T & = y X^T \\ \hat{w} & = y X^T (X X^T)^{-1}. \end{align} \] This looks reasonable enough. Checking the matrix dimensions, all of them also match. Finally, I wrote some R code to check the results. First, the former, row-matrix layout:

> x <- matrix(c(1,1,1,10,243,341), byrow=FALSE, nrow=3)
> y <- matrix(c(37, 328, 451), byrow=FALSE, nrow=3)
> w_hat <- solve(t(x) %*% x) %*% t(x) %*% y
> w_hat
          [,1]
[1,] 24.416099
[2,]  1.250424

And then the column-matrix layout:

> x2 <- matrix(c(1,1,1,10,243,341), byrow=TRUE, ncol=3)
> y2 <- matrix(c(37, 328, 451), byrow=TRUE, ncol=3)
> w_hat_2 <- y2 %*% t(x2) %*% solve(x2 %*% t(x2))
> w_hat_2
        [,1]     [,2]
[1,] 24.4161 1.250424

And they agree. Nice.