Friday, 29 December 2017

Various gradient descent methods

Edit: fixed a bug in the code (was using total gradient instead of average). As a result, the first two graphs are slightly different, and the scale of the gradient comparison graphs is different.

After the previous post I wanted to solve the same problem with gradient descent. Turns out this particular problem illuminates rather well the somewhat finicky nature of gradient descent, while also being simple enough to understand.

The three points $(10,37)$, $(243,328)$ and $(341,451)$ are of course collinear, or very nearly so, and thus fitting a straight line through them is not the hardest problem I've ever solved. The correct answer, as determined in the previous post, is $y = 24.416 + 1.250 x$. But finding this out with ordinary gradient descent is harder than I thought.

By "ordinary gradient descent", I mean one where the weights are initialised randomly to "small" values and then the gradient of the loss function is followed, with a constant learning rate for each weight component, until convergence occurs. The basic problem with this is that the optimal values 24.416 and 1.250 are far enough apart in terms of magnitude that, since we're using the same learning rate for both weight components, either the gradient for the latter ($w_1$) will oscillate wildly and possibly diverge (learning rate too big), or the gradient for the former ($w_0$) will converge to the correct value very slowly (learning rate too small). To illustrate, here's a picture of completely standard gradient descent with a small learning rate:
Recall that the optimal value for $w_0$ is 24.416; yet after ten thousand iterations, standard gradient descent has only managed to increase it to about 3. Increasing the learning rate from $10^{-5}$ to, say, $10^{-4}$ doesn't help, since that makes the algorithm diverge - the other weight is so small that it cannot abide a larger learning rate.

There are various ways to try to remedy this. One thing I'm not sure would help much is better initialisation (I went for (unscaled) standard normal for what it's worth - being careful to keep the same seed across all my experiments). The optimal values for the weights are only about an order of magnitude apart, and it's not clear whether there's any general method that could guess more suitable initial values "in one go": if you get it wrong by about an order of magnitude, that doesn't help at all, and if you could somehow do better than that, then gradient descent wouldn't even be needed: just use your magical guessing method instead.

A better idea is to normalise the data, so that the gradients are of similar magnitude in each direction. I'll come back to this point later.

One thing that is often done is adaptive learning rates of various kinds. These seem to be the go-to method in practice, based both on Andrew Ng's Coursera courses and what's commonly available in APIs such as Tensorflow; of course these more advanced methods become more valuable the more difficult the problem is. So I implemented (from scratch) some of them.

Momentum


The momentum method is a straightforward optimisation that maintains a moving average of previous gradients, and uses this in place of the current gradient at each step. The idea is that averaging the gradients will cancel out any oscillations and thus result in a smaller effective learning rate for any gradient that would otherwise oscillate, while non-oscillating gradients are not affected by the averaging and will remain as large as they were. This enables the use of a larger learning rate, since the likelihood of divergence will be lower. (In practice the averaging is done with a simple-to-implement exponentially decaying average; it would be possible to do an "exact" average as well that remembers the previous $k$ gradients, but in practice exponential decay works just fine.) And indeed, with momentum (all other things being equal), a higher learning rate is achieved and the learning goes much faster, though still not amazingly fast:


RMSProp


RMSProp is another way of averaging gradients; it's based only on the magnitudes, and not the signs. This can work better in some cases I guess, but in my case it behaved rather strangely. It allowed me to use a larger learning rate than momentum, and converged more quickly, but after that it started oscillating a bit. This could be due to how RMSProp uses (square roots of) squared gradients, which cannot really prevent back-and-forth oscillations, since the information about the sign of the gradient is lost.


Adam


Finally, Adam is a combination of momentum and RMSProp. Basically, both of these are done at the same time, and the combined method should therefore prevent oscillations and also converge fairly quickly, even with a fairly high base learning rate. And this indeed seems to be the case:


Gradient comparisons


As a further comparison of the various methods, I also had a look at the magnitudes of the gradients. Now as you recall, we want each component of the gradient to reach zero (and stay there), at which point the algorithm has converged. Plotting the gradient components directly basically corroborates the story above: plain gradient descent is stable but extremely slow; momentum is better; RMSProp is faster but less stable than momentum; and Adam is the best overall, achieving both speed and stability.


Code used


Here's the quick and dirty code I used for Adam; It's written in Scala using the ND4J library for the matrix computations. (The code for the other methods is similar but less interesting.)
case object SquaredError extends Error {

  override def averageError(z: INDArray, y: INDArray): INDArray = {
    val diff = z.sub(y)
    diff.mmul(diff.transpose()).div(2.0*diff.shape()(1))
  }

  override def errorGradient(z: INDArray, y: INDArray, X: INDArray): INDArray = {
    val diff = z.sub(y)
    diff.mmul(X.transpose()).mul(1.0/diff.shape()(1))
  }
}

case class Adam(
                 betaMomentum: Double = 0.9,
                 betaSquares: Double = 0.999,
                 epsilon: Double = 1E-8
               ) extends Optimiser {
  override def runForNRounds(n: Int, wInitial: INDArray, data: Data, learningRate: Double, error: Error,
                             printTooMuchCrap: Boolean): (INDArray, Seq[Double]) = {
    var w = wInitial
    val errors = ArrayBuffer[Double]()

    var momentum = Nd4j.zeros(w.shape()(1))
    var squares = Nd4j.zeros(w.shape()(1))

    for (i <- 0 until n) {
      val z = w.mmul(data.X)
      val avgErr = error.averageError(z, data.y)
      val dw = error.errorGradient(z, data.y, data.X)

      momentum = momentum.mul(betaMomentum).add(dw.mul(1.0 - betaMomentum))
      val dw2 = Transforms.pow(dw, 2.0)
      squares = squares.mul(betaSquares).add(dw2.mul(1.0 - betaSquares))

      w = w.sub(momentum.div(Transforms.sqrt(squares).add(epsilon)).mul(learningRate))

      val avgErrDbl = avgErr.getDouble(0)
      GradientDescent.maybePrintTooMuchCrap(printTooMuchCrap, w, dw, avgErrDbl)
      errors.append(avgErrDbl)
    }
    (w, errors)
  }
}


Normalising the data


Finally, a few words about normalising the data. I'm not quite sure what to think of this. Normalisation seems to be both quite elegant and straightforward, solving the uneven gradient problem effectively, and also somewhat unwieldy to apply to any problem where you might not have all of the data to hand to begin with, such as any kind of online learning or reinforcement learning, or a real-world project where after six months you might get a batch of brand new data. In actual applications it seems one can usually get away with only centering the data (i.e. subtracting the mean) or even forgo any normalisation at all. I suspect normalising the data is one of those techniques where I will eventually come across a problem that genuinely requires it, but this simplest of examples here doesn't.

In any case, here's what happens to any collinear or nearly collinear data set when you normalise each input and output dimension by subtracting the mean and dividing by the standard deviation:
Pictured: the three points, normalised, and the line $y = x$. Needless to say, this line is not difficult to find even with the most primitive gradient descent method!

Tuesday, 26 December 2017

Normal equations for the simple linear model

Edit: corrected an error in the second derivation (which did not affect the end result).

I've been recently going through some Deep Learning courses on Coursera. Just to refresh my memory I started writing some basic code to implement gradient descent in a few different ways. I started with the simplest possible thing, which is a linear model. Since this is solvable analytically I did that first, to get the "right" answer, which I could then compare to the gradient descent result.

It turns out that since there are different ways to arrange the data and the weights in their respective matrices, there are also different forms for the normal equations of the linear model. And as it happens, Andrew Ng's Coursera courses use one data layout, while traditional literature on linear regression uses the other data layout. I could not quickly find a reference for the normal equations for the former, so I did some investigating myself.

In the traditional layout, the model is something like (to take a simple example), \[ y = X w = \begin{bmatrix} 37\\ 328\\ 451 \end{bmatrix} = \begin{bmatrix} 1 & 10\\ 1 & 243\\ 1 & 341 \end{bmatrix} \begin{bmatrix} w_0\\ w_1 \end{bmatrix}, \] where the data points are $10$, $243$ and $341$ and the corresponding dependent variables are $37$, $328$ and $451$. The column of ones is added to the data matrix in the usual way so that $y_i = w_0 + w_1 x_i$ can be expressed directly without needing a separate bias variable for the $w_0$.

Now for a system like this, given some weights $w$, we can compute the prediction $\hat{y} = X w$ and then figure out how close to the real $y$ we got by simply calculating the squared difference, i.e. $L(\hat{y},y) = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y)$. Gradient descent is of course based on the derivative of this loss function, but because the model here is so simple, the derivative can also directly give us the answer that minimises the loss: \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y) \\ & = (w^T X^T - y^T)(Xw-y) \\ & = w^T X^T X w - w^T X^T y - y^T X w + y^T y \end{align} \] and taking the derivative of this w.r.t. $w$ and setting it to zero we get (since $w^T X^T y = y^T X w$ as they are both scalars) \[ \begin{align} 2 X^T X w - 2 X^T y & = 0 \\ X^T X w & = X^T y \\ \hat{w} & = (X^T X)^{-1} X^T y, \end{align} \] i.e. the familiar normal equations. (The matrix derivative formulas can be found on Wikipedia or in the Matrix Cookbook; note that $X^T X$ is symmetric.)

Now as I said, Andrew Ng favours the other orientation for the data matrices on his Deep Learning courses. On these courses the model would instead be \[ y = w X = \begin{bmatrix} 37 & 328 & 451 \end{bmatrix} = \begin{bmatrix} w_0 & w_1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 10 & 243 & 341 \end{bmatrix}, \] which is the same data just in a column-vector format, i.e. $X$ is now $D \times n$ where $D$ is the dimension (plus one) and $n$ is the number of data points. So how will the normal equations look now? I couldn't quickly find a reference for this, but it's easy to derive them directly. The loss function is, similar to what we did before, \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (wX-y)(wX-y)^T \\ & = (wX-y)(X^T w^T - y^T) \\ & = w X X^T w^T - w X y^T - y X^T w^T + y y^T \\ & = w X X^T w^T - 2 w X y^T + y y^T, \end{align} \] where we have again combined the middle two terms, which are just real numbers. Taking the derivative of this with respect to $w$ and setting it to zero we get (again consulting the sources for the appropriate matrix derivatives) \[ \begin{align} 2 w X X^T - 2 y X^T & = 0 \\ w X X^T & = y X^T \\ \hat{w} & = y X^T (X X^T)^{-1}. \end{align} \] This looks reasonable enough. Checking the matrix dimensions, all of them also match. Finally, I wrote some R code to check the results. First, the former, row-matrix layout:

> x <- matrix(c(1,1,1,10,243,341), byrow=FALSE, nrow=3)
> y <- matrix(c(37, 328, 451), byrow=FALSE, nrow=3)
> w_hat <- solve(t(x) %*% x) %*% t(x) %*% y
> w_hat
          [,1]
[1,] 24.416099
[2,]  1.250424

And then the column-matrix layout:
> x2 <- matrix(c(1,1,1,10,243,341), byrow=TRUE, ncol=3)
> y2 <- matrix(c(37, 328, 451), byrow=TRUE, ncol=3)
> w_hat_2 <- y2 %*% t(x2) %*% solve(x2 %*% t(x2))
> w_hat_2
        [,1]     [,2]
[1,] 24.4161 1.250424

And they agree. Nice.