I've been recently going through some Deep Learning courses on Coursera. Just to refresh my memory I started writing some basic code to implement gradient descent in a few different ways. I started with the simplest possible thing, which is a linear model. Since this is solvable analytically I did that first, to get the "right" answer, which I could then compare to the gradient descent result.
It turns out that since there are different ways to arrange the data and the weights in their respective matrices, there are also different forms for the normal equations of the linear model. And as it happens, Andrew Ng's Coursera courses use one data layout, while traditional literature on linear regression uses the other data layout. I could not quickly find a reference for the normal equations for the former, so I did some investigating myself.
In the traditional layout, the model is something like (to take a simple example), \[ y = X w = \begin{bmatrix} 37\\ 328\\ 451 \end{bmatrix} = \begin{bmatrix} 1 & 10\\ 1 & 243\\ 1 & 341 \end{bmatrix} \begin{bmatrix} w_0\\ w_1 \end{bmatrix}, \] where the data points are $10$, $243$ and $341$ and the corresponding dependent variables are $37$, $328$ and $451$. The column of ones is added to the data matrix in the usual way so that $y_i = w_0 + w_1 x_i$ can be expressed directly without needing a separate bias variable for the $w_0$.
Now for a system like this, given some weights $w$, we can compute the prediction $\hat{y} = X w$ and then figure out how close to the real $y$ we got by simply calculating the squared difference, i.e. $L(\hat{y},y) = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y)$. Gradient descent is of course based on the derivative of this loss function, but because the model here is so simple, the derivative can also directly give us the answer that minimises the loss: \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (Xw-y)^T (Xw-y) \\ & = (w^T X^T - y^T)(Xw-y) \\ & = w^T X^T X w - w^T X^T y - y^T X w + y^T y \end{align} \] and taking the derivative of this w.r.t. $w$ and setting it to zero we get (since $w^T X^T y = y^T X w$ as they are both scalars) \[ \begin{align} 2 X^T X w - 2 X^T y & = 0 \\ X^T X w & = X^T y \\ \hat{w} & = (X^T X)^{-1} X^T y, \end{align} \] i.e. the familiar normal equations. (The matrix derivative formulas can be found on Wikipedia or in the Matrix Cookbook; note that $X^T X$ is symmetric.)
Now as I said, Andrew Ng favours the other orientation for the data matrices on his Deep Learning courses. On these courses the model would instead be \[ y = w X = \begin{bmatrix} 37 & 328 & 451 \end{bmatrix} = \begin{bmatrix} w_0 & w_1 \end{bmatrix} \begin{bmatrix} 1 & 1 & 1 \\ 10 & 243 & 341 \end{bmatrix}, \] which is the same data just in a column-vector format, i.e. $X$ is now $D \times n$ where $D$ is the dimension (plus one) and $n$ is the number of data points. So how will the normal equations look now? I couldn't quickly find a reference for this, but it's easy to derive them directly. The loss function is, similar to what we did before, \[ \begin{align} L(\hat{y},y) & = (\hat{y}-y)^2 = (wX-y)(wX-y)^T \\ & = (wX-y)(X^T w^T - y^T) \\ & = w X X^T w^T - w X y^T - y X^T w^T + y y^T \\ & = w X X^T w^T - 2 w X y^T + y y^T, \end{align} \] where we have again combined the middle two terms, which are just real numbers. Taking the derivative of this with respect to $w$ and setting it to zero we get (again consulting the sources for the appropriate matrix derivatives) \[ \begin{align} 2 w X X^T - 2 y X^T & = 0 \\ w X X^T & = y X^T \\ \hat{w} & = y X^T (X X^T)^{-1}. \end{align} \] This looks reasonable enough. Checking the matrix dimensions, all of them also match. Finally, I wrote some R code to check the results. First, the former, row-matrix layout:
> x <- matrix(c(1,1,1,10,243,341), byrow=FALSE, nrow=3) > y <- matrix(c(37, 328, 451), byrow=FALSE, nrow=3) > w_hat <- solve(t(x) %*% x) %*% t(x) %*% y > w_hat [,1] [1,] 24.416099 [2,] 1.250424
And then the column-matrix layout:
> x2 <- matrix(c(1,1,1,10,243,341), byrow=TRUE, ncol=3) > y2 <- matrix(c(37, 328, 451), byrow=TRUE, ncol=3) > w_hat_2 <- y2 %*% t(x2) %*% solve(x2 %*% t(x2)) > w_hat_2 [,1] [,2] [1,] 24.4161 1.250424
And they agree. Nice.
No comments:
Post a Comment