Sunday, 10 June 2018

Layers & gradient shapes in neural networks

TLDR: backpropagation is reasonably straightforward, but the details are subtle.

This post became much bigger and took much longer than I expected. Hopefully it is useful for others who might be interested in the mathematical foundations of backpropagation. Basically, at each layer we calculate $z=wa$, and the goal is to optimise various things by taking the derivative of this. Well, the derivative of $z$ with respect to $w$ can only be $a$ (or some reasonable variation thereof, such as a matrix transpose $a^T$), and the derivative with respect to $a$ can only be $w$ (or some reasonable variation thereof). One of these will be used to adjust the weights $w$ at this layer, while the other will be passed backwards, in an application of the chain rule, to adjust the weights of previous layers. That's it, in a nutshell. For the details, which are more complicated, read on.

Preface


I recently completed all of Andrew Ng's Deep Learning courses on Coursera. These courses are really good and I highly recommend them; Professor Ng has the skill to explain complicated concepts in a very understandable way.

To make sure I learn this stuff properly, I wanted to deepen my understanding of the mathematics involved in gradient descent (see also the previous two posts). Now Prof Ng doesn't actually get very deep into the mathematics on his courses, which is fine, but I wanted a deeper dive.

Simple network


Let's start at the beginning. I will use a column-oriented layout ("$w x$") as in the previous posts and start to construct a neural network model by building on the simple linear model. Ng's notation is somewhat confusing when it comes to matrix orientation: the input data $X$ is column-oriented, as are the activations $a^{[1]}, a^{[2]}$ etc; but the weights are row-oriented, i.e. all the weight matrices $w^{[1]}, w^{[2]}$ etc are row matrices. This is probably because ${w^{[i]}}^T$ is a bit cumbersome to write everywhere. I will follow this same convention: weights are rows, other stuff is columns.

One thing that will be different from the simple model is the bias variables. In traditional statistics, one can add an extra row (or column) of ones to the data matrix to use as bias variables (see previous posts), which simplifies the math, since all the variables, weights and biases, will then be contained in one matrix. However in a multi-layered network, this becomes impractical, since one would have to modify the output matrix at each layer to add this bias row. It is possible but not desirable to keep doing these modifications at every layer. Another issue is that in larger neural networks the initialisation of the weights is very important (see so-called Xavier initialisation), whereas biases can just be initialised to zero. I'm not sure how much it matters how the biases are initialised, but this is something that could have an effect.

So, building on the traditional linear model from the previous posts, the simplest model that could reasonably be called a neural network would be a logistic regression model (or another generalised linear model, but here I'm using logistic regression as an example). In a logistic regression model we have as input some data points $X$ and the correct answers $y$, which in this case are binary (0 and 1), and the model uses a weight vector $w$ plus a (separate) bias $b$ to produce predictions $\hat{y}$: \[ \begin{eqnarray} z & = & w^T X + b \\ a & = & \sigma(z) = \frac{1}{1+e^{-z}} \\ \hat{y} & = & a \end{eqnarray} \] The loss function used here would usually be the log loss: \[ L(y,\hat{y}) = -y \log \hat{y} - (1-y) \log (1-\hat{y}) \] So the goal of course is to adjust the parameters $w$ and $b$ so as to minimise the loss; and traditional gradient descent of course does this by using the gradient. How do we compute the gradient? That might get complicated here since we have a whole matrix full of data, a bunch of vectors for the weights and outcomes, and more than one function chained together. It is this multi-dimensional shape of things that gave me some difficulty. Basically there are two separate but related things: how to do the math correctly, taking all the vectors and matrices into account; and, since we actually want the average gradient rather than $n$ separate ones, how to compute that efficiently using suitable matrix operations. Let's see how it all develops.

First consider the logistic regression model above in the simplest case, where we only have one data point, a $d$-dimensional column vector $x$. The weight vector is a $d$-dimensional vector as well, and computing the gradients of the loss with respect to the parameters is straightforward enough, just using the standard formulas for vector derivatives: \[ \begin{eqnarray} \frac{\mathrm{d} L}{\mathrm{d} w} & = & \frac{\mathrm{d} L}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} z} \frac{\mathrm{d} z}{\mathrm{d} w} \\ & = & (-\frac{y}{a} + \frac{1-y}{1-a}) \, a(1-a) \, x^T \\ & = & (a-y)x^T; \\ \frac{\mathrm{d} L}{\mathrm{d} b} & = & \frac{\mathrm{d} L}{\mathrm{d} a} \frac{\mathrm{d} a}{\mathrm{d} z} \frac{\mathrm{d} z}{\mathrm{d} b} \\ & = & (-\frac{y}{a} + \frac{1-y}{1-a}) \, a(1-a) \, 1 \\ & = & a-y. \end{eqnarray} \] Here we got away without too much worry about the fact that some things are vectors. The next thing to do is to generalise these results to the more realistic case of $n$ data points, for which we want the average gradient. Consider that for any given data point $x$, the gradients will of course be the same as above, namely $(a-y)x^T$ and $a-y$. Given all the data points in a matrix, \[ X = \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix}, \] it is easy enough to basically guess how to efficiently compute the sum of all the gradients with respect to $w$: \[ g = (a-y)X^T = \begin{bmatrix} a_1-y_1 & a_2-y_2 & \ldots & a_n-y_n \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix} \] And all we need to do is divide the result by $n$ to get the average gradient. The case for the average gradient with respect to $b$ is similar but simpler.

More complex network: single data point


Let's move on to more complicated models. In a proper neural network, you will obviously have multiple layers. Each layer is a collection of neurons, each of which has its own weight vector and bias, so in total you will have a matrix of weight vectors and a vector of biases, per layer. For example, with two layers (using Ng's notation): \[ \begin{eqnarray} z^{[1]} & = & w^{[1]} X + b^{[1]} \\ a^{[1]} & = & \sigma(z^{[1]}) \\ z^{[2]} & = & w^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} & = & \sigma(z^{[2]}) \\ \hat{y} & = & a^{[2]} \end{eqnarray} \] In this network we're using sigmoid activation for both layers and the last layer is meant to produce a one-dimensional output, which then gets squeezed to $[0,1]$ by the sigmoid as usual. To make the dimensions more concrete, consider a simple case of 3-dimensional data points, as column vectors in $X$, being reduced first to 2-dimensional activations by the first layer and then to 1-dimensional predictions by the second and final layer. As you can check, in this case $w^{[1]}$ would be a $(2,3)$ matrix (with the weight vectors as rows) and $w^{[2]}$ would be a $(1,2)$ row vector. The outputs $z^{[1]}$ would then be $(2,n)$ and the outputs $z^{[2]}$ would be $(1,n)$. Of course the activation functions do not affect these shapes, being component-wise.

How about the gradient calculation in this case? Well, it gets a bit complicated. Basically the chain rule, which I used somewhat recklessly before, still applies in the case of vectors, but one needs to use Jacobians rather than ordinary derivatives, Jacobians being the proper generalisation. However in the multiple-layer case we are dealing not only with vectors but also with matrices.

Consider the two-layer example above. The derivatives of the loss with respect to the last layer's parameters are the same as they were in the one-layer case. However, we also want to adjust the parameters of the previous layer. Ignoring the activations for the moment (since they don't affect the shapes of the gradients), we want to compute $\mathrm{J} a^{[2]}(w^{[1]})$, that is, the Jacobian of the last layer's activations with respect to the first layer's weights. But there are $n$ outcomes from the last layer and the first layer's weights are a matrix, so the result is a higher-dimensional tensor. With a vector input and a vector output we would get a nice, understandable Jacobian matrix, but with a vector and a matrix we get something else.

There are a few ways of making sense of this. One way would be to learn how tensors work, but this didn't seem like the best avenue here for various reasons. Another way would be to vectorise the inputs, that is, treat the last-layer activations as a function of a vector which is all the first-layer weights concatenated. I ended up choosing a third approach, which is to consider each vector in the weight matrix separately. Let's see how that works in practice.

Imagine a two-layer network with a very simple low-dimensional structure as before: three-dimensional input data, two neurons in the first layer, and one neuron in the final layer. Imagine further, just for a moment, that there are no activation functions or biases (since they don't affect the shapes of the gradients w.r.t. weights). Concretely, \[ \begin{eqnarray} a^{[1]} & = & w^{[1]} X = \begin{bmatrix} \text{---} & w^{[1]}_a & \text{---} \\ \text{---} & w^{[1]}_b & \text{---} \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ x_1 & \ldots & x_n \\ \vert & & \vert \end{bmatrix}; \\ a^{[2]} & = & w^{[2]} a^{[1]} = \begin{bmatrix} \text{---} & w^{[2]} & \text{---} \\ \end{bmatrix} \begin{bmatrix} \vert & & \vert \\ a_1 & \ldots & a_n \\ \vert & & \vert \end{bmatrix}. \end{eqnarray} \] The difficult bit now is to calculate the Jacobian of $a^{[2]}$ with respect to the first-layer weight matrix $w^{[1]}$. So, let's do that part by part, by first computing the Jacobian with respect to the first weight vector in the first layer, i.e. $w^{[1]}_a$. Considered this way, the Jacobian is an ordinary matrix, as it is computed for a function that takes a vector and returns a vector. We can then do the same computation for the second weight vector $w^{[1]}_b$ and, if there are more than two neurons in the first layer, we will also be able to see how it generalises beyond that.

Let's again start with the case of just one input vector $x$. By the Jacobian chain rule, \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & \textrm{J} a^{[2]}(a^{[1]}) \textrm{J} a^{[1]}(w^{[1]}_a). \\ \end{eqnarray} \] Now if there is only one data point, the final layer produces only one outcome, so the Jacobian of $a^{[2]}$ will have one row; it will have 3 columns, since that is the number of dimensions in $w^{[1]}_a$. Of the "intermediate" components, $\textrm{J} a^{[2]}(a^{[1]})$ is $(1,2)$, i.e. one outcome in $a^{[2]}$ from two components of $a^{[1]}$, and $\textrm{J} a^{[1]}(w^{[1]}_a)$ is $(2,3)$, i.e. two outcomes in $a^{[1]}$ from three components of $w^{[1]}_a$. Of course $(1,2) \times (2,3) = (1,3)$ as expected. These two "component" Jacobians are now straightforward to calculate: \[ \begin{eqnarray} \textrm{J} a^{[2]}(a^{[1]}) & = & \begin{bmatrix} \text{---} & \partial a^{[2]}/\partial a^{[1]} & \text{---} \end{bmatrix} \\ & = & \begin{bmatrix} \text{---} & \partial w^{[2]} a^{[1]}/\partial a^{[1]} & \text{---} \end{bmatrix} \\ & = & w^{[2]}; \\ \textrm{J} a^{[1]}(w^{[1]}_a) & = & \begin{bmatrix} \text{---} & \partial a^{[1]}_1/\partial w^{[1]}_a & \text{---} \\ \text{---} & \partial a^{[1]}_2/\partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[1]}_a x /\partial w^{[1]}_a & \text{---} \\ \text{---} & \partial w^{[1]}_b x /\partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & x^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix},\\ \end{eqnarray} \] and therefore the whole Jacobian is \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & w^{[2]} \begin{bmatrix} \text{---} & x^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix}\\ & = & w^{[2]}_1 x^T. \end{eqnarray} \] So this is the gradient we can use to adjust the first weight vector of the first layer. Doing the same calculation for the second weight vector, we get a very similar result: \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_b) & = & w^{[2]}_2 x^T. \end{eqnarray} \] And we can now put these two together to get the complete Jacobian, i.e. the complete gradient: \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}) & = & \begin{bmatrix} \text{---} & w^{[2]}_1 x^T & \text{---} \\ \text{---} & w^{[2]}_2 x^T & \text{---} \end{bmatrix} \\ & = & \begin{bmatrix} w^{[2]}_1 \\ w^{[2]}_2 \end{bmatrix} \begin{bmatrix} \text{---} & x^T & \text{---} \end{bmatrix} \\ & = & {w^{[2]}}^T x^T. \\ \end{eqnarray} \] This gradient matrix is just the two gradient vectors in one. Of course it has the same shape as $w^{[1]}$, so we can use it directly.

More complex network: multiple data points


So what about more than one data point? In that case, the Jacobian of $a^{[2]}$ will have $n$ rows, corresponding to the $n$ outcomes. There are now also $n$ components in $a^{[1]}$, which makes the calculation a little more complicated. Basically, since we're going from $a^{[2]}$ to $w^{[1]}$ through multiple routes - all the various $a^{[1]}_i$s - we can calculate the Jacobian as the total derivative of all these routes. For the first weight vector of the first layer, \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & \textrm{J} a^{[2]}(a^{[1]}_1) \textrm{J} a^{[1]}_1(w^{[1]}_a) + \textrm{J} a^{[2]}(a^{[1]}_2) \textrm{J} a^{[1]}_2(w^{[1]}_a) + \ldots \end{eqnarray} \] To make this more concrete, let's do an example calculation of the first element of this sum, that is, through the first element, $a^{[1]}_1$; this is the first output element of the first layer. This partial Jacobian will have the same shape as the full one, but most of its contents will be zero, similar to what we saw before: \[ \begin{eqnarray} \textrm{J} a^{[2]}(a^{[1]}_1) & = & \begin{bmatrix} \text{---} & \partial a^{[2]}_1 / \partial a^{[1]}_1 & \text{---} \\ \text{---} & \partial a^{[2]}_2 / \partial a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & \partial a^{[2]}_n / \partial a^{[1]}_1 & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[2]} a^{[1]}_1 / \partial a^{[1]}_1 & \text{---} \\ \text{---} & \partial w^{[2]} a^{[1]}_2 / \partial a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & \partial w^{[2]} a^{[1]}_n / \partial a^{[1]}_1 & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & w^{[2]} & \text{---} \\ \text{---} & 0 & \text{---} \\ & \ldots & \\ \text{---} & 0 & \text{---} \\ \end{bmatrix};\\ \textrm{J} a^{[1]}_1(w^{[1]}_a) & = & \begin{bmatrix} \text{---} & \partial a^{[1]}_{11} / \partial w^{[1]}_a & \text{---} \\ \text{---} & \partial a^{[1]}_{12} / \partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & \partial w^{[1]}_a x_1 / \partial w^{[1]}_a & \text{---} \\ \text{---} & \partial w^{[1]}_b x_1 / \partial w^{[1]}_a & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \text{---} & x_1^T & \text{---} \\ \text{---} & 0 & \text{---} \\ \end{bmatrix}.\\ \end{eqnarray} \] The first thing here is an $(n, 2)$ matrix, and the second thing is a $(2, 3)$ matrix, so together they produce an $(n, 3)$ matrix. Notice that we now have multiple data points $x_i$, but as the model is very "regular", only the first data point $x_1$ contributes to the first activation $a^{[1]}_1$, i.e. $a^{[1]}_1 = \begin{bmatrix} w^{[1]}_a x_1 & w^{[1]}_b x_1 \end{bmatrix}^T$.

So what do we do with this? By itself, not much; but if we do the same calculation for the second element of the sum - i.e. through the second activation vector, to which only the second data point contributes - and then for the third element and so on, the result for the whole sum is easily seen to be \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_a) & = & w^{[2]}_1 \begin{bmatrix} \text{---} & x_1^T & \text{---} \\ \text{---} & x_2^T & \text{---} \\ & \ldots & \\ \text{---} & x_n^T & \text{---} \\ \end{bmatrix}\\ & = & w^{[2]}_1 X^T .\\ \end{eqnarray} \] Here of course $w^{[2]}_1$ is simply a scalar, the first element of the two-element weight vector.

So starting from $n$ data points, we have calculated $n$ different gradients for the first weight vector - one gradient per point. This seems logical enough. Similarly, if we do the math for the other weight vector $w^{[1]}_b$, we end up with a slightly different arrangement of matrices, such that the second element of $w^{[2]}$ is now used, and the result is \[ \begin{eqnarray} \textrm{J} a^{[2]}(w^{[1]}_b) & = & w^{[2]}_2 X^T ,\\ \end{eqnarray} \] i.e. the second element of the second weight vector, times all the gradients of each point.

So this is pretty understandable. For the first weight vector, the Jacobian involves all of the gradients of each input to that layer - here just the original data points, since this is the first layer - times the element in the second-layer weight vector that corresponds to the first weight vector. For the second weight vector, we have the same gradients, times the second element of the weight vector from the next layer. It is obvious that if there are more neurons and hence more weight vectors in the first layers, say $q$ of them, there also needs to be $q$ columns in the next layer's weight vector or matrix, and this same process happens $q$ times.

How to best calculate this? Keeping in mind that we are actually interested in the average gradient, the most straightforward thing to do from an implementation point of view would be to calculate the average of the original data points first, then do a simple for loop that runs $q$ times to calculate the rest. However such for loops are just distasteful when doing matrix calculations. Fortunately there is a better way.

It turns out to be helpful to consider also the role of the (immediate) gradients of the loss function and the activation functions, as they help with doing all the required computations more efficiently.

Loss and activation


There are a few different ways of considering the gradients of the loss function and the activation functions. The more straightforward way is to reason that since these functions are taken elementwise, their derivatives are also elementwise, and that's that. In other words, there's no need for any matrix math, since the functions and their derivatives are simply $\mathbb{R} \to \mathbb{R}$. Done this way, when one uses the resulting matrices in an application of the chain rule, the multiplication must be done elementwise as well.

More mathematically, the loss function can be considered to be a function that takes in $n$ predictions and outputs the $n$ corresponding losses (the correct answers $y$ are of course constant). Hence, the Jacobian of the loss function is a matrix with $n$ rows, one for each loss, and $n$ columns, one for each input. However, due to the "regular" structure of the model, the Jacobian will be a diagonal matrix, since clearly the first input, say, has no effect on losses other than the first one. So we have \[ \begin{eqnarray} \textrm{J} L (a^{[2]}) & = & \textrm{diag}(\partial L(a^{[2]}_i) / \partial a^{[2]}_i) . \end{eqnarray} \] Similarly, the activation function of the last layer takes in $n$ inputs and produces $n$ outputs, and since each input only affects the output at the same index, the immediate Jacobian of an activation function $\sigma$ is \[ \begin{eqnarray} \textrm{J} \sigma (z^{[2]}) & = & \textrm{diag}(\partial \sigma(z^{[2]}_i) / \partial z^{[2]}_i) . \end{eqnarray} \] For example, in the case of traditional linear regression there is no activation function (i.e. the activation function is the identity function and its Jacobian is the identity matrix) and hence the product of these two Jacobians is simply $\textrm{diag}(a_i-y_i)$, which is the direct Jacobian of the loss. If we are using a sigmoid in the last layer and the logistic loss function (e.g. logistic regression), then the product of the Jacobians turns out to be the same, as is known.

In practice of course it doesn't make sense to construct a matrix with $n^2$ elements, only $n$ of which are non-zero. But from the diagonal matrix form we can see that the mathematically correct way is indeed equivalent to the simpler elementwise multiplication mentioned before. In inner layers, where the Jacobians would again be more complex, I didn't do the math for the Jacobians, since the simple elementwise computation is sufficient (as done also by Ng and others).

How do these loss & activation derivatives fit in with the rest of the calculations? Well, previously we saw that a primitive way to calculate the gradients for the first-layer weights would be to calculate $w^{[2]}_q X^T$ for each (scalar) element $q$. This of course is an $(n,d)$ matrix, so it's no problem to multiply it from the left with an $(n,n)$ diagonal matrix or, equivalently, elementwise from the left with an $(n,1)$ column vector. (The latter operation is slightly tricky, since the dimensions don't quite match; we need to be careful there to note that the elementwise product here means first fattening, or "broadcasting", the $(n,1)$ vector into an $(n,d)$ matrix, by duplicating the column. But it turns out the operation we actually want to do will not require this.)

So we now have all the pieces. We know how to compute the gradients, mathematically speaking, and we have all of the various matrices containing these gradients. The next step is to compute the sum, and then the average, of the gradients in an efficient way.

Optimised calculations


So we want to calculate, for each weight vector, the average gradient (across all data points) for that weight vector. The average is of course just the sum of the gradients divided by a scalar. Consider now all the components that go into the full gradient of a given weight vector. We have the derivatives of the loss, of the activation function, and of the linear function, followed by more activations and linears in case of multiple layers. We want the sum of this for each data point; each data point has its corresponding loss derivative and activation derivative.

Starting from the loss, first do the element-wise multiplication of the loss derivatives with the activation derivatives for the final layer; these of course have the same shape. For example, in our case of sigmoid activation with logistic loss, the result will be a row vector containing $a^{[2]}_i-y_i$ for each element $i$.

To adjust the weights of the last layer, as we saw before with the simple network, the sum of the gradients is simply $(a^{[2]}-y){a^{[1]}}^T$ in our example, or more generally, the product of the previous gradients and the transpose of the input to the last layer. As you can check, the dimensions match, the loss and activation gradients being $(1,n)$ and the input transpose being $(n,2)$. In general, multiplying an $(a,b)$ matrix by a $(b,c)$ matrix produces a sum of each $a$-times-$c$ element; here, the $a$s are the loss-and-activation-derivatives and the $c$s are the inputs to the current layer. Mathematically, \[ \begin{eqnarray} & & \begin{bmatrix} a^{[2]}_1-y_1 & a^{[2]}_2-y_2 & \ldots & a^{[2]}_n-y_n \\ \end{bmatrix} \begin{bmatrix} \text{---} & a^{[1]}_1 & \text{---} \\ & \ldots & \\ \text{---} & a^{[1]}_n & \text{---} \\ \end{bmatrix}\\ & = & \begin{bmatrix} \sum_i (a^{[2]}_i-y_i) a^{[1]}_{i1} & \ldots & \sum_i (a^{[2]}_i-y_i) a^{[1]}_{iq} \\ \end{bmatrix}\\ \end{eqnarray} \] where $q$ is the dimensionality of each previous-layer activation, in our example 2. The math looks complicated, but I encourage you to ponder the loss-etc elements and the inputs until the sums computed here become clear. Now all we need to do is to divide this by $n$ and we have the average.

One thing to note with this calculation is that the $a^{[2]}-y$ represents the elementwise product of the loss gradient and the final-layer activation gradient. In the following we need to treat these two separately.

Now let's see how to proceed to the previous layer. In short, the last layer's weight adjustments are loss-etc times layer input; the layer before that is loss-etc times next-layer weights times layer input. How do we compute this? The loss-etc is $(1,n)$ as before; the weights of the final layer are $(1,q)$; and the layer input for the second-to-last layer is $(q,n)$ (in our example $q=2$).

So for a non-final layer, we want to compute basically the same thing, loss-etc times layer input - except we want $q$ copies of it, where each copy is multiplied by a different element of the next-layer weight. This is the "for loop" we saw before. We need to also consider the activation derivatives for the current layer, but since these are elementwise, there are no complicated matrix shapes or sums to consider. I'll do the derivation first without the activation derivatives and then add them in after, since the calculations are easier to understand that way.

We first compute an "outer product" of the loss-etc and the next-layer weights, to get \[ \begin{eqnarray} {w^{[2]}}^T (a^{[2]}-y) & = & \begin{bmatrix} \text{---} & w^{[2]}_1 (a^{[2]}-y) & \text{---} \\ & \ldots & \\ \text{---} & w^{[2]}_q (a^{[2]}-y) & \text{---} \\ \end{bmatrix} \end{eqnarray}. \] This is just the same $a-y$ multiplied by each of the $q$ elements of the final weights, in a $(q,n)$ matrix. We can then tie this together, along the $n$ axis, with the layer input of the first layer. The layer input has $n$ elements, and we want the elementwise sum as before, so the correct calculation is \[ \begin{eqnarray} ({w^{[2]}}^T (a^{[2]}-y)) X^T & = & \begin{bmatrix} \text{---} & w^{[2]}_1 (a^{[2]}-y) & \text{---} \\ & \ldots & \\ \text{---} & w^{[2]}_n (a^{[2]}-y) & \text{---} \\ \end{bmatrix} \begin{bmatrix} \text{---} & x^1 & \text{---} \\ & \ldots & \\ \text{---} & x^n & \text{---} \\ \end{bmatrix} \end{eqnarray}. \] As you can check, the dimensions match: $((2,1) \times (1,n)) \times (n,3) = (2,3)$, which is the dimensionality of the first-layer weights, as desired. Careful checking reveals that each row of the resulting matrix corresponds to one element of $w^{[2]}$ times $\sum_i (a_i-y_i) x_i^T$, which is what we wanted.

As was said before, we still need to consider the derivatives of the first-layer activation. These can be taken elementwise, and the result is another $(q,n)$ matrix. So to incorporate the activation derivatives, we take the elementwise product of ${w^{[2]}}^T (a^{[2]}-y)$ and those derivatives, call the result $M$: \[ \begin{eqnarray} M & = & ({w^{[2]}}^T (a^{[2]}-y)) \circ (\partial \sigma^{[1]}(z^{[1]}) / \partial z^{[1]}) \end{eqnarray}. \] We then multiply $M$ by $X^T$ to get the final answer for the weights, this time with the activation derivatives included.

What about layers deeper back than two? I won't go into the details there, but you should be able to convince yourself, with a bit of thinking and calculating, that applying this same method of loss-etc times next-layer weights, and that multiplied by current-layer input, will continue to work for as many layers as you need. In our example, think of a "layer zero" with, say, 4-dimensional inputs. The "loss-etc" is now ${w^{[2]}}^T (a^{[2]}-y)$, and the next-layer weights are $w^{[1]}$; the calculation is \[ M' = {w^{[1]}}^T ({w^{[2]}}^T (a^{[2]}-y)), \] and this can then be both multiplied by the layer-zero inputs to compute the layer-zero weight gradient and also passed further backward if need be. As you can check, in each case the sizes match and the calculation produces the desired result. (I've omitted the activation derivatives in the last equation for clarity, but these are straightforward to add, as above.)

Putting it all together


Let's summarise these findings into a final algorithm. This end result is of course familiar from Prof Ng's lecture notes and many other sources. The algorithm here is valid for a network with one-dimensional output, e.g. a binary classifier.
  1. Given the input and the network, compute the predictions $\hat{y}$ (forward propagation). This will be a $(1,n)$ row vector.
  2. Compute the loss $L(\hat{y}, y)$, another $(1,n)$ row vector.
  3. Compute the gradient of the loss, i.e. the elementwise derivatives, and store it in another $(1,n)$ vector; call this $dL$.
  4. Let $M := dL$.
  5. For each layer $i$, starting from the last:
    • Compute the elementwise derivatives of the activations $a^{[i]}$ produced by the current layer; call these $dA$. This will be the same shape as $M$ (e.g. $(1,n)$ for the last layer).
    • Compute the elementwise product of $M$ with $dA$, i.e. $C := M \circ dA$. This will of course be the same shape again.
    • Compute $dW := \frac{1}{n} C {a^{[i-1]}}^T$, where $a^{[i-1]}$ is the input to the current layer. The resulting $dW$ is the same shape as the current layer's weights, as you can check.
    • Adjust the weights according to $dW$, i.e. take a gradient descent step.
    • Update $M$ with $M := {w^{[i]}}^T C = {w^{[i]}}^T (M \circ dA)$.
    • Set $i := i-1$ and proceed to the previous layer.
  6. Repeat until convergence.
You will notice I've left the biases out of the algorithm. I recommend adding them yourself, as this is a good way to verify your understanding of both the mathematics and the practical computation.

If you made it this far, congratulations and I hope this post has been useful.