#
Backpropagation is a procedure at the center of deep learning, the method by which gradients of the complex non-linear functions are calculated for all parts of the network. It is often overlooked in explanations of deep learning, especially in the last few years. This is unfortunate, as some very important issues in deep learning techniques, such as the vanishing gradient problem, require a full understanding of backpropagation, while many other techniques and issues require a partial understanding, including Xavier initialization, training memory consumption, and loss scaling for mixed precision training.
I made four diagrams which explain both the high level aspects and the low level aspects of backpropagation using simple and consistent notation and coloring schemes. At the end describing the vanishing gradient problem precisely using actual mathematics, rather than the usual handwaving, will be an easy exercise if you are familiar with the mathematical notation, rather than a serious chore only to be taken on by specialists.
First, forward propagation for a single node. A node is simply the sum of a number of weighted scalars in the input. The output of the node is passed through a scalar activation function. This value is then used as input to the next layers, though a weighted sum.
Next, backpropogation for a single node. Backpropogation is a recursive method where you start at the output and compute error backwards. Therefore, the key to backpropogation is that you assume you already have computed the error of the output layer, in particular the linear part of the output layer. You then weight that error to find out how much changing the value of the activated (non-linear) neuron would have changed the output. Finally, you compute the derivative of the activation function \(\sigma^\prime\) in order to get the error of the linear part of the nueron, \(\frac{\partial L}{\partial c}\).
*
So, now that the individual networks are clear, lets look at the whole system. To start the recurrence relation of the backpropagation, you need a base case. This means you need an output neuron which is the cost. In a supervised setting this is computed by directly comparing the output of the network to some gold standard output which is considered to be correct. One common comparison is squared error \(cost = \sum_i (O_i - E_i)^2\)
After computing the cost, you can use the single node backpropagation method to recursively compute the error of the previous layers.
The real goal of backpropagation is to compute the output with respect to the weights, not the layers. Then the weights will be accordingly updated. But once you have computed the cost with respect to the layers, computing the gradients of the weights is simple.
\[\frac{\partial W^k_{ij}}{\partial c} = A_i \frac{\partial O_i}{\partial c}\]