Meet Musio

## Backpropagation through time

Goal
Today’s summary will a give insight into the machinery behind optimization, namely the backpropagation algorithm, in any kind of neural network, whether it is a standard feed forward, convolutional or recurrent one.

Motivation
In order to adjust the weights of layers in neural networks in a way that the model shows learning behavior, we have to determine how the individual weights influence the final output.

Ingredients

Steps
We start be providing the objects that we have to handle for the above defined task of adjusting the weights in a meaningful way.
In general, we are interested in the behavior of some function depending on the output of the last layer.
Instead of just looking directly at the output, one usually defines a loss function that specifies the error that the model is currently making for some task.
Then the objective would be to minimize the error and correspondingly the loss function by changing the network accordingly.
Now in order to find the direction in which we should push the weights of each layer, we start calculating gradients with respect to the loss function.
The first step is simply to provide the gradient of the loss function with respect to the last output.
We are talking of gradients, since the input to our function, will be a vector in most cases.
This is nothing more than generalizing the one-dimensional partial derivative to multiple dimensions.
Since our neural networks possess a certain depth in layers, we also have to calculate the gradients of the loss function with respect to the weights of first layer.
The mathematical tool to accomplish this in a convenient way is called the chain rule.

The intuition for the backpropagation algorithm can be provided by considering a computation which involves several steps, say first a addition of two numbers and then a multiplication by a third one.
At every step of the computation we are able to compute locally the outputs and the gradients of the outputs with respect to the input.
When we reach the final layer, we can hence start to back propagate the gradients and obtain a each step and understanding of how to change the input in order to increase or decrease the final outcome of our computation.

For a neural network the computational steps involve matrix multiplication and activation functions which introduce non-linearity.
Since gradient calculations can be rather tricky one usually relies on staged computation by explicitly computing the gradient for every step.
In particular some activation functions, like the sigmoid, are easy to handle since their gradients are not difficult to calculate.
Even more simple is the ReLu whose gradient is either one or zero.
In recent years more and more libraries provide the gradient calculations for us.
Theano is one framework that allows to build the gradients symbolically along the computational graph.

Let us now go a little bit more into detail of the actual algorithm.
The first task is to compute the activations for every layer by providing some input.
Next we compute the output error, meaning the gradient of the loss function with respect to the output of the last layer.
Then we are ready to back propagate this error the previous layer by multiplication with the computed activations of the feed forward step.
Finally, we are interested in the gradients of the loss function with respect to the weights, since we are going to adjust those and not the outputs.
But this is just another simple multiplication by the input to the layer we are considering.
Despite that chain rule is known quite easy to understand and is known for a long time, the backpropagation algorithm is relatively new.
An alternative way to calculate the gradients directly is to vary the input of every node of a layer by a tiny amount and calculate the effect on the loss function.
However, in practice this method requires a an enormous amount of computational steps since we should do this for every node in our network.

For the final part, we take a look at backpropagation in convolutional and recurrent networks.
Convolutional networks are not that different from standard ones and so one might guess that there is only little work to be done to adjust the algorithm.
Indeed, the only change it needs is taking convolutions of errors and previous outputs instead of matrix multiplication to back propagate the error.
Recurrent neural networks involve a time component, since the input is fed as a sequence.
Hence, in the hidden layers we have to additionally propagate the errors through time.
That’s were the naming for the algorithm in recurrent networks comes from.
The reason for this is that the weights are shared within the layer and the best intuition for that can be gathered by rolling out the network in time.
Depending on the length of the input sequence, the steps to compute the gradients of the loss function with respect to the first outputs can become large.
This gives rise to the vanishing gradient problem, since in every step the value of gradient of the activation function is usually between one and zero.