{"id":1274,"date":"2016-04-10T23:51:43","date_gmt":"2016-04-10T23:51:43","guid":{"rendered":"http:\/\/blog.themusio.com\/?p=1274"},"modified":"2024-05-01T11:06:33","modified_gmt":"2024-05-01T02:06:33","slug":"backpropagation-through-time","status":"publish","type":"post","link":"https:\/\/blog.themusio.com\/?p=1274","title":{"rendered":"Backpropagation through time"},"content":{"rendered":"<p><strong>Goal<\/strong><br \/>\nToday&#8217;s summary will a give insight into the machinery behind optimization, namely the backpropagation algorithm, in any kind of neural network, whether it is a standard feed forward, convolutional or recurrent one.<\/p>\n<p><strong>Motivation<\/strong><br \/>\nIn order to adjust the weights of layers in neural networks in a way that the model shows learning behavior, we have to determine how the individual weights influence the final output.<\/p>\n<p><strong>Ingredients<\/strong><br \/>\nchain rule, differentiation, gradient<\/p>\n<p><strong>Steps<\/strong><br \/>\nWe start be providing the objects that we have to handle for the above defined task of adjusting the weights in a meaningful way.<br \/>\nIn general, we are interested in the behavior of some function depending on the output of the last layer.<br \/>\nInstead of just looking directly at the output, one usually defines a loss function that specifies the error that the model is currently making for some task.<br \/>\nThen the objective would be to minimize the error and correspondingly the loss function by changing the network accordingly.<br \/>\nNow in order to find the direction in which we should push the weights of each layer, we start calculating gradients with respect to the loss function.<br \/>\nThe first step is simply to provide the gradient of the loss function with respect to the last output.<br \/>\nWe are talking of gradients, since the input to our function, will be a vector in most cases.<br \/>\nThis is nothing more than generalizing the one-dimensional partial derivative to multiple dimensions.<br \/>\nSince our neural networks possess a certain depth in layers, we also have to calculate the gradients of the loss function with respect to the weights of first layer.<br \/>\nThe mathematical tool to accomplish this in a convenient way is called the chain rule.<\/p>\n<p>The intuition for the backpropagation algorithm can be provided by considering a computation which involves several steps, say first a addition of two numbers and then a multiplication by a third one.<br \/>\nAt every step of the computation we are able to compute locally the outputs and the gradients of the outputs with respect to the input.<br \/>\nWhen we reach the final layer, we can hence start to back propagate the gradients and obtain a each step and understanding of how to change the input in order to increase or decrease the final outcome of our computation.<\/p>\n<p>For a neural network the computational steps involve matrix multiplication and activation functions which introduce non-linearity.<br \/>\nSince gradient calculations can be rather tricky one usually relies on staged computation by explicitly computing the gradient for every step.<br \/>\nIn particular some activation functions, like the sigmoid, are easy to handle since their gradients are not difficult to calculate.<br \/>\nEven more simple is the ReLu whose gradient is either one or zero.<br \/>\nIn recent years more and more libraries provide the gradient calculations for us.<br \/>\nTheano is one framework that allows to build the gradients symbolically along the computational graph.<\/p>\n<p>Let us now go a little bit more into detail of the actual algorithm.<br \/>\nThe first task is to compute the activations for every layer by providing some input.<br \/>\nNext we compute the output error, meaning the gradient of the loss function with respect to the output of the last layer.<br \/>\nThen we are ready to back propagate this error the previous layer by multiplication with the computed activations of the feed forward step.<br \/>\nFinally, we are interested in the gradients of the loss function with respect to the weights, since we are going to adjust those and not the outputs.<br \/>\nBut this is just another simple multiplication by the input to the layer we are considering.<br \/>\nDespite that chain rule is known quite easy to understand and is known for a long time, the backpropagation algorithm is relatively new.<br \/>\nAn alternative way to calculate the gradients directly is to vary the input of every node of a layer by a tiny amount and calculate the effect on the loss function.<br \/>\nHowever, in practice this method requires a an enormous amount of computational steps since we should do this for every node in our network.<\/p>\n<p>For the final part, we take a look at backpropagation in convolutional and recurrent networks.<br \/>\nConvolutional networks are not that different from standard ones and so one might guess that there is only little work to be done to adjust the algorithm.<br \/>\nIndeed, the only change it needs is taking convolutions of errors and previous outputs instead of matrix multiplication to back propagate the error.<br \/>\nRecurrent neural networks involve a time component, since the input is fed as a sequence.<br \/>\nHence, in the hidden layers we have to additionally propagate the errors through time.<br \/>\nThat&#8217;s were the naming for the algorithm in recurrent networks comes from.<br \/>\nThe reason for this is that the weights are shared within the layer and the best intuition for that can be gathered by rolling out the network in time.<br \/>\nDepending on the length of the input sequence, the steps to compute the gradients of the loss function with respect to the first outputs can become large.<br \/>\nThis gives rise to the vanishing gradient problem, since in every step the value of gradient of the activation function is usually between one and zero.<br \/>\nTherefore, calculating gradients through several activations quickly leads to a vanishing gradient.<br \/>\nIn practice, one truncates the calculation to a few steps.<br \/>\nOther attempts involve a proper initialization of the weights, some kind of regularization or introducing LSTM or GRU units.<br \/>\nSometimes one also takes care of exploding gradients by clipping the values if the exceed a defined value.<\/p>\n<p><strong>Resources<br \/>\n&#8220;<\/strong><a href=\"http:\/\/cs231n.github.io\/optimization-2\/\" target=\"_blank\" rel=\"noopener\">CS231n Convolutional Neural Networks for Visual Recognition<\/a>&#8221; (WEB). <em>CS231n Convolutional Neural Networks for Visual Recognition. <\/em>Accessed 11 April 2016.<em><br \/>\n<\/em>&#8220;<a href=\"http:\/\/colah.github.io\/posts\/2015-08-Backprop\/\" target=\"_blank\" rel=\"noopener\">Calculus on Computational Graphs: Backpropagation<\/a>&#8221; (WEB). <em>Calculus on Computational Graphs: Backpropagation<\/em>. 31st August 2015. Accessed 11 April 2016.<br \/>\n&#8220;<a href=\"http:\/\/neuralnetworksanddeeplearning.com\/chap2.html\" target=\"_blank\" rel=\"noopener\">How the backpropagation algorithm works<\/a>&#8221; (WEB). <em>How the backpropagation algorithm works<\/em>. 22nd Jan 2016. Accessed 11 April 2016.<br \/>\n&#8220;<a href=\"http:\/\/www.wildml.com\/2015\/10\/recurrent-neural-networks-tutorial-part-3-backpropagation-through-time-and-vanishing-gradients\" target=\"_blank\" rel=\"noopener\">Recurrent Neural Networks Tutorial, Part 3 \u2013 Backpropagation Through Time and Vanishing Gradients<\/a>&#8221; (WEB). <em>Recurrent Neural Networks Tutorial, Part 3 \u2013 Backpropagation Through Time and Vanishing Gradients<\/em>. 8th October 2015. Accessed 11 April 2016.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Goal Today&#8217;s summary will a give insight into the machinery behind optimization, namely the backpropagat [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3642,3640],"tags":[3650,3652,3760,3698,3656,4114,3762,3658,3700,3664,4116,4118,3710],"class_list":["post-1274","post","type-post","status-publish","format-standard","hentry","category-ai-en","category-all-en","tag-ai-ja-en","tag-aka-ja-en","tag-artificial-intelligence-en","tag-backpropogation-en","tag-baggage-en","tag-chain-rule-en","tag-children-book-ja-en","tag-christmas-en","tag-cmos-en","tag-crowd-funding-en","tag-differentiation-en","tag-gradient-en","tag-musio-en"],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1274","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1274"}],"version-history":[{"count":3,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1274\/revisions"}],"predecessor-version":[{"id":10887,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1274\/revisions\/10887"}],"wp:attachment":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1274"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1274"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1274"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}