{"id":1675,"date":"2016-08-26T18:53:17","date_gmt":"2016-08-26T09:53:17","guid":{"rendered":"http:\/\/blog.themusio.com\/?p=1675"},"modified":"2024-05-01T10:56:34","modified_gmt":"2024-05-01T01:56:34","slug":"alternatives-to-the-softmax-layer","status":"publish","type":"post","link":"https:\/\/blog.themusio.com\/?p=1675","title":{"rendered":"Alternatives to the softmax layer"},"content":{"rendered":"<div id=\"table-of-contents\">\n<h2>Table of Contents<\/h2>\n<div id=\"text-table-of-contents\">\n<ul>\n<li><a href=\"#org9eeba93\">1. Alternatives to the softmax layer&#xa0;&#xa0;&#xa0;<span class=\"tag\"><span class=\"softmax\">softmax<\/span><\/span><\/a>\n<ul>\n<li><a href=\"#org96f3239\">1.1. goal<\/a><\/li>\n<li><a href=\"#org8f6d0d3\">1.2. motivation<\/a><\/li>\n<li><a href=\"#org13f0828\">1.3. ingredients<\/a><\/li>\n<li><a href=\"#orge9bb5a2\">1.4. steps<\/a><\/li>\n<li><a href=\"#orgb798bd9\">1.5. outlook<\/a><\/li>\n<li><a href=\"#orgbd4c972\">1.6. resources<\/a><\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<\/div>\n<\/div>\n<h1>Alternatives to the softmax layer     <a id=\"org9eeba93\"><\/a><\/h1>\n<h2>goal<a id=\"org96f3239\"><\/a><\/h2>\n<p>This weeks posts deals with some possible alternatives to the softmax layer when calculating probabilities for words over large vocabularies.<\/p>\n<h2>motivation<a id=\"org8f6d0d3\"><\/a><\/h2>\n<p>Natural language tasks as neural machine translation or dialogue generation rely on word embeddings at the input and output layer.<br \/>\nFurther for decent performances a very large vocabulary is needed to reduce the number of out of vocabulary words that cannot be properly embedded and therefore not processed.<br \/>\nThe natural language models used for these task usually come with a final softmax layer to compute the probabilities over the words in the vocabulary.<br \/>\nNow huge vocabularies make this final computation extremely expensive and therefore slows down the whole training process.<br \/>\nThis makes it necessary to look for more efficient alternatives to the softmax layer.<\/p>\n<h2>ingredients<a id=\"org13f0828\"><\/a><\/h2>\n<p>softmax, hierarchical softmax, differentiated softmax, importance sampling, noise contrastive estimation, negative sampling<\/p>\n<h2>steps<a id=\"orge9bb5a2\"><\/a><\/h2>\n<p>Before we start, let us repeat the general procedure for generating a translation of a sentence or a response to an utterance in a dialogue.<br \/>\nDepending on some context, which are most often the previous words in a sequence the next word is selected according to some normalized probability distribution over all words in the vocabulary.<br \/>\nThe actual computation is performed by a softmax layer which computes an activation for each word in the vocabulary by defining a matrix of the size of the vocabulary times dimension of the output of the previous layer.<br \/>\nSince vocabulary sizes from 10K to 1B are becoming standard the final softmax layer is way bigger than every other layer in our network architecture.<br \/>\nFurthermore the normalization of the activation forces us to sum over all words in the vocabulary.<\/p>\n<p>Apart from the obvious expensive summation over a large vocabulary there are further problems that arise indirectly and affect the overall training time.<br \/>\nSince we deal with large vocabularies we also need large training data sets in order to have sufficient training examples for infrequent words.<br \/>\nOn top of that larger training sets usually allow and make it necessary to train larger models.<br \/>\nHence the training time per epoch usually increases and the learning performance is reduced due two deeper network architectures.<\/p>\n<p>Alternatives for the softmax layer can in principle distinguished into two classes.<br \/>\nFirst we have a look at modifications of the softmax layer itself before we explain sampling approaches which allow to completely get rid of the softmax layer.<br \/>\nOne alternative, known as hierarchical softmax, modifies the network architecture by organizing the vocabulary into a tree.<br \/>\nEvery word is now reachable by a unique path and the probability of a word is the product of probabilities along the path.<br \/>\nTypically one uses balanced binary trees which have a depth equal to the logarithm of the vocabulary and hence reduce the computation time.<br \/>\nAnother method closely related to information encoding makes use of frequency clustering.<br \/>\nMore frequent words have shorter paths and one can achieve a further speed up compared to binary trees.<br \/>\nDuring test time where we do not have the target words we have to rely on the standard softmax layer.<\/p>\n<p>The idea behind differentiated softmax is to group words in the vocabulary and assign a fixed number of parameters to each.<br \/>\nFormulated differently, we split the final linear transformation into block matrices and hence only take a fixed number of parameters into consideration when computing the activations.<br \/>\nUsually one allows more parameters for more frequent words which on the other hand makes it even more difficult to learn proper word embeddings for infrequent words.<br \/>\nThis method can obviously also applied during test time.<\/p>\n<p>We now come to sampling approaches which are based on approximating the loss function and in particular the sum over the whole vocabulary.<br \/>\nTaking a closer look at the gradient of the loss function clarifies how sampling can be applied.<br \/>\nThe gradient consists of a positive reinforcement term for the target word and a negative reinforcement term over all  other words in the vocabulary weighted by their probabilities.<br \/>\nIt is the expectation of the gradient of the activations of all words.<\/p>\n<p>A first approach, called importance sampling, to this term is heavily applying Monte Carlo methods.<br \/>\nWe approximate the expectation value by sampling from the networks probability distribution.<br \/>\nSince this is exactly what we wanted to avoid, we introduce a proposal distribution which should be close to the network distribution and easier to sample from.<br \/>\nIn practice one chooses the unigram distribution of the training set.<br \/>\nThis allows for easy sampling, however we still have to rely on the normalized network distribution when calculating the expectations.<br \/>\nFortunately, we can work with a biased estimator by approximating the sum over the vocabulary in the denominator with the same samples drawn from the proposal distribution.<br \/>\nThe quality and performance of importance sampling heavily relies on the number of samples drawn from the proposal distribution.<\/p>\n<p>Two other methods rely on introducing a binary classification task based on an auxiliary loss function which indirectly maximizes the probability over correct words.<br \/>\nFirst we list noise contrastive estimation, before we go one step further and introduce and approximation of it.<br \/>\nBy generating a certain number of noise samples from a distribution, e.g. the unigram distribution of the training set, we try to guess the label of a number of sampled words.<br \/>\nFor this we introduce a mixture distribution consisting of the network distribution and the noise distribution.<br \/>\nWords from the noise distribution are labeled false and the actual target words are labeled true.<br \/>\nThe sum in the denominator over the vocabulary was found to be roughly one with a low variance when it was chosen as a free parameter that the network should learn.<br \/>\nHence it is reasonable to set it to one.<\/p>\n<p>Negative sampling is based on noise contrastive estimation and is in particular an approximation of it.<br \/>\nBy setting the number of noise samples equal to the size of the vocabulary and the noise distribution equal to the uniform distribution, we can reduce the computational steps further.<br \/>\nSince this approximation no longer optimizes for the correct words this method is not used for language modeling.<br \/>\nHowever for calculating word embeddings it was found to be useful.<\/p>\n<p>All methods allow for speed ups during training time.<br \/>\nCompared to the standard softmax computational time is reduced by a factors between 2 and 100.<br \/>\nHowever most of these methods only apply to training and only differentiated softmax with a speed up of a factor of 2 is also applicable to testing and inference.<br \/>\nThe performance of each method also heavily varies with the size of the vocabulary and certain methods perform worse for infrequent words.<\/p>\n<h2>outlook<a id=\"orgb798bd9\"><\/a><\/h2>\n<p>Since beam search, which is nowadays standard during inference and might also be important for performance reason during training time adjustments by the above mentioned methods seems very promising.<\/p>\n<h2>resources<a id=\"orgbd4c972\"><\/a><\/h2>\n<p><a href=\"http:\/\/aclweb.org\/anthology\/P\/P16\/P16-1186.pdf\">http:\/\/aclweb.org\/anthology\/P\/P16\/P16-1186.pdf<\/a><br \/>\n<a href=\"http:\/\/sebastianruder.com\/word-embeddings-softmax\/\">http:\/\/sebastianruder.com\/word-embeddings-softmax\/<\/a><br \/>\n<a href=\"http:\/\/arxiv.org\/pdf\/1410.8251v1.pdf\">http:\/\/arxiv.org\/pdf\/1410.8251v1.pdf<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Table of Contents 1. Alternatives to the softmax layer&#xa0;&#xa0;&#xa0;softmax 1.1. goal 1.2. motivation 1.3. [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_exactmetrics_skip_tracking":false,"_exactmetrics_sitenote_active":false,"_exactmetrics_sitenote_note":"","_exactmetrics_sitenote_category":0,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3642,3640],"tags":[3650,3758,3760,3656,3762,3658,3700,3900,3902,3904,3710,3906,3908,3878,3886],"class_list":["post-1675","post","type-post","status-publish","format-standard","hentry","category-ai-en","category-all-en","tag-ai-ja-en","tag-aka-intelligence-en","tag-artificial-intelligence-en","tag-baggage-en","tag-children-book-ja-en","tag-christmas-en","tag-cmos-en","tag-differentiated-softmax-en","tag-hierarchical-softmax-en","tag-importance-sampling-en","tag-musio-en","tag-negative-sampling-en","tag-noise-contrastive-estimation-en","tag-parents-en","tag-softmax-en"],"aioseo_notices":[],"jetpack_sharing_enabled":true,"jetpack_featured_media_url":"","_links":{"self":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1675","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1675"}],"version-history":[{"count":2,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1675\/revisions"}],"predecessor-version":[{"id":10871,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=\/wp\/v2\/posts\/1675\/revisions\/10871"}],"wp:attachment":[{"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1675"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=1675"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blog.themusio.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1675"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}