Alternatives to the softmax layer
Table of Contents
Alternatives to the softmax layer
This weeks posts deals with some possible alternatives to the softmax layer when calculating probabilities for words over large vocabularies.
Natural language tasks as neural machine translation or dialogue generation rely on word embeddings at the input and output layer.
Further for decent performances a very large vocabulary is needed to reduce the number of out of vocabulary words that cannot be properly embedded and therefore not processed.
The natural language models used for these task usually come with a final softmax layer to compute the probabilities over the words in the vocabulary.
Now huge vocabularies make this final computation extremely expensive and therefore slows down the whole training process.
This makes it necessary to look for more efficient alternatives to the softmax layer.
softmax, hierarchical softmax, differentiated softmax, importance sampling, noise contrastive estimation, negative sampling
Before we start, let us repeat the general procedure for generating a translation of a sentence or a response to an utterance in a dialogue.
Depending on some context, which are most often the previous words in a sequence the next word is selected according to some normalized probability distribution over all words in the vocabulary.
The actual computation is performed by a softmax layer which computes an activation for each word in the vocabulary by defining a matrix of the size of the vocabulary times dimension of the output of the previous layer.
Since vocabulary sizes from 10K to 1B are becoming standard the final softmax layer is way bigger than every other layer in our network architecture.
Furthermore the normalization of the activation forces us to sum over all words in the vocabulary.
Apart from the obvious expensive summation over a large vocabulary there are further problems that arise indirectly and affect the overall training time.
Since we deal with large vocabularies we also need large training data sets in order to have sufficient training examples for infrequent words.
On top of that larger training sets usually allow and make it necessary to train larger models.
Hence the training time per epoch usually increases and the learning performance is reduced due two deeper network architectures.
Alternatives for the softmax layer can in principle distinguished into two classes.
First we have a look at modifications of the softmax layer itself before we explain sampling approaches which allow to completely get rid of the softmax layer.
One alternative, known as hierarchical softmax, modifies the network architecture by organizing the vocabulary into a tree.
Every word is now reachable by a unique path and the probability of a word is the product of probabilities along the path.
Typically one uses balanced binary trees which have a depth equal to the logarithm of the vocabulary and hence reduce the computation time.
Another method closely related to information encoding makes use of frequency clustering.
More frequent words have shorter paths and one can achieve a further speed up compared to binary trees.
During test time where we do not have the target words we have to rely on the standard softmax layer.
The idea behind differentiated softmax is to group words in the vocabulary and assign a fixed number of parameters to each.
Formulated differently, we split the final linear transformation into block matrices and hence only take a fixed number of parameters into consideration when computing the activations.
Usually one allows more parameters for more frequent words which on the other hand makes it even more difficult to learn proper word embeddings for infrequent words.
This method can obviously also applied during test time.
We now come to sampling approaches which are based on approximating the loss function and in particular the sum over the whole vocabulary.
Taking a closer look at the gradient of the loss function clarifies how sampling can be applied.
The gradient consists of a positive reinforcement term for the target word and a negative reinforcement term over all other words in the vocabulary weighted by their probabilities.
It is the expectation of the gradient of the activations of all words.
A first approach, called importance sampling, to this term is heavily applying Monte Carlo methods.
We approximate the expectation value by sampling from the networks probability distribution.
Since this is exactly what we wanted to avoid, we introduce a proposal distribution which should be close to the network distribution and easier to sample from.
In practice one chooses the unigram distribution of the training set.
This allows for easy sampling, however we still have to rely on the normalized network distribution when calculating the expectations.
Fortunately, we can work with a biased estimator by approximating the sum over the vocabulary in the denominator with the same samples drawn from the proposal distribution.
The quality and performance of importance sampling heavily relies on the number of samples drawn from the proposal distribution.
Two other methods rely on introducing a binary classification task based on an auxiliary loss function which indirectly maximizes the probability over correct words.
First we list noise contrastive estimation, before we go one step further and introduce and approximation of it.
By generating a certain number of noise samples from a distribution, e.g. the unigram distribution of the training set, we try to guess the label of a number of sampled words.
For this we introduce a mixture distribution consisting of the network distribution and the noise distribution.
Words from the noise distribution are labeled false and the actual target words are labeled true.
The sum in the denominator over the vocabulary was found to be roughly one with a low variance when it was chosen as a free parameter that the network should learn.
Hence it is reasonable to set it to one.
Negative sampling is based on noise contrastive estimation and is in particular an approximation of it.
By setting the number of noise samples equal to the size of the vocabulary and the noise distribution equal to the uniform distribution, we can reduce the computational steps further.
Since this approximation no longer optimizes for the correct words this method is not used for language modeling.
However for calculating word embeddings it was found to be useful.
All methods allow for speed ups during training time.
Compared to the standard softmax computational time is reduced by a factors between 2 and 100.
However most of these methods only apply to training and only differentiated softmax with a speed up of a factor of 2 is also applicable to testing and inference.
The performance of each method also heavily varies with the size of the vocabulary and certain methods perform worse for infrequent words.
Since beam search, which is nowadays standard during inference and might also be important for performance reason during training time adjustments by the above mentioned methods seems very promising.