In this short summary we will have a look at the beam search algorithm, which is applied in NLP for optimizing the generation of sequences of words or characters.
Most recurrent neural networks are optimized on predicting the next most probable output based on the history of some input sequence.
However, in general this does not lead to the most probable sequence.
RNN, decoder, greedy search, conditioning
Decoder architectures in the form of recurrent neural networks, LSTMs or GRUs as they are used for generating sequences of words or characters are optimized on predicting the next word. Hence during training, the networks sees a certain input sequence and should learn to predict the next word or character in this sequence based on the previous seen inputs. In order to simplify the learning process, at every step during training the network is fed with the true token of the sequence, which is considered to be a certain conditioning.
The alternative would be to rely on the actual output of the network and feed it back as the input for the next time step. Eventually, this is how we want to generate arbitrary responses with regard to some utterance. So during testing we provide the previous output as the next input. This algorithm is sometimes also called greedy, since only the maximum likelihood of the output is kept. In general, this does not lead to the most probable sequence, since this sequence might not start with the most probable output. Unfortunately, the network additionally tends to sum up errors that it made for previous outputs and might quickly loose its confidence.
There is another algorithm called beam search departing from the previous greedy approach. This ansatz belongs to the width first search algorithms and most libraries struggle to provide simple implementations due to architectural reasons. Instead of only relying on the maximum likelihood output of every step, we keep a certain number of outputs around, also called the beam. Now we evolve the RNN for each of these outputs and score the gained sequences by their overall probability. In this way we always keep a beam of sequences.
However, this still does not mean that we are able to find the most probable sequence in general. But we can increase our chances by following a larger number of beams, which on the other hand also means longer computation time. During this procedure we might encounter sequences that are completed, in the sense of ending with an end of sequence token. A critical disadvantage is that shorter sequence naturally lead to higher probabilities and we might not be satisfied with these kind of sequences as they might not contain enough information or context to continue a conversation.
Besides width first algorithms, there are the depth first algorithms which evolve a single output to complete sequence and then start all over using the next most probable output.