Table of Contents
This week we are going to have a look at the latest developments of generative adversarial networks (GANs) in the field of dialogue generation by summarizing the paper “Adversarial Learning for Neural Dialogue Generation”.
General encoder decoder models for response generation are usually not able to produce meaningful utterances and instead come up with short, generic, repetitive and non-informative sequences of words.
The idea here is to apply adversarial methods so far only successful in computer vision to NLP problems, in particular dialogues.
Adversarial training with respect to modeling conversations can be considered as implementing the Turing test and assigning the role of the evaluator to another neural network, the discriminator.
generator, discriminator, reinforce algorithm, beam search, adversarial training, sequence-to-sequence
Adversarial networks have been applied with a lot of success for image generation tasks.
A generator produces an image and tries to fool the discriminator in believing that the image is from the target distribution.
The discriminator tries to distinguish between generated and target images and gives feedback to the generator.
However for NLP one soon discovers that text generation is discrete as opposed to image generation and therefore the network architecture becomes non-differentiable, which complicates the gradient calculation and updating the weights.
Therefore methods from reinforcement learning have to be applied to make the learning in adversarial networks with discrete output generation.
The response generation policy is given by general sequence-to-sequence model which consists of a encoder mapping an utterance to a distributed vector representation and a decoder which takes this representation as an input to generate a sequence of words known as the response.
This model can further be furnished with a nowadays popular attention mechanism.
The discriminator is just a binary classifier with the labels machine- and human-generated which takes as input the response of the generator.
It can be considered as an analogue to the human evaluator in the Turing test.
One possible network architecture is the hierarchical encoder model.
The output of the discriminator then provides an reward and directly gives feedback about the generators quality.
Adversarial learning makes use of policy gradient training known from reinforcement learning.
Basically, we try to maximize the reward for the generated responses.
The gradients needed to enhance the generator are provided by the adversarial reinforcement algorithm and in particular calculated using the likelihood ratio trick.
Known problems linked to this calculation are that the expectation of the reward is given by just one example and on top of that the provided gradients for the token generation steps are all identical.
In other words it is impossible to provide rewards for partially finished sequences with a standard discriminator.
A way to provide rewards for partially finished sequences is to complete these via Monte Carlo search and average the rewards for the completed sequences.
This method is very time consuming, but the only alternative is to train a partial sequence discriminator.
To overcoming over fitting on subsequences, only one positive and one negative example are sampled from all subsequences in the training process of the discriminator.
In this way one is able to give different gradients for every token to the generator.
The training procedure for the discriminator follows the the standard methods of gradient descent.
One proposed alternative to overcome the necessity of using the reinforce algorithm is to feed intermediate hidden states to the discriminator.
This way the network architecture becomes differentiable.
In the training process one notices that even for pretrained sequence-to-sequence models the reward on generated responses is not enough to enable learning.
Since the generator is only indirectly exposed to the actual target response by the reward and it is too difficult to find a good sequence in the huge space of possible responses the generator is not able to learn properly.
Therefore, human-generated responses are fed to the generator with either reward one or the determined reward by the discriminator.
This allows to regulate the generator in such a way that it does not deviate too much from the training data set.
Finally we have a look at the training details and some tricks to enhance the generation process.
Both the generator in form of the sequence-to-sequence model with attention and the discriminator are pretrained.
For the discriminator the negative samples are generated using beam search or randomly picked from human-generated responses to other utterances.
Besides removing short responses from the data set, adjusting the learning rate depending on an average tf-idf score of the words in the response, penalizing repetition of words not in a stop word list and encouraging diversity during beam search enhance the overall generation process.
Despite being not beneficial to machine translation, adversarial training improves dialogue generation.
The reason for this might be the difference in the distribution of generated responses and the target distribution.
It is in general known that adversarial learning is helpful in situations where the generated distribution has high entropy.
In the future adversarial network architecture will find their way into NLP and provide progress for certain text generation tasks as seen for dialogue generation.