Table of Contents
In this week’s post we will have a closer look at a paper dealing with the modeling of style, topic and high-level syntactic structures in language models by introducing global distributed latent representations.
In particular, the variational autoencoder seems to be a promising candidate for pushing generative language models forwards and including global features.
Recurrent neural network language models are known to be capable of modeling complex distributions over sequences.
However, their architecture limits them to modeling local statistics over sequences and therefore global features have to be captured otherwise.
latent variables, variational autoencoder, KL divergence, global features, language models
Non-generative language models include the standard recurrent neural network language model, which predicts words depending on previous seen words and does not learn a global vector representation of the sequence at any time.
An alternative is the standard sequence autoencoder model, which learns a sequence representation, but does not introduce a prior distribution over it.
Hence, it is not capable of generating new sequences and learned sequence representations do not form a smooth manifold of interpretable features which means that decoding from an arbitrary point in the representation space leads to grammatically incorrect sentences.
By contrast the variational autoencoder language model, which is the main model in this paper, is capable of generating unseen sequences and learns global latent variables that capture features like style, topic and syntactic structure.
The model itself replaces the deterministic encoding function of the standard autoencoder with a posterior probabilistic recognition model.
This allows the model to learn ellipsoidal regions instead of isolated points in the latent space for given sequences during training.
Another crucial distinction from the standard autoencoder is the modification of the objective function.
In addition to the reconstruction loss, we regularize the posterior probability over the latent variables by introducing a prior distribution, which is usually a standard Gaussian.
Without the regularization term the model is not able to smooth out its global sequence representation and would simply set the posterior variance to zero.
The training procedure brings along one major challenge.
Any useful model should hold some information in its latent variables.
However, most settings for the hyperparameters lead to a zero KL divergence term, which means that there is no information contained in the latent variables.
This is due to the decoder having troubles learning from the varying samples given by the posterior distribution over the latent variables and instead is going for the local statistics of the sequences to explain the data.
A first solution is to slowly turn on the KL divergence term which pulls the posterior distribution towards the prior.
This allows the model to encode as much information in the latent variables as possible in the beginning before smoothing out the encodings when it starts paying the full cost.
Another trick is to weaken the decoder by introducing word dropout.
Instead of feeding the previous target word at the each time step, we introduce a so-called drop rate and feed the decoder with the unknown token.
In this way, the decoder has to rely more on the global sequence representation for predicting the next word.
The results of training the language model on a book corpus show that the reconstruction term always dominates the KL divergence, but with increasing word dropout rate the KL divergence grows.
This suggests that the model always prefers to use simple local statistics over encoding information in the latent variables.
Taking a closer look at the sentences generated from the prior distribution shows that increasing word dropout also brings more variety.
However, for the inputless decoder the number of grammatical mistakes made during the generation increases noticeable.
It is further remarkable that even latent variables with a low probability under the prior distribution lead to consistent and coherent sentences.
The authors point out that an interesting direction of research is the factorization of the latent variable space and the analysis of the generated sentences conditioned on these extrinsic features.