Meet Musio

## Variational Autoencoder

Goal
Variational methods for inference of latent variables became popular in the last years.
Here, we have a look at variational autoencoders from a slightly more mathematical point of view.

Motivation
Some deep generative models for vision tasks, such as style transfer or simply generating images similar to some given images, rely heavily on variational autoencoder frameworks.
Only very recently latent variables have been introduced into the hierarchical recurrent encoder decoder framework to enhance the expressive power of the model when coming up with responses to utterances.
Further variational autoencoder allow to perform unsupervised learning and are thus in general interesting to solving artificial intelligence.

Ingredients
variational inference, posterior distribution, latent variable, Bayes model, Kullback-Leibler divergence, objective function, lower bound

Steps
Imagine we are in the situation were we have a distribution over some data points and we are not interested in putting a label on them, but we want to generate more data points which are similar to the ones we got.
In particular, we can think of a set of images of birds or 3D models of spaceships and we would like to come up with more of the same kind.
So far, model building for generation required either strong assumptions on the structure of the data, severe approximations which lead to sub-optimal models or were simply computationally expensive.
One approach to tackle these problems is to introduce latent variables into the models.
In principle these are capable of capturing some of the underlying structure of the data we want to model and allow us to maximize the probability of the seen data points such that we can generate probable new ones.

The variational autoencoder is one model that incorporates latent variables and only shares its name with the sparsity and denoising autoencoder because of its encoder decoder framework.
The most important question we have to answer is how we manage to maximize the probability of the data points for the variational autoencoder.
First, we have a look at how to choose the latent variables.
In general, these might experience a very complicated dependency structure, since for example for handwritten character generation the angle of the character might be influenced by the writing style and velocity which we all have to capture in the latent variable.
Luckily, we have neural networks around, we draw the latent variables from a normal distribution and hope that our network will come up with a useful distribution over the latent variables in some layer.
Next, we have to approximate the data distribution which can be done by sampling.
So we sample a bunch of latent variables and calculate the expectation of the conditional data distribution.
Now for high dimensional spaces we will run into problems since generated data points will mostly have low probability since they will in general look very different from the data points we already have.
It would be a better idea to sample latent variables that are more likely to have produced the data points we observed and then compute the conditional distribution using these values for the latent variables.
The mathematical tool to use is called Kullback-Leibler divergence and measures the similarity of two probability distributions.
This leads us to an objective function which can be optimized by looking at a lower bound to the logarithm of the data distribution.
The lower bound in particular includes a term that drives an arbitrarily chosen distribution over the latent variables to one that gives high probability to those latent variable values that have generated the observed data points.
Another advantages of this is that we make the inference of the posterior distribution over the latent variables tractable in this way.

The optimization of this lower bound can be done by the conventional methods of stochastic gradient descent.
We choose an inference model that comes up with parameters for a normal distribution which we try to match to a standard normal distribution over the latent variables.
Sampling becomes also a minor problem since during gradient descent we already go over the whole data set and it suffices to pick one sample for the latent variables.
Since the sampling process is non-continuous a reparametrization trick is needed to push the gradients through to the inference model.
Instead of sampling over the the normal distribution, we sample an error value from a standard normal distribution and deterministically calculate the latent variables using the parameter output of the inference model.
In this way none of the expectations in our objective depend on the parameters of our models and we can perform stochastic gradient descent.
During testing we sample a latent variable from the standard normal distribution and we get a lower bound on the logarithm of the probability for the generated data point.

There is also a nice interpretation of lower bound in terms of information theory.
Usually the logarithm of the data distribution tells us the total number bits needed to reconstruct the data point from an ideal encoding.
What we are doing with the lower bound is calculating the extra information when the latent variable is sampled from the inference model instead of the standard normal distribution plus the information needed to reconstruct the data point given this latent variable.
In the end there is a small penalty for sub-optimal encoding since the inference model is used instead of the posterior distribution coming directly from the observed data points.

From a more practical point of view and having applications with respect to natural language processing in mind the shortcoming of recurrent language models in producing averaged responses to given utterances might be circumvented by allowing the models to generate several probable responses.
We will pursue this direction further and follow recent developments with interest.

Resources

Tutorial on Variational Autoencoders” (PDF). Tutorial on Variational Autoencoders. Published June 21, 2016. Accessed 29 June 2016.
Variational auto-encoders do not train complex generative models” (WEB). Variational auto-encoders do not train complex generative models. Published June 22, 2016. Accessed 29 June 2016.
Chainer-Variational-Recurrent-Autoencoder” (GIT). Chainer-Variational-Recurrent-Autoencoder. Published August 23, 2015. Accessed 29 June 2016.