AKA Story

## Expectation Maximization Algorithm

Goal
In today’s summary we have a look at the expectation maximization algorithm that allows to optimize latent variable models when analytic inference of the posterior probability of latent variables is intractable.

Motivation
Latent variable models are itself interesting, because they are related to variational autoencoders and encoder-decoder frameworks that are popular in unsupervised and semi-supervised learning.
They allow to sample from the data distribution and are believed to enhance the expressiveness of the hierarchical recurrent encoder decoder models.
We can think of them as memorizing higher abstract information, such as emotional states that allow to generate sentimental utterances in the encoder.

Ingredients
variational autoencoder, observable variables, latent variables, maximum likelihood, posterior probability, complete data log likelihood

Steps
In general we are concerned with finding good models, which means determining parameters of this model that can explain the data.
More mathematically spoken, we try to find the maximum likelihood of the probability function of the model given the observed data.
But now and then we encounter data sets, where not all data points have labels and we are left with a certain amount of unobserved data.
It is also sometimes helpful to assume that there simply are hidden variables that lead to the observation of the data.
These hidden variables are also called latent variables and might better explain the structure of the data.

Then in order to determine the likelihood of the observed data we have to marginalize over these latent variables and have to take into account the values the latent variables can take.
The sum over these values of the latent variables becomes particular tricky since it sits inside the logarithm of the likelihood function and an analytic solution is hard to find.

Therefore it makes sense to look out for alternatives that may allow us to eventually get a handle on the marginalized log likelihood.
We could make use of the complete data log likelihood function, if we would be able to observe the latent variables.
For this we need the posterior probability of the latent variables given the data, which itself is a quantity that is not easily accessible.
But there are a certain number of tools around that allow to approximate it, like Monte Carlo methods and variational inference.
Now we are at the point where we can calculate the expectation of the complete data log likelihood function and that is also where the naming for the expectation maximization algorithm comes from.
This first step is also called the E-step and usually followed by the maximization of the expectation called the M-step.
There is now some math left to be done to show that this algorithm of picking initial parameters of the network, estimating the latent variables and determining the optimized parameters will increase the marginalized log likelihood function.
We will skip the proof here and only state that the expectation maximization algorithm converges, but it is not ensured that we can reach a global maximum.
However in practice the algorithm works well for optimizing latent variable models.

Outlook
Alternative methods for inferring the posterior probability of latent variables have recently been under consideration in situations where the sampling process becomes more difficult.
In particular, variational inference seems to scale well enough to be used in deep neural networks, such as autoencoder or encoder decoder frameworks.

Resources
Machine Learning” (PDF). Machine Learning. Published Fall 2012. Accessed 10 June 2016.
The Expectation Maximization Algorithm” (PDF). The Expectation Maximization Algorithm. Published July 18 2004. Accessed 10 June 2016.
Expectation–maximization algorithm” (WEB). Expectation–maximization algorithm. Accessed 10 June 2016.