Meet Musio

Covering rare words

Covering rare words


This week’s blogpost treats a new network architecture, named pointer models, for taking care of rare words.
We will dive into some details of the implementation and give a short analysis of the benefits of these kind of models.


Motivation for the introduction of new architectures comes directly from short-comings of RNN language models, as well as encoder decoder frameworks.
Rare words, especially named entities do not experience good word embeddings and hence do not lead to appropriate sentence embeddings which might be used to initialize a decoder component for predicting an output sequence.
Furthermore, the softmax layer over the fixed vocabulary in the decoder does not handle well infrequent words.
The RNN is not able to encode rare words in its hidden state over a long period of time steps and hence is not able to predict these.


attention, RNN, gating, softmax, vocabulary, rare words


Specific words in an immediate context are often found to be repeated.
Such is the case for text summaries, question answering over a given paragraph and also in conversations.
The problem with rare words, that we encounter in general is that we have to store information about these words in a hidden state of fixed capacity over many time steps.
Most RNN language models struggle with this kind of task and also models with attention over the recent input are not able to predict rare words.

Pointer networks have been recently introduced to overcome problems related to rare and out-of-vocabulary words.
By directly selecting the next predicted word via an attention mechanism over the context, which for example could be all previous input words, the model is able to put probability on less frequent and even out-of-vocabulary words.
Unfortunately, we still can’t predict rare words that do not appear in the input context.

Pointer models alone are limited to the words in the actual context.
Therefore it makes sense to combine them in some way with the standard RNN language models.
Only recently a new kind of mixture model was proposed that nicely integrates an RNN into the pointer model instead of going the opposite way and letting the RNN decided when to use a pointer component.

In the following we will have a more detailed look at the components of this model and the interplay between them.
On the one hand we have a standard RNN language model for predicting the next word by looking at the previously seen words in a sequence.
It uses a softmax layer over a fixed vocabulary to assign probabilities.
The pointer component on the other hand distributes attention scores over the input words in the context.
This happens in an unsupervised way, such that the model is flexible enough to deal with multiple occurrences of words.
Further, it might make sense for very long input sequences to restrict the attention to smaller window sizes to reduce the computation time.
The most interesting part is the actual combination of the above components which is handled by a special gating function.
For this particular model the gating function depends on a latent variable indicating the base distribution used which is either the softmax or the pointer distribution.
The actual value of the gate is then determined by an additional attention score for an end-of-attention vector besides the hidden states in the input window.
The pointer distribution given by the attention scores has then to be normalized respectively.
In this way the gating function directly depends on the last hidden state of the RNN and more importantly the size of the pointer window and the included hiddent states.
The RNN hidden state alone is in general not suited to keep track of the position of words within a pointer window of reasonable size and the time step when specific words drop out of it.

An analysis of this pointer mixture models on different data sets for natural language processing shows strong improvements in perplexity for rare words but also minor improvements on more frequent words.
The actual usage of the pointer component can be traced back to focusing on rare names, named entities and frequent words like units of measurements and numbers.
As could be expected the pointer is helpful for predicting words at the beginning of the pointer window.
Surprisingly sometimes also common frequent words are attended by the pointer component.


Besides tasks like text summary, question answering and machine translation which for sure benefit from pointer components, also dialogue modeling could be improved.
Rare words, named entities and especially names are often repeated in a conversation by the speakers for grounding reasons.
It might be also interesting to provide context in the form of user models and point to words in a database, which might contain information on family relations for example.
In the end pointer models do not tackle the problem of predicting unseen words, but are a first step in the direction of solving the problem of rare words.


Leave a Reply