AKA Story

Word Embedding

Thoughts about character-based word embedding and vocabularies in NLP :character:word:embedding:vocabulary:
In this summary we compare the two standard methods of single character embedding and full word embedding.

In order to teach a computer to understand words in order to perform natural language tasks, we have to map characters or words to a vector space the computer naturally acts on.

vocabulary, convolutional layers, highway layers, vector space, out of vocabulary words, semantics, syntax

The mapping of character, words or even complete sentences into a vector space is usually called embedding.
Given some text, there are two distinct methods to compute word embeddings manageable by a computer.
Children learning to read to start by recognizing individual characters before they are able to understand the meaning of words.
In the same way we can teach a computer to operate on a character-level.
An obvious advantage of this method is that learning a fixed set of about 70 characters for a language allows to write all known words.
Further a lot of words contain substructure in the form of affixes that transform verbs into adjectives or prefixes that signal negation.
A character-level approach might be able to easily recognize the plural of a word, which otherwise are to completely unrelated words.
If we look at the problem from a neural network perspective, we are able to model the substructure and identify words with a similar word stem more easily.

Furthermore, the small set of characters reduces the number of parameters needed for embedding words into a vector space.
On the other side, these models don’t come without drawbacks.
Very similar words, that often only experience an edit distance of 1, meaning that they only differ in one character, convey completely different meanings.

It is naturally hard for these models to learn distinct embeddings for these kind of words.
If we are interested in generating text, then a character approach often performs worse than a generating words, since it involves generating very long sequences of characters, that are not favored by current probabilistic cross-entropy measures.
The architecture of these models involve several convolution layers that pool over the characters of a word. Recent models stack highway layers on top and achieve even better results.

Let us take a step back and consider the direct word embedding approach for a given text.
In order to define a mapping from words into a vector space, we first have to specify the set of words that should be mapped.
Naturally, one uses a limited list of the most frequent words, which is called the vocabulary, and depending on the seize of the text or corpora might be around 1 billion word types.

Since, we do not consider all words, this approach reserves a special symbol or embedding for all words not included in the vocabulary.
Certainly, this is a major drawback in handling new words or generalizing to never seen texts.
The most common way to proceed is to embed the words in the vocabulary as one-hot vectors.
All entries except the one corresponding to the position in the vocabulary are set to zero.
In recent years, certain pre-trained word embeddings, e.g. word2vec, gensim or GloVe, became available allowing to initialize words with already learned embeddings.

This is a clear advantage if the aim for example is to build sentence representations on top of word embeddings in order to perform natural language understanding tasks.

As stated before the output sequence when operating on a word level is shorter and hence generating sentences is easier.
There are however, major disadvantages that triggered the rise of character-level embeddings.
This approach is clearly blind to the substructure of words, which is of great importance if one wants to model morphologically rich languages like Turkish or German.

Further, rare words are poorly embedded and out-of-vocabulary words cannot at all be embedded in a meaningful way.
Since most of the parameters in natural language models do arise from word embeddings a big vocabulary is a major disadvantage, but needed to deal with dumps of wikipedia which contain about 20 million word types.

The evaluation of these two approaches certainly depends on the task in which they are implemented.
At the level of the pure embedding one might come up with a list of semantic and syntactic tasks.
For example, an embedding should be capable of grouping capital cities in a way that one easily retrieve a the capital for a given country.

In the same sense, words that involve a distinction between man and woman should be grouped.
With respect to syntax, the identification of adjectives or plurals corresponding to a verb might be of interest.
In general however, we evaluate the two approaches with respect to serving a more sophisticated task like machine translation, word tagging or semantic analysis.

For machine translation recent results show that the character-level approach allows for better BLEU scores since it reduces the number of unseen words significantly. In any case the training procedure demands for an encoder decoder framework that involves feeding the word embeddings into a recurrent neural network or LSTM to produce meaningful word embeddings.

Future directions for the character-level approach might be to refine the used convolutional layers and to address the problem of generating sentences character by character.
In particular modeling noisy data from social networks services will make it necessary to abandon the fixed vocabulary approach.

Character-based Neural Machine Translation” (PDF). Character-based Neural Machine Translation. March 2016. Retrieved 11 March 2016
Character-Aware Neural Language Models” (PDF). Character-Aware Neural Language Models. December 2015. Retrieved 11 March 2016
Efficient Estimation of Word Representations in Vector Space” (PDF). Efficient Estimation of Word Representations in Vector Space. September 2013. Retrieved 11 March 2016
Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation” (PDF). Finding Function in Form: “Compositional Character Models for Open Vocabulary Word Representation. Retrieved 11 March 2016
Character-level Convolutional Networks for Text Classification” (PDF). Character-level Convolutional Networks for Text Classification. September 2015. Retrieved 11 March 2016

Leave a Reply