Table of Contents
Today’s summary deals with the question of modeling additional information present in natural language and how models could take advantage of this hierarchical syntactic structure.
Besides semantics, being about meaning and logic in language, syntax determines the structure of sentences.
Basic linguistic information in the form of parse trees, characterizing relationships among words in a sentence, might in principle lead to better word and sentence representations which should enhance natural language processing.
parse tree, constituency, dependency, syntax, semantics, recursive neural network, recurrent neural network, reduce shift algorithm, shift left/right algorithm
We start by describing in which way syntactic information is represented by state-of-the-art parsers like spacy, SyntaxNet or the Stanford Parser before we look at model frameworks that can incorporate this kind of information.
In the so-called constituency-based parse tree children nodes are reduced to a parent node.
Depending on the number, one distinguishes binary and n-ary trees.
In this way, one can for example identify the subject of a sentence which might consists itself of an article, an adjective and a noun.
We summarize the adjective and the noun in one parent node before we reduce it together with the article to form the subject of our sentence.
Clearly, we have to distinguish between terminal nodes, which contain the actual words, and non-terminal information, which keep the syntactic information between several words.
It should also be clear that the syntactic information cannot be processed sequentially, which means that we cannot easily use recurrent neural networks in their basic form for processing parse trees.
Another way of representing the syntactic information is to put emphasis on the dependency between words of a sentence.
Usually one defines the root word to be the verb of the sentence and specifies the relation of it to several other words, like the nouns of the subject and object.
These are themselves linked to articles or adjectives.
The dependencies are labeled by standard syntactic functions describing the relation between two words.
It might not be obvious that these two ways of representing syntactic structure contain the same information, but it is possible to recover either representation from the other.
Now we start looking at models that can deal with the information contained in parse trees.
Recursive neural networks are perfectly fit to model the non-sequential structure of parse trees.
In a bottom-up computation the network visits recursively every triple of a parent node and its two children to compute the representations up to the final nodes which summarizes the whole sentence.
In principle we can deploy a recurrent neural network to process two children nodes, but since the parse tree for every sentence has a different structure, we cannot process several distinct trees together.
Hence it is not easy to use batches.
Therefore recursive neural networks lack speed during the training process in comparison to sequential models which are solely based on recurrent neural networks.
Next we observe the computational structure of parse trees.
There are two main algorithms that specify how word embedding vectors should be processed to reach a sentence vector enhanced by syntactic information.
Both algorithms work with a buffer, which contains the unprocessed words of the sentence, and a stack which holds the partially processed tree.
In addition we have to determine a certain number of transitions or actions that can be performed on the buffer and stack.
In case of the shift reduce algorithm, we are allowed to shift a word from the buffer to the top of the stack or to reduce the top two entries of the stack.
Here the reduce transitions summarizes two nodes in the parse tree for us.
This algorithm is applicable for constituency parse trees of the binary form.
By reducing more than two nodes it is also possible to compute n-ary parse trees.
Dependency parse trees need a different scheme of transitions.
In addition to the shift action a left and a right reduction have to be introduced.
In this way we accept that words in the buffer are related to previously shifted words or words that are still on the buffer.
Recent network architectures try to track the computational paths of these algorithms with recurrent neural networks.
This allows to some extent to train on batches, which means a noticeable speed-up compared to recursive neural networks.
But since the transitions for the data points in the batch per step are not the same one still does not reach the performance of recurrent neural networks.
Another nice feature of having a recurrent neural network tracking the computations is the availability of context that can be used during a reduction.
Previously one had to rely on the pure word embeddings in one recursive step.
After going through a lot of linguistic theory on parse trees and algorithms, we should in principle answer the question of what we gain by including syntactic information.
Following the progress on the natural language tasks of identifying the relation among a sentence pair, which can be either deductive, neutral or contradictory, the battle between semantic and syntactic models is not decided.
Currently the best model on the Standford Natural Language Inference task is of the syntactic form, but purely sequential models might soon catch up.
In the future it would also be interesting to see if the decoder of a Sequence-to-Sequence model could benefit from syntactic information.
Sequential decoders sometimes lack the ability to produce grammatically correct sentences and in general lack to generate informative sentences.
By generating correct parse trees both kind of problems could be addressed.