Table of Contents
In this weeks blogpost we are going to shed light on coreference resolution in an end-to-end way as recently presented in the paper “End-to-end Neural Coreference Resolution”.
Solving tasks in an end-to-end fashion becomes more and more popular for several reason.
Trying to optimize several components or modules of a system simultaneously can be very difficult due to their complex intertwining.
Besides that multi-component system usually need a lot of hand-tuned parameters such as thresholds in order to decide which data to pass to the next stage.
This is also the case for previous work on coreference resolution which relies on parse trees and hand-engineered algorithms for aligning mentions on top of these.
Systems designed in this way clearly do not easily generalize to other domains.
scoring, bidirectional lstm, pruning, convolutional layer, word embeddings, character embeddings, metadata embedding
The task of coreferencing can be summarized as detecting all expressions in some text which correspond to the same entity.
Coreference resolution is a rudimentary task needed for tackling context understanding, question answering, information extraction and hence dialogue generation.
In more detail coreferencing refers to assigning an antecedent to every mention in an utterance, sentence or document.
Now relying on spans of several words for detecting these mentions allows one to move a way from generating a parse tree.
And an antecedent can be every span occurring previous to the actual mention.
The model architecture in the above mentioned paper consists of a unary and pairwise scoring of spans.
The unary scoring takes a span representation as input and tells us how likely it is that it is a mention.
The pairwise scoring consumes two span representations and additional metadata such as the speaker(s), the genre and the distance between those spans.
Simple feed forward networks provide the parameters to learn a good scoring.
At this point we have to explain the structure of the span representations.
For properly detecting mentions not only the internal structure is of importance but also the context.
Hence the authors rely a bidirectional LSTM with a standard attention mechanism as of now being the standard for processing sequential data.
Pretrained word embeddings in form of 300-dimensional GloVe vectors as well as 50-dimensional vectors insensitive to word-ordering are fed to the LSTM.
To resolve the internal structure one-dimensional convolutional filters of size 3,4 and 5 are used, which provide further 8-dimensional embeddings.
A span representation is then the concatenation of the result of the hidden states of the start and end position of the span, the attention state, as well as the size of the span.
Obviously, splitting utterances into all possible spans and scoring all of these pairwise is a highly computational expensive algorithm.
Therefore a certain way of pruning is needed.
First the possible number of words per span should be limited and it makes sense to only pairwise score those mentions with the highest mention scores.
Furthermore one can neglect overlapping spans and only take a certain number of antecedents into account.
This seems to be a lot of hyperparameters to tune and might eventually lead to throwing a way of gold training examples.
Fortunately this is not the case.
For spans which do not appear in the training data or whose antecedents do not contain the gold standard the model requires the introduction of a dummy span, which gets a score of zero.
Finally, all pairwise scored mentions should be clustered and represent now one entity.
The authors present gains in F1 scores on the OntoNotes benchmark and hence reach state-of-the-art performance.
Additionally the model allows to interpret the absence of coreference links and trace it back to either low mention scores or low mention ranking.
The attention mechanism also nicely shows the contribution of the individual words in the span to the coreference decision.
In the future additional information in the form of the output of a named entity recognizer or world knowledge could improve the models performance even more.
Another paper which we want to mention here is “Event Coreference Resolution by Iteratively Unfolding Inter-dependencies among Events”.
In addition to inter-document resolution of coreferences the authors tackle cross-document resolution and try to cluster mentions of real-world events.