In this paper, semantic information and word order are used in order to determine sentence similarity. In order to do that, the authors introduce their proposed algorithm which takes into account our common, shared lexical corpus and the frequency of words that are similar across sentences.
The analyzation of text symmetry has been used on long texts to measure word similarity in the past. For the most part it was successful because the texts which were being analyzed had a higher frequency of return regarding word similarity due to their length. On the other hand, shorter texts were harder to analyze in this way because there were a limited number of words to work with, ultimately meaning a smaller chance for success. Here, the authors are attempting to a develop an algorithm that can analyze these short texts differently than traditional methods in two ways. First, it only analyzes text at a sentence-to-sentence level rather than as a whole like in long form documents. Second, they are integrating word-order data to better detect sentence similarity. In long texts, it is easier to identify the way words and their overall order within a text bring it meaning and hold information. In short texts, the task is a little bit more difficult. Essentially, “The task is to establish a computational method that is able to measure the similarity between very short texts (sentences).”
Traditionally, “information retrieval methods used a set of a pre-determined index terms (words or collocations) that are used to represent a document in the form of a document-term vector.” This is problematic, however, because the vector will have a very small number of nonzero points when used on short texts. It also can cause certain keywords or phrases to be missed, or even overall sentence relevance, because the data limits of the term set are set too broadly.
First, it is important to examine word similarity. Using Miller’s WordNet, word similarity is measured by the shortest distance traveled in a hierarchical architecture of synonyms and semantic meaning. A formula proposed by the authors sums up this process by essentially describing a way to determine the shortest relational distance between two words, taking into account the depth to which the words begin to share meaning in the hierarchy.
Next, it is easy for humans to understand sentence similarity. “The quick brown fox jumps over the lazy dog” and “The quick brown dog jumps over the lazy fox” both present similar information but the subject and object in each sentence reverse places. It is easy for us to make that distinction in natural conversation, but translating this into a computational language for the use of natural language processing is no small task. In these cases, word order similarity defines sentence similarity, and this is what the authors intend to measure. In the example provided in this paragraph, both sentences are similar except for the fact that the 4th and 9th words in each are switched. So, the author’s solution to this is to assign each word in a sentence a unique numerical value, and upon comparison, if the same values are shown to exist in the same places across different sentences, similarity is appropriately measured.
The authors state that “Semantic similarity represents the lexical similarity. On the other hand, word order similarity provides information about the relationship between words: which words appear in the sentence, and which words come before or after which other words.” So, both semantic and syntactic similarity have bearing on sentence similarity. When words share similarities in meaning, we can begin to map sentence similarity. Word order and syntax have bearing on word similarity as well, which further contributes to accurately determining whether sentences share similarities.
Now that the authors have determined a way to assess levels of sentence similarity, they need to be able to implement it so it can be made use of by conversational agents. Since the research here is going to be used to help machines be able to process input and communicate using language in a more human manner, the implementation of this data must adhere to pattern-based rules (because conversational agents rely on stimulus patterns and response patterns to carry on dialogue). So, in order to complete a dataset that is useful to natural language processing, every possible stimulus and response must be logged and archived if there is to be any semblance of natural dialogue. Since language is so dynamic, these datasets tend to be long. However, using the authors’ algorithm to determine sentence similarity, one could foreseeably restrict the length of these datasets. For instance, when a set of possible stimulus patterns is considered, there could be several words pertaining to child (kid, boy, girl, etc.), and a response is built around one or more of these words individually, thus adding similar response possibilities to the dataset. However, using word similarity vectors, the dataset could be shortened by combining possible responses that contain words that are deemed to fall within the limits of whatever semantic similarity the words are seen to share (i.e. there don’t need to be four separate data entries for child, kid, boy, girl). This eliminates sentences that only show small differences in words and not overall meaning, allowing for the dataset of possible responses to be shorter and to elicit a faster response time from a conversational agent.
So, pattern matching might not be anything new in natural language processing, but comparing sentence similarity has opened the doors to some new and potentially groundbreaking advancements in the use of dialogue agents. The rules and datasets are drastically shorter, and that makes them infinitely more readable and easier to maintain and update as developments are made. Essentially, this research was done in order to simplify a conversational agent’s representation of knowledge (potential stimuli and responses) and how it processes that knowledge. This ultimately leads to the entire process being shortened, making it more efficient when natural conversation relies on the quickness with which dialogue agents can assess input, measure the appropriateness of a response, and issue that response.
Li, Yuhua, Zuhair Bandar, David McLean, and James O’Shea. “A Method for Measuring Sentence Similarity and Its Application to Conversational Agents.“PROCEEDINGS OF THE SEVENTEENTH INTERNATIONAL FLORIDA ARTIFICIAL INTELLIGENCE RESEARCH SOCIETY CONFERENCE (2004). AAAI. Web.