Meet Musio



This paper focuses on a single question:  If given a large amount of unlabelled audio inputs, is it possible to automatically categorize them without supervision by semantically utilizing the frame semantics theory?  Essentially, after you are left with a large amount of parsed audio samples, how do you group the parsed parts of speech?  Well, you assign values and clustering rubrics to this data in order to fill those preassigned semantic slots. The authors of the paper are trying to develop the use of a state-of-the-art frame-semantic parser and a spectral clustering based slot ranking model that adapts the generic output of the parser to the target semantic space.

Traditionally, semantic categorization of parsed audio samples has been annotated and compiled by hand according to the frameworks created by developers or domain experts, which is typically pretty time-consuming and expensive.  Using the method proposed here, predefined slots won’t be needed and development and data collection time would be greatly reduced.  

To be able to make up for these variations in audio inputs, the authors are aiming to utilize SLU (spoken language understanding) to create connections between these inputs and the semantic representations that best encapsulate a speaker’s intentions.

Using the traditional method of having developers and other professionals manually define the process for domain specific tasks is very expensive and labor intensive.  And there is another downside to this method.  Categorizing lexical data in this way can be limiting when it comes to real-world application due to the dynamic nature of the English language. Essentially, they are assigning audio inputs that have been parsed based on probabilities, and then randomly assign those pieces of audio to several series of clusterings based on whatever conversational parameters have been set. The authors want to be able to show that doing it this way, that is, without having someone go in and individually define the semantic slots in which the parses should fall, is more efficient and just as accurate as if it was done by hand.  They want to be able to show that their random system of parse grouping demonstrate groupings that are just as accurate categorized and clustered according to rubrics and rulesets. To do this, they will compare their automated system of ranking based on parse frequency to some compiled by hand in order to test accuracy.

The authors state that their main goal is “to use a FrameNet-trained statistical probabilistic semantic parser [14] to generate initial frame-semantic parses from automatic speech recognition (ASR) decodings of the raw audio conversation files.”  But how does one arrange the results of the FrameNet parsing so that they are useful when applied to SDS (spoken dialogue systems)? Going on the assumption that word meanings can more or less be encapsulated in a semantic framework, they can be broken down into three categories: frame, frame elements, and lexical units.  

SEMAFOR is fairly accurate when it comes to predicting semantic categorization if the inputs are annotated and compiled by hand.  So, this method is used as a sort of “control,” wherein what is represented by the SEMAFOR semantic parsing into predetermined slots represents the standard by which the authors will judge their results.

The general idea was to parse the inputs into “generic” and “domain-specific” parses.  Word frequency and “coherence of values” are two ways in which to measure accuracy when parsing in a domain-specific manner.  If the values correspond in some way when they are parsed and slotted into the specified framework, then they are seen to be more prominent, and the same goes for word frequency.  In the end, spectral clustering was used in the authors’ slot-ranking model.  This was done for three reasons.  First, “spectral clustering is very easy to implement, and can be solved efficiently by standard linear algebra techniques.”  It is also “invariant to the shapes and densities of each cluster.”  And finally, “spectral clustering projects the manifolds within data into solvable space, and often outperform other clustering approaches.”   

The authors then verified their results in two ways.  Comparing their results to how parses are slotted into frameworks that have been manually created by domain experts is a way that was discussed earlier on.  The second is to compare them to hand-parsed samples annotated by actual human beings.

To more accurately determine how related the slot comparisons are in the first test, the quality of the semantic relationships between the two sets of slotted words is assessed.  So, the authors are essentially matching the parsed results of their spectral clustering to those parses sorted into predetermined semantic slots.  Of course, there will be some differences, and this task is essentially a measure of the shortest distance between two related words.

It is much easier for the authors to test their results with the second method.  They simply needed to extract the words from the framework they generated using spectral clustering and compare them to the results of the hand-parsed word lists.  The results are measured either in a way where only exact word matches are accepted or where at least one word in a slot with multiple parses must match.  Overall, it was found that matches were frequent and more or less consistently accurate.

In conclusion, the results of the authors’ proposed method of randomly parsing inputs have shown that it is indeed possible to accurately create a framework and slots for parsed samples without the time-consuming task of manually determining said slots and frameworks by hand.  Using spectral clustering, the authors were able to generate frameworks that matched well with those created by developers and were also able to compare their resulting parses from these randomized frameworks to the results of hand-parsed inputs.  It is clear that although there were some lulls in word-relation and frequency of matches in some in samples, the authors are clearly making progress toward further automating the development of spoken dialogue systems and spoken language understanding.



Leave a Reply