Open Domain Dialogue Dataset Comparison Report

Bach vs. Others

This document presents a comparison between curated open-domain dialogue datasets available in the public domain and the data produced by AKA’s Bach data platform. The current report focuses on quantitative measurement which could be done in a transparent manner and represent objective differences found in the data.

The analysis was performed using the following criteria:

Total Number of Tokens

Number of tokens is a measure of the overall size of the dataset. It is very important for training the modern Deep Learning-based models. Bach dataset displays clear superiority to others. Higher is better.

Vocabulary Size

Vocabulary size is the number of unique tokens appearing in the dataset. It represents the variety of speech in dialogues. Our dataset comes in par with the best scoring competitor dataset and close to the target user’s actual vocabulary. Higher is better.

Lexical Redundancy

https://en.wikipedia.org/wiki/Lexical_diversity

This plot shows how many times each word is used on average throughout the dataset. Higher lexical redundancy means that the user is exposed to more ways in which a certain word is used. This is important for educational purposes as well as for training ML models.

Number of Utterances

Utterances are the individual dialogue turns. It also represents the overall size of the dataset. High number of utterances allows ML models to learn more fine-grain semantic information about tokens. Higher is better.

Length of Dialogues

Here we plot the average length of dialogues. Shorter dialogues aid more focused on contextual information extraction. In the case when we need longer dialogues Bach provides the possibility to create them by linking one tree to another (MetaTree).

Number of Child Responses

Here we plot the number of children per utterance (excluding leaves). Such a structure captures the alternative dialogue option information which is beneficial for better topic understanding. In the same way exposure to the same words in different utterances helps gain more information about words, the occurrence of the same context in different dialogues aids in understanding the utterances.
Points:

The dialogues in our dataset are tree-structured
All other available datasets have linear dialogue structure

Number of Dialogues

This plot shows the number of unique dialogues in each dataset. Each dialogue corresponds to a single training example for the model. The pure number of dialogues allows us to use state-of-the-art Deep Learning techniques.
Points:

Our dataset has a much more diverse set of dialogues compared to others.

Open Domain Dialogue Dataset Comparison Report﻿