Abusive Language Detection for Muse Engine

Overview

Bach, multiple linked dialogue data platform for Muse engine, has utilized multiple resources – artificial intelligence, human reviewers, automated rating system, etc. – in an effort to generate best human-machine conversations, and a noisy data follows as a necessity from the development process. Noisy data is meaningless data, and its meaning can be expanded to include abusive language which causes challenges that we encountered when developing Muse engine. In this blog post, we will describe a development process of Muse engine ‘s abusive language detection system and demonstrate the efficacy by comparing the system with different models in detecting abusive language . To be brief, AKA’s abusive language detection system has shown a good performance by extracting additional features such as sentiment words.

Data

Abusive language refers to any type of insult, vulgarity, or profanity that debases the target; it also can be anything that causes aggravation. That is, abusive language can be a bit of a catch-all term. Additionally, detecting abusive language is often more difficult than one expects for a variety of reasons. for instance, a message can be regarded as harmless on its own, but when taking previous threads into account it may be seen as abusive, and vice versa. Also, there are a lot of difficulties still on the table such as sarcasm, difficulty to track all minority insults, intentional obfuscation of words, and so on. In order to reframe the definition of abusive language suitable for our task, we redefined the concept of abusive language that encompasses hate speech , profanity and derogatory language, with reference to the following datasets. To classify these sub categories of abusive language, we chose to experiment with three datasets: Wikipedia Abusive Language (Personal Attack) Data Set, Aggressive Language Identification Dataset : Trolling, Aggression and Cyberbullying (TRAC1), and Crowdflower Twitter Hate Speech Dataset.

**Figure 1: Wikipedia Abusive Language Data Set**

**Figure 2: Aggressive Language Identification**

**Figure 3: Crowdflower Twitter Hate Speech**

Features

In the data pre-processing stage, we combine the three datasets used for this work. The task involves removal of unnecessary columns from the datasets and enumerating the classes. The classes such as toxic, severe toxic, aggressive, sexism, racism and all associated labels are considered as abusive language according to the above-mentioned definition. In the text pre-processing steps, we convert the texts to lowercase and remove the unnecessary contents from the texts. And the input is tokenized into words, and converted into 300-dimensional word embedding vectors using 1 million word vectors trained on Wikipedia using the Fasttext and GloVe classifier. Also, we used the Porter Stemmer algorithm to reduce the inflectional forms of the words. After combining the dataset in proper format, we randomly shuffle and split the dataset into two parts: train dataset containing 70% of the samples and test dataset containing 30% of the samples.

Remove irrelevant characters, Space Pattern, URLs, Stopwords, punctuation marks
Tokenize words
Porter stemming; remove the commoner morphological and inflexional endings from words
Padding; standardize the input length

We extract the n-gram features from the texts and weight them according to their TF-IDF values. The goal of using TF-IDF is to reduce the effect of less informative tokens that appear very frequently in the datasets. Experiments are performed on values of n ranging from one to three. Thus, we can consider unigram, bigram and trigram features. The formula that is used to compute the TF-IDF of term t present in document d is as follows. D is total number of documents in the corpus. We feed these features to both machine learning models and neural network based models.

In an attempt to find the best way of solving abusive language problems, we not only experimented with different pre-processing techniques, but also made an effort to obtain and include additional training data as well as to enrich the given texts with additional meta information. Typical example of extra-linguistic information is sentiment features. The sentiment features are determined using VADER(Valence Aware Dictionary and sEntiment Reasoner) which is able to assign each text a positive and a negative score.

Model

These days, machine learning models are actively being deployed in the field to detect abusive language in an online environment. Therefore, we consider two machine learning algorithms used for text classification: Logistic Regression and Naive Bayes. Also, we implemented various kinds of neural network based models such as CNN, RNN, and their variant models. As mentioned above, a pre-trained GloVe and FastText representation are used for word-level features. For your information, we train each model on training dataset by performing grid search for all the combinations of feature parameters and perform 10-fold cross-validation. The performance of each model is analyzed based on the average score of the cross-validation for each combination of feature parameters.

Results

We compute the average accuracy for the binary classification (ABUSE vs. OTHER), and provide precision, recall and f1-score for the classes. As shown in the below table, most of models have shown similar in their precision and recall performances switching back and forth, so we set F1 score as a main criterion. As a result, we chose the RNN model as the best model showing the highest F1 score among them. There were small differences across the three datasets, but we regarded the difference within the margin of error. In order to solve the task of AKA’s abusive language detection, RNN-based variant model was selected, and we have tested and applied it to other datasets such as Bach.

Applications & Future Works

We have validated our best model (i.e. RNN based models) into other datasets. Below is the sample results of multi-labeled classification for abusive language detection. We had combined all abuse-related labels as “ABUSIVE” in the pre-processing stage, and left other labels as its original labels. From the below sample results, it can be seen that the predictions are pretty accurate. Also, by applying our learning model on three different datasets, we show that this model can be applicable to our own dataset, Bach. Of course, most of Bach datasets is composed of the refined data which is not abusive language, so it does not mean that much in terms of abusive language detection. Nevertheless, we can get some meaningful result by its application to improve user experience in our service.