Testing Hateful Speeches against Policies

Budhrani, Girish; Liu, Xueqing; Rathnasuriya, Ravishka; Yang, Wei; Zheng, Jiangrui

Testing Hateful Speeches against Policies

Authors: Girish Budhrani
Xueqing Liu
Ravishka Rathnasuriya
Wei Yang
Jiangrui Zheng
Publication date: 23 July 2023
Publisher

Abstract

In the recent years, many software systems have adopted AI techniques, especially deep learning techniques. Due to their black-box nature, AI-based systems brought challenges to traceability, because AI system behaviors are based on models and data, whereas the requirements or policies are rules in the form of natural or programming language. To the best of our knowledge, there is a limited amount of studies on how AI and deep neural network-based systems behave against rule-based requirements/policies. This experience paper examines deep neural network behaviors against rule-based requirements described in natural language policies. In particular, we focus on a case study to check AI-based content moderation software against content moderation policies. First, using crowdsourcing, we collect natural language test cases which match each moderation policy, we name this dataset HateModerate; second, using the test cases in HateModerate, we test the failure rates of state-of-the-art hate speech detection software, and we find that these models have high failure rates for certain policies; finally, since manual labeling is costly, we further proposed an automated approach to augument HateModerate by finetuning OpenAI's large language models to automatically match new examples to policies. The dataset and code of this work can be found on our anonymous website: \url{https://sites.google.com/view/content-moderation-project}

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2307.12418

Last time updated on 28/07/2023