583 research outputs found
Revisiting Contextual Toxicity Detection in Conversations
Understanding toxicity in user conversations is undoubtedly an important
problem. Addressing "covert" or implicit cases of toxicity is particularly hard
and requires context. Very few previous studies have analysed the influence of
conversational context in human perception or in automated detection models. We
dive deeper into both these directions. We start by analysing existing
contextual datasets and come to the conclusion that toxicity labelling by
humans is in general influenced by the conversational structure, polarity and
topic of the context. We then propose to bring these findings into
computational detection models by introducing and evaluating (a) neural
architectures for contextual toxicity detection that are aware of the
conversational structure, and (b) data augmentation strategies that can help
model contextual toxicity detection. Our results have shown the encouraging
potential of neural architectures that are aware of the conversation structure.
We have also demonstrated that such models can benefit from synthetic data,
especially in the social media domain
SoK: Content Moderation in Social Media, from Guidelines to Enforcement, and Research to Practice
To counter online abuse and misinformation, social media platforms have been
establishing content moderation guidelines and employing various moderation
policies. The goal of this paper is to study these community guidelines and
moderation practices, as well as the relevant research publications to identify
the research gaps, differences in moderation techniques, and challenges that
should be tackled by the social media platforms and the research community at
large. In this regard, we study and analyze in the US jurisdiction the fourteen
most popular social media content moderation guidelines and practices, and
consolidate them. We then introduce three taxonomies drawn from this analysis
as well as covering over one hundred interdisciplinary research papers about
moderation strategies. We identified the differences between the content
moderation employed in mainstream social media platforms compared to fringe
platforms. We also highlight the implications of Section 230, the need for
transparency and opacity in content moderation, why platforms should shift from
a one-size-fits-all model to a more inclusive model, and lastly, we highlight
why there is a need for a collaborative human-AI system
Pathways to Online Hate: Behavioural, Technical, Economic, Legal, Political & Ethical Analysis.
The Alfred Landecker Foundation seeks to create a safer digital space for all. The work of the Foundation helps to develop research, convene stakeholders to share
valuable insights, and support entities that combat online harms, specifically online hate, extremism, and disinformation. Overall, the Foundation seeks to reduce hate and harm tangibly and measurably in the digital space by using its resources in the most impactful way. It also aims to assist in building an ecosystem that can prevent, minimise, and mitigate online harms while at the same time preserving open societies and healthy democracies. A non-exhaustive literature review was undertaken to explore the main facets of harm and hate speech in the evolving online landscape and to analyse behavioural, technical, economic, legal, political and ethical drivers; key findings are detailed in this report
SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution
Advanced text-to-image models such as DALL-E 2 and Midjourney possess the
capacity to generate highly realistic images, raising significant concerns
regarding the potential proliferation of unsafe content. This includes adult,
violent, or deceptive imagery of political figures. Despite claims of rigorous
safety mechanisms implemented in these models to restrict the generation of
not-safe-for-work (NSFW) content, we successfully devise and exhibit the first
prompt attacks on Midjourney, resulting in the production of abundant
photorealistic NSFW images. We reveal the fundamental principles of such prompt
attacks and suggest strategically substituting high-risk sections within a
suspect prompt to evade closed-source safety measures. Our novel framework,
SurrogatePrompt, systematically generates attack prompts, utilizing large
language models, image-to-text, and image-to-image modules to automate attack
prompt creation at scale. Evaluation results disclose an 88% success rate in
bypassing Midjourney's proprietary safety filter with our attack prompts,
leading to the generation of counterfeit images depicting political figures in
violent scenarios. Both subjective and objective assessments validate that the
images generated from our attack prompts present considerable safety hazards.Comment: 14 pages, 11 figure
- …