583 research outputs found

    Revisiting Contextual Toxicity Detection in Conversations

    Full text link
    Understanding toxicity in user conversations is undoubtedly an important problem. Addressing "covert" or implicit cases of toxicity is particularly hard and requires context. Very few previous studies have analysed the influence of conversational context in human perception or in automated detection models. We dive deeper into both these directions. We start by analysing existing contextual datasets and come to the conclusion that toxicity labelling by humans is in general influenced by the conversational structure, polarity and topic of the context. We then propose to bring these findings into computational detection models by introducing and evaluating (a) neural architectures for contextual toxicity detection that are aware of the conversational structure, and (b) data augmentation strategies that can help model contextual toxicity detection. Our results have shown the encouraging potential of neural architectures that are aware of the conversation structure. We have also demonstrated that such models can benefit from synthetic data, especially in the social media domain

    SoK: Content Moderation in Social Media, from Guidelines to Enforcement, and Research to Practice

    Full text link
    To counter online abuse and misinformation, social media platforms have been establishing content moderation guidelines and employing various moderation policies. The goal of this paper is to study these community guidelines and moderation practices, as well as the relevant research publications to identify the research gaps, differences in moderation techniques, and challenges that should be tackled by the social media platforms and the research community at large. In this regard, we study and analyze in the US jurisdiction the fourteen most popular social media content moderation guidelines and practices, and consolidate them. We then introduce three taxonomies drawn from this analysis as well as covering over one hundred interdisciplinary research papers about moderation strategies. We identified the differences between the content moderation employed in mainstream social media platforms compared to fringe platforms. We also highlight the implications of Section 230, the need for transparency and opacity in content moderation, why platforms should shift from a one-size-fits-all model to a more inclusive model, and lastly, we highlight why there is a need for a collaborative human-AI system

    Pathways to Online Hate: Behavioural, Technical, Economic, Legal, Political & Ethical Analysis.

    Get PDF
    The Alfred Landecker Foundation seeks to create a safer digital space for all. The work of the Foundation helps to develop research, convene stakeholders to share valuable insights, and support entities that combat online harms, specifically online hate, extremism, and disinformation. Overall, the Foundation seeks to reduce hate and harm tangibly and measurably in the digital space by using its resources in the most impactful way. It also aims to assist in building an ecosystem that can prevent, minimise, and mitigate online harms while at the same time preserving open societies and healthy democracies. A non-exhaustive literature review was undertaken to explore the main facets of harm and hate speech in the evolving online landscape and to analyse behavioural, technical, economic, legal, political and ethical drivers; key findings are detailed in this report

    SurrogatePrompt: Bypassing the Safety Filter of Text-To-Image Models via Substitution

    Full text link
    Advanced text-to-image models such as DALL-E 2 and Midjourney possess the capacity to generate highly realistic images, raising significant concerns regarding the potential proliferation of unsafe content. This includes adult, violent, or deceptive imagery of political figures. Despite claims of rigorous safety mechanisms implemented in these models to restrict the generation of not-safe-for-work (NSFW) content, we successfully devise and exhibit the first prompt attacks on Midjourney, resulting in the production of abundant photorealistic NSFW images. We reveal the fundamental principles of such prompt attacks and suggest strategically substituting high-risk sections within a suspect prompt to evade closed-source safety measures. Our novel framework, SurrogatePrompt, systematically generates attack prompts, utilizing large language models, image-to-text, and image-to-image modules to automate attack prompt creation at scale. Evaluation results disclose an 88% success rate in bypassing Midjourney's proprietary safety filter with our attack prompts, leading to the generation of counterfeit images depicting political figures in violent scenarios. Both subjective and objective assessments validate that the images generated from our attack prompts present considerable safety hazards.Comment: 14 pages, 11 figure

    Extreme Digital Speech:Contexts, Responses, and Solutions

    Get PDF

    Extreme Digital Speech:Contexts, Responses, and Solutions

    Get PDF
    • …
    corecore