Search CORE

10,830 research outputs found

TA-COS 2018 : 2nd Workshop on Text Analytics for Cybersecurity and Online Safety : Proceedings

Author: De Pauw Guy
Desmet Bart
Lefever Els
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2018
Field of study

Large scale crowdsourcing and characterization of Twitter abusive behavior

Author: Blackburn Jeremy
Chatzakou Despoina
Djouvas Constantinos
Founta Antigoni-Maria
Kourtellis Nicolas
Leontiadis Ilias
Sirivianos Michael
Stringhini Gianluca
Vakali Athena
Publication venue: AAAI Press
Publication date: 01/01/2018
Field of study

In recent years online social networks have suffered an increase in sexism, racism, and other types of aggressive and cyberbullying behavior, often manifesting itself through offensive, abusive, or hateful language. Past scientific work focused on studying these forms of abusive activity in popular online social networks, such as Facebook and Twitter. Building on such work, we present an eight month study of the various forms of abusive behavior on Twitter, in a holistic fashion. Departing from past work, we examine a wide variety of labeling schemes, which cover different forms of abusive behavior. We propose an incremental and iterative methodology that leverages the power of crowdsourcing to annotate a large collection of tweets with a set of abuse-related labels.By applying our methodology and performing statistical analysis for label merging or elimination, we identify a reduced but robust set of labels to characterize abuse-related tweets. Finally, we offer a characterization of our annotated dataset of 80 thousand tweets, which we make publicly available for further scientific exploration.Accepted manuscrip

Boston University Institutional Repository (OpenBU)

Holaaa!! Writin like u talk is kewl but kinda hard 4 NLP

Author: Domingo Judit
Marquina Montse
Melero Maite
Quixal Martí
Ruiz Costa-Jussà Marta
Publication venue
Publication date: 01/01/2012
Field of study

We present work in progress aiming to build tools for the normalization of User-Generated Content (UGC). As we will see, the task requires the revisiting of the initial steps of NLP processing, since UGC (micro-blog, blog, and, generally, Web 2.0 user texts) presents a number of non-standard communicative and linguistic characteristics, and is in fact much closer to oral and colloquial language than to edited text. We present and characterize a corpus of UGC text in Spanish from three different sources: Twitter, consumer reviews and blogs. We motivate the need for UGC text normalization by analyzing the problems found when processing this type of text through a conventional language processing pipeline, particularly in the tasks of lemmatization and morphosyntactic tagging, and finally we propose a strategy for automatically normalizing UGC using a selector of correct forms on top of a pre-existing spell-checker.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC