Search CORE

874 research outputs found

When silver glitters more than gold: Bootstrapping an Italian part-of-speech tagger for Twitter

Author: Nissim Malvina
Plank Barbara
Publication venue
Publication date: 01/01/2016
Field of study

We bootstrap a state-of-the-art part-of-speech tagger to tag Italian Twitter data, in the context of the Evalita 2016 PoSTWITA shared task. We show that training the tagger on native Twitter data enriched with little amounts of specifically selected gold data and additional silver-labelled data scraped from Facebook, yields better results than using large amounts of manually annotated data from a mix of genres.Comment: Proceedings of the 5th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2016

arXiv.org e-Print Archive

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

OpenEdition

Dissertations of the University of Groningen

EVALITA Evaluation of NLP and Speech Tools for Italian Proceedings of the Final Workshop

Author: Basile Pierpaolo
Cutugno Franco
Nissim Malvina
Patti Viviana
Pierpaolo Basile Franco Cutugno, Malvina Nissim, Viviana Patti, Rachele Sprugnoli
Sprugnoli Rachele
Publication venue: place:Torino
Publication date: 01/01/2016
Field of study

Editor of the proceedings of EVALITA 2016

Archivio istituzionale della Ricerca - Università degli Studi di Parma

PubliCatt

Annotating Italian Social Media Texts in Universal Dependencies

Author: Bosco Cristina
Lavelli Alberto
Mazzei Alessandro
Sanguinetti Manuela
Tamburini Fabio
Publication venue: Linköping University Electronic Press
Publication date: 01/01/2017
Field of study

Institutional Research Information System University of Turin

Detecting and Monitoring Hate Speech in Twitter

Author: Camacho-Collados Miguel
Liberatore Federico
Pereira-Kohatsu Juan Carlos
Quijano-Sánchez Lara
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Social Media are sensors in the real world that can be used to measure the pulse of societies. However, the massive and unfiltered feed of messages posted in social media is a phenomenon that nowadays raises social alarms, especially when these messages contain hate speech targeted to a specific individual or group. In this context, governments and non-governmental organizations (NGOs) are concerned about the possible negative impact that these messages can have on individuals or on the society. In this paper, we present HaterNet, an intelligent system currently being used by the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that identifies and monitors the evolution of hate speech in Twitter. The contributions of this research are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification approaches based on different document representation strategies and text classification models. (4) The best approach consists of a combination of a LTSM+MLP neural network that takes as input the tweet’s word, emoji, and expression tokens’ embeddings enriched by the tf-idf, and obtains an area under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Union’s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge

Multidisciplinary Digital Publishing Institute

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Online Research @ Cardiff

Universidad Carlos III de Madrid e-Archivo

Biblos-e Archivo

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Author: Baldwin T
Caselli T
Ljubešić N
Mahendra R
Muller B
Plank B
Ramponi A
Roncal ISV
Sidorenko W
van der Goot R
Workshop on Noisy User-Generated Text
Zubiaga A
Çetinoğlu Ö
Çolakoğlu T
Publication venue
Publication date: 01/01/2021
Field of study

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MULTILEXNORM shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 12 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system

Queen Mary Research Online

Italian Event Detection Goes Deep Learning

Author: Caselli Tommaso
Publication venue
Publication date: 01/01/2018
Field of study

This paper reports on a set of experiments with different word embeddings to initialize a state-of-the-art Bi-LSTM-CRF network for event detection and classification in Italian, following the EVENTI evaluation exercise. The net- work obtains a new state-of-the-art result by improving the F1 score for detection of 1.3 points, and of 6.5 points for classification, by using a single step approach. The results also provide further evidence that embeddings have a major impact on the performance of such architectures.Comment: to appear at CLiC-it 201

arXiv.org e-Print Archive

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

OpenEdition

Dissertations of the University of Groningen

Bi-directional LSTM-CNNs-CRF for Italian Sequence Labeling and Multi-Task Learning

Author: Basile Pierpaolo
Cassotti Pierluigi
Semeraro Giovanni
Siciliani Lucia
Publication venue: 'OpenEdition'
Publication date: 15/12/2020
Field of study

In this paper, we propose a Deep Learning architecture for several Italian Natural Language Processing tasks based on a state of the art model that exploits both word- and character-level representations through the combination of bidirectional LSTM, CNN and CRF. This architecture provided state of the art performance in several sequence labeling tasks for the English language. We exploit the same approach for the Italian language and extend it for performing a multi-task learning involving PoS-tagging and sentiment analysis. Results show that the system is able to achieve state of the art performance in all the tasks and in some cases overcomes the best systems previously developed for the Italian

OpenEdition

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Author: Baldwin Timothy
Caselli Tommaso
Ljubešic´ Nikola
Mahendra Rahmad
Muller Benjamin
Plank Barbara
Ramponi Alan
San Vicente Roncal Iñaki
Sidorenko Wladimir
van der Goot Rob
Zubiaga Arkaitz
Çetinoğlu Özlem
Çolakoglu Talha
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system

Proceedings - University of Groningen

University of Groningen

Archivio della ricerca - Fondazione Bruno Kessler

ARTS repository - University of Groningen

The IT University of Copenhagen's Repository

Dissertations of the University of Groningen