Search CORE

393 research outputs found

One model, two languages: training bilingual parsers with harmonized treebanks

Author: Alonso Miguel A.
Gómez-Rodríguez Carlos
Vilares David
Publication venue
Publication date: 01/01/2016
Field of study

We introduce an approach to train lexicalized parsers using bilingual corpora obtained by merging harmonized treebanks of different languages, producing parsers that can analyze sentences in either of the learned languages, or even sentences that mix both. We test the approach on the Universal Dependency Treebanks, training with MaltParser and MaltOptimizer. The results show that these bilingual parsers are more than competitive, as most combinations not only preserve accuracy, but some even achieve significant improvements over the corresponding monolingual parsers. Preliminary experiments also show the approach to be promising on texts with code-switching and when more languages are added.Comment: 7 pages, 4 tables, 1 figur

arXiv.org e-Print Archive

Crossref

Treebanking user-generated content: A proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoğlu Özlem
Publication venue: ELRA ; IDS, Bibliothek
Publication date: 01/01/2020
Field of study

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD

MAnnheim DOCument Server

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Author: Amir Zeldes
Bosco Cristina
Cignarella Alessandra Teresa
Djam&#233
Ines Rehbein
Josef Ruppenhofer
Lauren Cassidy
Ozlem Cetinoglu
Sanguinetti Manuela
Teresa Lynn
Publication venue: ELRA – European Language Resources Association
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data

Author: Braggaar Anouck
van der Goot Rob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/04/2021
Field of study

The IT University of Copenhagen's Repository

Treebanking user-generated content: a proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoglu Özlem
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/05/2020
Field of study

DCU Online Research Access Service

Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi

Author: Muller Benjamin
Sagot Benoit
Seddah Djamé
Publication venue
Publication date: 01/05/2020
Field of study

Building natural language processing systems for non standardized and low resource languages is a difficult challenge. The recent success of large-scale multilingual pretrained language models provides new modeling tools to tackle this. In this work, we study the ability of multilingual language models to process an unseen dialect. We take user generated North-African Arabic as our case study, a resource-poor dialectal variety of Arabic with frequent code-mixing with French and written in Arabizi, a non-standardized transliteration of Arabic to Latin script. Focusing on two tasks, part-of-speech tagging and dependency parsing, we show in zero-shot and unsupervised adaptation scenarios that multilingual language models are able to transfer to such an unseen dialect, specifically in two extreme cases: (i) across scripts, using Modern Standard Arabic as a source language, and (ii) from a distantly related language, unseen during pretraining, namely Maltese. Our results constitute the first successful transfer experiments on this dialect, paving thus the way for the development of an NLP ecosystem for resource-scarce, non-standardized and highly variable vernacular languages

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server