393 research outputs found
One model, two languages: training bilingual parsers with harmonized treebanks
We introduce an approach to train lexicalized parsers using bilingual corpora
obtained by merging harmonized treebanks of different languages, producing
parsers that can analyze sentences in either of the learned languages, or even
sentences that mix both. We test the approach on the Universal Dependency
Treebanks, training with MaltParser and MaltOptimizer. The results show that
these bilingual parsers are more than competitive, as most combinations not
only preserve accuracy, but some even achieve significant improvements over the
corresponding monolingual parsers. Preliminary experiments also show the
approach to be promising on texts with code-switching and when more languages
are added.Comment: 7 pages, 4 tables, 1 figur
Treebanking user-generated content: A proposal for a unified representation in universal dependencies
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD
Treebanking user-generated content: a proposal for a unified representation in universal dependencies
The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD
Can Multilingual Language Models Transfer to an Unseen Dialect? A Case Study on North African Arabizi
Building natural language processing systems for non standardized and low
resource languages is a difficult challenge. The recent success of large-scale
multilingual pretrained language models provides new modeling tools to tackle
this. In this work, we study the ability of multilingual language models to
process an unseen dialect. We take user generated North-African Arabic as our
case study, a resource-poor dialectal variety of Arabic with frequent
code-mixing with French and written in Arabizi, a non-standardized
transliteration of Arabic to Latin script. Focusing on two tasks,
part-of-speech tagging and dependency parsing, we show in zero-shot and
unsupervised adaptation scenarios that multilingual language models are able to
transfer to such an unseen dialect, specifically in two extreme cases: (i)
across scripts, using Modern Standard Arabic as a source language, and (ii)
from a distantly related language, unseen during pretraining, namely Maltese.
Our results constitute the first successful transfer experiments on this
dialect, paving thus the way for the development of an NLP ecosystem for
resource-scarce, non-standardized and highly variable vernacular languages
- …