Search CORE

38 research outputs found

Universal Dependencies Parsing for Colloquial Singaporean English

Author: Chan GuangYong Leonard
Chieu Hai Leong
Wang Hongmin
Yang Jie
Zhang Yue
Publication venue
Publication date: 01/01/2017
Field of study

Singlish can be interesting to the ACL community both linguistically as a major creole based on English, and computationally for information extraction and sentiment analysis of regional social media. We investigate dependency parsing of Singlish by constructing a dependency treebank under the Universal Dependencies scheme, and then training a neural network model by integrating English syntactic knowledge into a state-of-the-art parser trained on the Singlish treebank. Results show that English knowledge can lead to 25% relative error reduction, resulting in a parser of 84.47% accuracies. To the best of our knowledge, we are the first to use neural stacking to improve cross-lingual dependency parsing on low-resource languages. We make both our annotation and parser available for further research.Comment: Accepted by ACL 201

arXiv.org e-Print Archive

Crossref

Presenting TWITTIRÒ-UD: An Italian Twitter Treebank in Universal Dependencies

Author: Bosco Cristina
Cignarella ALESSANDRA TERESA
Paolo Rosso
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2019
Field of study

Institutional Research Information System University of Turin

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Author: Amir Zeldes
Bosco Cristina
Cignarella Alessandra Teresa
Djam&#233
Ines Rehbein
Josef Ruppenhofer
Lauren Cassidy
Ozlem Cetinoglu
Sanguinetti Manuela
Teresa Lynn
Publication venue: ELRA – European Language Resources Association
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Treebanking user-generated content: A proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoğlu Özlem
Publication venue: ELRA ; IDS, Bibliothek
Publication date: 01/01/2020
Field of study

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD

MAnnheim DOCument Server

PoSTWITA-UD: an Italian Twitter Treebank in Universal Dependencies

Author: Alberto Lavelli
Bosco Cristina
Fabio Tamburini
Mazzei Alessandro
Sanguinetti Manuela
Publication venue: ELRA
Publication date: 01/01/2018
Field of study

Institutional Research Information System University of Turin

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Author: Bosco Cristina
Cignarella ALESSANDRA TERESA
Sanguinetti Manuela
Publication venue
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

Treebanking user-generated content: a proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoglu Özlem
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/05/2020
Field of study

DCU Online Research Access Service

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank

Author: Chau Ethan C.
Lin Lucy H.
Smith Noah A.
Publication venue
Publication date: 14/11/2020
Field of study

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled \emph{and unlabeled} data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models' pretraining data and target language varieties.Comment: In Findings of EMNLP 202

arXiv.org e-Print Archive

Improving neural language models on low-resource creole languages

Author: Schieferstein Sarah
Publication venue
Publication date: 01/12/2018
Field of study

When using neural models for NLP tasks, like language modelling, it is difficult to utilize a language with little data, also known as a low-resource language. Creole languages are frequently low-resource and as such it is difficult to train neural language models for them well. Creole languages are a special type of language that is widely thought of as having multiple parents and thus receiving a mix of evolutionary traits from all of them. One of a creole language’s parents is known as the lexifier, which gives the creole its lexicon, and the other parents are known as substrates, which possibly are thought to give the creole language its morphology and syntax. Creole languages are most lexically similar to their lexifier and most syntactically similar to otherwise unrelated creole languages. High lexical similarity to the lexifier is unsurprising because by definition lexifiers provide a creole’s lexicon, but high syntactic similarity to the other unrelated creole languages is not obvious and is explored in detail. We can use this information about creole languages’ unique genesis and typology to decrease the perplexity of neural language models on low-resource creole languages. We discovered that syntactically similar languages (especially other creole languages) can successfully transfer learned features during pretraining from a high-resource language to a low-resource creole language through a method called neural stacking. A method that normalized the vocabulary of a creole language to its lexifier also lowered perplexities of creole-language neural models

Illinois Digital Environment for Access to Learning and Scholarship Repository