Search CORE

184 research outputs found

Treebanking user-generated content: a UD based overview of guidelines, corpora and unified recommendations

Author: Bosco Cristina
Cignarella ALESSANDRA TERESA
Sanguinetti Manuela
Publication venue
Publication date: 01/01/2022
Field of study

Institutional Research Information System University of Turin

Treebanking User-Generated Content: A Proposal for a Unified Representation in Universal Dependencies

Author: Amir Zeldes
Bosco Cristina
Cignarella Alessandra Teresa
Djam&#233
Ines Rehbein
Josef Ruppenhofer
Lauren Cassidy
Ozlem Cetinoglu
Sanguinetti Manuela
Teresa Lynn
Publication venue: ELRA – European Language Resources Association
Publication date: 01/01/2020
Field of study

Institutional Research Information System University of Turin

Treebanking user-generated content: A proposal for a unified representation in universal dependencies

Author: Bosco Cristina
Cassidy Lauren
Cignarella Alessandra Teresa
Lynn Teresa
Rehbein Ines
Ruppenhofer Josef
Sanguinetti Manuela
Seddah Djamé
Zeldes Amir
Çetinoğlu Özlem
Publication venue: ELRA ; IDS, Bibliothek
Publication date: 01/01/2020
Field of study

The paper presents a discussion on the main linguistic phenomena of user-generated texts found in web and social media, and proposes a set of annotation guidelines for their treatment within the Universal Dependencies (UD) framework. Given on the one hand the increasing number of treebanks featuring user-generated content, and its somewhat inconsistent treatment in these resources on the other, the aim of this paper is twofold: (1) to provide a short, though comprehensive, overview of such treebanks - based on available literature - along with their main features and a comparative analysis of their annotation criteria, and (2) to propose a set of tentative UD-based annotation guidelines, to promote consistent treatment of the particular phenomena found in these types of texts. The main goal of this paper is to provide a common framework for those teams interested in developing similar resources in UD, thus enabling cross-linguistic consistency, which is a principle that has always been in the spirit of UD

MAnnheim DOCument Server

Natural Language Processing Resources for Finnish. Corpus Development in the General and Clinical Domains

Author: Haverinen Katri
Publication venue: Turku Centre for Computer Science
Publication date: 04/09/2014
Field of study

Siirretty Doriast

UTUPub

Statistical Parsing by Machine Learning from a Classical Arabic Treebank

Author: Dukes Kais
Publication venue: University of Leeds
Publication date: 01/09/2013
Field of study

Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

White Rose E-theses Online

New functions and updates of the resource DiACL - Diachronic Atlas of Compartive Linguistics

Author: Carling Gerd
Larsson Filip
Lundgren Olof
Nilsson Linus
Verhoeven Rob
Publication venue: Pavia University Press
Publication date: 01/01/2021
Field of study

Lund University Publications

Towards unification of discourse annotation frameworks

Author: Fu Yingxue
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 16/04/2022
Field of study

Funding: The author is funded by University of St Andrews-China Scholarship Council joint scholarship (NO.202008300012).Discourse information is difficult to represent and annotate. Among the major frameworks for annotating discourse information, RST, PDTB and SDRT are widely discussed and used, each having its own theoretical foundation and focus. Corpora annotated under different frameworks vary considerably. To make better use of the existing discourse corpora and achieve the possible synergy of different frameworks, it is worthwhile to investigate the systematic relations between different frameworks and devise methods of unifying the frameworks. Although the issue of framework unification has been a topic of discussion for a long time, there is currently no comprehensive approach which considers unifying both discourse structure and discourse relations and evaluates the unified framework intrinsically and extrinsically. We plan to use automatic means for the unification task and evaluate the result with structural complexity and downstream tasks. We will also explore the application of the unified framework in multi-task learning and graphical models.Publisher PD

arXiv.org e-Print Archive

University of St. Andrews - Pure

St Andrews Research Repository