Search CORE

299 research outputs found

Modeling the interface between morphology and syntax in data-driven dependency parsing

Author: Seeker Wolfgang
Publication venue
Publication date: 01/01/2016
Field of study

When people formulate sentences in a language, they follow a set of rules specific to that language that defines how words must be put together in order to express the intended meaning. These rules are called the grammar of the language. Languages have essentially two ways of encoding grammatical information: word order or word form. English uses primarily word order to encode different meanings, but many other languages change the form of the words themselves to express their grammatical function in the sentence. These languages are commonly subsumed under the term morphologically rich languages. Parsing is the automatic process for predicting the grammatical structure of a sentence. Since grammatical structure guides the way we understand sentences, parsing is a key component in computer programs that try to automatically understand what people say and write. This dissertation is about parsing and specifically about parsing languages with a rich morphology, which encode grammatical information in the form of words. Today’s parsing models for automatic parsing were developed for English and achieve good results on this language. However, when applied to other languages, a significant drop in performance is usually observed. The standard model for parsing is a pipeline model that separates the parsing process into different steps, in particular it separates the morphological analysis, i.e. the analysis of word forms, from the actual parsing step. This dissertation argues that this separation is one of the reasons for the performance drop of standard parsers when applied to other languages than English. An analysis is presented that exposes the connection between the morphological system of a language and the errors of a standard parsing model. In a second series of experiments, we show that knowledge about the syntactic structure of sentence can support the prediction of morphological information. We then argue for an alternative approach that models morphological analysis and syntactic analysis jointly instead of separating them. We support this argumentation with empirical evidence by implementing two parsers that model the relationship between morphology and syntax in two different but complementary ways

Proceedings of the LREC workshop on partial parsing : between chunk parsing and deep parsing

Author: Kübler Sandra
Piskorski Jakub
Przepiorkowski Adam
Publication venue
Publication date: 03/11/2008
Field of study

Hochschulschriftenserver - Universität Frankfurt am Main

Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

Author: Sandhan Jivnesh
Publication venue
Publication date: 17/08/2023
Field of study

The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

arXiv.org e-Print Archive

CoNLL 2017 Shared Task : Multilingual Parsing from Raw Text to Universal Dependencies

Author: Attia Mohammed
Badmaeva Elena
Banerjee Esha
Burchardt Aljoscha
Cinková Silvie
de Marneffe Marie-Catherine
dePaiva Valeria
Droganova Kira
Elkahky Ali
Fernández Alcalde Héctor
Ginter Filip
Gökırmak Memduh
Habash Nizar
Hajič Jan
Hajič jr., Jan
Harris Kim
Hlaváčová Jaroslava
Kanayama Hiroshi
Kanerva Jenna
Kayadelen Tolga
Kettnerová Václava
Kirchner Jesse
Kwak Sookyoung
Lando Tatiana
Lertpradit Saran
Leung Herman
Li Josie
Luotolahti Juhani
Macketanz Vivien
Mandl Michael
Manning Christopher D.
Manurung Ruli
Marheinecke Katrin
Martínez Alonso Héctor
Mendonça Gustavo
Missilä Anna
Nedoluzhko Anna
Nitisaroj Rattima
Nivre Joakim
Ojala Stina
Petrov Slav
Pitler Emily
Popel Martin
Potthast Martin
Pyysalo Sampo
Reddy Siva
Rehm Georg
Sanguinetti Manuela
Schuster Sebastian
Shimada Atsuko
Simi Maria
Stella Antonio
Straka Milan
Strnadová Jana
Sulubacak Umut
Taji Dima
Tyers Francis
Urešová Zdeňka
Uszkoreit Hans
Yu Zhuoran
Zeman Daniel
Çöltekin Çağrı
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, one of two tasks was devoted to learning dependency parsers for a large number of languages, in a real world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe data preparation, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.Peer reviewe

Crossref

Archivio della Ricerca - Università di Pisa

Biblio at Institute of Formal and Applied Linguistics

Helsingin yliopiston digitaalinen arkisto

Institutional Research Information System University of Turin

Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

Author: Attia M
Badmaeva E
Banerjee E
Burchardt A
Cinková S
Droganova K
Elkahky A
Fernandez Alcalde H
Ginter F
Gökırmak M
Habash N
Hajič J
Hajič jr. J
Harris K
Hlaváčová J
Kanayama H
Kanerva J
Kayadelen T
Kettnerová V
Kirchner J
Kwak S
Lando T
Lertpradit S
Leung H
Li J
Luotolahti J
Macketanz V
Mandl M
Manning C
Manurung R
Marheinecke K
Marneffe M
Martínez Alonso H
Mendonçca G
Missilä A
Nedoluzhko A
Nitisaroj R
Nivre J
Ojala S
Paiva V
Petrov S
Pitler E
Popel M
Potthast M
Pyysalo S
Reddy S
Rehm G
Sanguinetti M
Schuster S
Shimada A
Simi M
Stella A
Straka M
Strnadova J
Taji D
Tyers F
Urešová Z
Uszkoreit H
Yu Z
Zeman D
Publication venue: Vancouver, Canada
Publication date: 28/10/2022
Field of study

UTUPub

Statistical Parsing by Machine Learning from a Classical Arabic Treebank

Author: Dukes Kais
Publication venue: University of Leeds
Publication date: 01/09/2013
Field of study

Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year

White Rose E-theses Online

SHR++: An Interface for Morpho-syntactic annotation of Sanskrit Corpora

Author: Chawla Dilpreet
Goyal Pawan
Krishna Amrith
Sambhavi Sruti
Vidhyut Shiv
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/02/2020
Field of study

The IT University of Copenhagen's Repository