7 research outputs found
Statistical parsing of morphologically rich languages (SPMRL): what, how and whither
The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop on statistical parsing of MRLs hosts a variety of contributions which show that despite language-specific idiosyncrasies, the problems associated with parsing MRLs cut across languages and parsing frameworks. In this paper we review the current state-of-affairs with respect to parsing MRLs and point out central challenges. We synthesize the contributions of researchers working on parsing Arabic, Basque, French, German, Hebrew, Hindi and Korean to point out shared solutions across languages. The overarching analysis suggests itself as a source of directions for future investigations
Overview of the SPMRL 2013 shared task: cross-framework evaluation of parsing morphologically rich languages
This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs given different representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios
Overview of the SPMRL 2013 Shared Task: A Cross-Framework Evaluation of Parsing Morphologically Rich Languages
International audienceThis paper reports on the first shared task on statistical parsing of morphologically rich lan- guages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the eval- uation metrics for parsing MRLs given dif- ferent representation types. We present and analyze parsing results obtained by the task participants, and then provide an analysis and comparison of the parsers across languages and frameworks, reported for gold input as well as more realistic parsing scenarios
Recommended from our members
Creating a frequency dictionary of spoken Hebrew : a reproducible use of technology to overcome scarcity of data
Studies using word frequency dictionaries—on topics such as vocabulary acquisition, vocabulary load, extensive reading, and vocabulary testing—have historically centered around corpora and morphological issues specific to European languages, especially English. One of the reasons for this is the lack of resources that often plagues departments of less commonly taught languages. Corpora of spoken language are particularly difficult to obtain—the funding and time necessary often make such a project impossible. This thesis is an effort to provide some of the methodology and tools necessary for educators interested in creating frequency dictionaries for research purposes, for their own classrooms, or even for wider dissemination. In doing so, it will provide an overview of some of the key decisions that must be taken into account for such a project. Throughout this thesis, the creation process behind the Frequency Dictionary of Spoken Hebrew (FDOSH)—a list of the most common words in conversational Modern Hebrew—will be explained. The tools used to create the FDOSH, including corpus resources and customized scripts, are provided as part of a repository of supplementary materials. The goal is to make the entire dictionary-creation process as reproducible as possible while allowing for flexibility and transparency in the tools used. It does this by using well-documented open-source scripts written in an easily readable programming language, Python. Beyond providing these tools, the present project explores the theory and many of the considerations that play an important role in the creation of a frequency dictionary. These include issues such as corpus size, corpus text type, whether the list is intended for general or specialized use, word family levels, and objective criteria. Issues regarding Hebrew’s synthetic morphology and ambiguous non-vocalized writing system are also addressed. The project aims to serve as a catalyst for future research that may build upon the ideas discussed here. The development and open dissemination of tools such as these can only lead to greater cooperation among educators and researchers, to the benefit of all involved.Middle Eastern Studie
Statistical Parsing by Machine Learning from a Classical Arabic Treebank
Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic.
Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations.
A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic.
The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year