5,308 research outputs found

    Why is German dependency parsing more reliable than constituent parsing?

    Get PDF
    In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used. Another direction in parsing research is the development of dependency parsers. Dependency parsing profits from the non-hierarchical nature of dependency relations, thus lexical information can be included in the parsing process in a much more natural way. Especially machine learning based approaches are very successful (cf. e.g.). The results achieved by these dependency parsers are very competitive although comparisons are difficult because of the differences in annotation. For English, the Penn Treebank has been converted to dependencies. For this version, Nivre et al. report an accuracy rate of 86.3%, as compared to an F-score of 92.1 for Charniaks parser. The Penn Chinese Treebank is also available in a constituent and a dependency representations. The best results reported for parsing experiments with this treebank give an F-score of 81.8 for the constituent version and 79.8% accuracy for the dependency version. The general trend in comparisons between constituent and dependency parsers is that the dependency parser performs slightly worse than the constituent parser. The only exception occurs for German, where F-scores for constituent plus grammatical function parses range between 51.4 and 75.3, depending on the treebank, NEGRA or TüBa-D/Z. The dependency parser based on a converted version of Tüba-D/Z, in contrast, reached an accuracy of 83.4%, i.e. 12 percent points better than the best constituent analysis including grammatical functions

    Universal Dependencies Parsing for Colloquial Singaporean English

    Full text link
    Singlish can be interesting to the ACL community both linguistically as a major creole based on English, and computationally for information extraction and sentiment analysis of regional social media. We investigate dependency parsing of Singlish by constructing a dependency treebank under the Universal Dependencies scheme, and then training a neural network model by integrating English syntactic knowledge into a state-of-the-art parser trained on the Singlish treebank. Results show that English knowledge can lead to 25% relative error reduction, resulting in a parser of 84.47% accuracies. To the best of our knowledge, we are the first to use neural stacking to improve cross-lingual dependency parsing on low-resource languages. We make both our annotation and parser available for further research.Comment: Accepted by ACL 201

    Universal Dependencies for Learner English

    Get PDF
    We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216

    A Reranking Approach for Dependency Parsing with Variable-sized Subtree Features

    Get PDF
    Employing higher-order subtree structures in graph-based dependency parsing has shown substantial improvement over the accuracy, however suffers from the inefficiency increasing with the order of subtrees. We present a new reranking approach for dependency parsing that can utilize complex subtree representation by applying efficient subtree selection heuristics. We demonstrate the effective-ness of the approach in experiments conducted on the Penn Treebank and the Chinese Treebank. Our system improves the baseline accuracy from 91.88 % to 93.37 % for English, and in the case of Chinese from 87.39 % to 89.16%. 1

    Treebank-based grammar acquisition for German

    Get PDF
    Manual development of deep linguistic resources is time-consuming and costly and therefore often described as a bottleneck for traditional rule-based NLP. In my PhD thesis I present a treebank-based method for the automatic acquisition of LFG resources for German. The method automatically creates deep and rich linguistic presentations from labelled data (treebanks) and can be applied to large data sets. My research is based on and substantially extends previous work on automatically acquiring wide-coverage, deep, constraint-based grammatical resources from the English Penn-II treebank (Cahill et al.,2002; Burke et al., 2004; Cahill, 2004). Best results for English show a dependency f-score of 82.73% (Cahill et al., 2008) against the PARC 700 dependency bank, outperforming the best hand-crafted grammar of Kaplan et al. (2004). Preliminary work has been carried out to test the approach on languages other than English, providing proof of concept for the applicability of the method (Cahill et al., 2003; Cahill, 2004; Cahill et al., 2005). While first results have been promising, a number of important research questions have been raised. The original approach presented first in Cahill et al. (2002) is strongly tailored to English and the datastructures provided by the Penn-II treebank (Marcus et al., 1993). English is configurational and rather poor in inflectional forms. German, by contrast, features semi-free word order and a much richer morphology. Furthermore, treebanks for German differ considerably from the Penn-II treebank as regards data structures and encoding schemes underlying the grammar acquisition task. In my thesis I examine the impact of language-specific properties of German as well as linguistically motivated treebank design decisions on PCFG parsing and LFG grammar acquisition. I present experiments investigating the influence of treebank design on PCFG parsing and show which type of representations are useful for the PCFG and LFG grammar acquisition tasks. Furthermore, I present a novel approach to cross-treebank comparison, measuring the effect of controlled error insertion on treebank trees and parser output from different treebanks. I complement the cross-treebank comparison by providing a human evaluation using TePaCoC, a new testsuite for testing parser performance on complex grammatical constructions. Manual evaluation on TePaCoC data provides new insights on the impact of flat vs. hierarchical annotation schemes on data-driven parsing. I present treebank-based LFG acquisition methodologies for two German treebanks. An extensive evaluation along different dimensions complements the investigation and provides valuable insights for the future development of treebanks

    DepAnn - An Annotation Tool for Dependency Treebanks

    Full text link
    DepAnn is an interactive annotation tool for dependency treebanks, providing both graphical and text-based annotation interfaces. The tool is aimed for semi-automatic creation of treebanks. It aids the manual inspection and correction of automatically created parses, making the annotation process faster and less error-prone. A novel feature of the tool is that it enables the user to view outputs from several parsers as the basis for creating the final tree to be saved to the treebank. DepAnn uses TIGER-XML, an XML-based general encoding format for both, representing the parser outputs and saving the annotated treebank. The tool includes an automatic consistency checker for sentence structures. In addition, the tool enables users to build structures manually, add comments on the annotations, modify the tagsets, and mark sentences for further revision
    corecore