19 research outputs found

    Planting Trees in the Desert: Delexicalized Tagging and Parsing Combined

    Get PDF
    Various unsupervised and semi-supervised methods have been proposed to tag and parse an unseen language. We explore delexicalized parsing, proposed by (Zeman and Resnik, 2008), and delexicalized tagging, proposed by (Yu et al., 2016). For both approaches we provide a detailed evaluation on Universal Dependencies data (Nivre et al., 2016), a de-facto standard for multi-lingual morphosyntactic processing (while the previous work used other datasets). Our results confirm that in separation, each of the two delexicalized techniques has some limited potential when no annotation of the target language is available. However, if used in combination, their errors multiply beyond acceptable limits. We demonstrate that even the tiniest bit of expert annotation in the target language may contain significant potential and should be used if available

    New Treebank or Repurposed? On the Feasibility of Cross-Lingual Parsing of Romance Languages with Universal Dependencies

    Get PDF
    This is the final peer-reviewed manuscript that was accepted for publication in Natural Language Engineering. Changes resulting from the publishing process, such as editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.[Abstract] This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.Ministerio de EconomĂ­a y Competitividad; FJCI-2014-22853Ministerio de EconomĂ­a y Competitividad; FFI2014-51978-C2-1-RMinisterio de EconomĂ­a y Competitividad; FFI2014-51978-C2-2-

    Multilingual projection for parsing truly low resource languages

    Get PDF
    International audienceWe propose a novel approach to cross-lingual part-of-speech tagging and dependency parsing for truly low-resource languages. Our annotation projection-based approach yields tagging and parsing models for over 100 languages. All that is needed are freely available parallel texts, and taggers and parsers for resource-rich languages. The empirical evaluation across 30 test languages shows that our method consistently provides top-level accuracies , close to established upper bounds, and outperforms several competitive baselines
    corecore