235 research outputs found

    A multilingual collection of CoNLL-U-compatible morphological lexicons

    Get PDF
    International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures

    Enhancing Sequence-to-Sequence Neural Lemmatization with External Resources

    Full text link
    We propose a novel hybrid approach to lemmatization that enhances the seq2seq neural model with additional lemmas extracted from an external lexicon or a rule-based system. During training, the enhanced lemmatizer learns both to generate lemmas via a sequential decoder and copy the lemma characters from the external candidates supplied during run-time. Our lemmatizer enhanced with candidates extracted from the Apertium morphological analyzer achieves statistically significant improvements compared to baseline models not utilizing additional lemma information, achieves an average accuracy of 97.25% on a set of 23 UD languages, which is 0.55% higher than obtained with the Stanford Stanza model on the same set of languages. We also compare with other methods of integrating external data into lemmatization and show that our enhanced system performs considerably better than a simple lexicon extension method based on the Stanza system, and it achieves complementary improvements w.r.t. the data augmentation method

    CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing

    Get PDF
    International audienceFollowing the development of the universal dependencies (UD) framework and the CoNLL 2017 Shared Task on end-to-end UD parsing, we address the need for a universal representation of morphological analysis which on the one hand can capture a range of different alternative morphological analyses of surface tokens, and on the other hand is compatible with the segmentation and morphological annotation guidelines prescribed for UD treebanks. We propose the CoNLL universal lattices (CoNLL-UL) format, a new annotation format for word lattices that represent morphological analyses, and provide resources that obey this format for a range of typologically different languages. The resources we provide are harmonized with the two-level representation and morphological annotation in their respective UD v2 treebanks, thus enabling research on universal models for morphological and syntactic parsing , in both pipeline and joint settings, and presenting new opportunities in the development of UD resources for low-resource languages

    An Experiment in Verb Valency Frame Extraction from Croatian Dependency Treebank

    Get PDF
    The paper presents an approach to semi-automatic verb valency frame extraction from the Croatian Dependency Treebank. Our algorithm extracted 1923 verb valency frames for 594 different verbs. We discuss applicability of our method to semi-automatic verb valency lexicon creation and refinement, along with possibilities of utilizing it in the task of parsing Croatian texts

    ELMoLex: Connecting ELMo and Lexicon features for Dependency Parsing

    Get PDF
    International audienceIn this paper, we present the details of the neural dependency parser and the neu-ral tagger submitted by our team 'ParisNLP' to the CoNLL 2018 Shared Task on parsing from raw text to Universal Dependencies. We augment the deep Biaffine (BiAF) parser (Dozat and Manning, 2016) with novel features to perform competitively: we utilize an indomain version of ELMo features (Peters et al., 2018) which provide context-dependent word representations; we utilize disambiguated, embedded, morphosyntactic features from lexicons (Sagot, 2018), which complements the existing feature set. Henceforth , we call our system 'ELMoLex'. In addition to incorporating character embed-dings, ELMoLex leverage pre-trained word vectors, ELMo and morphosyntactic features (whenever available) to correctly handle rare or unknown words which are prevalent in languages with complex morphology. ELMoLex 1 ranked 11th by Labeled Attachment Score metric (70.64%), Morphology-aware LAS metric (55.74%) and ranked 9th by Bilexical dependency metric (60.70%). In an extrinsic evaluation setup, ELMoLex ranked 7 th for Event Extraction, Negation Resolution tasks and 11th for Opinion Analysis task by F1 score

    When linguistics meets web technologies. Recent advances in modelling linguistic linked data

    Get PDF
    This article provides an up-to-date and comprehensive survey of models (including vocabularies, taxonomies and ontologies) used for representing linguistic linked data (LLD). It focuses on the latest developments in the area and both builds upon and complements previous works covering similar territory. The article begins with an overview of recent trends which have had an impact on linked data models and vocabularies, such as the growing influence of the FAIR guidelines, the funding of several major projects in which LLD is a key component, and the increasing importance of the relationship of the digital humanities with LLD. Next, we give an overview of some of the most well known vocabularies and models in LLD. After this we look at some of the latest developments in community standards and initiatives such as OntoLex-Lemon as well as recent work which has been in carried out in corpora and annotation and LLD including a discussion of the LLD metadata vocabularies META-SHARE and lime and language identifiers. In the following part of the paper we look at work which has been realised in a number of recent projects and which has a significant impact on LLD vocabularies and models

    Representation and parsing of multiword expressions

    Get PDF
    This book consists of contributions related to the definition, representation and parsing of MWEs. These reflect current trends in the representation and processing of MWEs. They cover various categories of MWEs such as verbal, adverbial and nominal MWEs, various linguistic frameworks (e.g. tree-based and unification-based grammars), various languages including English, French, Modern Greek, Hebrew, Norwegian), and various applications (namely MWE detection, parsing, automatic translation) using both symbolic and statistical approaches
    corecore