5 research outputs found

    LIME: Towards a Metadata Module for Ontolex

    Get PDF
    The OntoLex W3C Community Group has been working for more than a year on realizing a proposal for a standard ontol-ogy lexicon model. As the core-specification of the model is almost com-plete, the group started development of additional modules for specific tasks and use cases. We think that in many usage scenarios (e.g. linguistic enrichment, lo-calization and alignment of ontologies) the discovery and exploitation of linguis-tically grounded datasets may benefit from summarizing information about their linguistic expressivity. While the VoID vocabulary covers the need for general metadata about linked datasets, this more specific information demands a dedicated extension. In this paper, we fill this gap by introducing LIME (Linguistic Metadata), a new vocabulary aiming at completing the OntoLex standard with specifications for linguistic metadata.

    Langues « par défaut » ? Analyse contrastive et diachronique des langues non citées dans les articles de TALN et d'ACL

    Get PDF
    National audienceWe study the application of the #BenderRule in natural language processing articles, taking into account a contrastive and a diachronic dimensions, by examining the proceedings of two NLP conferences, TALN and ACL, over time. A sample of articles was annotated manually and two classifiers were developed to automatically annotate the remaining articles. This allows us to quantify the extent to which the #BenderRule is applied and to show a slight advantage in favor of TALN.Cet article étudie l'application de la #RÚgledeBender dans des articles de traitement automatique des langues (TAL), en prenant en compte une dimension contrastive, par l'examen des actes de deux conférences du domaine, TALN et ACL, et une dimension diachronique, en examinant ces conférences au fil du temps. Un échantillon d'articles a été annoté manuellement et deux classifieurs ont été développés afin d'annoter automatiquement les autres articles. Nous quantifions ainsi l'application de la #RÚgledeBender, et mettons en évidence un léger mieux en faveur de TALN sur cet aspect

    Toward an effective Igbo part-of-speech tagger

    Get PDF
    Part-of-speech (POS) tagging is a well-established technology for most Western European languages and a few other world languages, but it has not been evaluated on Igbo, an agglutinative African language. This article presents POS tagging experiments conducted using an Igbo corpus as a test bed for identifying the POS taggers and the Machine Learning (ML) methods that can achieve a good performance with the small dataset available for the language. Experiments have been conducted using different well-known POS taggers developed for English or European languages, and different training data styles and sizes. Igbo has a number of language-specific characteristics that present a challenge for effective POS tagging. One interesting case is the wide use of verbs (and nominalizations thereof) that have an inherent noun complement, which form “linked pairs” in the POS tagging scheme, but which may appear discontinuously. Another issue is Igbo's highly productive agglutinative morphology, which can produce many variant word forms from a given root. This productivity is a key cause of the out-of-vocabulary (OOV) words observed during Igbo tagging. We report results of experiments on a promising direction for improving tagging performance on such morphologically-inflected OOV words

    NLP4NLP+5: The Deep (R)evolution in Speech and Language Processing

    Get PDF
    This paper aims at analyzing the changes in the fields of speech and natural language processing over the recent past 5 years (2016–2020). It is in continuation of a series of two papers that we published in 2019 on the analysis of the NLP4NLP corpus, which contained articles published in 34 major conferences and journals in the field of speech and natural language processing, over a period of 50 years (1965–2015), and analyzed with the methods developed in the field of NLP, hence its name. The extended NLP4NLP+5 corpus now covers 55 years, comprising close to 90,000 documents [+30% compared with NLP4NLP: as many articles have been published in the single year 2020 than over the first 25 years (1965–1989)], 67,000 authors (+40%), 590,000 references (+80%), and approximately 380 million words (+40%). These analyses are conducted globally or comparatively among sources and also with the general scientific literature, with a focus on the past 5 years. It concludes in identifying profound changes in research topics as well as in the emergence of a new generation of authors and the appearance of new publications around artificial intelligence, neural networks, machine learning, and word embedding

    The LRE map. Harmonising community descriptions of resources

    No full text
    International audienceAccurate and reliable documentation of Language Resources is an undisputable need: documentation is the gateway to discovery ofLanguage Resources, a necessary step towards promoting the data economy. Language resources that are not documented virtually donot exist: for this reason every initiative able to collect and harmonise metadata about resources represents a valuable opportunity forthe NLP community. In this paper we describe the LRE Map, reporting statistics on resources associated with LREC2012 papers andproviding comparisons with LREC2010 data. The LRE Map, jointly launched by FLaReNet and ELRA in conjunction with the LREC2010 conference, is an instrument for enhancing availability of information about resources, either new or already existing ones,reinforcing and facilitating the use of standards in the community. The LRE Map web interface provides the possibility of searchingaccording to a fixed set of metadata and to view the details of extracted resources. The LRE Map is continuing to collect bottom-upinput about resources from authors of other conferences through standard submission process. This will help broadening the notion of“language resources” and attract to the field neighboring disciplines that so far have been only marginally involved by the standardnotion of language resources
    corecore