444 research outputs found

    LR-Sum: Summarization for Less-Resourced Languages

    Full text link
    This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset

    Researching Less-Resourced Languages : the DigiSami Corpus

    Get PDF
    Peer reviewe

    Evaluation of contextual embeddings on less-resourced languages

    Get PDF
    The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models

    Sentiment Lexicon Construction Using SentiWordNet 3.0

    Get PDF
    Opinion mining and sentiment analysis have become popular in linguistic resource rich languages. Opinions for such analysis are drawn from many forms of freely available online/electronic sources, such as websites, blogs, news re-ports and product reviews. But attention received by less resourced languages is significantly less. This is because the success of any opinion mining algorithm depends on the availability of resources, such as special lexicon and WordNet type tools. In this research, we implemented a less complicated but an effective approach that could be used to classify comments in less resourced languages. We experimented the approach for use with Sinhala Language where no such opinion mining or sentiment analysis has been carried out until this day. Our algorithm gives significantly promising results for analyzing sentiments in Sinhala for the first time

    Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque

    Get PDF
    IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology

    A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

    Full text link
    Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.Comment: Paper accepted for TSD201

    Quinductor: a multilingual data-driven method for generating reading-comprehension questions using Universal Dependencies

    Full text link
    We propose a multilingual data-driven method for generating reading comprehension questions using dependency trees. Our method provides a strong, mostly deterministic, and inexpensive-to-train baseline for less-resourced languages. While a language-specific corpus is still required, its size is nowhere near those required by modern neural question generation (QG) architectures. Our method surpasses QG baselines previously reported in the literature and shows a good performance in terms of human evaluation

    Quizzes on tap: exporting a test generation system from one less resourced language to another

    Get PDF
    It is difficult to develop and deploy Language Technology and applications for minority languages for many reasons. These include the lack of Natural Language Processing (NLP) resources for the language, a scarcity of NLP researchers who speak the language and the communication gap between teachers in the classroom and researchers working in universities and other centres of research. One approach to overcoming these obstacles is for researchers interested in Less-Resourced Languages (LRLs) to work together in reusing and adapting existing resources where possible. This article outlines how a multiple-choice quiz generator for Basque was adapted for Irish. The Quizzes on Tap (QOT) system uses Latent Semantic Analysis (LSA) to automatically generate multiple choice test items. Adapting the Basque application to work for Irish involved the sourcing of suitable Irish corpora and a morphological engine for Irish, as well as the compilation of a development set. Various integration issues arising from differences between Basque and Irish needed to be dealt with. The QOT system provides a useful resource that enables Irish teachers to produce both domain-specific and generalknowledge quizzes in a timely manner, for children with varying levels of exposure to the language. Keywords: LRL, less-resourced languages, Irish, morphological analysis, multiple choice tes
    corecore