Search CORE

444 research outputs found

LR-Sum: Summarization for Less-Resourced Languages

Author: Lignos Constantine
Palen-Michel Chester
Publication venue
Publication date: 26/10/2023
Field of study

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset

arXiv.org e-Print Archive

Researching Less-Resourced Languages : the DigiSami Corpus

Author: Jokinen Päivi Kristiina
Publication venue: European Languages Resources Association (ELRA)
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Evaluation of contextual embeddings on less-resourced languages

Author: Armendariz CS
Pollak S
Purver M
Repar A
Robnik-Šikonja M
Ulčar M
Žagar A
Publication venue
Publication date: 01/01/2021
Field of study

The current dominance of deep neural networks in natural language processing is based on contextual embeddings such as ELMo, BERT, and BERT derivatives. Most existing work focuses on English; in contrast, we present here the first multilingual empirical comparison of two ELMo and several monolingual and multilingual BERT models using 14 tasks in nine languages. In monolingual settings, our analysis shows that monolingual BERT models generally dominate, with a few exceptions such as the dependency parsing task, where they are not competitive with ELMo models trained on large corpora. In cross-lingual settings, BERT models trained on only a few languages mostly do best, closely followed by massively multilingual BERT models

Queen Mary Research Online

Sentiment Lexicon Construction Using SentiWordNet 3.0

Author: Medagoda N
Whalley JL
Publication venue: IEEE
Publication date: 15/08/2015
Field of study

Opinion mining and sentiment analysis have become popular in linguistic resource rich languages. Opinions for such analysis are drawn from many forms of freely available online/electronic sources, such as websites, blogs, news re-ports and product reviews. But attention received by less resourced languages is significantly less. This is because the success of any opinion mining algorithm depends on the availability of resources, such as special lexicon and WordNet type tools. In this research, we implemented a less complicated but an effective approach that could be used to classify comments in less resourced languages. We experimented the approach for use with Sinhala Language where no such opinion mining or sentiment analysis has been carried out until this day. Our algorithm gives significantly promising results for analyzing sentiments in Sinhala for the first time

AUT Scholarly Commons

Strategies to develop Language Technologies for Less-Resourced Languages based on the case of Basque

Author: Alegria Iñaki,
Artola Xabier,
Díaz De Ilarraza Arantza
Sarasola Kepa
Publication venue: HAL CCSD
Publication date: 25/11/2011
Field of study

IXA group has developed during 23 years a basic set of resources, tools and applications for Basque following to an initial strategy which has been adapted according to technological changes. We think that our strategy and experience can be a reference for other less resourced languages. According to a six level classification of world languages, we estimate that this strategy may be useful for several hundred languages, those that have developed a written standard but that still are beginners in Human Language Technology

ArtXiker - @HAL

A Lightweight Regression Method to Infer Psycholinguistic Properties for Brazilian Portuguese

Author: Aluisio Sandra M.
Candido Jr. Arnaldo
Duran Magali S.
Hartmann Nathan S.
Paetzold Gustavo H.
Santos Leandro B. dos
Publication venue
Publication date: 19/05/2017
Field of study

Psycholinguistic properties of words have been used in various approaches to Natural Language Processing tasks, such as text simplification and readability assessment. Most of these properties are subjective, involving costly and time-consuming surveys to be gathered. Recent approaches use the limited datasets of psycholinguistic properties to extend them automatically to large lexicons. However, some of the resources used by such approaches are not available to most languages. This study presents a method to infer psycholinguistic properties for Brazilian Portuguese (BP) using regressors built with a light set of features usually available for less resourced languages: word length, frequency lists, lexical databases composed of school dictionaries and word embedding models. The correlations between the properties inferred are close to those obtained by related works. The resulting resource contains 26,874 words in BP annotated with concreteness, age of acquisition, imageability and subjective frequency.Comment: Paper accepted for TSD201

arXiv.org e-Print Archive

Crossref

Quinductor: a multilingual data-driven method for generating reading-comprehension questions using Universal Dependencies

Author: Boye Johan
Kalpakchi Dmytro
Publication venue
Publication date: 18/03/2021
Field of study

We propose a multilingual data-driven method for generating reading comprehension questions using dependency trees. Our method provides a strong, mostly deterministic, and inexpensive-to-train baseline for less-resourced languages. While a language-specific corpus is still required, its size is nowhere near those required by modern neural question generation (QG) architectures. Our method surpasses QG baselines previously reported in the literature and shows a good performance in terms of human evaluation

arXiv.org e-Print Archive

Quizzes on tap: exporting a test generation system from one less resourced language to another

Author: Foster Jennifer
Maritxalar Montse
Uí Dhonnchadha Elaine
Ward Monica
Publication venue
Publication date: 27/11/2011
Field of study

It is difficult to develop and deploy Language Technology and applications for minority languages for many reasons. These include the lack of Natural Language Processing (NLP) resources for the language, a scarcity of NLP researchers who speak the language and the communication gap between teachers in the classroom and researchers working in universities and other centres of research. One approach to overcoming these obstacles is for researchers interested in Less-Resourced Languages (LRLs) to work together in reusing and adapting existing resources where possible. This article outlines how a multiple-choice quiz generator for Basque was adapted for Irish. The Quizzes on Tap (QOT) system uses Latent Semantic Analysis (LSA) to automatically generate multiple choice test items. Adapting the Basque application to work for Irish involved the sourcing of suitable Irish corpora and a morphological engine for Irish, as well as the compilation of a development set. Various integration issues arising from differences between Basque and Irish needed to be dealt with. The QOT system provides a useful resource that enables Irish teachers to produce both domain-specific and generalknowledge quizzes in a timely manner, for children with varying levels of exposure to the language. Keywords: LRL, less-resourced languages, Irish, morphological analysis, multiple choice tes

CiteSeerX

Irish Universities

DCU Online Research Access Service