Search CORE

10 research outputs found

Building an endangered language resource in the classroom: Universal dependencies for Kakataibo

Author: Alvarado C.
Blum F.
Echevarria X.
Gomez L.
Gonzales R.
Illescas M.
Oncevay A.
Oporto S.
Vera J.
Zariquiey R.
Publication venue
Publication date: 01/06/2022
Field of study

In this paper, we launch a new Universal Dependencies treebank for an endangered language from Amazonia: Kakataibo, a Panoan language spoken in Peru. We first discuss the collaborative methodology implemented, which proved effective to create a treebank in the context of a Computational Linguistic course for undergraduates. Then, we describe the general details of the treebank and the language-specific considerations implemented for the proposed annotation. We finally conduct some experiments on part-of-speech tagging and syntactic dependency parsing. We focus on monolingual and transfer learning settings, where we study the impact of a Shipibo-Konibo treebank, another Panoan language resourc

MPG.PuRe

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Author: Chiruzzo Luis
Coto-Solano Rolando
Ebrahimi Abteen
Giménez-Lugo Gustavo A.
Kann Katharina
McCarthy Arya D.
Oncevay Arturo
Ortega John E.
Publication venue
Publication date: 15/02/2023
Field of study

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.Comment: EACL 202

arXiv.org e-Print Archive

A Low-Resourced Peruvian Language Identification Model

Author: Linares A.E.
Oncevay-Marcos A.
Publication venue: CEUR-WS
Publication date: 01/01/2017
Field of study

Repositorio institucional - Concytec

Corpus creation and initial SMT experiments between Spanish and Shipibo-Konibo

Author: Galarreta A.-P.
Melgar A.
Oncevay-Marcos A.
Publication venue: 'Assoc. for Computational Linguistics Bulgaria'
Publication date: 01/01/2017
Field of study

Repositorio institucional - Concytec

WordNet-SHP: Towards the building of a lexical database for a Peruvian minority language

Author: Maguiño-Valencia D.
Oncevay-Marcos A.
Sobrevilla Cabezudo M.A.
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2019
Field of study

Repositorio institucional - Concytec

Chanot: An intelligent annotation tool for indigenous and highly agglutinative languages in Peru

Author: Mercado-Gonzales R.
Oncevay A.
Pereira-Noriega J.
Sobrevilla M.
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2019
Field of study

Repositorio institucional - Concytec

Overview de ReCoRES en IberLEF 2022: Comprensión de Lectura y Explicación de Razonamiento en Español

Author: Alva-Manchego Fernando
Diestra Diego
Gómez Erasmo
López Rodrigo
Oncevay Arturo
Sobrevilla Cabezudo Marco A.
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/09/2022
Field of study

This paper presents the ReCoRES task, organized at IberLEF 2022, within the framework of the 38th edition of the International Conference of the Spanish Society for Natural Language Processing. The main goal of this shared-task is to promote the task of Reading Comprehension and Verbal Reasoning. This task is divided into two sub-tasks: (1) identifying the correct alternative in reading comprehension questions and (2) generating the reasoning used to select an alternative. In general, 3 teams participated in this event, mainly proposing transformer-based neural models in conjunction with additional strategies. The results of this event, insights and some challenges are presented, opening a range of possibilities for future work.Este artículo presenta la tarea ReCoRES, organizada en IberLEF 2022, en el marco de la 38 edición de la Conferencia Internacional de la Sociedad Española para el Procesamiento del Lenguaje Natural. El objetivo de esta tarea es promover la tarea de Comprensión de Lectura y Razonamiento Verbal. Esta tarea es dividida en dos sub-tareas: (1) la identificación de la alternativa correcta en preguntas de comprensión de lectura y (2) la generación del razonamiento usado para seleccionar una alternativa. En general, 3 equipos participaron de este evento proponiendo mayormente modelos neuronales basados en transformers con algunas estrategias adicionales. Los resultados de este evento así como aprendizajes y algunos desafíos son presentados, abriendo un abanico de posibilidades como trabajos futuros

Repositorio Institucional de la Universidad de Alicante

Ship-lemmatagger: Building an nlp toolkit for a peruvian native language

Author: Melgar A.
Mercado-Gonzales R.
Oncevay-Marcos A.
Pereira-Noriega J.
Sobrevilla-Cabezudo M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Repositorio institucional - Concytec

Open machine translation for low resource South American languages (AmericasNLP 2021 shared task contribution)

Author: Dash Amulya
Doğruöz A. SezaLW228020034756660000-0003-2589-58948c4a7f57-d442-11ea-ac2c-81382da2de55
Hernández Amadeo
Kann Katharinaeditor
Mager Manueleditor
Motlicek Petr
Neubig Grahameditor
Oncevay Arturoeditor
Ortega-Mendoza Rosa M.
Panda Subhadarshi
Parida Shantipriya
Rios Annetteeditor
Ruiz Ivan Vladimir Mezaeditor
Sharma Yashvardhan
Villatoro-Tello Esau
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.This paper describes the team (“Tamalli”)’s submission to AmericasNLP2021 shared task on Open Machine Translation for low resource South American languages. Our goal was to evaluate different Machine Translation (MT) techniques, statistical and neural-based, under several configuration settings. We obtained the second-best results for the language pairs “Spanish-Bribri”, “Spanish-Asháninka”, and “Spanish-Rarámuri” in the category “Development set not used for training”. Our performed experiments will serve as a point of reference for researchers working on MT with low-resource languages.C

Archivsystem Ask23