Search CORE

11 research outputs found

Becoming a High-Resource Language in Speech: The Catalan Case in the Common Voice Corpus

Author: Armentano Oller Carme
Marimon Montserrat
Villegas Marta
Publication venue: ELRA Language Resources Association and the International Committee on Computational Linguistics
Publication date: 01/01/2024
Field of study

Collecting voice resources for speech recognition systems is a multifaceted challenge, involving legal, technical, and diversity considerations. However, it is crucial to ensure fair access to voice-driven technology across diverse linguistic backgrounds. We describe an ongoing effort to create an extensive, high-quality, publicly available voice dataset for future development of speech technologies in Catalan through the Mozilla Common Voice crowd-sourcing platform. We detail the specific approaches used to address the challenges faced in recruiting contributors and managing the collection, validation, and recording of sentences. This detailed overview can serve as a source of guidance for similar initiatives across other projects and linguistic contexts. The success of this project is evident in the latest corpus release, version 16.1, where Catalan ranks as the most prominent language in the corpus, both in terms of recorded hours and when considering validated hours. This establishes Catalan as a language with significant speech resources for language technology development and significantly raises its international visibility.An immense thanks to the entire AINA team, particularly Paul Andrei Petrea and Baybars Külebi, for their consistent participation and support, and Hannah Rose Galbraith for refactoring the sentence filter. We express deep gratitude to the entire MCV community, with special recognition to Francis Tyers, for their unwavering dedication in overcoming challenges during the campaign. Our sincere appreciation extends to the collective SoftCatalà, especially Joan Montané, and other supporting organizations like Platforma per la Llengua, Òmnium Cultural, and many more. Special acknowledgment is given to the writers and editors who generously contributed, including Grup Enciclopèdia Catalana, VilaWeb, Racó Català, El Cèrvol, Secretaria de Política Lingüística, Màrius Serra, Carles Cortés, and Joan Pujolar. Lastly, immense thanks to the over 35,000 individuals who lent their voices to this project; your contributions have been invaluable to our success. This work has been promoted and financed by the Generalitat de Catalunya through the AINA project.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

NoNiRes: A Catalan corpus annotated with negation

Author: Armentano Oller Carme
Calvo Figueras Blanca
Nofre Montserrat
Tañá Velasco Laura
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/09/2023
Field of study

En este artículo se presentan los criterios aplicados para la anotación de la negación y del foco de la negación del corpus NoNiRes del catalán. El corpus está constituido por 20.600 oraciones procedentes de datasets ya existentes (5.000 oraciones), un foro de Internet (10.000 oraciones) y un periódico digital (5.600 oraciones). Se han tratado aspectos complejos como son el foco y la gradación de la negación. Se ofrecen datos estadísticos exhaustivos sobre las estructuras anotadas.In this article we present the criteria applied for the annotation of negation and focus of negation of the corpus NoNiRes of Catalan. The corpus is composed of 20.600 sentences from existing datasets (5.000 sentences), an Internet forum (10.000 sentences), and a digital newspaper (5.600 sentences). Complex aspects such as the focus and the gradation of negation have been dealt with. Comprehensive statistical data on the annotated structures are provided.Este trabajo ha sido financiado por CLiC, Centre de Llenguatge i Computació, grupo de investigación consolidado por la Generalitat de Catalunya (2021 SGR 00313), y por el Departament de la Vicepresidència i de Polítiques Digitals i Territori de la Generalitat de Catalunya, dentro del marco del Projecte AINA

Repositorio Institucional de la Universidad de Alicante

Re-use of linguistic data to create a machine translation system for a new language pair

Author: Armentano Oller Carme
Forcada Mikel L.
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2008
Field of study

Este trabajo estudia varias formas de reutilizar datos lingüísticos ya desarrollados para obtener rápidamente un sistema de traducción automática para un nuevo par de lenguas. En particular, se ha desarrollado un traductor entre el portugués y el catalán basado en la plataforma Apertium (www.apertium.org), a partir de los datos ya disponibles en esta plataforma para traducir entre portugués y español y entre español y catalán. Los resultados obtenidos indican que una simple composición de dos traductores completos es una buena primera opción, aunque también se muestran otros resultados muy interesantes obtenidos en poco tiempo usando las herramientas que proporciona esta plataforma.This work examines various ways to re-use pre-existing linguistic data to quickly generate a machine translation system for a new language pair. In particular, a machine translation system between Portuguese and Catalan based on the Apertium platform (www.apertium.org) has been built from data existing in this platform for translating between Portuguese and Spanish and between Spanish and Catalan. The results obtained indicate that a simple composition of two complete translators is an adequate first option, but other very interesting results are shown which have been obtained in short time using the tools provided in the Apertium platform

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Training part-of-speech taggers to build machine translation systems for less-resourced language pairs

Author: Armentano Oller Carme
Forcada Mikel L.
Pérez-Ortiz Juan Antonio
Sánchez-Martínez Felipe
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2007
Field of study

Este articulo revisa el empleo de un método no supervisado para la obtención de desambiguadores léxicos categoriales para su empleo dentro del ingenio de traducción automática (TA) de código abierto Apertium. El método emplea el resto de módulos del sistema de TA y un modelo de la lengua destino de la traducción para la obtención de desambiguadores léxicos categoriales que después se usan dentro de la plataforma de TA Apertium para traducir. Los experimentos realizados con el par de lenguas occitano–catalán (un caso de estudio para pares de lenguas minorizadas con pocos recursos) muestran que la cantidad de corpus necesario para el entrenamiento es reducida comparado con los tamaños de corpus habitualmente usados con otros métodos de entrenamiento no supervisado como el algoritmo de Baum y Welch. Esto hace que el método sea especialmente apropiado para la obtención de desambiguadores léxicos categoriales para su empleo en TA entre pares de lenguas minorizadas. Además, la calidad de traducción del sistema de TA que utiliza el desambiguador léxico categorial resultante es comparativamente mejor.In this paper we review an unsupervised method that can be used to train the hidden-Markov-model-based part-of-speech taggers used within the opensource shallow-transfer machine translation (MT) engine Apertium. This method uses the remaining modules of the MT engine and a target language model to obtain part-of-speech taggers that are then used within the Apertium MT engine in order to produce translations. The experimental results on the Occitan–Catalan language pair (a case study of a less-resourced language pair) show that the amount of corpora needed by this training method is small compared with the usual corpus sizes needed by the standard (unsupervised) Baum-Welch algorithm. This makes the method appropriate to train part-of-speech taggers to be used in MT for less-resourced language pairs. Moreover, the translation performance of the MT system embedding the resulting part-of-speech tagger is comparatively better.Work funded by the Spanish Ministry of Education and Science through project TIN2006- 15071-C03-01, by the Spanish Ministry of Education and Science and the European Social Fund through research grant BES-2004-4711, and by the Spanish Ministry of Industry, Tourism and Commerce through project FIT-350401-2006-5. The development of the Occitan–Catalan linguistic data was supported by the Generalitat de Catalunya

Repositorio Institucional de la Universidad de Alicante

CiteSeerX

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Open- source machine translation between small languages: Catalan

Author: Aranese Occitan
Carme Armentano I Oller
Mikel L. Forcada
Publication venue
Publication date
Field of study

We describe the use of an open- source shallow- transfer machine translation engine, Apertium, and existing open- source linguistic data to build a bidirectional machine translation system for a new pair of 'small' languages, Catalan (6 million speakers) and the Aranese variety (5000 speakers) of Occitan (about 1 million speakers), and discuss its possible uses and their effects on the linguistic nor malization of the smaller language. 1

CiteSeerX

An open-source shallow-transfer machine translation toolbox: consequences of its release and availability

Author: Armentano Oller Carme
Bonev Boyan
Corbí Bellot Antonio Miguel
Forcada Mikel L.
Ginestí Rosell Mireia
Ortiz Rojas Sergio
Pérez-Ortiz Juan Antonio
Ramírez Sánchez Gema
Sánchez-Martínez Felipe
Publication venue: OSMaTran
Publication date: 01/09/2005
Field of study

By the time Machine Translation Summit X is held in September 2005, our group will have released an open-source machine translation toolbox as part of a large government-funded project involving four universities and three linguistic technology companies from Spain. The machine translation toolbox, which will most likely be released under a GPL-like license includes (a) the open-source engine itself, a modular shallow-transfer machine translation engine suitable for related languages and largely based upon that of systems we have already developed, such as interNOSTRUM for Spanish—Catalan and Traductor Universia for Spanish—Portuguese, (b) extensive documentation (including document type declarations) specifying the XML format of all linguistic (dictionaries, rules) and document format management files, (c) compilers converting these data into the high-speed (tens of thousands of words a second) format used by the engine, and (d) pilot linguistic data for Spanish—Catalan and Spanish—Galician and format management specifications for the HTML, RTF and plain text formats. After describing very briefly this toolbox, this paper aims at exploring possible consequences of the availability of this architecture, including the community-driven development of machine translation systems for languages lacking this kind of linguistic technology.The development of the toolbox is funded by project FIT-340101-2004-3 (Spanish Ministry of Industry, Commerce and Tourism)

Repositorio Institucional de la Universidad de Alicante

Apertium, una plataforma de código abierto para el desarrollo de sistemas de traducción automática

Author: Armentano Oller Carme
Corbí Bellot Antonio Miguel
Forcada Mikel L.
Ginestí Rosell Mireia
Montava Belda Marco A.
Ortiz Rojas Sergio
Pérez-Ortiz Juan Antonio
Ramírez Sánchez Gema
Sánchez-Martínez Felipe
Publication venue: Universidad de Cádiz. Servicio de Publicaciones
Publication date: 01/01/2007
Field of study

Uno de los principales retos de la informática para las próximas décadas es el desarrollo de sistemas capaces de procesar eficazmente el lenguaje natural (o lenguaje humano). Dentro de este campo, los sistemas de traducción automática, encargados de traducir un texto escrito en un idioma a una versión equivalente en otro idioma, reciben especial atención dado, por ejemplo, el carácter multilingüe de sociedades como la europea. La automatización de dicho proceso es particularmente compleja porque los programas han de enfrentarse a características del lenguaje natural, como la ambigüedad, cuyo tratamiento algorítmico no es factible, de modo que una mera aproximación o automatización parcial del proceso ya se considera un éxito. Los programas de traducción automática han sido tradicionalmente sistemas cerrados, pero en los últimos tiempos la tendencia marcada por el software libre ha llegado también a este campo. En este artículo describimos Apertium, apertium.org, una plataforma avanzada de código abierto, con licencia GNU GPL, que, gracias al desacoplamiento que ofrece entre datos y programas permite desarrollar cómodamente nuevos traductores automáticos. La plataforma Apertium ha sido desarrollada por el grupo de investigación Transducens de la Universitat d’Alacant en el marco de varios proyectos de colaboración con universidades y empresas de España en los que, además de los programas que conforman el motor de traducción, se han confeccionado datos lingüísticos abiertos para la traducción automática catalán–español, gallego–español, portugués–español, francés–catalán, inglés–catalán y occitano–catalán. Tanto la plataforma en la que se integra el motor de traducción como los datos para estos pares de lenguas están disponibles para su descarga en sf.net/projects/apertium/ y para su evaluación en línea en xixona.dlsi.ua.es/prototype/.Este trabajo ha sido parcialmente subvencionado por el Ministerio de Industria, Comercio y Turismo a través de los proyectos FIT-340101-2004-3, FIT-340001-2005-2 y FIT-350401-2006-5, por el Ministerio de Educación y Ciencia a través de los proyectos TIC2003-08681-C02-01 y TIN2006-15071-C03-01, y por la Generalitat de Catalunya a través del proyecto DURSI1-05I. Felipe Sánchez-Martínez disfruta de la ayuda para la formación de personal investigador BES-2004-4711, financiada por el Fondo Social Europeo y el Ministerio de Educación y Ciencia

Repositorio Institucional de la Universidad de Alicante

MarIA: Modelos del Lenguaje en Español

Author: Armengol-Estapé Jordi
Armentano Oller Carme
Carrino Casimiro Pio
Gonzalez-Agirre Aitor
Gutiérrez-Fandiño Asier
Llop-Palao Joan
Pàmies Marc
Rodríguez Penagos Carlos
Silveira-Ocampo Joaquín
Villegas Montserrat Marta
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/03/2022
Field of study

This work presents MarIA, a family of Spanish language models and associated resources made available to the industry and the research community. Currently, MarIA includes RoBERTa-base, RoBERTa-large, GPT2 and GPT2-large Spanish language models, which can arguably be presented as the largest and most proficient language models in Spanish. The models were pretrained using a massive corpus of 570GB of clean and deduplicated texts with 135 billion words extracted from the Spanish Web Archive crawled by the National Library of Spain between 2009 and 2019. We assessed the performance of the models with nine existing evaluation datasets and with a novel extractive Question Answering dataset created ex novo. Overall, MarIA models outperform the existing Spanish models across a variety of NLU tasks and training settings.En este artículo se presenta MarIA, una familia de modelos del lenguaje en español y sus correspondientes recursos que se hacen públicos para la industria y la comunidad científica. Actualmente MarIA incluye los modelos del lenguaje en español RoBERTa-base, RoBERTa-large, GPT2 y GPT2-large que pueden considerarse como los modelos más grandes y mejores para español. Los modelos han sido preentrenados utilizando un corpus masivo de 570GB de textos limpios y deduplicados, que comprende un total de 135 mil millones de palabras extraidas del Archivo Web del Español construido por la Biblioteca Nacional de España entre los años 2009 y 2019. Evaluamos el rendimiento de los modelos con nueve conjuntos de datos existentes y con un nuevo conjunto de datos de pregunta-respuesta extractivo creado ex novo. El conjunto de modelos de MarIA supera, en la practica totalidad, el rendimiento de los modelos existentes en español en las diferentes tareas y configuraciones presentadas.This work was funded by the Spanish State Secretariat for Digitalization and Artificial Intelligence (SEDIA) within the framework of the Plan-TL

Repositorio Institucional de la Universidad de Alicante

arXiv.org e-Print Archive

UPCommons. Portal del coneixement obert de la UPC

consequences of its release

Author: Antonio M. Corbí-bellot
Boyan Bonev
Carme Armentano-oller
Gema Ramírez
Juan Antonio Pérez-ortiz
Mikel L. Forcada
Mireia Ginestí
Sergio Ortiz-rojas
Sánchez Felipe Sánchez-martínez
Publication venue
Publication date
Field of study

open-source shallow-transfer machine translation toolbox

CiteSeerX

Open-source Portuguese-Spanish machine translation

Author: Antonio M. Corbí-bellot
Carme Armentano-oller
Felipe Sánchez-martínez
Gema Ramírez-sánchez
Juan Antonio Pérez-ortiz
Mikel L. Forcada
Mireia Ginestí-rosell
Miriam A. Scalco
Rafael C. Carrasco
Sergio Ortiz-rojas
Publication venue: Springer-Verlag
Publication date: 01/01/2006
Field of study

Abstract. This paper describes the current status of development of an open-source shallow-transfer machine translation (MT) system for the [European] Portuguese ↔ Spanish language pair, developed using the OpenTrad Apertium MT toolbox (www.apertium.org). Apertium uses finite-state transducers for lexical processing, hidden Markov models for part-of-speech tagging, and finite-state-based chunking for structural transfer, and is based on a simple rationale: to produce fast, reasonably intelligible and easily correctable translations between related languages, it suffices to use a MT strategy which uses shallow parsing techniques to refine word-for-word MT. This paper briefly describes the MT engine, the formats it uses for linguistic data, and the compilers that convert these data into an efficient format used by the engine, and then goes on to describe in more detail the pilot Portuguese↔Spanish linguistic data.

Repositorio Institucional de la Universidad de Alicante

CiteSeerX

Crossref