Search CORE

20 research outputs found

Bridging linguistic typology and multilingual machine translation with multi-view language representations

Author: Birch Alexandra
Haddow Barry
Oncevay Arturo
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2020
Field of study

Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.Comment: 15 pages, 6 figure

arXiv.org e-Print Archive

Crossref

Edinburgh Research Explorer

Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

Author: Birch Alexandra
Iyer Vivek
Oncevay Arturo
Publication venue
Publication date: 02/05/2023
Field of study

Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models

Edinburgh Research Explorer

The University of Edinburgh’s English-Tamil and English-Inuktitut Submissions to the WMT20 News Translation Task

Author: Bawden Rachel
Birch Alexandra
Dobreva Radina
Miceli Barone Antonio Valerio
Oncevay Marcos Arturo
Williams Philip
Publication venue
Publication date: 19/11/2020
Field of study

Edinburgh Research Explorer

“Estudio de la viabilidad de un proyecto inmobiliario de vivienda multifamiliar ubicada en el distrito de Lince, límite con San Isidro”

Author: Estrada Izaguirre Kadir Arturo
Oncevay Marcos Diego Alfonso
Publication venue: 'Baishideng Publishing Group Inc.'
Publication date: 23/03/2018
Field of study

En una situación en la que el consumidor está en un estado de precaución y desconfianza, debido a la recesión económica actual, lo que significa, que deja de gastar y no se endeuda a largo plazo, no invierte. Por lo tanto, menos familias pedirán créditos hipotecarios, se desacelera la demanda de viviendas, las ventas caen y aumenta el stock de viviendas. La demanda insatisfecha de viviendas aún sigue incrementándose en la ciudad de Lima. Lince es uno de los distritos en donde el déficit de vivienda está ligeramente por debajo de la oferta de vivienda, a través de los últimos años, ha tenido un crecimiento en su oferta de viviendas y distrito, a pesar de la desaceleración del mercado, atractivo para adquirir viviendas; según el índice PER, que es el precio de vivienda sobre su precio de alquiler anual, es un distrito atractivo, no sólo para las familias, sino también para los inversionistas que adquieren viviendas para colocarlas en alquiler. Ubicamos tres terrenos colindantes de 8 metros de frente por 40 metros de fondo, haciendo un total de 960 m2. La ubicación, en la calle Francisco de Zela, es una buena ubicación cerca del distrito de San Isidro, de colegios, supermercados, el parque Mariscal Castilla, entre otros atributos urbanísticos. Nuestro producto se presenta al cliente vendiendo la idea de pertenecer al distrito de San Isidro, pero perteneciendo a Lince, dirigido a familias que no pueden acceder a un crédito hipotecario para comprar un producto en ese distrito, pero aspiran a tener un estilo de vida parecida al del distrito de San Isidro. Por lo tanto, el nombre del proyecto “Las Palmeras” hace referencia al nombre de la calle donde se ubica el proyecto cuando pertenece al distrito de San Isidro, a media cuadra del sitio. El proyecto plantea construir un edificio de 5 pisos y un semisótano, en total serán 6 pisos de viviendas, 2 niveles de sótanos de estacionamientos y 35 departamentos de viviendas. Los departamentos tipo flats con áreas promedio desde 80 m2 hasta 113 m2, de dos y tres dormitorios. El diseño de un producto según las necesidades del cliente, la creación de valor para los clientes meta y la estrategia de marketing para comunicar correctamente el producto al cliente, garantizará que nuestra velocidad de venta no caiga y se mantenga en 2 unidades vendidas mensualmente. El sólo hecho de tener una velocidad de venta cerca a la unidad por mes, hace que el proyecto tenga un VAN negativo en la tasa de descuento proyectada.Tesi

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Registro Nacional de Trabajos de Investigación y Proyectos

Repositorio Digital de Tesis PUCP

Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

Author: Chiruzzo Luis
Coto-Solano Rolando
Ebrahimi Abteen
Giménez-Lugo Gustavo A.
Kann Katharina
McCarthy Arya D.
Oncevay Arturo
Ortega John E.
Publication venue
Publication date: 15/02/2023
Field of study

Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.Comment: EACL 202

arXiv.org e-Print Archive

AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

Author: Chaudhary Vishrav
Chiruzzo Luis
Coto-Solano Rolando
Ebrahimi Abteen
Fan Angela
Giménez-Lugo Gustavo
Gutierrez-Vasques Ximena
Kann Katharina
Mager Elisabeth
Mager Manuel
Meza Ruiz Ivan Vladimir
Neubig Graham
Oncevay Arturo
Ortega John E
Palmer Alexis
Ramos Ricardo
Rios Annette
Vu Ngoc Thang
Publication venue: 'Frontiers Media SA'
Publication date: 02/12/2022
Field of study

Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task

PubMed Central

ZORA

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

Author: Aiton Grant
Ambridge Ben
Ataman Duygu
Ate Yustinus Ghanggo
Barta Botond
Bayyr-ool Aziyana
Bernardy Jean-Philippe
Chodroff Eleanor
Coler Matt
Cotterell Ryan
Ek Adam
El-Khaissi Charbel
Ganieva Sofya
Gasser Michael
Goldman Omer
Habash Nizar
Hatcher Richard J.
Hulden Mans
Ivanova Sardana
Khalifa Salam
Kieraś Witold
Klyachko Elena
Krizhanovsky Andrew
Krizhanovsky Natalia
Kumar Ritesh
Lakatos Dorina
Lane William
Leonard Brian
Liu Zoey
Mielke Sabrina J.
Montoya Samame Jaime Rafael
Nicolai Garett
Nuriah Zahroh
Oncevay Arturo
Pimentel Tiago
Plugaryov Matvey
Ponti Edoardo M.
Prud'hommeaux Emily
Raj Mohit
Ratan Shyam
Ryskina Maria
Salchak Aelita
Salehi Ali
Shcherbakov Andrey
Sheifer Karina
Silva Villegas Gema Celeste
Stoehr Niklas
Straughn Christopher
Suhardijanto Totok
Szolnok Gábor
Tyers Francis M.
Vania Clara
Vylomova Ekaterina
Washington Jonathan
Woliński Marcin
Wu Shijie
Yarowsky David
Ács Judit
Publication venue: The Association for Computational Linguistics
Publication date: 01/08/2021
Field of study

This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

Edinburgh Research Explorer

Helsingin yliopiston digitaalinen arkisto

UniMorph 4.0:Universal Morphology

Author: Aiton Grant
Anastasopoulos Antonios
Andrushko Taras
Angulo Candy
Arora Aryaman
Ataman Duygu
Ate Yustinus Ghanggo
Batsuren Khuyagbaatar
Bautista Juan López
Baxi Jatayu
Bayyr-ool Aziyana
Bella Gábor
Bernardy Jean-Philippe
Bhatt Brijesh
Budianskaya Elena
Camaiteri Delio Siticonatzi
Chodroff Eleanor
Coler Matt
Cotterell Ryan
Cruz Hilaria
Czarnowska Paula
Dirix Peter
Dolatian Hossep
Ek Adam
El-Khaissi Charbel
Francis Didier López
Ganieva Sofya
Gasser Michael
Giunchiglia Fausto
Goldman Omer
Gorman Kyle
Guriel David
Habash Nizar
Hatcher Richard J.
Hennigen Lucas Torroba
Hulden Mans
Ivanova Sardana
Karahóǧa Ritván
Khalifa Salam
Kieraś Witold
Klyachko Elena
Krizhanovskaya Natalia
Krizhanovsky Andrew
Kumar Ritesh
Lane William
Leonard Brian
Liu Zoey
Marchenko Igor
Markantonatou Stella
Mashkovtseva Polina
Maudslay Rowan Hall
McCarthy Arya D.
Mielke Sabrina J.
Nepomniashchaya Maria
Nicolai Garrett
Nikkarinen Irene
Nuriah Zahroh
Oncevay Arturo
Pavlidis George
Pimentel Tiago
Pinter Yuval
Plugaryov Matvey
Ponti Edoardo M.
Prud'hommeaux Emily
Raj Mohit
Ratan Shyam
Rodionova Daria
Rojas Esaú Zumaeta
Ryskina Maria
Salchak Aelita
Salehi Ali
Salesky Elizabeth
Samame Jaime Rafael Montoya
Scherbakov Andrey
Serova Alexandra
Sheifer Karina
Silfverberg Miikka
Stoehr Niklas
Straughn Christopher
Suhardijanto Totok
Tsarfaty Reut
Tyers Francis M.
Valvoda Josef
Vania Clara
Villegas Gema Celeste Silva
Vylomova Ekaterina
Washington Jonathan North
White Jennifer
Wolinski Marcin
Yablonskaya Anna
Yarowsky David
Yemelina Anastasia
Young Jeremiah
Zariquiey Roberto
Zmigrod Ran
Publication venue: 'Center for Open Science'
Publication date: 07/05/2022
Field of study

ARTS repository - University of Groningen