20 research outputs found

    Bridging linguistic typology and multilingual machine translation with multi-view language representations

    Get PDF
    Sparse language vectors from linguistic typology databases and learned embeddings from tasks like multilingual machine translation have been investigated in isolation, without analysing how they could benefit from each other's language characterisation. We propose to fuse both views using singular vector canonical correlation analysis and study what kind of information is induced from each source. By inferring typological features and language phylogenies, we observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy in tasks that require information about language similarities, such as language clustering and ranking candidates for multilingual transfer. With our method, we can easily project and assess new languages without expensive retraining of massive multilingual or ranking models, which are major disadvantages of related approaches.Comment: 15 pages, 6 figure

    Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation

    Get PDF
    Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models

    “Estudio de la viabilidad de un proyecto inmobiliario de vivienda multifamiliar ubicada en el distrito de Lince, límite con San Isidro”

    Get PDF
    En una situación en la que el consumidor está en un estado de precaución y desconfianza, debido a la recesión económica actual, lo que significa, que deja de gastar y no se endeuda a largo plazo, no invierte. Por lo tanto, menos familias pedirán créditos hipotecarios, se desacelera la demanda de viviendas, las ventas caen y aumenta el stock de viviendas. La demanda insatisfecha de viviendas aún sigue incrementándose en la ciudad de Lima. Lince es uno de los distritos en donde el déficit de vivienda está ligeramente por debajo de la oferta de vivienda, a través de los últimos años, ha tenido un crecimiento en su oferta de viviendas y distrito, a pesar de la desaceleración del mercado, atractivo para adquirir viviendas; según el índice PER, que es el precio de vivienda sobre su precio de alquiler anual, es un distrito atractivo, no sólo para las familias, sino también para los inversionistas que adquieren viviendas para colocarlas en alquiler. Ubicamos tres terrenos colindantes de 8 metros de frente por 40 metros de fondo, haciendo un total de 960 m2. La ubicación, en la calle Francisco de Zela, es una buena ubicación cerca del distrito de San Isidro, de colegios, supermercados, el parque Mariscal Castilla, entre otros atributos urbanísticos. Nuestro producto se presenta al cliente vendiendo la idea de pertenecer al distrito de San Isidro, pero perteneciendo a Lince, dirigido a familias que no pueden acceder a un crédito hipotecario para comprar un producto en ese distrito, pero aspiran a tener un estilo de vida parecida al del distrito de San Isidro. Por lo tanto, el nombre del proyecto “Las Palmeras” hace referencia al nombre de la calle donde se ubica el proyecto cuando pertenece al distrito de San Isidro, a media cuadra del sitio. El proyecto plantea construir un edificio de 5 pisos y un semisótano, en total serán 6 pisos de viviendas, 2 niveles de sótanos de estacionamientos y 35 departamentos de viviendas. Los departamentos tipo flats con áreas promedio desde 80 m2 hasta 113 m2, de dos y tres dormitorios. El diseño de un producto según las necesidades del cliente, la creación de valor para los clientes meta y la estrategia de marketing para comunicar correctamente el producto al cliente, garantizará que nuestra velocidad de venta no caiga y se mantenga en 2 unidades vendidas mensualmente. El sólo hecho de tener una velocidad de venta cerca a la unidad por mes, hace que el proyecto tenga un VAN negativo en la tasa de descuento proyectada.Tesi

    Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models

    Full text link
    Large multilingual models have inspired a new class of word alignment methods, which work well for the model's pretraining languages. However, the languages most in need of automatic alignment are low-resource and, thus, not typically included in the pretraining data. In this work, we ask: How do modern aligners perform on unseen languages, and are they better than traditional methods? We contribute gold-standard alignments for Bribri--Spanish, Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we evaluate state-of-the-art aligners with and without model adaptation to the target language. Finally, we also evaluate the resulting alignments extrinsically through two downstream tasks: named entity recognition and part-of-speech tagging. We find that although transformer-based methods generally outperform traditional models, the two classes of approach remain competitive with each other.Comment: EACL 202

    AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas

    Full text link
    Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task

    SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages

    Get PDF
    This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe

    UniMorph 4.0:Universal Morphology

    Get PDF
    corecore