20 research outputs found
Bridging linguistic typology and multilingual machine translation with multi-view language representations
Sparse language vectors from linguistic typology databases and learned
embeddings from tasks like multilingual machine translation have been
investigated in isolation, without analysing how they could benefit from each
other's language characterisation. We propose to fuse both views using singular
vector canonical correlation analysis and study what kind of information is
induced from each source. By inferring typological features and language
phylogenies, we observe that our representations embed typology and strengthen
correlations with language relationships. We then take advantage of our
multi-view language vector space for multilingual machine translation, where we
achieve competitive overall translation accuracy in tasks that require
information about language similarities, such as language clustering and
ranking candidates for multilingual transfer. With our method, we can easily
project and assess new languages without expensive retraining of massive
multilingual or ranking models, which are major disadvantages of related
approaches.Comment: 15 pages, 6 figure
Exploring Enhanced Code-Switched Noising for Pretraining in Neural Machine Translation
Multilingual pretraining approaches in Neural Machine Translation (NMT) have shown that training models to denoise synthetic code-switched data can yield impressive performance gains --- owing to better multilingual semantic representations and transfer learning. However, they generated the synthetic code-switched data using non-contextual, one-to-one word translations obtained from lexicons - which can lead to significant noise in a variety of cases, including the poor handling of polysemes and multi-word expressions, violation of linguistic agreement and inability to scale to agglutinative languages. To overcome these limitations, we propose an approach called Contextual Code-Switching (CCS), where contextual, many-to-many word translations are generated using a `base' NMT model. We conduct experiments on 3 different language families - Romance, Uralic, and Indo-Aryan - and show significant improvements (by up to 5.5 spBLEU points) over the previous lexicon-based SOTA approaches. We also observe that small CCS models can perform comparably or better than massive models like mBART50 and mRASP2, depending on the size of data provided. We empirically analyse several key factors responsible for these - including context, many-to-many substitutions, code-switching language count etc. - and prove that they all contribute to enhanced pretraining of multilingual NMT models
“Estudio de la viabilidad de un proyecto inmobiliario de vivienda multifamiliar ubicada en el distrito de Lince, límite con San Isidro”
En una situación en la que el consumidor está en un estado de
precaución y desconfianza, debido a la recesión económica actual, lo que
significa, que deja de gastar y no se endeuda a largo plazo, no invierte. Por lo
tanto, menos familias pedirán créditos hipotecarios, se desacelera la demanda
de viviendas, las ventas caen y aumenta el stock de viviendas.
La demanda insatisfecha de viviendas aún sigue incrementándose en la
ciudad de Lima. Lince es uno de los distritos en donde el déficit de vivienda
está ligeramente por debajo de la oferta de vivienda, a través de los últimos
años, ha tenido un crecimiento en su oferta de viviendas y distrito, a pesar de la
desaceleración del mercado, atractivo para adquirir viviendas; según el índice
PER, que es el precio de vivienda sobre su precio de alquiler anual, es un
distrito atractivo, no sólo para las familias, sino también para los inversionistas
que adquieren viviendas para colocarlas en alquiler.
Ubicamos tres terrenos colindantes de 8 metros de frente por 40 metros
de fondo, haciendo un total de 960 m2. La ubicación, en la calle Francisco de Zela, es una buena ubicación cerca del distrito de San Isidro, de colegios,
supermercados, el parque Mariscal Castilla, entre otros atributos urbanísticos.
Nuestro producto se presenta al cliente vendiendo la idea de pertenecer
al distrito de San Isidro, pero perteneciendo a Lince, dirigido a familias que no
pueden acceder a un crédito hipotecario para comprar un producto en ese
distrito, pero aspiran a tener un estilo de vida parecida al del distrito de San
Isidro. Por lo tanto, el nombre del proyecto “Las Palmeras” hace referencia al
nombre de la calle donde se ubica el proyecto cuando pertenece al distrito de
San Isidro, a media cuadra del sitio.
El proyecto plantea construir un edificio de 5 pisos y un semisótano, en
total serán 6 pisos de viviendas, 2 niveles de sótanos de estacionamientos y 35
departamentos de viviendas. Los departamentos tipo flats con áreas promedio
desde 80 m2 hasta 113 m2, de dos y tres dormitorios.
El diseño de un producto según las necesidades del cliente, la creación
de valor para los clientes meta y la estrategia de marketing para comunicar
correctamente el producto al cliente, garantizará que nuestra velocidad de
venta no caiga y se mantenga en 2 unidades vendidas mensualmente. El sólo
hecho de tener una velocidad de venta cerca a la unidad por mes, hace que el
proyecto tenga un VAN negativo en la tasa de descuento proyectada.Tesi
Meeting the Needs of Low-Resource Languages: The Value of Automatic Alignments via Pretrained Models
Large multilingual models have inspired a new class of word alignment
methods, which work well for the model's pretraining languages. However, the
languages most in need of automatic alignment are low-resource and, thus, not
typically included in the pretraining data. In this work, we ask: How do modern
aligners perform on unseen languages, and are they better than traditional
methods? We contribute gold-standard alignments for Bribri--Spanish,
Guarani--Spanish, Quechua--Spanish, and Shipibo-Konibo--Spanish. With these, we
evaluate state-of-the-art aligners with and without model adaptation to the
target language. Finally, we also evaluate the resulting alignments
extrinsically through two downstream tasks: named entity recognition and
part-of-speech tagging. We find that although transformer-based methods
generally outperform traditional models, the two classes of approach remain
competitive with each other.Comment: EACL 202
AmericasNLI: Machine translation and natural language inference systems for Indigenous languages of the Americas
Little attention has been paid to the development of human language technology for truly low-resource languages—i.e., languages with limited amounts of digitally available text data, such as Indigenous languages. However, it has been shown that pretrained multilingual models are able to perform crosslingual transfer in a zero-shot setting even for low-resource languages which are unseen during pretraining. Yet, prior work evaluating performance on unseen languages has largely been limited to shallow token-level tasks. It remains unclear if zero-shot learning of deeper semantic tasks is possible for unseen languages. To explore this question, we present AmericasNLI, a natural language inference dataset covering 10 Indigenous languages of the Americas. We conduct experiments with pretrained models, exploring zero-shot learning in combination with model adaptation. Furthermore, as AmericasNLI is a multiway parallel dataset, we use it to benchmark the performance of different machine translation models for those languages. Finally, using a standard transformer model, we explore translation-based approaches for natural language inference. We find that the zero-shot performance of pretrained models without adaptation is poor for all languages in AmericasNLI, but model adaptation via continued pretraining results in improvements. All machine translation models are rather weak, but, surprisingly, translation-based approaches to natural language inference outperform all other models on that task
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.Peer reviewe