315 research outputs found

    Do we really need fully unsupervised cross-lingual embeddings?

    Get PDF
    Recent efforts in cross-lingual word embedding (CLWE) learning have predominantly focused on fully unsupervised approaches that project monolingual embeddings into a shared cross-lingual space without any cross-lingual signal. The lack of any supervision makes such approaches conceptually attractive. Yet, their only core difference from (weakly) supervised projection-based CLWE methods is in the way they obtain a seed dictionary used to initialize an iterative self-learning procedure. The fully unsupervised methods have arguably become more robust, and their primary use case is CLWE induction for pairs of resource-poor and distant languages. In this paper, we question the ability of even the most robust unsupervised CLWE approaches to induce meaningful CLWEs in these more challenging settings. A series of bilingual lexicon induction (BLI) experiments with 15 diverse languages (210 language pairs) show that fully unsupervised CLWE methods still fail for a large number of language pairs (e.g., they yield zero BLI performance for 87/210 pairs). Even when they succeed, they never surpass the performance of weakly supervised methods (seeded with 500-1,000 translation pairs) using the same self-learning procedure in any BLI setup, and the gaps are often substantial. These findings call for revisiting the main motivations behind fully unsupervised CLWE methods

    Refinement of Unsupervised Cross-Lingual Word Embeddings

    Get PDF
    Cross-lingual word embeddings aim to bridge the gap between high-resource and low-resource languages by allowing to learn multilingual word representations even without using any direct bilingual signal. The lion's share of the methods are projection-based approaches that map pre-trained embeddings into a shared latent space. These methods are mostly based on the orthogonal transformation, which assumes language vector spaces to be isomorphic. However, this criterion does not necessarily hold, especially for morphologically-rich languages. In this paper, we propose a self-supervised method to refine the alignment of unsupervised bilingual word embeddings. The proposed model moves vectors of words and their corresponding translations closer to each other as well as enforces length- and center-invariance, thus allowing to better align cross-lingual embeddings. The experimental results demonstrate the effectiveness of our approach, as in most cases it outperforms state-of-the-art methods in a bilingual lexicon induction task.Comment: Accepted at the 24th European Conference on Artificial Intelligence (ECAI 2020

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Machine learning with limited label availability: algorithms and applications

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Limitations and challenges of unsupervised cross-lingual pre-training

    Full text link
    [ES] Los métodos de alineamiento croslingüe para representaciones monolingües del lenguaje han sido objeto de un interés notable en el campo de procesamiento del lenguaje natural durante los últimos años, en gran medida debido a la capacidad que estos tienen para general alineamientos entre lenguas utilizando poca o nula información paralela. Sin embargo, su uso en técnicas de preentrenamiento de modelos de traducción automática, un papel en el que los modelos monolingües son particularmente exitosos, y que debería beneficiarse de la información croslingüe obtenida, sigue siendo limitado. Esta propuesta intenta aportar algo de luz sobre los efectos de algunos de los factores que afectan a las representaciones croslingües y las estrategias de preentrenamiento, con la esperanza de que pueda ayudar a futuras investigaciones en este campo. Para ello, este trabajo estudia los dos componentes principales que constituyen el preentrenamiento croslingüe: los alineamientos croslingües y la integración de los mismos como modelos de preentrenamiento. Los primeros son explorados a través de varios métodos croslingües no supervisados ampliamente conocidos, que emplean principalmente similaridades distribucionales para encontrar un alineamiento satisfactorio entre lenguajes. Debido a esto, resultan un interesante terreno de pruebas en el que analizar los efectos de la similaridad entre lenguajes sobre tanto las técnicas de alineamiento croslingüe como los espacios de representación sobre los que operan. En en apartado de integración en preentrenamiento, los espacios de representación croslingües son utilizados para preentrenar modelos de traducción automática, los cuales son comparados contra esquemas que emplean espacios de representación independientes. Los resultados muestran que los métodos croslingües con supervisión débil son remarcablemente efectivos a la hora de generar alineamientos incluso para parejas de lenguajes muy diferentes, y se benefician notablemente de la información a nivel de subpalabra. Sin embargo, el efecto del alineamiento croslingüe en el preentrenamiento es reducido debido a las dificultad de mantener la estructura de la proyección durante el entrenamiento, así como por la limitada influencia que el propio preentrenamiento tiene sobre el modelo supervisado.[EN] Cross-lingual alignment methods for monolingual language representations have received notable research attention in the past few years due to their capacity to induce bilingual alignments with little or no supervision signals. However, their use in machine translation pre-training, a function that monolingual models excel at, and which should benefit from cross-lingual information, remains limited. This work tries to shed light on the effects of some of the factors that play a role in cross-lingual representations and pre-training strategies, with the hope that it can help guide future endeavors in the field. To this end, the survey studies the two main components that constitute cross-lingual pre-training: cross-lingual mappings and their pre-training integration. The former are explored through some widely known fully unsupervised cross-lingual methods, which rely on distributional similarities between languages. Consequently, they are a great basis upon which to consider the effects of language similarity on both cross-mapping techniques and the representation spaces over which they operate. In pre-training integration, cross-lingual representation spaces are used to pre-train a neural machine translation models, which are compared against techniques that employ independent monolingual spaces. The results show that weakly-supervised cross-lingual methods are remarkably effective at inducing alignment even for distant languages and they benefit noticeably from subword information. However, the effect of cross-linguality in pre-training is diminished due to difficulties in maintaining the structure of the projection during training, and the limited influence that pre-training itself has in the supervised model.Quesada Zaragoza, M. (2021). Limitations and challenges of unsupervised cross-lingual pre-training. Universitat Politècnica de València. http://hdl.handle.net/10251/174111TFG
    corecore