1,951 research outputs found
Multilingual Text Representation
Modern NLP breakthrough includes large multilingual models capable of
performing tasks across more than 100 languages. State-of-the-art language
models came a long way, starting from the simple one-hot representation of
words capable of performing tasks like natural language understanding,
common-sense reasoning, or question-answering, thus capturing both the syntax
and semantics of texts. At the same time, language models are expanding beyond
our known language boundary, even competitively performing over very
low-resource dialects of endangered languages. However, there are still
problems to solve to ensure an equitable representation of texts through a
unified modeling space across language and speakers. In this survey, we shed
light on this iterative progression of multilingual text representation and
discuss the driving factors that ultimately led to the current
state-of-the-art. Subsequently, we discuss how the full potential of language
democratization could be obtained, reaching beyond the known limits and what is
the scope of improvement in that space.Comment: PhD Comprehensive exam repor
Inducing Language-Agnostic Multilingual Representations
Cross-lingual representations have the potential to make NLP techniques
available to the vast majority of languages in the world. However, they
currently require large pretraining corpora or access to typologically similar
languages. In this work, we address these obstacles by removing language
identity signals from multilingual embeddings. We examine three approaches for
this: (i) re-aligning the vector spaces of target languages (all together) to a
pivot source language; (ii) removing language-specific means and variances,
which yields better discriminativeness of embeddings as a by-product; and (iii)
increasing input similarity across languages by removing morphological
contractions and sentence reordering. We evaluate on XNLI and reference-free MT
across 19 typologically diverse languages. Our findings expose the limitations
of these approaches -- unlike vector normalization, vector space re-alignment
and text normalization do not achieve consistent gains across encoders and
languages. Due to the approaches' additive effects, their combination decreases
the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R)
on average across all tasks and languages, however. Our code and models are
publicly available.Comment: *SEM2021 Camera Read
Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations
To date, the most successful word, word sense, and concept modelling techniques have used large corpora and knowledge resources to produce dense vector representations that capture semantic similarities in a relatively low-dimensional space. Most current approaches, however, suffer from a monolingual bias, with their strength depending on the amount of data available across languages. In this paper we address this issue and propose Conception, a novel technique for building language-independent vector representations of concepts which places multilinguality at its core while retaining explicit relationships between concepts. Our approach results in high-coverage representations that outperform the state of the art in multilingual and cross-lingual Semantic Word Similarity and Word Sense Disambiguation, proving particularly robust on low-resource languages. Conception – its software and the complete set of representations – is available at https://github.com/SapienzaNLP/conception
- …