110 research outputs found
A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity
Cross-lingual word embeddings encode the meaning of words from different
languages into a shared low-dimensional space. An important requirement for
many downstream tasks is that word similarity should be independent of language
- i.e., word vectors within one language should not be more similar to each
other than to words in another language. We measure this characteristic using
modularity, a network measurement that measures the strength of clusters in a
graph. Modularity has a moderate to strong correlation with three downstream
tasks, even though modularity is based only on the structure of embeddings and
does not require any external resources. We show through experiments that
modularity can serve as an intrinsic validation metric to improve unsupervised
cross-lingual word embeddings, particularly on distant language pairs in
low-resource settings.Comment: Accepted to ACL 2019, camera-read
Refinement of Unsupervised Cross-Lingual Word Embeddings
Cross-lingual word embeddings aim to bridge the gap between high-resource and
low-resource languages by allowing to learn multilingual word representations
even without using any direct bilingual signal. The lion's share of the methods
are projection-based approaches that map pre-trained embeddings into a shared
latent space. These methods are mostly based on the orthogonal transformation,
which assumes language vector spaces to be isomorphic. However, this criterion
does not necessarily hold, especially for morphologically-rich languages. In
this paper, we propose a self-supervised method to refine the alignment of
unsupervised bilingual word embeddings. The proposed model moves vectors of
words and their corresponding translations closer to each other as well as
enforces length- and center-invariance, thus allowing to better align
cross-lingual embeddings. The experimental results demonstrate the
effectiveness of our approach, as in most cases it outperforms state-of-the-art
methods in a bilingual lexicon induction task.Comment: Accepted at the 24th European Conference on Artificial Intelligence
(ECAI 2020
Exploiting Cross-Lingual Subword Similarities in Low-Resource Document Classification
Text classification must sometimes be applied in a low-resource language with
no labeled training data. However, training data may be available in a related
language. We investigate whether character-level knowledge transfer from a
related language helps text classification. We present a cross-lingual document
classification framework (CACO) that exploits cross-lingual subword similarity
by jointly training a character-based embedder and a word-based classifier. The
embedder derives vector representations for input words from their written
forms, and the classifier makes predictions based on the word vectors. We use a
joint character representation for both the source language and the target
language, which allows the embedder to generalize knowledge about source
language words to target language words with similar forms. We propose a
multi-task objective that can further improve the model if additional
cross-lingual or monolingual resources are available. Experiments confirm that
character-level knowledge transfer is more data-efficient than word-level
transfer between related languages.Comment: AAAI 202
Representation Learning for Natural Language Processing
This open access book provides an overview of the recent advances in representation learning theory, algorithms and applications for natural language processing (NLP). It is divided into three parts. Part I presents the representation learning techniques for multiple language entries, including words, phrases, sentences and documents. Part II then introduces the representation techniques for those objects that are closely related to NLP, including entity-based world knowledge, sememe-based linguistic knowledge, networks, and cross-modal entries. Lastly, Part III provides open resource tools for representation learning techniques, and discusses the remaining challenges and future research directions. The theories and algorithms of representation learning presented can also benefit other related domains such as machine learning, social network analysis, semantic Web, information retrieval, data mining and computational biology. This book is intended for advanced undergraduate and graduate students, post-doctoral fellows, researchers, lecturers, and industrial engineers, as well as anyone interested in representation learning and natural language processing
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
- …