Search CORE

1,228 research outputs found

Trans-gram, Fast Cross-lingual Word-embeddings

Author: Benhalloum Amine
Coulmance Jocelyn
Marty Jean-Marc
Wenzek Guillaume
Publication venue
Publication date: 01/01/2015
Field of study

We introduce Trans-gram, a simple and computationally-efficient method to simultaneously learn and align wordembeddings for a variety of languages, using only monolingual data and a smaller set of sentence-aligned data. We use our new method to compute aligned wordembeddings for twenty-one languages using English as a pivot language. We show that some linguistic features are aligned across languages for which we do not have aligned data, even though those properties do not exist in the pivot language. We also achieve state of the art results on standard cross-lingual text classification and word translation tasks.Comment: EMNLP 201

arXiv.org e-Print Archive

Crossref

Multi-lingual Common Semantic Space Construction via Cluster-consistent Word Embedding

Author: Cho Kyunghyun
Huang Lifu
Ji Heng
Knight Kevin
Zhang Boliang
Publication venue
Publication date: 01/01/2018
Field of study

We construct a multilingual common semantic space based on distributional semantics, where words from multiple languages are projected into a shared space to enable knowledge and resource transfer across languages. Beyond word alignment, we introduce multiple cluster-level alignments and enforce the word clusters to be consistently distributed across multiple languages. We exploit three signals for clustering: (1) neighbor words in the monolingual word embedding space; (2) character-level information; and (3) linguistic properties (e.g., apposition, locative suffix) derived from linguistic structure knowledge bases available for thousands of languages. We introduce a new cluster-consistent correlational neural network to construct the common semantic space by aligning words as well as clusters. Intrinsic evaluation on monolingual and multilingual QVEC tasks shows our approach achieves significantly higher correlation with linguistic features than state-of-the-art multi-lingual embedding learning methods do. Using low-resource language name tagging as a case study for extrinsic evaluation, our approach achieves up to 24.5\% absolute F-score gain over the state of the art.Comment: 10 page

arXiv.org e-Print Archive

Crossref

Learning Bilingual Word Representations by Marginalizing Alignments

Author: Blunsom Phil
Hermann Karl Moritz
Kočiský Tomáš
Publication venue
Publication date: 01/01/2014
Field of study

We present a probabilistic model that simultaneously learns alignments and distributed representations for bilingual data. By marginalizing over word alignments the model captures a larger semantic context than prior work relying on hard alignments. The advantage of this approach is demonstrated in a cross-lingual classification task, where we outperform the prior published state of the art.Comment: Proceedings of ACL 2014 (Short Papers

arXiv.org e-Print Archive

Crossref

Oxford University Research Archive