908 research outputs found
Rotated canonical correlation analysis for multilingual corpora
This paper aims at proposing the joint use of Canonical Correlation Analysis and Procrustes Rotations (RCA), when we deal with a text and its translation into another language. The basic idea is representing words in the two different natural languages on a common reference space. The main characteristic of this space is to be lan-guage independent, although Procrustes Rotation is performed transforming the lexical table derived from trans-lation by minimizing its distance from the lexical table belonging to the original corpus, while the subsequent Canonical Correlation Analysis treats symmetrically the two word sets. The most interesting RCA feature is building a unique reference space for representing the correlation structure in the data, inducing the two systems of canonical factors to lie on the same space. These graphical representations enables us to read distances be-tween corresponding points in terms of different way of translating the same word in relation with the general context defined by the canonical variates. Trying to understand the distances between matched points could rep-resent an useful tool for enriching lexical resources in a translation procedure. In this paper we propose the com-parison of the most frequent content bearing words in the two languages, analyzing one year (2003) of Le Monde Diplomatique and its Italian edition
Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies
With the ongoing growth in number of digital articles in a wider set of
languages and the expanding use of different languages, we need annotation
methods that enable browsing multi-lingual corpora. Multilingual probabilistic
topic models have recently emerged as a group of semi-supervised machine
learning models that can be used to perform thematic explorations on
collections of texts in multiple languages. However, these approaches require
theme-aligned training data to create a language-independent space. This
constraint limits the amount of scenarios that this technique can offer
solutions to train and makes it difficult to scale up to situations where a
huge collection of multi-lingual documents are required during the training
phase. This paper presents an unsupervised document similarity algorithm that
does not require parallel or comparable corpora, or any other type of
translation resource. The algorithm annotates topics automatically created from
documents in a single language with cross-lingual labels and describes
documents by hierarchies of multi-lingual concepts from independently-trained
models. Experiments performed on the English, Spanish and French editions of
JCR-Acquis corpora reveal promising results on classifying and sorting
documents by similar content.Comment: Accepted at the 10th International Conference on Knowledge Capture
(K-CAP 2019
Cross-lingual Contextualized Topic Models with Zero-shot Learning
Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in
multiple languages. They all cover the same content, but the linguistic
differences make it impossible to use traditional, bag-of-word-based topic
models. Models have to be either single-language or suffer from a huge, but
extremely sparse vocabulary. Both issues can be addressed by transfer learning.
In this paper, we introduce a zero-shot cross-lingual topic model. Our model
learns topics on one language (here, English), and predicts them for unseen
documents in different languages (here, Italian, French, German, and
Portuguese). We evaluate the quality of the topic predictions for the same
document in different languages. Our results show that the transferred topics
are coherent and stable across languages, which suggests exciting future
research directions.Comment: Updated version. Published as a conference paper at EACL202
Recommended from our members
Crosslingual Topic Transfer
Probabilistic topic modeling has been used as an efficient tool for extracting high-level abstracts from large corpus, and is also commonly used as a feature extraction technique for many natural language processing tasks. As a natural extension, multilingual topic models extract language-consistent features from corpora in multiple languages, enabling knowledge transfer for crosslingual tasks. While many models have been proposed, they mostly require very specific crosslingual supervision data, which limits the generalization to languages without rich linguistic resources. In this thesis, we will start by designing an efficient multilingual topic model evaluation as the foundation of subsequent works. We then formulate the model training as a knowledge transfer process by defining a transfer operation. Based on this formulation, we are able to identify factors that actually affect the performance of crosslingual learning in topic models, and thus we introduce a new model that achieves competitive performance while using significantly less linguistic resource.</p
Multilingual and Multimodal Topic Modelling with Pretrained Embeddings
This paper presents M3L-Contrast—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images and takes advantage of pretrained document and image embeddings to abstract the complexities between different languages and modalities. As a multilingual topic model, it produces aligned language-specific topics and as multimodal model, it infers textual representations of semantic concepts in images. We demonstrate that our model is competitive with a zero-shot topic model in predicting topic distributions for comparable multilingual data and significantly outperforms a zero-shot model in predicting topic distributions for comparable texts and images. We also show that our model performs almost as well on unaligned embeddings as it does on aligned embeddings.Peer reviewe
- …