539 research outputs found
Multilingual and Multimodal Topic Modelling with Pretrained Embeddings
This paper presents M3L-Contrast—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images and takes advantage of pretrained document and image embeddings to abstract the complexities between different languages and modalities. As a multilingual topic model, it produces aligned language-specific topics and as multimodal model, it infers textual representations of semantic concepts in images. We demonstrate that our model is competitive with a zero-shot topic model in predicting topic distributions for comparable multilingual data and significantly outperforms a zero-shot model in predicting topic distributions for comparable texts and images. We also show that our model performs almost as well on unaligned embeddings as it does on aligned embeddings.Peer reviewe
Cross-lingual Contextualized Topic Models with Zero-shot Learning
Many data sets (e.g., reviews, forums, news, etc.) exist parallelly in
multiple languages. They all cover the same content, but the linguistic
differences make it impossible to use traditional, bag-of-word-based topic
models. Models have to be either single-language or suffer from a huge, but
extremely sparse vocabulary. Both issues can be addressed by transfer learning.
In this paper, we introduce a zero-shot cross-lingual topic model. Our model
learns topics on one language (here, English), and predicts them for unseen
documents in different languages (here, Italian, French, German, and
Portuguese). We evaluate the quality of the topic predictions for the same
document in different languages. Our results show that the transferred topics
are coherent and stable across languages, which suggests exciting future
research directions.Comment: Updated version. Published as a conference paper at EACL202
Multi-National Topics Maps for Parliamentary Debate Analysis
In recent years, automated political text processing became an indispensable requirement for providing automatic access to political debate. During the Covid-19 worldwide pandemic, this need became visible not only in social sciences but also in public opinion. We provide a path to operationalize this need in a multi-lingual topic-oriented manner. Using a publicly available data set consisting of parliamentary speeches, we create a novel process pipeline to identify a good reference model and to link national topics to the cross-national topics. We use design science research to create this process pipeline as an artifact
- …