26 research outputs found

    Visual Topic Modelling for NewsImage Task at MediaEval 2021

    Get PDF
    We present the Visual Topic Model (VTM)—a model able to generate a topic distribution for an image, without using any text during inference. The model is applied to an image-text matching task at MediaEval 2021. Though results for this specific task are negative (the model works worse than a baseline), we demonstrate that VTM produces meaningful results and can be used in other applications

    Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

    Get PDF
    This paper presents M3L-Contrast—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps texts from multiple languages and images into a shared topic space. Our model is trained jointly on texts and images and takes advantage of pretrained document and image embeddings to abstract the complexities between different languages and modalities. As a multilingual topic model, it produces aligned language-specific topics and as multimodal model, it infers textual representations of semantic concepts in images. We demonstrate that our model is competitive with a zero-shot topic model in predicting topic distributions for comparable multilingual data and significantly outperforms a zero-shot model in predicting topic distributions for comparable texts and images. We also show that our model performs almost as well on unaligned embeddings as it does on aligned embeddings.Peer reviewe

    Protein Function Prediction using Biomedical Literature

    Get PDF
    Protein function prediction aims to identify the function of a given protein using, for example, sequence data, protein-protein interaction or evolutionary relationships. The use of biomedical literature to predict protein function, however, is a relatively under-studied topic given the vast amount of readily available data. This thesis explores the use of abstracts from biomedical literature to predict protein functions using the terms specified in the Gene Ontology (GO). The Gene Ontology (GO) is a standardised method of cataloguing protein functions where the functions are organised in a directed acyclic graph (DAG). The GO is composed of three separate ontologies: cellular component (CC), molecular function (MF) and biological process (BP). Hierarchical classification is a classification method that assigns an instance to one or more classes where the classes are hierarchically related to each other, as in the GO. We build a hierarchical classifier that assigns GO terms to abstracts by training individual binary NaĂŻve Bayes classifiers to recognise each GO term. We present three different methods of mining abstracts from PubMed. Using these methods we assembled four datasets to train our classifiers. Each classifier is tested in three different ways: (a) in the paper-centric approach, we assign GO terms to a single abstract, (b) in the protein-centric approach, we assign GO terms to a concatenation of abstracts relating to single protein; and (c) the term-centric approach is a complement of the protein-centric approach where the goal is to assign proteins to a GO term. We evaluate the performance of our method using two evaluation metrics: maximum F-measure (F-max) and minimum semantic distance (S-min). Our results show that the best dataset for training our classifier depends on the evaluation metric, the ontology and the proteins being annotated. We also find that there is a negative correlation between the F-max score of a GO term and its information content (IC) and a positive correlation between the F-max and the term's centrality in the DAG. Lastly we compare our method with GOstruct, the state-of-the-art literature-based protein annotation program. Our method outperforms GOstruct on human proteins, showing a significant improvement for the MF ontology

    Grounded and Well-rounded: A Methodological Approach to the Study of Cross-modal and Cross-lingual Grounding

    Full text link
    Grounding has been argued to be a crucial component towards the development of more complete and truly semantically competent artificial intelligence systems. Literature has divided into two camps: While some argue that grounding allows for qualitatively different generalizations, others believe it can be compensated by mono-modal data quantity. Limited empirical evidence has emerged for or against either position, which we argue is due to the methodological challenges that come with studying grounding and its effects on NLP systems. In this paper, we establish a methodological framework for studying what the effects are - if any - of providing models with richer input sources than text-only. The crux of it lies in the construction of comparable samples of populations of models trained on different input modalities, so that we can tease apart the qualitative effects of different input sources from quantifiable model performances. Experiments using this framework reveal qualitative differences in model behavior between cross-modally grounded, cross-lingually grounded, and ungrounded models, which we measure both at a global dataset level as well as for specific word representations, depending on how concrete their semantics is.Comment: accepted to Findings of EMNLP 202

    A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval

    Get PDF
    We address the problem of linking related documents across languages in a multilingual collection. We evaluate three diverse unsupervised methods to represent and compare documents: (1) multilingual topic model; (2) cross-lingual document embeddings; and (3) Wasserstein distance. We test the performance of these methods in retrieving news articles in Swedish that are known to be related to a given Finnish article. The results show that ensembles of the methods outperform the stand-alone methods, suggesting that they capture complementary characteristics of the documents.Peer reviewe

    EMBEDDIA at SemEval-2022 Task 8: Investigating Sentence, Image, and Knowledge Graph Representations for Multilingual News Article Similarity

    Get PDF
    In this paper, we present the participation of the EMBEDDIA team in the SemEval-2022 Task 8 (Multilingual News Article Similarity). We cover several techniques and propose different methods for finding the multilingual news article similarity by exploring the dataset in its entirety. We take advantage of the textual content of the articles, the provided metadata (e.g., titles, keywords, topics), the translated articles, the images (those that were available), and knowledge graph-based representations for entities and relations present in the articles. We, then, compute the semantic similarity between the different features and predict through regression the similarity scores. Our findings show that, while our proposed methods obtained promising results, exploiting the semantic textual similarity with sentence representations is unbeatable. Finally, in the official SemEval-2022 Task 8, we ranked fifth in the overall team ranking cross-lingual results, and second in the English-only results.Peer reviewe

    Multilingual Topic Labelling of News Topics using Ontological Mapping

    Get PDF
    The large volume of news produced daily makes topic modelling useful for analysing topical trends. A topic is usually represented by a ranked list of words but this can be difficult and time-consuming for humans to interpret. Therefore, various methods have been proposed to generate labels that capture the semantic content of a topic. However, there has been no work so far on coming up with multilingual labels which can be useful for exploring multilingual news collections. We propose an ontological mapping method that maps topics to concepts in a language-agnostic news ontology. We test our method on Finnish and English topics and show that it performs on par with state-of-the-art label generation methods, is able to produce multilingual labels, and can be applied to topics from languages that have not been seen during training without any modifications.Peer reviewe

    Effectiveness of Data Augmentation and Pretraining for Improving Neural Headline Generation in Low-Resource Settings

    Get PDF
    We tackle the problem of neural headline generation in a low-resource setting, where only limited amount of data is available to train a model. We compare the ideal high-resource scenario on English with results obtained on a smaller subset of the same data and also run experiments on two small news corpora covering low-resource languages, Croatian and Estonian. Two options for headline generation in a multilingual low-resource scenario are investigated: a pretrained multilingual encoder-decoder model and a combination of two pretrained language models, one used as an encoder and the other as a decoder, connected with a cross-attention layer that needs to be trained from scratch. The results show that the first approach outperforms the second one by a large margin. We explore several data augmentation and pretraining strategies in order to improve the performance of both models and show that while we can drastically improve the second approach using these strategies, they have little to no effect on the performance of the pretrained encoder-decoder model. Finally, we propose two new measures for evaluating the performance of the models besides the classic ROUGE scores.Peer reviewe

    The expansion of isms, 1820-1917 : Data-driven analysis of political language in digitized newspaper collections

    Get PDF
    Words with the suffix -ism are reductionist terms that help us navigate complex social issues by using a simple one-word label for them. On the one hand, they are often associated with political ideologies, but on the other they are present in many other domains of language, especially culture, science, and religion.This has not always been the case. This paper studies isms in a historical record of digitized newspapers published from 1820 to 1917 in Finland to find out how the language of isms developed historically.We use diachronic word embeddings and affinity propagation clustering to trace how new isms entered the lexicon and how they relate to one another over time. We are able to show how they became more common and entered more and more domains. Still, the uses of isms as traditions for political action and thinking stand out in our analysisPeer reviewe

    Multilingual Dynamic Topic Model

    Get PDF
    Dynamic topic models (DTMs) capture the evolution of topics and trends in time series data. Current DTMs are applicable only to monolingual datasets. In this paper we present the multilingual dynamic topic model (ML-DTM), a novel topic model that combines DTM with an existing multilingual topic modeling method to capture crosslingual topics that evolve across time. We present results of this model on a parallel German-English corpus of news articles and a comparable corpus of Finnish and Swedish news articles. We demonstrate the capability of ML-DTM to track significant events related to a topic and show that it finds distinct topics and performs as well as existing multilingual topic models in aligning cross-lingual topics.Peer reviewe
    corecore