2,343 research outputs found
A Multiplicative Model for Learning Distributed Text-Based Attribute Representations
In this paper we propose a general framework for learning distributed
representations of attributes: characteristics of text whose representations
can be jointly learned with word embeddings. Attributes can correspond to
document indicators (to learn sentence vectors), language indicators (to learn
distributed language representations), meta-data and side information (such as
the age, gender and industry of a blogger) or representations of authors. We
describe a third-order model where word context and attribute vectors interact
multiplicatively to predict the next word in a sequence. This leads to the
notion of conditional word similarity: how meanings of words change when
conditioned on different attributes. We perform several experimental tasks
including sentiment classification, cross-lingual document classification, and
blog authorship attribution. We also qualitatively evaluate conditional word
neighbours and attribute-conditioned text generation.Comment: 11 pages. An earlier version was accepted to the ICML-2014 Workshop
on Knowledge-Powered Deep Learning for Text Minin
From Word to Sense Embeddings: A Survey on Vector Representations of Meaning
Over the past years, distributed semantic representations have proved to be
effective and flexible keepers of prior knowledge to be integrated into
downstream applications. This survey focuses on the representation of meaning.
We start from the theoretical background behind word vector space models and
highlight one of their major limitations: the meaning conflation deficiency,
which arises from representing a word with all its possible meanings as a
single vector. Then, we explain how this deficiency can be addressed through a
transition from the word level to the more fine-grained level of word senses
(in its broader acceptation) as a method for modelling unambiguous lexical
meaning. We present a comprehensive overview of the wide range of techniques in
the two main branches of sense representation, i.e., unsupervised and
knowledge-based. Finally, this survey covers the main evaluation procedures and
applications for this type of representation, and provides an analysis of four
of its important aspects: interpretability, sense granularity, adaptability to
different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence
Researc
Automatic information search for countering covid-19 misinformation through semantic similarity
Trabajo Fin de MĂĄster en BioinformĂĄtica y BiologĂa ComputacionalInformation quality in social media is an increasingly important issue and misinformation problem has become even more critical in the current COVID-19 pandemic, leading people exposed
to false and potentially harmful claims and rumours. Civil society organizations, such as the
World Health Organization, have demanded a global call for action to promote access to health
information and mitigate harm from health misinformation. Consequently, this project pursues
countering the spread of COVID-19 infodemic and its potential health hazards.
In this work, we give an overall view of models and methods that have been employed in the
NLP field from its foundations to the latest state-of-the-art approaches. Focusing on deep learning methods, we propose applying multilingual Transformer models based on siamese networks,
also called bi-encoders, combined with ensemble and PCA dimensionality reduction techniques.
The goal is to counter COVID-19 misinformation by analyzing the semantic similarity between
a claim and tweets from a collection gathered from official fact-checkers verified by the International Fact-Checking Network of the Poynter Institute.
It is factual that the number of Internet users increases every year and the language spoken
determines access to information online. For this reason, we give a special effort in the application of multilingual models to tackle misinformation across the globe. Regarding semantic
similarity, we firstly evaluate these multilingual ensemble models and improve the result in the
STS-Benchmark compared to monolingual and single models. Secondly, we enhance the interpretability of the modelsâ performance through the SentEval toolkit. Lastly, we compare these
modelsâ performance against biomedical models in TREC-COVID task round 1 using the BM25
Okapi ranking method as the baseline. Moreover, we are interested in understanding the ins
and outs of misinformation. For that purpose, we extend interpretability using machine learning
and deep learning approaches for sentiment analysis and topic modelling. Finally, we developed
a dashboard to ease visualization of the results.
In our view, the results obtained in this project constitute an excellent initial step toward
incorporating multilingualism and will assist researchers and people in countering COVID-19
misinformation
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
- âŠ