12,433 research outputs found
Meta-Embedding as Auxiliary Task Regularization.
Word embeddings have been shown to benefit from ensambling several word
embedding sources, often carried out using straightforward mathematical
operations over the set of word vectors. More recently, self-supervised
learning has been used to find a lower-dimensional representation, similar in
size to the individual word embeddings within the ensemble. However, these
methods do not use the available manually labeled datasets that are often used
solely for the purpose of evaluation. We propose to reconstruct an ensemble of
word embeddings as an auxiliary task that regularises a main task while both
tasks share the learned meta-embedding layer. We carry out intrinsic evaluation
(6 word similarity datasets and 3 analogy datasets) and extrinsic evaluation (4
downstream tasks). For intrinsic task evaluation, supervision comes from
various labeled word similarity datasets. Our experimental results show that
the performance is improved for all word similarity datasets when compared to
self-supervised learning methods with a mean increase of in Spearman
correlation. Specifically, the proposed method shows the best performance in 4
out of 6 of word similarity datasets when using a cosine reconstruction loss
and Brier's word similarity loss. Moreover, improvements are also made when
performing word meta-embedding reconstruction in sequence tagging and sentence
meta-embedding for sentence classification
A Novel Contrastive Learning Method for Clickbait Detection on RoCliCo: A Romanian Clickbait Corpus of News Articles
To increase revenue, news websites often resort to using deceptive news
titles, luring users into clicking on the title and reading the full news.
Clickbait detection is the task that aims to automatically detect this form of
false advertisement and avoid wasting the precious time of online users.
Despite the importance of the task, to the best of our knowledge, there is no
publicly available clickbait corpus for the Romanian language. To this end, we
introduce a novel Romanian Clickbait Corpus (RoCliCo) comprising 8,313 news
samples which are manually annotated with clickbait and non-clickbait labels.
Furthermore, we conduct experiments with four machine learning methods, ranging
from handcrafted models to recurrent and transformer-based neural networks, to
establish a line-up of competitive baselines. We also carry out experiments
with a weighted voting ensemble. Among the considered baselines, we propose a
novel BERT-based contrastive learning model that learns to encode news titles
and contents into a deep metric space such that titles and contents of
non-clickbait news have high cosine similarity, while titles and contents of
clickbait news have low cosine similarity. Our data set and code to reproduce
the baselines are publicly available for download at
https://github.com/dariabroscoteanu/RoCliCo.Comment: Accepted at EMNLP 202
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
Real or not? Identifying untrustworthy news websites using third-party partnerships
Untrustworthy content such as fake news and clickbait have become a pervasive problem on the Internet, causing significant socio-political problems around the world. Identifying untrustworthy content is a crucial step in countering them. The current best-practices for identification involve content analysis and arduous fact-checking of the content. To complement content analysis, we propose examining websites? third-parties to identify their trustworthiness. Websites utilize third-parties, also known as their digital supply chains, to create and present content and help the website function. Third-parties are an important indication of a website?s business model. Similar websites exhibit similarities in the third-parties they use. Using this perspective, we use machine learning and heuristic methods to discern similarities and dissimilarities in third-party usage, which we use to predict trustworthiness of websites. We demonstrate the effectiveness and robustness of our approach in predicting trustworthiness of websites from a database of News, Fake News, and Clickbait websites. Our approach can be easily and cost-effectively implemented to reinforce current identification methods
- …