2,285 research outputs found
Comparative Analysis of Word Embeddings for Capturing Word Similarities
Distributed language representation has become the most widely used technique
for language representation in various natural language processing tasks. Most
of the natural language processing models that are based on deep learning
techniques use already pre-trained distributed word representations, commonly
called word embeddings. Determining the most qualitative word embeddings is of
crucial importance for such models. However, selecting the appropriate word
embeddings is a perplexing task since the projected embedding space is not
intuitive to humans. In this paper, we explore different approaches for
creating distributed word representations. We perform an intrinsic evaluation
of several state-of-the-art word embedding methods. Their performance on
capturing word similarities is analysed with existing benchmark datasets for
word pairs similarities. The research in this paper conducts a correlation
analysis between ground truth word similarities and similarities obtained by
different word embedding methods.Comment: Part of the 6th International Conference on Natural Language
Processing (NATP 2020
Detecting and Monitoring Hate Speech in Twitter
Social Media are sensors in the real world that can be used to measure the pulse of societies.
However, the massive and unfiltered feed of messages posted in social media is a phenomenon that
nowadays raises social alarms, especially when these messages contain hate speech targeted to a
specific individual or group. In this context, governments and non-governmental organizations
(NGOs) are concerned about the possible negative impact that these messages can have on individuals
or on the society. In this paper, we present HaterNet, an intelligent system currently being used by
the Spanish National Office Against Hate Crimes of the Spanish State Secretariat for Security that
identifies and monitors the evolution of hate speech in Twitter. The contributions of this research
are many-fold: (1) It introduces the first intelligent system that monitors and visualizes, using social
network analysis techniques, hate speech in Social Media. (2) It introduces a novel public dataset on
hate speech in Spanish consisting of 6000 expert-labeled tweets. (3) It compares several classification
approaches based on different document representation strategies and text classification models. (4)
The best approach consists of a combination of a LTSM+MLP neural network that takes as input the
tweet’s word, emoji, and expression tokens’ embeddings enriched by the tf-idf, and obtains an area
under the curve (AUC) of 0.828 on our dataset, outperforming previous methods presented in the
literatureThe work by Quijano-Sanchez was supported by the Spanish Ministry of Science and Innovation
grant FJCI-2016-28855. The research of Liberatore was supported by the Government of Spain, grant MTM2015-65803-R, and by the European Union’s Horizon 2020 Research and Innovation Programme, under the Marie Sklodowska-Curie grant agreement No. 691161 (GEOSAFE). All the financial support is gratefully acknowledge
A Large-Scale CNN Ensemble for Medication Safety Analysis
Revealing Adverse Drug Reactions (ADR) is an essential part of post-marketing
drug surveillance, and data from health-related forums and medical communities
can be of a great significance for estimating such effects. In this paper, we
propose an end-to-end CNN-based method for predicting drug safety on user
comments from healthcare discussion forums. We present an architecture that is
based on a vast ensemble of CNNs with varied structural parameters, where the
prediction is determined by the majority vote. To evaluate the performance of
the proposed solution, we present a large-scale dataset collected from a
medical website that consists of over 50 thousand reviews for more than 4000
drugs. The results demonstrate that our model significantly outperforms
conventional approaches and predicts medicine safety with an accuracy of 87.17%
for binary and 62.88% for multi-classification tasks
Basic tasks of sentiment analysis
Subjectivity detection is the task of identifying objective and subjective
sentences. Objective sentences are those which do not exhibit any sentiment.
So, it is desired for a sentiment analysis engine to find and separate the
objective sentences for further analysis, e.g., polarity detection. In
subjective sentences, opinions can often be expressed on one or multiple
topics. Aspect extraction is a subtask of sentiment analysis that consists in
identifying opinion targets in opinionated text, i.e., in detecting the
specific aspects of a product or service the opinion holder is either praising
or complaining about
Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm
NLP tasks are often limited by scarcity of manually annotated data. In social
media sentiment analysis and related tasks, researchers have therefore used
binarized emoticons and specific hashtags as forms of distant supervision. Our
paper shows that by extending the distant supervision to a more diverse set of
noisy labels, the models can learn richer representations. Through emoji
prediction on a dataset of 1246 million tweets containing one of 64 common
emojis we obtain state-of-the-art performance on 8 benchmark datasets within
sentiment, emotion and sarcasm detection using a single pretrained model. Our
analyses confirm that the diversity of our emotional labels yield a performance
improvement over previous distant supervision approaches.Comment: Accepted at EMNLP 2017. Please include EMNLP in any citations. Minor
changes from the EMNLP camera-ready version. 9 pages + references and
supplementary materia
- …