2,320 research outputs found
Short Messages Spam Filtering Using Sentiment Analysis
In the same way that short instant messages are more and more used, spam and non-legitimate campaigns through this type of communication systems are growing up. Those campaigns, besides being an illegal online activity, are a direct threat to the privacy of the users. Previous short messages spam filtering techniques focus on automatic text classification and do not take message polarity into account. Focusing on phone SMS messages, this work demonstrates that it is possible to improve spam filtering in short message services using sentiment analysis techniques. Using a publicly available labelled (spam/legitimate) SMS dataset, we calculate the polarity of each message and aggregate the polarity score to the original dataset, creating new datasets. We compare the results of the best classifiers and filters over the different datasets (with and without polarity) in order to demonstrate the influence of the polarity. Experiments show that polarity score improves the SMS spam classification, on the one hand, reaching to a 98.91% of accuracy. And on the other hand, obtaining a result of 0 false positives with 98.67% of accuracy
Data Sets: Word Embeddings Learned from Tweets and General Data
A word embedding is a low-dimensional, dense and real- valued vector
representation of a word. Word embeddings have been used in many NLP tasks.
They are usually gener- ated from a large text corpus. The embedding of a word
cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and
have unique lexical and semantic features that are different from other types
of text. Therefore, it is necessary to have word embeddings learned
specifically from tweets. In this paper, we present ten word embedding data
sets. In addition to the data sets learned from just tweet data, we also built
embedding sets from the general data and the combination of tweets with the
general data. The general data consist of news articles, Wikipedia data and
other web data. These ten embedding models were learned from about 400 million
tweets and 7 billion words from the general text. In this paper, we also
present two experiments demonstrating how to use the data sets in some NLP
tasks, such as tweet sentiment analysis and tweet topic classification tasks
On Identifying Disaster-Related Tweets: Matching-based or Learning-based?
Social media such as tweets are emerging as platforms contributing to
situational awareness during disasters. Information shared on Twitter by both
affected population (e.g., requesting assistance, warning) and those outside
the impact zone (e.g., providing assistance) would help first responders,
decision makers, and the public to understand the situation first-hand.
Effective use of such information requires timely selection and analysis of
tweets that are relevant to a particular disaster. Even though abundant tweets
are promising as a data source, it is challenging to automatically identify
relevant messages since tweet are short and unstructured, resulting to
unsatisfactory classification performance of conventional learning-based
approaches. Thus, we propose a simple yet effective algorithm to identify
relevant messages based on matching keywords and hashtags, and provide a
comparison between matching-based and learning-based approaches. To evaluate
the two approaches, we put them into a framework specifically proposed for
analyzing disaster-related tweets. Analysis results on eleven datasets with
various disaster types show that our technique provides relevant tweets of
higher quality and more interpretable results of sentiment analysis tasks when
compared to learning approach
Short Messages Spam Filtering Combining Personality Recognition and Sentiment Analysis
Currently, short communication channels are growing up due to the huge increase in the number of smartphones and online social networks users. This growth attracts malicious campaigns, such as spam campaigns, that are a direct threat to the security and privacy of the users. While most researches are focused on automatic text classification, in this work we demonstrate the possibility of improving current short messages spam detection systems using a novel method. We combine personality recognition and sentiment analysis techniques to analyze Short Message Services (SMS) texts. We enrich a publicly available dataset adding these features, first separately and after in combination, of each message to the dataset, creating new datasets. We apply several combinations of the best SMS spam classifiers and filters to each dataset in order to compare the results of each one. Taking into account the experimental results we analyze the real inuence of each feature and the combination of both. At the end, the best results are improved in terms of accuracy, reaching to a 99.01% and the number of false positive is reduced
The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness
Twitter updates now represent an enormous stream of information originating
from a wide variety of formal and informal sources, much of which is relevant
to real-world events. In this paper we adapt existing bio-surveillance
algorithms to detect localised spikes in Twitter activity corresponding to real
events with a high level of confidence. We then develop a methodology to
automatically summarise these events, both by providing the tweets which fully
describe the event and by linking to highly relevant news articles. We apply
our methods to outbreaks of illness and events strongly affecting sentiment. In
both case studies we are able to detect events verifiable by third party
sources and produce high quality summaries
Uso de Técnicas de Reconocimiento de la Personalidad para Mejorar el Filtrado Bayesiano de Spam
Millions of users per day are affected by unsolicited email campaigns. During the last years several techniques to detect spam have been developed, achieving specially good results using machine learning algorithms. In this work we provide a baseline for a new spam filtering method. Carrying out this research we validate our hypothesis that personality recognition techniques can help in Bayesian spam filtering. We add the personality feature to each email using personality recognition techniques, and then we compare Bayesian spam filters with and without personality in terms of accuracy. In a second experiment we combine personality and polarity features of each message and we compare all the results. At the end, the top ten Bayesian filtering classifiers have been improved, reaching to a 99.24% of accuracy, reducing also the false positive number.Millones de usuarios se ven afectados por las campanas de envÃo de correos electrónicos no deseados al dÃa. Durante los últimos años diferentes técnicas de detección de spam han sido desarrollados por investigadores, obteniendo especialmente buenos resultados con algoritmos de aprendizaje automático. En este trabajo presentamos una base para un nuevo método de filtrado de spam. Durante el estudio hemos validado la hipótesis de que las técnicas de reconocimiento de personalidad pueden ayudar a mejorar el filtrado Bayesiano de spam. Usando estas técnicas de filtrado, añadimos la caracterÃstica de personalidad a cada correo, y después comparamos los resultados del filtrado Bayesiano de spam con y sin personalidad, analizando los resultados en términos de exactitud. En un segundo experimento, combinamos las caracterÃsticas de personalidad y polaridad de cada mensaje, y comparamos los resultados. Al final, conseguimos mejorar los resultados del filtrado Bayesiano de spam, alcanzando el 99,24% de exactitud, y reduciendo el número de falsos positivos.This work has been partially funded by the Basque Department of Education, Language policy and Culture under the project SocialSPAM (PI_2014_1_102)
- …