2,230 research outputs found

    Short Messages Spam Filtering Using Sentiment Analysis

    Get PDF
    In the same way that short instant messages are more and more used, spam and non-legitimate campaigns through this type of communication systems are growing up. Those campaigns, besides being an illegal online activity, are a direct threat to the privacy of the users. Previous short messages spam filtering techniques focus on automatic text classification and do not take message polarity into account. Focusing on phone SMS messages, this work demonstrates that it is possible to improve spam filtering in short message services using sentiment analysis techniques. Using a publicly available labelled (spam/legitimate) SMS dataset, we calculate the polarity of each message and aggregate the polarity score to the original dataset, creating new datasets. We compare the results of the best classifiers and filters over the different datasets (with and without polarity) in order to demonstrate the influence of the polarity. Experiments show that polarity score improves the SMS spam classification, on the one hand, reaching to a 98.91% of accuracy. And on the other hand, obtaining a result of 0 false positives with 98.67% of accuracy

    Data Sets: Word Embeddings Learned from Tweets and General Data

    Full text link
    A word embedding is a low-dimensional, dense and real- valued vector representation of a word. Word embeddings have been used in many NLP tasks. They are usually gener- ated from a large text corpus. The embedding of a word cap- tures both its syntactic and semantic aspects. Tweets are short, noisy and have unique lexical and semantic features that are different from other types of text. Therefore, it is necessary to have word embeddings learned specifically from tweets. In this paper, we present ten word embedding data sets. In addition to the data sets learned from just tweet data, we also built embedding sets from the general data and the combination of tweets with the general data. The general data consist of news articles, Wikipedia data and other web data. These ten embedding models were learned from about 400 million tweets and 7 billion words from the general text. In this paper, we also present two experiments demonstrating how to use the data sets in some NLP tasks, such as tweet sentiment analysis and tweet topic classification tasks

    On Identifying Disaster-Related Tweets: Matching-based or Learning-based?

    Full text link
    Social media such as tweets are emerging as platforms contributing to situational awareness during disasters. Information shared on Twitter by both affected population (e.g., requesting assistance, warning) and those outside the impact zone (e.g., providing assistance) would help first responders, decision makers, and the public to understand the situation first-hand. Effective use of such information requires timely selection and analysis of tweets that are relevant to a particular disaster. Even though abundant tweets are promising as a data source, it is challenging to automatically identify relevant messages since tweet are short and unstructured, resulting to unsatisfactory classification performance of conventional learning-based approaches. Thus, we propose a simple yet effective algorithm to identify relevant messages based on matching keywords and hashtags, and provide a comparison between matching-based and learning-based approaches. To evaluate the two approaches, we put them into a framework specifically proposed for analyzing disaster-related tweets. Analysis results on eleven datasets with various disaster types show that our technique provides relevant tweets of higher quality and more interpretable results of sentiment analysis tasks when compared to learning approach

    Short Messages Spam Filtering Combining Personality Recognition and Sentiment Analysis

    Get PDF
    Currently, short communication channels are growing up due to the huge increase in the number of smartphones and online social networks users. This growth attracts malicious campaigns, such as spam campaigns, that are a direct threat to the security and privacy of the users. While most researches are focused on automatic text classification, in this work we demonstrate the possibility of improving current short messages spam detection systems using a novel method. We combine personality recognition and sentiment analysis techniques to analyze Short Message Services (SMS) texts. We enrich a publicly available dataset adding these features, first separately and after in combination, of each message to the dataset, creating new datasets. We apply several combinations of the best SMS spam classifiers and filters to each dataset in order to compare the results of each one. Taking into account the experimental results we analyze the real inuence of each feature and the combination of both. At the end, the best results are improved in terms of accuracy, reaching to a 99.01% and the number of false positive is reduced

    The Early Bird Catches The Term: Combining Twitter and News Data For Event Detection and Situational Awareness

    Full text link
    Twitter updates now represent an enormous stream of information originating from a wide variety of formal and informal sources, much of which is relevant to real-world events. In this paper we adapt existing bio-surveillance algorithms to detect localised spikes in Twitter activity corresponding to real events with a high level of confidence. We then develop a methodology to automatically summarise these events, both by providing the tweets which fully describe the event and by linking to highly relevant news articles. We apply our methods to outbreaks of illness and events strongly affecting sentiment. In both case studies we are able to detect events verifiable by third party sources and produce high quality summaries

    New approaches for content-based analysis towards online social network spam detection

    Get PDF
    Unsolicited email campaigns remain as one of the biggest threats affecting millions of users per day. Although spam filtering techniques are capable of detecting significant percentage of the spam messages, the problem is far from being solved, specially due to the total amount of spam traffic that flows over the Internet, and new potential attack vectors used by malicious users. The deeply entrenched use of Online Social Networks (OSNs), where millions of users share unconsciously any kind of personal data, offers a very attractive channel to attackers. Those sites provide two main interesting areas for malicious activities: exploitation of the huge amount of information stored in the profiles of the users, and the possibility of targeting user addresses and user spaces through their personal profiles, groups, pages... Consequently, new type of targeted attacks are being detected in those communication means. Being selling products, creating social alarm, creating public awareness campaigns, generating traffic with viral contents, fooling users with suspicious attachments, etc. the main purpose of spam messages, those type of communications have a specific writing style that spam filtering can take advantage of. The main objectives of this thesis are: (i) to demonstrate that it is possible to develop new targeted attacks exploiting personalized spam campaigns using OSN information, and (ii) to design and validate novel spam detection methods that help detecting the intentionality of the messages, using natural language processing techniques, in order to classify them as spam or legitimate. Additionally, those methods must be effective also dealing with the spam that is appearing in OSNs. To achieve the first objective a system to design and send personalized spam campaigns is proposed. We extract automatically users’ public information from a well known social site. We analyze it and design different templates taking into account the preferences of the users. After that, different experiments are carried out sending typical and personalized spam. The results show that the click-through rate is considerably improved with this new strategy. In the second part of the thesis we propose three novel spam filtering methods. Those methods aim to detect non-evident illegitimate intent in order to add valid information that is used by spam classifiers. To detect the intentionality of the texts, we hypothesize that sentiment analysis and personality recognition techniques could provide new means to differentiate spam text from legitimate one. Taking into account this assumption, we present three different methods: the first one uses sentiment analysis to extract the polarity feature of each analyzed text, thus we analyze the optimistic or pessimistic attitude of spam messages compared to legitimate texts. The second one uses personality recognition techniques to add personality dimensions (Extroversion/Introversion, Thinking/Feeling, Judging/ Perceiving and Sensing/iNtuition) to the spam filtering process; and the last one is a combination of the two previously mentioned techniques. Once the methods are described, we experimentally validate the proposed approaches in three different types of spam: email spam, SMS spam and spam from a popular OSN.Hartzailearen baimenik gabe bidalitako mezuak (spam) egunean milioika erabiltzaileri eragiten dien mehatxua dira. Nahiz eta spam detekzio tresnek gero eta emaitza hobeagoak lortu, arazoa konpontzetik oso urruti dago oraindik, batez ere spam kopuruari eta erasotzaileen estrategia berriei esker. Hori gutxi ez eta azken urteetan sare sozialek izan duten erabiltzaile gorakadaren ondorioz, non milioika erabiltzailek beraien datu pribatuak publiko egiten dituzten, gune hauek oso leku erakargarriak bilakatu dira erasotzaileentzat. Batez ere bi arlo interesgarri eskaintzen dituzte webgune hauek: profiletan pilatutako informazio guztiaren ustiapena, eta erabiltzaileekin harreman zuzena izateko erraztasuna (profil bidez, talde bidez, orrialde bidez...). Ondorioz, gero eta ekintza ilegal gehiago atzematen ari dira webgune hauetan. Spam mezuen helburu nagusienak zerbait saldu, alarma soziala sortu, sentsibilizazio kanpainak martxan jarri, etab. izaki, mezu mota hauek eduki ohi duten idazketa mezua berauen detekziorako erabilia izan daiteke. Lan honen helburu nagusiak ondorengoak dira: alde batetik, sare sozialetako informazio publikoa erabiliz egungo detekzio sistemak saihestuko dituen spam pertsonalizatua garatzea posible dela erakustea; eta bestetik hizkuntza naturalaren prozesamendurako teknikak erabiliz, testuen intentzionalitatea atzeman eta spam-a detektatzeko metodologia berriak garatzea. Gainera, sistema horiek sare sozialetako spam mezuekin lan egiteko gaitasuna ere izan beharko dute. Lehen helburu hori lortzekolan honetan spam pertsonalizatua diseinatu eta bidaltzeko sistema bat aurkeztu da. Era automatikoan erabiltzaileen informazio publikoa ateratzen dugu sare sozial ospetsu batetik, ondoren informazio hori aztertu eta txantiloi ezberdinak garatzen ditugu erabiltzaileen iritziak kontuan hartuaz. Behin hori egindakoan, hainbat esperimentu burutzen ditugu spam normala eta pertsonalizatua bidaliz, bien arteko emaitzen ezberdintasuna alderatzeko. Tesiaren bigarren zatian hiru spam atzemate metodologia berri aurkezten ditugu. Berauen helburua tribialak ez den intentzio komertziala atzeman ta hori baliatuz spam mezuak sailkatzean datza. Intentzionalitate hori lortze aldera, analisi sentimentala eta pertsonalitate detekzio teknikak erabiltzen ditugu. Modu honetan, hiru sistema ezberdin aurkezten dira hemen: lehenengoa analisi sentimentala soilik erabiliz, bigarrena lan honetarako pertsonalitate detekzio teknikek eskaintzen dutena aztertzen duena, eta azkenik, bien arteko konbinazioa. Tresna hauek erabiliz, balidazio esperimentala burutzen da proposatutako sistemak eraginkorrak diren edo ez aztertzeko, hiru mota ezberdinetako spam-arekin lan eginez: email spam-a, SMS spam-a eta sare sozial ospetsu bateko spam-a

    Uso de Técnicas de Reconocimiento de la Personalidad para Mejorar el Filtrado Bayesiano de Spam

    Get PDF
    Millions of users per day are affected by unsolicited email campaigns. During the last years several techniques to detect spam have been developed, achieving specially good results using machine learning algorithms. In this work we provide a baseline for a new spam filtering method. Carrying out this research we validate our hypothesis that personality recognition techniques can help in Bayesian spam filtering. We add the personality feature to each email using personality recognition techniques, and then we compare Bayesian spam filters with and without personality in terms of accuracy. In a second experiment we combine personality and polarity features of each message and we compare all the results. At the end, the top ten Bayesian filtering classifiers have been improved, reaching to a 99.24% of accuracy, reducing also the false positive number.Millones de usuarios se ven afectados por las campanas de envío de correos electrónicos no deseados al día. Durante los últimos años diferentes técnicas de detección de spam han sido desarrollados por investigadores, obteniendo especialmente buenos resultados con algoritmos de aprendizaje automático. En este trabajo presentamos una base para un nuevo método de filtrado de spam. Durante el estudio hemos validado la hipótesis de que las técnicas de reconocimiento de personalidad pueden ayudar a mejorar el filtrado Bayesiano de spam. Usando estas técnicas de filtrado, añadimos la característica de personalidad a cada correo, y después comparamos los resultados del filtrado Bayesiano de spam con y sin personalidad, analizando los resultados en términos de exactitud. En un segundo experimento, combinamos las características de personalidad y polaridad de cada mensaje, y comparamos los resultados. Al final, conseguimos mejorar los resultados del filtrado Bayesiano de spam, alcanzando el 99,24% de exactitud, y reduciendo el número de falsos positivos.This work has been partially funded by the Basque Department of Education, Language policy and Culture under the project SocialSPAM (PI_2014_1_102)
    • …
    corecore