4,579 research outputs found
BlogForever: D2.5 Weblog Spam Filtering Report and Associated Methodology
This report is written as a first attempt to define the BlogForever spam detection strategy. It comprises a survey of weblog spam technology and approaches to their detection. While the report was written to help identify possible approaches to spam detection as a component within the BlogForver software, the discussion has been extended to include observations related to the historical, social and practical value of spam, and proposals of other ways of dealing with spam within the repository without necessarily removing them. It contains a general overview of spam types, ready-made anti-spam APIs available for weblogs, possible methods that have been suggested for preventing the introduction of spam into a blog, and research related to spam focusing on those that appear in the weblog context, concluding in a proposal for a spam detection workflow that might form the basis for the spam detection component of the BlogForever software
Let Your CyberAlter Ego Share Information and Manage Spam
Almost all of us have multiple cyberspace identities, and these {\em
cyber}alter egos are networked together to form a vast cyberspace social
network. This network is distinct from the world-wide-web (WWW), which is being
queried and mined to the tune of billions of dollars everyday, and until
recently, has gone largely unexplored. Empirically, the cyberspace social
networks have been found to possess many of the same complex features that
characterize its real counterparts, including scale-free degree distributions,
low diameter, and extensive connectivity. We show that these topological
features make the latent networks particularly suitable for explorations and
management via local-only messaging protocols. {\em Cyber}alter egos can
communicate via their direct links (i.e., using only their own address books)
and set up a highly decentralized and scalable message passing network that can
allow large-scale sharing of information and data. As one particular example of
such collaborative systems, we provide a design of a spam filtering system, and
our large-scale simulations show that the system achieves a spam detection rate
close to 100%, while the false positive rate is kept around zero. This system
has several advantages over other recent proposals (i) It uses an already
existing network, created by the same social dynamics that govern our daily
lives, and no dedicated peer-to-peer (P2P) systems or centralized server-based
systems need be constructed; (ii) It utilizes a percolation search algorithm
that makes the query-generated traffic scalable; (iii) The network has a built
in trust system (just as in social networks) that can be used to thwart
malicious attacks; iv) It can be implemented right now as a plugin to popular
email programs, such as MS Outlook, Eudora, and Sendmail.Comment: 13 pages, 10 figure
CEAI: CCM based Email Authorship Identification Model
In this paper we present a model for email authorship identification (EAI) by
employing a Cluster-based Classification (CCM) technique. Traditionally,
stylometric features have been successfully employed in various authorship
analysis tasks; we extend the traditional feature-set to include some more
interesting and effective features for email authorship identification (e.g.
the last punctuation mark used in an email, the tendency of an author to use
capitalization at the start of an email, or the punctuation after a greeting or
farewell). We also included Info Gain feature selection based content features.
It is observed that the use of such features in the authorship identification
process has a positive impact on the accuracy of the authorship identification
task. We performed experiments to justify our arguments and compared the
results with other base line models. Experimental results reveal that the
proposed CCM-based email authorship identification model, along with the
proposed feature set, outperforms the state-of-the-art support vector machine
(SVM)-based models, as well as the models proposed by Iqbal et al. [1, 2]. The
proposed model attains an accuracy rate of 94% for 10 authors, 89% for 25
authors, and 81% for 50 authors, respectively on Enron dataset, while 89.5%
accuracy has been achieved on authors' constructed real email dataset. The
results on Enron dataset have been achieved on quite a large number of authors
as compared to the models proposed by Iqbal et al. [1, 2]
SMS Spam Filtering: Methods and Data
Mobile or SMS spam is a real and growing problem primarily due to the availability of very cheap bulk pre-pay SMS packages and the fact that SMS engenders higher response rates as it is a trusted and personal service. SMS spam filtering is a relatively new task which inherits many issues and solu- tions from email spam filtering. However it poses its own specific challenges. This paper motivates work on filtering SMS spam and reviews recent devel- opments in SMS spam filtering. The paper also discusses the issues with data collection and availability for furthering research in this area, analyses a large corpus of SMS spam, and provides some initial benchmark results
Extracting semantic entities and events from sports tweets
Large volumes of user-generated content on practically every major issue and event are being created on the microblogging site Twitter. This content can be combined and processed to detect events, entities and popular moods to feed various knowledge-intensive practical applications. On the downside, these content items are very noisy and highly informal, making it difficult to extract sense out of the stream. In this paper, we exploit various approaches to detect the named entities and significant micro-events from usersâ tweets during a live sports event. Here we describe how combining linguistic features with background knowledge and the use of Twitter-specific features can achieve high, precise detection results (f-measure = 87%) in different datasets. A study was conducted on tweets from cricket matches in the ICC World Cup in order to augment the event-related non-textual media with collective intelligence
- âŠ