37 research outputs found
PageRank optimization applied to spam detection
We give a new link spam detection and PageRank demotion algorithm called
MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked
trusted and spam pages. We define the MaxRank of a page as the frequency of
visit of this page by a random surfer minimizing an average cost per time unit.
On a given page, the random surfer selects a set of hyperlinks and clicks with
uniform probability on any of these hyperlinks. The cost function penalizes
spam pages and hyperlink removals. The goal is to determine a hyperlink
deletion policy that minimizes this score. The MaxRank is interpreted as a
modified PageRank vector, used to sort web pages instead of the usual PageRank
vector. The bias vector of this ergodic control problem, which is unique up to
an additive constant, is a measure of the "spamicity" of each page, used to
detect spam pages. We give a scalable algorithm for MaxRank computation that
allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We
show that our algorithm outperforms both TrustRank and AntiTrustRank for spam
and nonspam page detection.Comment: 8 pages, 6 figure
Incremental Information Gain Analysis of Input Attribute Impact on RBF-Kernel SVM Spam Detection
The massive increase of spam is posing a very serious threat to email and SMS, which have become an important means of communication. Not only do spams annoy users, but they also become a security threat. Machine learning techniques have been widely used for spam detection. Email spams can be detected through detecting senders’ behaviour, the contents of an email, subject and source address, etc, while SMS spam detection usually is based on the tokens or features of messages due to short content. However, a comprehensive analysis of email/SMS content may provide cures for users to aware of email/SMS spams. We cannot completely depend on automatic tools to identify all spams. In this paper, we propose an analysis approach based on information entropy and incremental learning to see how various features affect the performance of an RBF-based SVM spam detector, so that to increase our awareness of a spam by sensing the features of a spam. The experiments were carried out on the spambase and SMSSpemCollection databases in UCI machine learning repository. The results show that some features have significant impacts on spam detection, of which users should be aware, and there exists a feature space that achieves Pareto efficiency in True Positive Rate and True Negative Rate
Link-based similarity search to fight web spam
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1
A new semantic attribute deep learning with a linguistic attribute hierarchy for spam detection
The massive increase of spam is posing a very
serious threat to email and SMS, which have become an important
means of communication. Not only do spams annoy users, but
they also become a security threat. Machine learning techniques
have been widely used for spam detection. In this paper, we
propose another form of deep learning, a linguistic attribute
hierarchy, embedded with linguistic decision trees, for spam
detection, and examine the effect of semantic attributes on the
spam detection, represented by the linguistic attribute hierarchy.
A case study on the SMS message database from the UCI machine
learning repository has shown that a linguistic attribute hierarchy
embedded with linguistic decision trees provides a transparent
approach to in-depth analysing attribute impact on spam
detection. This approach can not only efficiently tackle ‘curse
of dimensionality’ in spam detection with massive attributes,
but also improve the performance of spam detection when the
semantic attributes are constructed to a proper hierarchy
Spam detection with a content-based random-walk algorithm
In this work we tackle the problem of the spam detection on the
Web. Spam web pages have become a problem for Web search
engines, due to the negative effects that this phe-nomenon can
cause in their retrieval results. Our approach is based on a
random-walk algorithm that obtains a ranking of pages
according to their relevance and their spam likelihood. We
introduce the novelty of taking into account the content of the
web pages to characterize the web graph and to ob-tain an a-
priori estimation of the spam likekihood of the web pages. Our
graph-based algorithm computes two scores for each node in the
graph. Intuitively, these values represent how bad or good
(spam-like or not) is a web page, according to its textual content
and the relations in the graph. Our experiments show that our
proposed technique outperforms other link-based techniques
for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0
WikiLinkGraphs: A Complete, Longitudinal and Multi-Language Dataset of the Wikipedia Link Networks
Wikipedia articles contain multiple links connecting a subject to other pages
of the encyclopedia. In Wikipedia parlance, these links are called internal
links or wikilinks. We present a complete dataset of the network of internal
Wikipedia links for the largest language editions. The dataset contains
yearly snapshots of the network and spans years, from the creation of
Wikipedia in 2001 to March 1st, 2018. While previous work has mostly focused on
the complete hyperlink graph which includes also links automatically generated
by templates, we parsed each revision of each article to track links appearing
in the main text. In this way we obtained a cleaner network, discarding more
than half of the links and representing all and only the links intentionally
added by editors. We describe in detail how the Wikipedia dumps have been
processed and the challenges we have encountered, including the need to handle
special pages such as redirects, i.e., alternative article titles. We present
descriptive statistics of several snapshots of this network. Finally, we
propose several research opportunities that can be explored using this new
dataset.Comment: 10 pages, 3 figures, 7 tables, LaTeX. Final camera-ready version
accepted at the 13TH International AAAI Conference on Web and Social Media
(ICWSM 2019) - Munich (Germany), 11-14 June 201