660 research outputs found
Combining Textual Content and Hyperlinks in Web Spam Detection
In this work1, we tackle the problem of spam detection on
the Web. Spam web pages have become a problem for Web search engines,
due to the negative effects that this phenomenon can cause in
their retrieval results. Our approach is based on a random-walk algorithm
that obtains a ranking of pages according to their relevance and
their spam likelihood. We introduce the novelty of taking into account
the content of the web pages to characterize the web graph and to obtain
an a priori estimation of the spam likelihood of the web pages. Our
graph-based algorithm computes two scores for each node in the graph.
Intuitively, these values represent how bad or good (spam-like or not)
a web page is, according to its textual content and the relations in the
graph. Our experiments show that our proposed technique outperforms
other link-based techniques for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0
PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam
Spam web pages have become a problem for Information Retrieval systems
due to the negative effects that this phenomenon can cause in their results. In this work
we tackle the problem of detecting these pages with a propagation algorithm that, taking
as input a web graph, chooses a set of spam and not-spam web pages in order to spread
their spam likelihood over the rest of the network. Thus we take advantage of the links
between pages to obtain a ranking of pages according to their relevance and their spam
likelihood. Our intuition consists in giving a high reputation to those pages related to
relevant ones, and giving a high spam likelihood to the pages linked to spam web pages.
We introduce the novelty of including the content of the web pages in the computation of
an a priori estimation of the spam likelihood of the pages, and propagate this information.
Our graph-based algorithm computes two scores for each node in the graph. Intuitively,
these values represent how bad or good (spam-like or not) is a web page, according to its
textual content and its relations in the graph. The experimental results show that our
method outperforms other techniques for spam detectionMinisterio de Educación y Ciencia HUM2007-66607-C04-0
Voting-based Classification for E-mail Spam Detection
The problem of spam e-mail has gained a tremendous amount of attention. Although entities tend to use e-mail spam filter applications to filter out received spam e-mails, marketing companies still tend to send unsolicited e-mails in bulk and users still receive a reasonable amount of spam e-mail despite those filtering applications. This work proposes a new method for classifying e-mails into spam and non-spam. First, several e-mail content features are extracted and then those features are used for classifying each e-mail individually. The classification results of three different classifiers (i.e. Decision Trees, Random Forests and k-Nearest Neighbor) are combined in various voting schemes (i.e. majority vote, average probability, product of probabilities, minimum probability and maximum probability) for making the final decision. To validate our method, two different spam e-mail collections were used
Link-based similarity search to fight web spam
www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1
A systematic survey of online data mining technology intended for law enforcement
As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies
Phishing detection and traceback mechanism
Isredza Rahmi A Hamid’s thesis entitled Phishing Detection and Trackback Mechanism. The thesis investigates detection of phishing attacks through email, novel method to profile the attacker and tracking the attack back to the origin
- …