Search CORE

660 research outputs found

Combining Textual Content and Hyperlinks in Web Spam Detection

Author: Cruz Mata Fermín
Enríquez de Salamanca Ros Fernando
MacDonald Craig
Ortega Rodríguez Francisco Javier
Troyano Jiménez José Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

In this work1, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0

idUS. Depósito de Investigación Universidad de Sevilla

Methods for web spam filtering

Author: Csalogány Károly
Publication venue
Publication date: 01/01/2009
Field of study

ELTE Digital Institutional Repository (EDIT)

PolaritySpam: Propagating Content-based Information Through a Web-Graph to Detect Web Spam

Author: Cruz Mata Fermín
García Vallejo Carlos Antonio
Ortega Rodríguez Francisco Javier
Troyano Jiménez José Antonio
Publication venue: ICIC International
Publication date: 01/01/2012
Field of study

Spam web pages have become a problem for Information Retrieval systems due to the negative effects that this phenomenon can cause in their results. In this work we tackle the problem of detecting these pages with a propagation algorithm that, taking as input a web graph, chooses a set of spam and not-spam web pages in order to spread their spam likelihood over the rest of the network. Thus we take advantage of the links between pages to obtain a ranking of pages according to their relevance and their spam likelihood. Our intuition consists in giving a high reputation to those pages related to relevant ones, and giving a high spam likelihood to the pages linked to spam web pages. We introduce the novelty of including the content of the web pages in the computation of an a priori estimation of the spam likelihood of the pages, and propagate this information. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and its relations in the graph. The experimental results show that our method outperforms other techniques for spam detectionMinisterio de Educación y Ciencia HUM2007-66607-C04-0

idUS. Depósito de Investigación Universidad de Sevilla

Voting-based Classification for E-mail Spam Detection

Author: Al-Shboul Bashar Awad
Aljarah Ibrahim
Alsawalqah Hamad
Faris Hossam
Hakh Heba
Publication venue: LPPM ITBis Lembah Dempo
Publication date: 30/04/2016
Field of study

The problem of spam e-mail has gained a tremendous amount of attention. Although entities tend to use e-mail spam filter applications to filter out received spam e-mails, marketing companies still tend to send unsolicited e-mails in bulk and users still receive a reasonable amount of spam e-mail despite those filtering applications. This work proposes a new method for classifying e-mails into spam and non-spam. First, several e-mail content features are extracted and then those features are used for classifying each e-mail individually. The classification results of three different classifiers (i.e. Decision Trees, Random Forests and k-Nearest Neighbor) are combined in various voting schemes (i.e. majority vote, average probability, product of probabilities, minimum probability and maximum probability) for making the final decision. To validate our method, two different spam e-mail collections were used

Journal of ICT Research and Applications

Directory of Open Access Journals

ITB Journal

Link-based similarity search to fight web spam

Author: Benczúr András
Csalogány Károly
Sarlós Tamás
Publication venue: Lehigh Univ.
Publication date: 01/01/2006
Field of study

www.ilab.sztaki.hu/websearch We investigate the usability of similarity search in fighting Web spam based on the assumption that an unknown spam page is more similar to certain known spam pages than to honest pages. In order to be successful, search engine spam never appears in isolation: we observe link farms and alliances for the sole purpose of search engine ranking manipulation. The artificial nature and strong inside connectedness however gave rise to successful algorithms to identify search engine spam. One example is trust and distrust propagation, an idea originating in recommender systems and P2P networks, that yields spam classificators by spreading information along hyperlinks from white and blacklists. While most previous results use PageRank variants for propagation, we form classifiers by investigating similarity top lists of an unknown page along various measures such as co-citation, companion, nearest neighbors in low dimensional projections and SimRank. We test our method over two data sets previously used to measure spam filtering algorithms. 1

CiteSeerX

SZTAKI Publication Repository

Mining the information architecture of the WWW using automated website boundary detection

Author: Broder
Cheung
Ford
Kwon
Pham
Salton
Schaeffer
Urvoy
Publication venue: 'IOS Press'
Publication date: 20/11/2017
Field of study

University of Liverpool Repository

Crossref

A systematic survey of online data mining technology intended for law enforcement

Author: Edwards Matthew
Rashid Awais
Rayson Paul
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2015
Field of study

As an increasing amount of crime takes on a digital aspect, law enforcement bodies must tackle an online environment generating huge volumes of data. With manual inspections becoming increasingly infeasible, law enforcement bodies are optimising online investigations through data-mining technologies. Such technologies must be well designed and rigorously grounded, yet no survey of the online data-mining literature exists which examines their techniques, applications and rigour. This article remedies this gap through a systematic mapping study describing online data-mining literature which visibly targets law enforcement applications, using evidence-based practices in survey making to produce a replicable analysis which can be methodologically examined for deficiencies

Lancaster E-Prints

Explore Bristol Research

Booter blacklist:Unveiling DDoS-for-hire websites

Author: De Vries Joey
Granville Lisandro Z.
Pras Aiko
Santanna Jose Jair
Schmidt Ricardo De O.
Tuncer Daphne
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 13/01/2017
Field of study

University of Twente Research Information

Phishing detection and traceback mechanism

Author: A Hamid Isredza Rahmi
Publication venue: Deakin University, Faculty of Science, Engineering and Built Environment, School of Information Technology
Publication date: 01/01/2015
Field of study

 Isredza Rahmi A Hamid’s thesis entitled Phishing Detection and Trackback Mechanism. The thesis investigates detection of phishing attacks through email, novel method to profile the attacker and tracking the attack back to the origin

Deakin Research Online