R-SpamRank: A Spam Detection Algorithm Based on Link Analysis

Abstract

Spam web pages intend to achieve higher-than-deserved ranking by various techniques. While human experts could easily identify spam web pages, the manual evaluating process of a large number of pages is still time consuming and cost consuming. To assist manual evaluation, we propose an algorithm to assign spam values to web pages and semi-automatically select potential spam web pages. We first manually select a small set of spam pages as seeds. Then, based on the link structure of the web, the initial R-SpamRank values assigned to the seed pages propagate through links and distribute among the whole web page set. After sorting the pages according to their R-SpamRank values, the pages with high values are selected. Our experiments and analyses show that the algorithm is highly successful in identifying spam pages, which gains a precision of 99.1 % in the top 10,000 web pages with the highest R-SpamRank values

    Similar works

    Full text

    thumbnail-image

    Available Versions