Skip to main content
Article thumbnail
Location of Repository

Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls

By Clint Sparkman, Hsin-tsang Lee and Dmitri Loguinov

Abstract

Abstract—With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, largescale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources. I

Year: 2011
OAI identifier: oai:CiteSeerX.psu:10.1.1.185.4630
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://irl.cs.tamu.edu/people/... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.