9 research outputs found
Identifying Search Engine Spam Using DNS
Web crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts
Efficient External-Memory Algorithms for Graph Mining
The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs.
We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O.
We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions.
We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank
Analysis, Modeling, and Algorithms for Scalable Web Crawling
This dissertation presents a modeling framework for the intermediate data generated
by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort,
replacement selection) that are well-known, yet without accurate models of produced
data volume. The motivation comes from the IRLbot crawl experience in June 2007,
where a collection of scalable and high-performance external sorting methods are
used to handle such problems as URL uniqueness checking, real-time frontier ranking,
budget allocation, spam avoidance, all being monumental tasks, especially when
limited to the resources of a single-machine. We discuss this crawl experience in
detail, use novel algorithms to collect data from the crawl image, and then advance
to a broader problem – sorting arbitrarily large-scale data using limited resources
and accurately capturing the required cost (e.g., time and disk usage).
To solve these problems, we present an accurate model of uniqueness probability
the probability to encounter previous unseen data and use that to analyze the
amount of intermediate data generated the above-mentioned sorting methods. We
also demonstrate how the intermediate data volume and runtime vary based on the
input properties (e.g., frequency distribution), hardware configuration (e.g., main
memory size, CPU and disk speed) and the choice of sorting method, and that our
proposed models accurately capture such variation.
Furthermore, we propose a novel hash-based method for replacement selection
sort and its model in case of duplicate data, where existing literature is limited to
random or mostly-unique data. Note that the classic replacement selection method
has the ability to increase the length of sorted runs and reduce their number, both
directly benefiting the merge step of external sorting and . But because of a priority
queue-assisted sort operation that is inherently slow, the application of replacement
selection was limited. Our hash-based design solves this problem by making the
sort phase significantly faster compared to existing methods, making this method a
preferred choice.
The presented models also enable exact analysis of Least-Recently-Used (LRU)
and Random Replacement caches (i.e., their hit rate) that are used as part of the
algorithms presented here. These cache models are more accurate than the ones in
existing literature, since the existing ones mostly assume infinite stream of data, while
our models work accurately on finite streams (e.g., sampled web graphs, click stream)
as well. In addition, we present accurate models for various crawl characteristics of
random graphs, which can forecast a number of aspects of crawl experience based on
the graph properties (e.g., degree distribution). All these models are presented under
a unified umbrella to analyze a set of large-scale information processing algorithms
that are streamlined for high performance and scalability
Efficient External-Memory Algorithms for Graph Mining
The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs.
We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O.
We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions.
We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank
Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls
Abstract—With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, largescale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources. I
Localizing the media, locating ourselves: a critical comparative analysis of socio-spatial sorting in locative media platforms (Google AND Flickr 2009-2011)
In this thesis I explore media geocoding (i.e., geotagging or georeferencing),
the process of inscribing the media with geographic information. A process
that enables distinct forms of producing, storing, and distributing information
based on location. Historically, geographic information technologies have
served a biopolitical function producing knowledge of populations. In their
current guise as locative media platforms, these systems build rich
databases of places facilitated by user-generated geocoded media. These
geoindexes render places, and users of these services, this thesis argues,
subject to novel forms of computational modelling and economic capture.
Thus, the possibility of tying information, people and objects to location sets
the conditions to the emergence of new communicative practices as well as
new forms of governmentality (management of populations). This project is
an attempt to develop an understanding of the socio-economic forces and
media regimes structuring contemporary forms of location-aware
communication, by carrying out a comparative analysis of two of the main
current location-enabled platforms: Google and Flickr. Drawing from the
medium-specific approach to media analysis characteristic of the subfield of
Software Studies, together with the methodological apparatus of Cultural
Analytics (data mining and visualization methods), the thesis focuses on
examining how social space is coded and computed in these systems. In
particular, it looks at the databases’ underlying ontologies supporting the
platforms' geocoding capabilities and their respective algorithmic logics. In
the final analysis the thesis argues that the way social space is translated in
the form of POIs (Points of Interest) and business-biased categorizations, as
well as the geodemographical ordering underpinning the way it is computed,
are pivotal if we were to understand what kind of socio-spatial relations are
actualized in these systems, and what modalities of governing urban mobility
are enabled
Towards a crowdsourced solution for the authoring bottleneck in interactive narratives
Interactive Storytelling research has produced a wealth of technologies that can be
employed to create personalised narrative experiences, in which the audience takes
a participating rather than observing role. But so far this technology has not led
to the production of large scale playable interactive story experiences that realise
the ambitions of the field. One main reason for this state of affairs is the difficulty
of authoring interactive stories, a task that requires describing a huge amount of
story building blocks in a machine friendly fashion. This is not only technically
and conceptually more challenging than traditional narrative authoring but also a
scalability problem.
This thesis examines the authoring bottleneck through a case study and a literature
survey and advocates a solution based on crowdsourcing. Prior work has already
shown that combining a large number of example stories collected from crowd workers
with a system that merges these contributions into a single interactive story can be
an effective way to reduce the authorial burden. As a refinement of such an approach,
this thesis introduces the novel concept of Crowd Task Adaptation. It argues that in
order to maximise the usefulness of the collected stories, a system should dynamically
and intelligently analyse the corpus of collected stories and based on this analysis
modify the tasks handed out to crowd workers.
Two authoring systems, ENIGMA and CROSCAT, which show two radically different
approaches of using the Crowd Task Adaptation paradigm have been implemented and
are described in this thesis. While ENIGMA adapts tasks through a realtime dialog
between crowd workers and the system that is based on what has been learned from
previously collected stories, CROSCAT modifies the backstory given to crowd workers
in order to optimise the distribution of branching points in the tree structure that
combines all collected stories. Two experimental studies of crowdsourced authoring
are also presented. They lead to guidelines on how to employ crowdsourced authoring
effectively, but more importantly the results of one of the studies demonstrate the
effectiveness of the Crowd Task Adaptation approach
Bioinspired metaheuristic algorithms for global optimization
This paper presents concise comparison study of newly developed bioinspired algorithms for global optimization problems. Three different metaheuristic techniques, namely Accelerated Particle Swarm Optimization (APSO), Firefly Algorithm (FA), and Grey Wolf Optimizer (GWO) are investigated and implemented in Matlab environment. These methods are compared on four unimodal and multimodal nonlinear functions in order to find global optimum values. Computational results indicate that GWO outperforms other intelligent techniques, and that all aforementioned algorithms can be successfully used for optimization of continuous functions
Experimental Evaluation of Growing and Pruning Hyper Basis Function Neural Networks Trained with Extended Information Filter
In this paper we test Extended Information Filter (EIF) for sequential training of Hyper Basis Function Neural Networks with growing and pruning ability (HBF-GP). The HBF neuron allows different scaling of input dimensions to provide better generalization property when dealing with complex nonlinear problems in engineering practice. The main intuition behind HBF is in generalization of Gaussian type of neuron that applies Mahalanobis-like distance as a distance metrics between input training sample and prototype vector. We exploit concept of neuron’s significance and allow growing and pruning of HBF neurons during sequential learning process. From engineer’s perspective, EIF is attractive for training of neural networks because it allows a designer to have scarce initial knowledge of the system/problem. Extensive experimental study shows that HBF neural network trained with EIF achieves same prediction error and compactness of network topology when compared to EKF, but without the need to know initial state uncertainty, which is its main advantage over EKF