9 research outputs found

    Identifying Search Engine Spam Using DNS

    Get PDF
    Web crawlers encounter both finite and infinite elements during crawl. Pages and hosts can be infinitely generated using automated scripts and DNS wildcard entries. It is a challenge to rank such resources as an entire web of pages and hosts could be created to manipulate the rank of a target resource. It is crucial to be able to differentiate genuine content from spam in real-time to allocate crawl budgets. In this study, ranking algorithms to rank hosts are designed which use the finite Pay Level Domains(PLD) and IPv4 addresses. Heterogenous graphs derived from the webgraph of IRLbot are used to achieve this. PLD Supporters (PSUPP) which is the number of level-2 PLD supporters for each host on the host-host-PLD graph is the first algorithm that is studied. This is further improved by True PLD Supporters(TSUPP) which uses true egalitarian level-2 PLD supporters on the host-IP-PLD graph and DNS blacklists. It was found that support from content farms and stolen links could be eliminated by finding TSUPP. When TSUPP was applied on the host graph of IRLbot, there was less than 1% spam in the top 100,000 hosts

    Efficient External-Memory Algorithms for Graph Mining

    Get PDF
    The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs. We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O. We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions. We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank

    Analysis, Modeling, and Algorithms for Scalable Web Crawling

    Get PDF
    This dissertation presents a modeling framework for the intermediate data generated by external-memory sorting algorithms (e.g., merge sort, bucket sort, hash sort, replacement selection) that are well-known, yet without accurate models of produced data volume. The motivation comes from the IRLbot crawl experience in June 2007, where a collection of scalable and high-performance external sorting methods are used to handle such problems as URL uniqueness checking, real-time frontier ranking, budget allocation, spam avoidance, all being monumental tasks, especially when limited to the resources of a single-machine. We discuss this crawl experience in detail, use novel algorithms to collect data from the crawl image, and then advance to a broader problem – sorting arbitrarily large-scale data using limited resources and accurately capturing the required cost (e.g., time and disk usage). To solve these problems, we present an accurate model of uniqueness probability the probability to encounter previous unseen data and use that to analyze the amount of intermediate data generated the above-mentioned sorting methods. We also demonstrate how the intermediate data volume and runtime vary based on the input properties (e.g., frequency distribution), hardware configuration (e.g., main memory size, CPU and disk speed) and the choice of sorting method, and that our proposed models accurately capture such variation. Furthermore, we propose a novel hash-based method for replacement selection sort and its model in case of duplicate data, where existing literature is limited to random or mostly-unique data. Note that the classic replacement selection method has the ability to increase the length of sorted runs and reduce their number, both directly benefiting the merge step of external sorting and . But because of a priority queue-assisted sort operation that is inherently slow, the application of replacement selection was limited. Our hash-based design solves this problem by making the sort phase significantly faster compared to existing methods, making this method a preferred choice. The presented models also enable exact analysis of Least-Recently-Used (LRU) and Random Replacement caches (i.e., their hit rate) that are used as part of the algorithms presented here. These cache models are more accurate than the ones in existing literature, since the existing ones mostly assume infinite stream of data, while our models work accurately on finite streams (e.g., sampled web graphs, click stream) as well. In addition, we present accurate models for various crawl characteristics of random graphs, which can forecast a number of aspects of crawl experience based on the graph properties (e.g., degree distribution). All these models are presented under a unified umbrella to analyze a set of large-scale information processing algorithms that are streamlined for high performance and scalability

    Efficient External-Memory Algorithms for Graph Mining

    Get PDF
    The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs. We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O. We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions. We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank

    Agnostic Topology-Based Spam Avoidance in Large-Scale Web Crawls

    No full text
    Abstract—With the proliferation of web spam and questionable content with virtually infinite auto-generated structure, largescale web crawlers now require low-complexity ranking methods to effectively budget their limited resources and allocate the majority of bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study the domain-level graph from a 6.3B-page web crawl and compare several agnostic topology-based ranking algorithms on this dataset. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method that can achieve much better crawl prioritization in practice, especially in applications with limited hardware resources. I

    Localizing the media, locating ourselves: a critical comparative analysis of socio-spatial sorting in locative media platforms (Google AND Flickr 2009-2011)

    Get PDF
    In this thesis I explore media geocoding (i.e., geotagging or georeferencing), the process of inscribing the media with geographic information. A process that enables distinct forms of producing, storing, and distributing information based on location. Historically, geographic information technologies have served a biopolitical function producing knowledge of populations. In their current guise as locative media platforms, these systems build rich databases of places facilitated by user-generated geocoded media. These geoindexes render places, and users of these services, this thesis argues, subject to novel forms of computational modelling and economic capture. Thus, the possibility of tying information, people and objects to location sets the conditions to the emergence of new communicative practices as well as new forms of governmentality (management of populations). This project is an attempt to develop an understanding of the socio-economic forces and media regimes structuring contemporary forms of location-aware communication, by carrying out a comparative analysis of two of the main current location-enabled platforms: Google and Flickr. Drawing from the medium-specific approach to media analysis characteristic of the subfield of Software Studies, together with the methodological apparatus of Cultural Analytics (data mining and visualization methods), the thesis focuses on examining how social space is coded and computed in these systems. In particular, it looks at the databases’ underlying ontologies supporting the platforms' geocoding capabilities and their respective algorithmic logics. In the final analysis the thesis argues that the way social space is translated in the form of POIs (Points of Interest) and business-biased categorizations, as well as the geodemographical ordering underpinning the way it is computed, are pivotal if we were to understand what kind of socio-spatial relations are actualized in these systems, and what modalities of governing urban mobility are enabled

    Towards a crowdsourced solution for the authoring bottleneck in interactive narratives

    Get PDF
    Interactive Storytelling research has produced a wealth of technologies that can be employed to create personalised narrative experiences, in which the audience takes a participating rather than observing role. But so far this technology has not led to the production of large scale playable interactive story experiences that realise the ambitions of the field. One main reason for this state of affairs is the difficulty of authoring interactive stories, a task that requires describing a huge amount of story building blocks in a machine friendly fashion. This is not only technically and conceptually more challenging than traditional narrative authoring but also a scalability problem. This thesis examines the authoring bottleneck through a case study and a literature survey and advocates a solution based on crowdsourcing. Prior work has already shown that combining a large number of example stories collected from crowd workers with a system that merges these contributions into a single interactive story can be an effective way to reduce the authorial burden. As a refinement of such an approach, this thesis introduces the novel concept of Crowd Task Adaptation. It argues that in order to maximise the usefulness of the collected stories, a system should dynamically and intelligently analyse the corpus of collected stories and based on this analysis modify the tasks handed out to crowd workers. Two authoring systems, ENIGMA and CROSCAT, which show two radically different approaches of using the Crowd Task Adaptation paradigm have been implemented and are described in this thesis. While ENIGMA adapts tasks through a realtime dialog between crowd workers and the system that is based on what has been learned from previously collected stories, CROSCAT modifies the backstory given to crowd workers in order to optimise the distribution of branching points in the tree structure that combines all collected stories. Two experimental studies of crowdsourced authoring are also presented. They lead to guidelines on how to employ crowdsourced authoring effectively, but more importantly the results of one of the studies demonstrate the effectiveness of the Crowd Task Adaptation approach

    Bioinspired metaheuristic algorithms for global optimization

    Get PDF
    This paper presents concise comparison study of newly developed bioinspired algorithms for global optimization problems. Three different metaheuristic techniques, namely Accelerated Particle Swarm Optimization (APSO), Firefly Algorithm (FA), and Grey Wolf Optimizer (GWO) are investigated and implemented in Matlab environment. These methods are compared on four unimodal and multimodal nonlinear functions in order to find global optimum values. Computational results indicate that GWO outperforms other intelligent techniques, and that all aforementioned algorithms can be successfully used for optimization of continuous functions

    Experimental Evaluation of Growing and Pruning Hyper Basis Function Neural Networks Trained with Extended Information Filter

    Get PDF
    In this paper we test Extended Information Filter (EIF) for sequential training of Hyper Basis Function Neural Networks with growing and pruning ability (HBF-GP). The HBF neuron allows different scaling of input dimensions to provide better generalization property when dealing with complex nonlinear problems in engineering practice. The main intuition behind HBF is in generalization of Gaussian type of neuron that applies Mahalanobis-like distance as a distance metrics between input training sample and prototype vector. We exploit concept of neuron’s significance and allow growing and pruning of HBF neurons during sequential learning process. From engineer’s perspective, EIF is attractive for training of neural networks because it allows a designer to have scarce initial knowledge of the system/problem. Extensive experimental study shows that HBF neural network trained with EIF achieves same prediction error and compactness of network topology when compared to EKF, but without the need to know initial state uncertainty, which is its main advantage over EKF
    corecore