28 research outputs found

    Zipf's Law for web surfers

    Get PDF
    One of the main activities of Web users, known as 'surfing', is to follow links. Lengthy navigation often leads to disorientation when users lose track of the context in which they are navigating and are unsure how to proceed in terms of the goal of their original query. Studying navigation patterns of Web users is thus important, since it can lead us to a better understanding of the problems users face when they are surfing. We derive Zipf's rank frequency law (i.e., an inverse power law) from an absorbing Markov chain model of surfers' behavior assuming that less probable navigation trails are, on average, longer than more probable ones. In our model the probability of a trail is interpreted as the relevance (or 'value') of the trail. We apply our model to two scenarios: in the first the probability of a user terminating the navigation session is independent of the number of links he has followed so far, and in the second the probability of a user terminating the navigation session increases by a constant each time the user follows a link. We analyze these scenarios using two sets of experimental data sets showing that, although the first scenario is only a rough approximation of surfers' behavior, the data is consistent with the second scenario and can thus provide an explanation of surfers' behavior

    Recoverable prevalence in growing scale-free networks and the effective immunization

    Get PDF
    We study the persistent recoverable prevalence and the extinction of computer viruses via e-mails on a growing scale-free network with new users, which structure is estimated form real data. The typical phenomenon is simulated in a realistic model with the probabilistic execution and detection of viruses. Moreover, the conditions of extinction by random and targeted immunizations for hubs are derived through bifurcation analysis for simpler models by using a mean-field approximation without the connectivity correlations. We can qualitatively understand the mechanisms of the spread in linearly growing scale-free networks.Comment: 9 pages, 9 figures, 1 table. Update version after helpful referee comment

    A stochastic model for the evolution of the Web

    Get PDF
    Recently several authors have proposed stochastic models of the growth of the Web graph that give rise to power-law distributions. These models are based on the notion of preferential attachment leading to the "rich get richer" phenomenon. However, these models fail to explain several distributions arising from empirical results, due to the fact that the predicted exponent is not consistent with the data. To address this problem, we extend the evolutionary model of the Web graph by including a non-preferential component, and we view the stochastic process in terms of an urn transfer model. By making this extension, we can now explain a wider variety of empirically discovered power-law distributions provided the exponent is greater than two. These include: the distribution of incoming links, the distribution of outgoing links, the distribution of pages in a Web site and the distribution of visitors to a Web site. A by-product of our results is a formal proof of the convergence of the standard stochastic model (first proposed by Simon)

    Extraction and classification of dense communities in the Web

    Get PDF
    The World Wide Web (WWW) is rapidly becoming important for society as a medium for sharing data, information and services, and there is a growing interest in tools for understanding collective behaviors and emerging phenomena in the WWW. In this paper we focus on the problem of searching and classifying communities in the web. Loosely speaking a community is a group of pages related to a common interest. More formally communities have been associated in the computer science literature with the existence of a locally dense sub-graph of the web-graph (where web pages are nodes and hyper-links are arcs of the web-graph) The core of our contribution is a new scalable algorithm for finding relatively dense subgraphs in massive graphs. We apply our algorithm on web-graphs built on three publicly available large crawls of the web (with raw sizes up to 120M nodes and 1G arcs). The effectiveness of our algorithm in finding dense subgraphs is demonstrated experimentally by embedding artificial communities in the web-graph and counting how many of these are blindly found. Effectiveness increases with the size and density of the communities: it is close to 100% for dense communities of a hundred nodes or more. Moreover it is still about 80% even for small communities of twenty nodes and density at 50% of the arcs present. We complete our Community Watch system by clustering the communities found in the web-graph into homogeneous groups by topic and labelling each group by representative keywords

    CT-FC: more Comprehensive Traversal Focused Crawler

    Get PDF
     In today’s world, people depend more on the WWW information, including professionals who have to analyze the data according their domain to maintain and improve their business. A data analysis would require information that is comprehensive and relevant to their domain. Focused crawler as a topical based Web indexer agent is used to meet this application’s information need. In order to increase the precision, focused crawler face the problem of low recall. The study on WWW hyperlink structure characteristics indicates that many Web documents are not strong connected but through co-citation & co-reference. Conventional focused crawler that uses forward crawling strategy could not visit the documents in these characteristics. This study proposes a more comprehensive traversal framework. As a proof, CT-FC (a focused crawler with the new traversal framework) ran on DMOZ data that is representative to WWW characteristics. The results show that this strategy can increase the recall significantly

    Power-Efficient and Highly Scalable Parallel Graph Sampling using FPGAs

    Get PDF
    Energy efficiency is a crucial problem in data centers where big data is generally represented by directed or undirected graphs. Analysis of this big data graph is challenging due to volume and velocity of the data as well as irregular memory access patterns. Graph sampling is one of the most effective ways to reduce the size of graph while maintaining crucial characteristics. In this paper we present design and implementation of an FPGA based graph sampling method which is both time- and energy-efficient. This is in contrast to existing parallel approaches which include memory-distributed clusters, multicore and GPUs. Our strategy utilizes a novel graph data structure, that we call COPRA that allows time- and memory-efficient representation of graphs suitable for reconfigurable hardware such as FPGAs. Our experiments show that our proposed techniques are 2x faster and 3x more energy efficient as compared to serial CPU version of the algorithm. We further show that our proposed techniques give comparable speedups to GPU and multi-threaded CPU architecture while energy consumption is 10x less than GPU and 2x less than CPU

    Random Web Crawls

    Get PDF
    International audienceThis paper proposes a random Web crawl model. A Web crawl is a (biased and partial) image of the Web. This paper deals with the hyperlink structure, i.e. a Web crawl is a graph, whose vertices are the pages and whose edges are the hypertextual links. Of course a Web crawl has a very special structure; we recall some known results about it. We then propose a model generating similar structures. Our model simply simulates a crawling, i.e. builds and crawls the graph at the same time. The graphs generated have lot of known properties of Web crawls. Our model is simpler than most random Web graph models, but captures the same properties. Notice that it models the crawling process instead of the page writing process of Web graph models
    corecore