43 research outputs found

    Local Ranking Problem on the BrowseGraph

    Full text link
    The "Local Ranking Problem" (LRP) is related to the computation of a centrality-like rank on a local graph, where the scores of the nodes could significantly differ from the ones computed on the global graph. Previous work has studied LRP on the hyperlink graph but never on the BrowseGraph, namely a graph where nodes are webpages and edges are browsing transitions. Recently, this graph has received more and more attention in many different tasks such as ranking, prediction and recommendation. However, a web-server has only the browsing traffic performed on its pages (local BrowseGraph) and, as a consequence, the local computation can lead to estimation errors, which hinders the increasing number of applications in the state of the art. Also, although the divergence between the local and global ranks has been measured, the possibility of estimating such divergence using only local knowledge has been mainly overlooked. These aspects are of great interest for online service providers who want to: (i) gauge their ability to correctly assess the importance of their resources only based on their local knowledge, and (ii) take into account real user browsing fluxes that better capture the actual user interest than the static hyperlink network. We study the LRP problem on a BrowseGraph from a large news provider, considering as subgraphs the aggregations of browsing traces of users coming from different domains. We show that the distance between rankings can be accurately predicted based only on structural information of the local graph, being able to achieve an average rank correlation as high as 0.8

    Lumbricus webis: a parallel and distributed crawling architecture for the Italian web

    Get PDF
    Web crawlers have become popular tools for gattering large portions of the web that can be used for many tasks from statistics to structural analysis of the web. Due to the amount of data and the heterogeneity of tasks to manage, it is essential for crawlers to have a modular and distributed architecture. In this paper we describe Lumbricus webis (short L.webis) a modular crawling infrastructure built to mine data from the web domain ccTLD .it and portions of the web reachable from this domain. Its purpose is to support gathering of advanced statics and advanced analytic tools on the content of the Italian Web. This paper describes the architectural features of L.webis and its performance. L.webis can currently download a mid-sized ccTLD such as ".it" in about one week

    Optimizing a Web Search Engine

    Get PDF
    Search Engine queries often have duplicate words in the search string. For example user searching for pizza pizza a popular brand name for Canadian pizzeria chain. An efficient search engine must return the most relevant results for such queries. Search queries also have pair of words which always occur together in the same sequence, for example “honda accord”, “hopton wafers”, “hp newwave” etc. We will hereafter refer to such pair of words as bigrams. A bigram can be treated as a single word to increase the speed and relevance of results returned by a search engine that is based on inverted index. Terms in a user query have a different degree of importance based on whether they occur inside title, description or anchor text of the document. Therefore an optimal weighting scheme for these components is required for search engines to prioritize relevant documents near the top for user searches. The goal of my project is to improve Yioop, an open source search engine created by Dr Chris Pollett, to support search for duplicate terms and bigrams in a search query. I will also optimize the Yioop search engine by improving its document grouping and BM25F weighting scheme. This would allow Yioop to return more relevant results quickly and efficiently for users of the search engine

    EvoluciĂłn y tendencias actuales de los Web crawlers

    Get PDF
    The information stored through the social network services is a growing source of information with special dynamic characteristics. The mechanisms responsible for tracking changes in such information (Web crawlers) often must be studied, and it is necessary to review and improve their algorithms. This document presents the current status of tracking algorithms of the Web (Web crawlers), its trends and developments, and its approach towards managing challenges emerging like social networks.La información disponible en redes de datos como la Web o las redes sociales se encuentra en continuo crecimiento, con unas características de dinamismo especiales. Entre los mecanismos encargados de rastrear los cambios en dicha información se encuentran los Webcrawlers, los cuales por la misma dinámica de la información, deben mejorarse constantemente en busca de algoritmos más eficientes. Este documento presenta el estado actual de los algoritmos de rastreo de la Web, sus tendencias, avances, y nuevos enfoques dentro del contexto de la dinámica de las redes sociales

    A Framework to Evaluate Information Quality in Public Administration Website

    Get PDF
    This paper presents a framework aimed at assessing the capacity of Public Administration bodies (PA) to offer a good quality of information and service on their web portals. Our framework is based on the extraction of “.it” domain names registered by Italian public institutions and the subsequent analysis of their relative websites. The analysis foresees an automatic gathering of the web pages of PA portals by means of web crawling and an assessment of the quality of their online information services. This assessment is carried out by verifying their compliance with current legislation on the basis of the criteria established in government guidelines[1]. This approach provides an ongoing monitoring process of the PA websites that can contribute to the improvement of their overall quality. Moreover, our approach can also hopefully be of benefit to local governments in other countries. Available at: https://aisel.aisnet.org/pajais/vol5/iss3/3

    Mind Economy: Dynamic Graph Analysis of Communications

    Get PDF
    Social networks are growing in reach and impact but little is known about their structure, dynamics, or users’ behaviors. New techniques and approaches are needed to study and understand why these networks attract users’ persistent attention, and how the networks evolve. This thesis investigates questions that arise when modeling human behavior in social networks, and its main contributions are: • an infrastructure and methodology for understanding communication on graphs; • identification and exploration of sub-communities; • metrics for identifying effective communicators in dynamic graphs; • a new definition of dynamic, reciprocal social capital and its iterative computation • a methodology to study influence in social networks in detail, using • a class hierarchy established by social capital • simulations mixed with reality across time and capital classes • various attachment strategies, e.g. via friends-of-friends or full utility optimization • a framework for answering questions such as “are these influentials accidental” • discovery of the “middle class” of social networks, which as shown with our new metrics and simulations is the real influential in many processes Our methods have already lead to the discovery of “mind economies” within Twitter, where interactions are designed to increase ratings as well as promoting topics of interest and whole subgroups. Reciprocal social capital metrics identify the “middle class” of Twitter which does most of the “long-term” talking, carrying the bulk of the system-sustaining conversations. We show that this middle class wields the most of the actual influence we should care about — these are not “accidental influentials.” Our approach is of interest to computer scientists, social scientists, economists, marketers, recruiters, and social media builders who want to find and present new ways of exploring, browsing, analyzing, and sustaining online social networks

    Effective web crawlers

    Get PDF
    Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed
    corecore