31 research outputs found

    Look back, look around:A systematic analysis of effective predictors for new outlinks in focused Web crawling

    Get PDF
    Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and of its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. The LBLA approach proved extremely effective, outperforming other models including those that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modelling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.Comment: 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated, accepted versio

    Computationally determining the salience of decision points for real-time wayfinding support

    Get PDF
    This study introduces the concept of computational salience to explain the discriminatory efficacy of decision points which in turn may have applications to providing real-time assistance to users of navigational aids. This research compared algorithms for calculating the computational salience of decision points and validated the results via three methods: high-salience decision points were used to classify wayfinders; salience scores were used to weight a conditional probabilistic scoring function for real-time wayfinder performance classification; and salience scores were correlated with wayfinding-performance metrics. As an exploratory step to linking computational and cognitive salience a photograph-recognition experiment was conducted. Results reveal a distinction between algorithms useful for determining computational and cognitive saliences. For computational salience information about the structural integration of decision points is effective while information about the probability of decision-point traversal shows promise for determining cognitive salience. Limitations from only using structural information and motivations for future work that include non-structural information are elicited

    Evaluation of information retrieval systems using structural equation modeling

    Get PDF
    The interpretation of the experimental data collected by testing systems across input datasets and model parameters is of strategic importance for system design and implementation. In particular, finding relationships between variables and detecting the latent variables affecting retrieval performance can provide designers, engineers and experimenters with useful if not necessary information about how a system is performing. This paper discusses the use of Structural Equation Modeling (SEM) in providing an in-depth explanation of evaluation results and an explanation of failures and successes of a system; in particular, we focus on the case of evaluation of Information Retrieval systems

    Effective web crawlers

    Get PDF
    Web crawlers are the component of a search engine that must traverse the Web, gathering documents in a local repository for indexing by a search engine so that they can be ranked by their relevance to user queries. Whenever data is replicated in an autonomously updated environment, there are issues with maintaining up-to-date copies of documents. When documents are retrieved by a crawler and have subsequently been altered on the Web, the effect is an inconsistency in user search results. While the impact depends on the type and volume of change, many existing algorithms do not take the degree of change into consideration, instead using simple measures that consider any change as significant. Furthermore, many crawler evaluation metrics do not consider index freshness or the amount of impact that crawling algorithms have on user results. Most of the existing work makes assumptions about the change rate of documents on the Web, or relies on the availability of a long history of change. Our work investigates approaches to improving index consistency: detecting meaningful change, measuring the impact of a crawl on collection freshness from a user perspective, developing a framework for evaluating crawler performance, determining the effectiveness of stateless crawl ordering schemes, and proposing and evaluating the effectiveness of a dynamic crawl approach. Our work is concerned specifically with cases where there is little or no past change statistics with which predictions can be made. Our work analyses different measures of change and introduces a novel approach to measuring the impact of recrawl schemes on search engine users. Our schemes detect important changes that affect user results. Other well-known and widely used schemes have to retrieve around twice the data to achieve the same effectiveness as our schemes. Furthermore, while many studies have assumed that the Web changes according to a model, our experimental results are based on real web documents. We analyse various stateless crawl ordering schemes that have no past change statistics with which to predict which documents will change, none of which, to our knowledge, has been tested to determine effectiveness in crawling changed documents. We empirically show that the effectiveness of these schemes depends on the topology and dynamics of the domain crawled and that no one static crawl ordering scheme can effectively maintain freshness, motivating our work on dynamic approaches. We present our novel approach to maintaining freshness, which uses the anchor text linking documents to determine the likelihood of a document changing, based on statistics gathered during the current crawl. We show that this scheme is highly effective when combined with existing stateless schemes. When we combine our scheme with PageRank, our approach allows the crawler to improve both freshness and quality of a collection. Our scheme improves freshness regardless of which stateless scheme it is used in conjunction with, since it uses both positive and negative reinforcement to determine which document to retrieve. Finally, we present the design and implementation of Lara, our own distributed crawler, which we used to develop our testbed

    Web Archive Services Framework for Tighter Integration Between the Past and Present Web

    Get PDF
    Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization

    Learning to select for information retrieval

    Get PDF
    The effective ranking of documents in search engines is based on various document features, such as the frequency of the query terms in each document, the length, or the authoritativeness of each document. In order to obtain a better retrieval performance, instead of using a single or a few features, there is a growing trend to create a ranking function by applying a learning to rank technique on a large set of features. Learning to rank techniques aim to generate an effective document ranking function by combining a large number of document features. Different ranking functions can be generated by using different learning to rank techniques or on different document feature sets. While the generated ranking function may be uniformly applied to all queries, several studies have shown that different ranking functions favour different queries, and that the retrieval performance can be significantly enhanced if an appropriate ranking function is selected for each individual query. This thesis proposes Learning to Select (LTS), a novel framework that selectively applies an appropriate ranking function on a per-query basis, regardless of the given query's type and the number of candidate ranking functions. In the learning to select framework, the effectiveness of a ranking function for an unseen query is estimated from the available neighbouring training queries. The proposed framework employs a classification technique (e.g. k-nearest neighbour) to identify neighbouring training queries for an unseen query by using a query feature. In particular, a divergence measure (e.g. Jensen-Shannon), which determines the extent to which a document ranking function alters the scores of an initial ranking of documents for a given query, is proposed for use as a query feature. The ranking function which performs the best on the identified training query set is then chosen for the unseen query. The proposed framework is thoroughly evaluated on two different TREC retrieval tasks (namely, Web search and adhoc search tasks) and on two large standard LETOR feature sets, which contain as many as 64 document features, deriving conclusions concerning the key components of LTS, namely the query feature and the identification of neighbouring queries components. Two different types of experiments are conducted. The first one is to select an appropriate ranking function from a number of candidate ranking functions. The second one is to select multiple appropriate document features from a number of candidate document features, for building a ranking function. Experimental results show that our proposed LTS framework is effective in both selecting an appropriate ranking function and selecting multiple appropriate document features, on a per-query basis. In addition, the retrieval performance is further enhanced when increasing the number of candidates, suggesting the robustness of the learning to select framework. This thesis also demonstrates how the LTS framework can be deployed to other search applications. These applications include the selective integration of a query independent feature into a document weighting scheme (e.g. BM25), the selective estimation of the relative importance of different query aspects in a search diversification task (the goal of the task is to retrieve a ranked list of documents that provides a maximum coverage for a given query, while avoiding excessive redundancy), and the selective application of an appropriate resource for expanding and enriching a given query for document search within an enterprise. The effectiveness of the LTS framework is observed across these search applications, and on different collections, including a large scale Web collection that contains over 50 million documents. This suggests the generality of the proposed learning to select framework. The main contributions of this thesis are the introduction of the LTS framework and the proposed use of divergence measures as query features for identifying similar queries. In addition, this thesis draws insights from a large set of experiments, involving four different standard collections, four different search tasks and large document feature sets. This illustrates the effectiveness, robustness and generality of the LTS framework in tackling various retrieval applications

    A Review of Relational Machine Learning for Knowledge Graphs

    Get PDF
    Relational machine learning studies methods for the statistical analysis of relational, or graph-structured, data. In this paper, we provide a review of how such statistical models can be “trained” on large knowledge graphs, and then used to predict new facts about the world (which is equivalent to predicting new edges in the graph). In particular, we discuss two different kinds of statistical relational models, both of which can scale to massive datasets. The first is based on tensor factorization methods and related latent variable models. The second is based on mining observable patterns in the graph. We also show how to combine these latent and observable models to get improved modeling power at decreased computational cost. Finally, we discuss how such statistical models of graphs can be combined with text-based information extraction methods for automatically constructing knowledge graphs from the Web. In particular, we discuss Google’s Knowledge Vault project.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF - 1231216
    corecore