6 research outputs found

    Brass: A Queueing Manager for Warrick

    Get PDF
    When an individual loses their website and a backup can-not be found, they can download and run Warrick, a web-repository crawler which will recover their lost website by crawling the holdings of the Internet Archive and several search engine caches. Running Warrick locally requires some technical know-how, so we have created an on-line queueing system called Brass which simplifies the task of recovering lost websites. We discuss the technical aspects of recon-structing websites and the implementation of Brass. Our newly developed system allows anyone to recover a lost web-site with a few mouse clicks and allows us to track which websites the public is most interested in saving

    Identifying the Effects of Unexpected Change in a Distributed Collection of Web Documents

    Get PDF
    It is not unusual for digital collections to degrade and suffer from problems associated with unexpected change. In previous analyses, I have found that categorizing the degree of change affecting a digital collection over time is a difficult task. More specifically, I found that categorizing this degree of change is not a binary problem where documents are either unchanged or they have changed so dramatically that they do not fit within the scope of the collection. It is, in part, a characterization of the intent of the change. In this dissertation, I present a study that compares change detection methods based on machine learning algorithms against the assessment made by human subjects in a user study. Consequently, this dissertation focuses on two research questions. First, how can we categorize the various degrees of change that documents can endure? This point becomes increasingly interesting if we take into account that the resources found in a digital library are often curated and maintained by experts with affiliations to professionally managed institutions. And second, how do the automatic detection methods fare against the human assessment of change in the ACM conference list? The results of this dissertation are threefold. First, I provide a categorization framework that highlights the different instances of change that I found in an analysis of the Association for Computing Machinery conference list. Second, I focus on a set of procedures to classify the documents according to the characteristics of change that they exhibit. Finally, I evaluate the classification procedures against the assessment of human subjects. Taking into account the results of the user evaluation and the inability of the test subjects to recognize some instances of change, the main conclusion that I derive from my dissertation is that managing the effects of unexpected change is a more serious problem than had previously been anticipated, thus requiring the immediate attention of collection managers and curators

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time

    Factors affecting website reconstruction from the web infrastructure

    No full text
    When a website is suddenly lost without a backup, it may be reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61% of each website’s resources. We found that Google’s PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI

    ABSTRACT Factors Affecting Website Reconstruction from the Web Infrastructure

    No full text
    When a website is suddenly lost without a backup, it may be reconstituted by probing web archives and search engine caches for missing content. In this paper we describe an experiment where we crawled and reconstructed 300 randomly selected websites on a weekly basis for 14 weeks. The reconstructions were performed using our web-repository crawler named Warrick which recovers missing resources from the Web Infrastructure (WI), the collective preservation effort of web archives and search engine caches. We examine several characteristics of the websites over time including birth rate, decay and age of resources. We evaluate the reconstructions when compared to the crawled sites and develop a statistical model for predicting reconstruction success from the WI. On average, we were able to recover 61 % of each website’s resources. We found that Google’s PageRank, number of hops and resource age were the three most significant factors in determining if a resource would be recovered from the WI
    corecore