9 research outputs found

    Uncovering the unarchived web

    Get PDF
    Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages

    Uncovering the unarchived web

    Get PDF
    htmlabstractMany national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages

    Lost but not forgotten: finding pages on the unarchived web

    Get PDF
    Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites

    The Web as a Historical Corpus: Collecting, Analysing and Selecting Sources on the Recent Past of Academic Institutions

    Get PDF
    The goal of this thesis is to understand the impact that the transition from analogue to born-digital sources will have on the way historians collect, analyse and select primary evidences. This thesis aims in particular at addressing the simultaneous scarcity and abundance of digital materials and at dealing with these issues by combining the historical method with methodologies from the fields of internet studies and natural language processing. The case study of this work is focused on recollecting sources on the recent past of Italian academic institutions, with specific attention to the University of Bologna. The dissertation is organised in three main parts. Part I offers an extensive overview of the academic background where this thesis is settled. Next, the so-called scarcity issue is addressed, by considering university websites as primary sources for the study of the recent past of academic institutions. With a combination of traditional sources and methods together with solutions from the field of internet studies, Part II presents how the digital past of the University of Bologna has been reconstructed. The collected resources allowed to address the second issue, namely the large abundance of born-digital sources. Part III focuses on collecting, analysing and selecting materials from large collections of academic publications. In particular, it is remarked on the importance of adopting methods from the field of natural language processing in a highly critical way. This point is stressed by presenting a case-study focused on identifying interdisciplinary collaborations through the analysis of a corpus of Ph.D. dissertations. Based on the case-studies presented, the final part of the dissertation describes how this work intends to be a contribution both to the research in digital humanities and in historiography

    Legibility Machines: Archival Appraisal and the Genealogies of Use

    Get PDF
    The web is a site of constant breakdown in the form of broken links, failed business models, unsustainable infrastructure, obsolescence and general neglect. Some esti­mate that about a quarter of all links break every 7 years, and even within highly curated regions of the web, such as scholarly publishing, rates of link rot can be as high as 50%. Over the past twenty years web archiving projects at cultural heritage organizations have worked to stem this tide of loss. Yet, we still understand little about the diversity of actors involved in web archiving, and how content is selected for web archives. This is due in large part to the ontological politics of web archives, and how the practice of archiving the web takes place out of sight at the boundaries between human and technical activity. This dissertation explores appraisal practices in web archives in order to answer two motivating research questions: 1) How is appraisal currently being enacted in web archives? 2) How do definitions of what constitutes a web archive shape the practice of appraisal? In order to answer these questions data was collected from interviews with practicing professionals in web archives, and from a year long ethnographic field study with a large federally funded archive. Method triangulation using the­matic analysis, critical discourse analysis and grounded theory generated a thick and layered description of archival practice. The results of this analysis highlight three fundamental characteristics of appraisal in web archives: time, ontology and use. The research findings suggest that as expressions of value, appraisal decisions do not simply occur at discrete moments in the life cycle of records. They are instead part of a diverse set of archival processes that repeat and evolve over time. Appraisal in web archives is not bound by a predefined assemblage of actors, technologies and prac­tices. Indeed, artificially limiting our definition of what constitutes a web archive truncates our understanding of how appraisal functions in web archives. Finally, the valuation of web records is inextricably tied to their use in legibility projects, where use is not singular, but part of a genealogy of use, disuse and misuse. Appraising appraisal along these three axes of time, ontology and use provides in­sight into the web­ memory practices that condition our understanding of the past, and that also work to create our collective present and futures. Explicitly linking appraisal to the many forms of use informs archival studies pedagogy, by establish­ ing the value of records in terms of the processes they participate in, rather than as a static attribute of the records or their immediate context. As machines increasingly become users of web archives the stakes for understanding the values present in web archival platforms could not be higher
    corecore