28 research outputs found

    Uncovering the unarchived web

    Get PDF
    htmlabstractMany national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages

    Uncovering the unarchived web

    Get PDF
    Many national and international heritage institutes realize the importance of archiving the web for future culture heritage. Web archiving is currently performed either by harvesting a national domain, or by crawling a pre-defined list of websites selected by the archiving institution. In either method, crawling results in more information being harvested than just the websites intended for preservation; which could be used to reconstruct impressions of pages that existed on the live web of the crawl date, but would have been lost forever. We present a method to create representations of what we will refer to as a web collection's (aura): the web documents that were not included in the archived collection, but are known to have existed --- due to their mentions on pages that were included in the archived web collection. To create representations of these unarchived pages, we exploit the information about the unarchived URLs that can be derived from the crawls by combining crawl date distribution, anchor text and link structure. We illustrate empirically that the size of the aura can be substantial: in 2012, the Dutch Web archive contained 12.3M unique pages, while we uncover references to 11.9M additional (unarchived) pages

    Lost but not forgotten: finding pages on the unarchived web

    Get PDF
    Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites

    Supporting the complex dynamics of the information seeking process

    Get PDF
    In the context of complex tasks, information seeking has been described as a journey. The correct route, and even the final destination of this journey is often unknown in advance. Searchers may discover new paths, but also encounter ample challenges and dead-ends. In the quest for knowledge, obfuscation and illumination may go hand-in-hand, but ultimately lead to new insights. The complex interplay of feelings, thoughts and actions during tasks involving learning and knowledge construction has been formally documented in various information seeking models. The feelings, thoughts and actions of searchers evolve throughout these stages, and may include moments of optimism, uncertainty, confusion and satisfaction. However, despite the evidence of information seeking models, the functionality of search engines, nowadays the prime intermediaries between information and user, has converged to a streamlined set. Even though the past years have embodied rapid advances in contextualization and personalization, the Web's complex information environment is still reduced to a set of ten 'relevant' blue links. This may not be beneficial for supporting sustained information-intensive tasks and knowledge construction. This thesis aims to shed new light on the apparent contradiction of models describing drastic changes in searchers' feelings, thoughts and actions, and the limited task support offered by current search systems. It focuses on research-based tasks conducted via web archives and online search engines. Through literature reviews, user studies and information retrieval experiments, this thesis aims to rethink the currently dominating search approach, and ultimately arrive at more dynamic support approaches for complex search tasks
    corecore