19,259 research outputs found

    Analyzing web archives through topic and event focused sub-collections

    No full text
    Web archives capture the history of the Web and are therefore an important source to study how societal developments have been reflected on the Web. However, the large size of Web archives and their temporal nature pose many challenges to researchers interested in working with these collections. In this work, we describe the challenges of working with Web archives and propose the research methodology of extracting and studying sub-collections of the archive focused on specific topics and events. We discuss the opportunities and challenges of this approach and suggest a framework for creating sub-collections

    Towards Better Understanding Researcher Strategies in Cross-Lingual Event Analytics

    Full text link
    With an increasing amount of information on globally important events, there is a growing demand for efficient analytics of multilingual event-centric information. Such analytics is particularly challenging due to the large amount of content, the event dynamics and the language barrier. Although memory institutions increasingly collect event-centric Web content in different languages, very little is known about the strategies of researchers who conduct analytics of such content. In this paper we present researchers' strategies for the content, method and feature selection in the context of cross-lingual event-centric analytics observed in two case studies on multilingual Wikipedia. We discuss the influence factors for these strategies, the findings enabled by the adopted methods along with the current limitations and provide recommendations for services supporting researchers in cross-lingual event-centric analytics.Comment: In Proceedings of the International Conference on Theory and Practice of Digital Libraries 201

    SLIS Student Research Journal, Vol. 1, Iss. 1

    Get PDF

    Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

    Full text link
    As web archives' holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms: : Archive-It, Conifer, the Croatian Web Archive (HAW), the Internet Archive's user account web archives, Library of Congress (LC), PANDORA, Trove, and the UK Web Archive (UKWA). We note a plethora of different approaches to web archive collection structures. Some web archive collections support sub-collections and some permit embargoes. Curatorial decisions may be attributed to a single organization or many. Archived web pages are known by many names: mementos, copies, captures, or snapshots. Some platforms restrict a memento to a single collection and others allow mementos to cross collections. Knowledge of collection structures has implications for many different applications and users. Visitors will need to understand how to navigate collections. Future archivists will need to understand what options are available for designing collections. Platform designers need it to know what possibilities exist. The developers of tools that consume collections need to understand collection structures so they can meet the needs of their users.Comment: 5 figures, 16 pages, accepted for publication at TPDL 202

    Bots, Seeds and People: Web Archives as Infrastructure

    Full text link
    The field of web archiving provides a unique mix of human and automated agents collaborating to achieve the preservation of the web. Centuries old theories of archival appraisal are being transplanted into the sociotechnical environment of the World Wide Web with varying degrees of success. The work of the archivist and bots in contact with the material of the web present a distinctive and understudied CSCW shaped problem. To investigate this space we conducted semi-structured interviews with archivists and technologists who were directly involved in the selection of content from the web for archives. These semi-structured interviews identified thematic areas that inform the appraisal process in web archives, some of which are encoded in heuristics and algorithms. Making the infrastructure of web archives legible to the archivist, the automated agents and the future researcher is presented as a challenge to the CSCW and archival community

    DARIAH and the Benelux

    Get PDF

    Symbiosis between the TRECVid benchmark and video libraries at the Netherlands Institute for Sound and Vision

    Get PDF
    Audiovisual archives are investing in large-scale digitisation efforts of their analogue holdings and, in parallel, ingesting an ever-increasing amount of born- digital files in their digital storage facilities. Digitisation opens up new access paradigms and boosted re-use of audiovisual content. Query-log analyses show the shortcomings of manual annotation, therefore archives are complementing these annotations by developing novel search engines that automatically extract information from both audio and the visual tracks. Over the past few years, the TRECVid benchmark has developed a novel relationship with the Netherlands Institute of Sound and Vision (NISV) which goes beyond the NISV just providing data and use cases to TRECVid. Prototype and demonstrator systems developed as part of TRECVid are set to become a key driver in improving the quality of search engines at the NISV and will ultimately help other audiovisual archives to offer more efficient and more fine-grained access to their collections. This paper reports the experiences of NISV in leveraging the activities of the TRECVid benchmark

    Web archives: the future

    Get PDF
    T his report is structured first, to engage in some speculative thought about the possible futures of the web as an exercise in prom pting us to think about what we need to do now in order to make sure that we can reliably and fruitfully use archives of the w eb in the future. Next, we turn to considering the methods and tools being used to research the live web, as a pointer to the types of things that can be developed to help unde rstand the archived web. Then , we turn to a series of topics and questions that researchers want or may want to address using the archived web. In this final section, we i dentify some of the challenges individuals, organizations, and international bodies can target to increase our ability to explore these topi cs and answer these quest ions. We end the report with some conclusions based on what we have learned from this exercise
    corecore