126 research outputs found

    On the Change in Archivability of Websites Over Time

    Get PDF
    As web technologies evolve, web archivists work to keep up so that our digital history is preserved. Recent advances in web technologies have introduced client-side executed scripts that load data without a referential identifier or that require user interaction (e.g., content loading when the page has scrolled). These advances have made automating methods for capturing web pages more difficult. Because of the evolving schemes of publishing web pages along with the progressive capability of web preservation tools, the archivability of pages on the web has varied over time. In this paper we show that the archivability of a web page can be deduced from the type of page being archived, which aligns with that page's accessibility in respect to dynamic content. We show concrete examples of when these technologies were introduced by referencing mementos of pages that have persisted through a long evolution of available technologies. Identifying these reasons for the inability of these web pages to be archived in the past in respect to accessibility serves as a guide for ensuring that content that has longevity is published using good practice methods that make it available for preservation.Comment: 12 pages, 8 figures, Theory and Practice of Digital Libraries (TPDL) 2013, Valletta, Malt

    CLEAR: a credible method to evaluate website archivability

    Get PDF
    Web archiving is crucial to ensure that cultural, scientific and social heritage on the web remains accessible and usable over time. A key aspect of the web archiving process is optimal data extraction from target websites. This procedure is difficult for such reasons as, website complexity, plethora of underlying technologies and ultimately the open-ended nature of the web. The purpose of this work is to establish the notion of Website Archivability (WA) and to introduce the Credible Live Evaluation of Archive Readiness (CLEAR) method to measure WA for any website. Website Archivability captures the core aspects of a website crucial in diagnosing whether it has the potentiality to be archived with completeness and accuracy. An appreciation of the archivability of a web site should provide archivists with a valuable tool when assessing the possibilities of archiving material and in- uence web design professionals to consider the implications of their design decisions on the likelihood could be archived. A prototype application, archiveready.com, has been established to demonstrate the viabiity of the proposed method for assessing Website Archivability

    Scripts in a Frame: A Framework for Archiving Deferred Representations

    Get PDF
    Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival tools are unable to archive the resulting JavaScript-dependent representations (what we term deferred representations), resulting in missing or incorrect content in the archives and the general inability to replay the archived resource as it existed at the time of capture. Building on prior studies on Web archiving, client-side monitoring of events and embedded resources, and studies of the Web, we establish an understanding of the trends contributing to the increasing unarchivability of deferred representations. We show that JavaScript leads to lower-quality mementos (archived Web resources) due to the archival difficulties it introduces. We measure the historical impact of JavaScript on mementos, demonstrating that the increased adoption of JavaScript and Ajax correlates with the increase in missing embedded resources. To measure memento and archive quality, we propose and evaluate a metric to assess memento quality closer to Web users’ perception. We propose a two-tiered crawling approach that enables crawlers to capture embedded resources dependent upon JavaScript. Measuring the performance benefits between crawl approaches, we propose a classification method that mitigates the performance impacts of the two-tiered crawling approach, and we measure the frontier size improvements observed with the two-tiered approach. Using the two-tiered crawling approach, we measure the number of client-side states associated with each URI-R and propose a mechanism for storing the mementos of deferred representations. In short, this dissertation details a body of work that explores the following: why JavaScript and deferred representations are difficult to archive (establishing the term deferred representation to describe JavaScript dependent representations); the extent to which JavaScript impacts archivability along with its impact on current archival tools; a metric for measuring the quality of mementos, which we use to describe the impact of JavaScript on archival quality; the performance trade-offs between traditional archival tools and technologies that better archive JavaScript; and a two-tiered crawling approach for discovering and archiving currently unarchivable descendants (representations generated by client-side user events) of deferred representations to mitigate the impact of JavaScript on our archives. In summary, what we archive is increasingly different from what we as interactive users experience. Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web

    Swimming in a Sea of JavaScript or: How I learned to Stop Worrying and Love High-Fidelity Replay

    Get PDF
    [First paragraph] Preserving and replaying modern web pages in high-fidelity has become an increasingly difficult task due to the increased usage of JavaScript. Reliance on server-side rewriting alone results in live-leakage and or the inability to replay a page due to the preserved JavaScript performing an action not permissible from the archive. The current state-of-the-art high fidelity archival preservation and replay solutions rely on handcrafted client-side URL rewriting libraries specifically tailored for the archive, namely Webrecoder\u27s and Pywb\u27s wombat.js [12]. Web archives not utilizing client-side rewriting rely on server-side rewriting that misses URLs used in a manner not accounted for by the archive or involve client-side execution of JavaScript by the browser

    Recognizing Co-Creators in Four Configurations: Critical Questions for Web Archiving

    Get PDF
    Four categories of co-creator shape web archivists\u27 practice and influence the development of web archives: social forces, users and uses, subjects of web archives, and technical agents. This paper illustrates how these categories of co-creator overlap and interact in four specific web archiving contexts. It recommends that web archivists acknowledge this complex array of contributors as a way to imagine web archives differently. A critical approach to web archiving recognizes relationships and blended roles among stakeholders; seeks opportunities for non-extractive archival activity; and acknowledges the value of creative reuse as an important aspect of preservation

    Die Hard: The Impossible, Absolutely Essential Task of Saving the Web for Scholars

    Full text link
    The web is fragile and littered with broken links. This poses a problem for the scholarly record and one’s own academic history. In this presentation given at the Association of College & Research Libraries – Eastern New York chapter conference, I review the stats on link rot and reference rot, and I give a brief overview of web archiving and its challenges. I review some web archiving tools: the Internet Archive, Perma.cc, WebRecorder, and GitHub. I advise creators of web projects to design their websites to be accessible and archivable, and to think about preservation (afterlife) of their projects from the start of the planning stages. I advise librarians and archivists to become familiar with web archiving tools and archivability/accessibility practices so that they, in turn, can advise web project creators

    Leveraging Heritrix and the Wayback Machine on a Corporate Intranet: A Case Study on Improving Corporate Archives

    Get PDF
    In this work, we present a case study in which we investigate using open-source, web-scale web archiving tools (i.e., Heritrix and the Wayback Machine installed on the MITRE Intranet) to automatically archive a corporate Intranet. We use this case study to outline the challenges of Intranet web archiving, identify situations in which the open source tools are not well suited for the needs of the corporate archivists, and make recommendations for future corporate archivists wishing to use such tools. We performed a crawl of 143,268 URIs (125 GB and 25 hours) to demonstrate that the crawlers are easy to set up, efficiently crawl the Intranet, and improve archive management. However, challenges exist when the Intranet contains sensitive information, areas with potential archival value require user credentials, or archival targets make extensive use of internally developed and customized web services. We elaborate on and recommend approaches for overcoming these challenges. [ABSTRACT FROM AUTHOR

    COVID-19 Dashboard Functionality and Design: Assessing Dashboard Design Service Providers for Health Disaster Response

    Get PDF
    71 pagesWhen disaster strikes, data visualizations are used as quick ways to concisely distill timely information to civilians. Amidst the COVID-19 pandemic, data-driven dashboards played a disproportionately large role in quickly collecting, processing, and conveying preliminary data to citizens. After the Johns Hopkins COVID-19 dashboard went viral, individual public health departments across the world realized the importance of distilling and delivering real-time data to citizens and decision makers. The widescale proliferation of dashboards across emergency response groups has only recently been made possible thanks to a business model in the software industry known as Platform as a Service, or PaaS, providers, which provide the data hosting, application development, and graphical interfaces for non-technical experts to deploy dashboards without an extensive background in web development. What the PaaS providers offer in ease-of-use, however, is traded against their limitations in functionality and accessibility. In this thesis, I used content analysis to perform a systematic review of 24 international COVID-19 data dashboards to understand international variation in COVID-19 dashboard design and to offer feature recommendations for software companies to incorporate into their PaaS platforms
    • …
    corecore