720 research outputs found

    Bringing Web Time Travel to MediaWiki: An Assessment of the Memento MediaWiki Extension

    Full text link
    We have implemented the Memento MediaWiki Extension Version 2.0, which brings the Memento Protocol to MediaWiki, used by Wikipedia and the Wikimedia Foundation. Test results show that the extension has a negligible impact on performance. Two 302 status code datetime negotiation patterns, as defined by Memento, have been examined for the extension: Pattern 1.1, which requires 2 requests, versus Pattern 2.1, which requires 3 requests. Our test results and mathematical review find that, contrary to intuition, Pattern 2.1 performs better than Pattern 1.1 due to idiosyncrasies in MediaWiki. In addition to implementing Memento, Version 2.0 allows administrators to choose the optional 200-style datetime negotiation Pattern 1.2 instead of Pattern 2.1. It also permits administrators the ability to have the Memento MediaWiki Extension return full HTTP 400 and 500 status codes rather than using standard MediaWiki error pages. Finally, version 2.0 permits administrators to turn off recommended Memento headers if desired. Seeing as much of our work focuses on producing the correct revision of a wiki page in response to a user's datetime input, we also examine the problem of finding the correct revisions of the embedded resources, including images, stylesheets, and JavaScript; identifying the issues and discussing whether or not MediaWiki must be changed to support this functionality.Comment: 23 pages, 18 figures, 9 tables, 17 listing

    To Relive the Web: A Framework for the Transformation and Archival Replay of Web Pages

    Get PDF
    When replaying an archived web page (known as a memento), the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives to modify the page and its embedded resources, so that they no longer reference (link to) the original server(s) they were archived from but back to the archive. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. Unfortunately, because the replay of mementos and the modifications made to them by web archives in order to facilitate replay varies between archives, the terminology for describing replay and the modification made to mementos for facilitating replay does not exist. In this thesis, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos in order to facilitate replay. This thesis also, in the process of defining terminology for the modifications made by client-side rewriting libraries to the JavaScript execution environment of the browser during replay, proposes a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. Also by using the generated client-side rewriter, we were able to replay mementos that were previously not replayable from the Internet Archive

    Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

    Full text link
    Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.Comment: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL

    Assessing the Quality of Web Archives

    Get PDF
    PDF of a powerpoint presentation from the 2014 Digital Preservation Meeting, Washington D. C., July 22-23, 2014. Also available from Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1010/thumbnail.jp

    Hashes Are Not Suitable to Verify Fixity of the Public Archived Web

    Get PDF
    Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages
    • …
    corecore