720 research outputs found
Bringing Web Time Travel to MediaWiki: An Assessment of the Memento MediaWiki Extension
We have implemented the Memento MediaWiki Extension Version 2.0, which brings
the Memento Protocol to MediaWiki, used by Wikipedia and the Wikimedia
Foundation. Test results show that the extension has a negligible impact on
performance. Two 302 status code datetime negotiation patterns, as defined by
Memento, have been examined for the extension: Pattern 1.1, which requires 2
requests, versus Pattern 2.1, which requires 3 requests. Our test results and
mathematical review find that, contrary to intuition, Pattern 2.1 performs
better than Pattern 1.1 due to idiosyncrasies in MediaWiki. In addition to
implementing Memento, Version 2.0 allows administrators to choose the optional
200-style datetime negotiation Pattern 1.2 instead of Pattern 2.1. It also
permits administrators the ability to have the Memento MediaWiki Extension
return full HTTP 400 and 500 status codes rather than using standard MediaWiki
error pages. Finally, version 2.0 permits administrators to turn off
recommended Memento headers if desired. Seeing as much of our work focuses on
producing the correct revision of a wiki page in response to a user's datetime
input, we also examine the problem of finding the correct revisions of the
embedded resources, including images, stylesheets, and JavaScript; identifying
the issues and discussing whether or not MediaWiki must be changed to support
this functionality.Comment: 23 pages, 18 figures, 9 tables, 17 listing
To Relive the Web: A Framework for the Transformation and Archival Replay of Web Pages
When replaying an archived web page (known as a memento), the fundamental expectation is that the page should be viewable and function exactly as it did at archival time. However, this expectation requires web archives to modify the page and its embedded resources, so that they no longer reference (link to) the original server(s) they were archived from but back to the archive. Although these modifications necessarily change the state of the representation, it is understood that without them the replay of mementos from the archive would not be possible. Unfortunately, because the replay of mementos and the modifications made to them by web archives in order to facilitate replay varies between archives, the terminology for describing replay and the modification made to mementos for facilitating replay does not exist. In this thesis, we propose terminology for describing the existing styles of replay and the modifications made on the part of web archives to mementos in order to facilitate replay. This thesis also, in the process of defining terminology for the modifications made by client-side rewriting libraries to the JavaScript execution environment of the browser during replay, proposes a general framework for the auto-generation of client-side rewriting libraries. Finally, we evaluate the effectiveness of using a generated client-side rewriting library to augment the existing replay systems of web archives by crawling mementos replayed from the Internet Archive’s Wayback Machine with and without the generated client-side rewriter. By using the generated client-side rewriter we were able to decrease the cumulative number of requests blocked by the content security policy of the Wayback Machine for 577 mementos by 87.5% and increased the cumulative number of requests made by 32.8%. Also by using the generated client-side rewriter, we were able to replay mementos that were previously not replayable from the Internet Archive
Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering
Many web sites are transitioning how they construct their pages. The
conventional model is where the content is embedded server-side in the HTML and
returned to the client in an HTTP response. Increasingly, sites are moving to a
model where the initial HTTP response contains only an HTML skeleton plus
JavaScript that makes API calls to a variety of servers for the content
(typically in JSON format), and then builds out the DOM client-side, more
easily allowing for periodically refreshing the content in a page and allowing
dynamic modification of the content. This client-side rendering, now
predominant in social media platforms such as Twitter and Instagram, is also
being adopted by news outlets, such as CNN.com. When conventional web archiving
techniques, such as crawling with Heritrix, are applied to pages that render
their content client-side, the JSON responses can become out of sync with the
HTML page in which it is to be embedded, resulting in temporal violations on
replay. Because the violative JSON is not directly observable in the page
(i.e., in the same manner a violative embedded image is), the temporal
violations can be difficult to detect. We describe how the top level CNN.com
page has used client-side rendering since April 2015 and the impact this has
had on web archives. Between April 24, 2015 and July 21, 2016, we found almost
15,000 mementos with a temporal violation of more than 2 days between the base
CNN.com HTML and the JSON responses used to deliver the content under the main
story. One way to mitigate this problem is to use browser-based crawling
instead of conventional crawlers like Heritrix, but browser-based crawling is
currently much slower than non-browser-based tools such as Heritrix.Comment: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE
Joint Conference on Digital Libraries (JCDL
Assessing the Quality of Web Archives
PDF of a powerpoint presentation from the 2014 Digital Preservation Meeting, Washington D. C., July 22-23, 2014. Also available from Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1010/thumbnail.jp
Hashes Are Not Suitable to Verify Fixity of the Public Archived Web
Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages
- …