655 research outputs found
Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool
Conventional Web archives are created by periodically crawling a web site and
archiving the responses from the Web server. Although easy to implement and
common deployed, this form of archiving typically misses updates and may not be
suitable for all preservation scenarios, for example a site that is required
(perhaps for records compliance) to keep a copy of all pages it has served. In
contrast, transactional archives work in conjunction with a Web server to
record all pages that have been served. Los Alamos National Laboratory has
developed SiteSory, an open-source transactional archive written in Java
solution that runs on Apache Web servers, provides a Memento compatible access
interface, and WARC file export features. We used the ApacheBench utility on a
pre-release version of to measure response time and content delivery time in
different environments and on different machines. The performance tests were
designed to determine the feasibility of SiteStory as a production-level
solution for high fidelity automatic Web archiving. We found that SiteStory
does not significantly affect content server performance when it is performing
transactional archiving. Content server performance slows from 0.076 seconds to
0.086 seconds per Web page access when the content server is under load, and
from 0.15 seconds to 0.21 seconds when the resource has many embedded and
changing resources.Comment: 13 pages, Technical Repor
Scripts in a Frame: A Framework for Archiving Deferred Representations
Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival tools are unable to archive the resulting JavaScript-dependent representations (what we term deferred representations), resulting in missing or incorrect content in the archives and the general inability to replay the archived resource as it existed at the time of capture.
Building on prior studies on Web archiving, client-side monitoring of events and embedded resources, and studies of the Web, we establish an understanding of the trends contributing to the increasing unarchivability of deferred representations. We show that JavaScript leads to lower-quality mementos (archived Web resources) due to the archival difficulties it introduces. We measure the historical impact of JavaScript on mementos, demonstrating that the increased adoption of JavaScript and Ajax correlates with the increase in missing embedded resources. To measure memento and archive quality, we propose and evaluate a metric to assess memento quality closer to Web users’ perception.
We propose a two-tiered crawling approach that enables crawlers to capture embedded resources dependent upon JavaScript. Measuring the performance benefits between crawl approaches, we propose a classification method that mitigates the performance impacts of the two-tiered crawling approach, and we measure the frontier size improvements observed with the two-tiered approach. Using the two-tiered crawling approach, we measure the number of client-side states associated with each URI-R and propose a mechanism for storing the mementos of deferred representations.
In short, this dissertation details a body of work that explores the following: why JavaScript and deferred representations are difficult to archive (establishing the term deferred representation to describe JavaScript dependent representations); the extent to which JavaScript impacts archivability along with its impact on current archival tools; a metric for measuring the quality of mementos, which we use to describe the impact of JavaScript on archival quality; the performance trade-offs between traditional archival tools and technologies that better archive JavaScript; and a two-tiered crawling approach for discovering and archiving currently unarchivable descendants (representations generated by client-side user events) of deferred representations to mitigate the impact of JavaScript on our archives.
In summary, what we archive is increasingly different from what we as interactive users experience. Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web
Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript
PDF of a powerpoint presentation from the International Internet Preservation Consortium (IIPC) 2016 Conference in Reykjavik, Iceland, April 11, 2016. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1003/thumbnail.jp
A Method for Identifying Personalized Representations in Web Archives
Web resources are becoming increasingly personalized — two different users clicking on the same link at the same time can see content customized for each individual user. These changes result in multiple representations of a resource that cannot be canonicalized in Web archives. We identify characteristics of this problem by presenting a potential solution to generalize personalized representations in archives. We also present our proof-of-concept prototype that analyzes WARC (Web ARChive) format files, inserts metadata establishing relationships, and provides archive users the ability to navigate on the additional dimension of environment variables in a modified Wayback Machine
Child relationships in the middle grades
Thesis (Ed.M.)--Boston Universit
Scraping SERPs for Archival Seeds: It Matters When You Start
Event-based collections are often started with a web search, but the search
results you find on Day 1 may not be the same as those you find on Day 7. In
this paper, we consider collections that originate from extracting URIs
(Uniform Resource Identifiers) from Search Engine Result Pages (SERPs).
Specifically, we seek to provide insight about the retrievability of URIs of
news stories found on Google, and to answer two main questions: first, can one
"refind" the same URI of a news story (for the same query) from Google after a
given time? Second, what is the probability of finding a story on Google over a
given period of time? To answer these questions, we issued seven queries to
Google every day for over seven months (2017-05-25 to 2018-01-12) and collected
links from the first five SERPs to generate seven collections for each query.
The queries represent public interest stories: "healthcare bill," "manchester
bombing," "london terrorism," "trump russia," "travel ban," "hurricane harvey,"
and "hurricane irma." We tracked each URI in all collections over time to
estimate the discoverability of URIs from the first five SERPs. Our results
showed that the daily average rate at which stories were replaced on the
default Google SERP ranged from 0.21 -0.54, and a weekly rate of 0.39 - 0.79,
suggesting the fast replacement of older stories by newer stories. The
probability of finding the same URI of a news story after one day from the
initial appearance on the SERP ranged from 0.34 - 0.44. After a week, the
probability of finding the same news stories diminishes rapidly to 0.01 - 0.11.
Our findings suggest that due to the difficulty in retrieving the URIs of news
stories from Google, collection building that originates from search engines
should begin as soon as possible in order to capture the first stages of
events, and should persist in order to capture the evolution of the events...Comment: This is an extended version of the ACM/IEEE Joint Conference on
Digital Libraries (JCDL 2018) full paper:
https://doi.org/10.1145/3197026.3197056. Some of the figure numbers have
change
- …