36 research outputs found

    Evaluating the SiteStory Transactional Web Archive With the ApacheBench Tool

    Full text link
    Conventional Web archives are created by periodically crawling a web site and archiving the responses from the Web server. Although easy to implement and common deployed, this form of archiving typically misses updates and may not be suitable for all preservation scenarios, for example a site that is required (perhaps for records compliance) to keep a copy of all pages it has served. In contrast, transactional archives work in conjunction with a Web server to record all pages that have been served. Los Alamos National Laboratory has developed SiteSory, an open-source transactional archive written in Java solution that runs on Apache Web servers, provides a Memento compatible access interface, and WARC file export features. We used the ApacheBench utility on a pre-release version of to measure response time and content delivery time in different environments and on different machines. The performance tests were designed to determine the feasibility of SiteStory as a production-level solution for high fidelity automatic Web archiving. We found that SiteStory does not significantly affect content server performance when it is performing transactional archiving. Content server performance slows from 0.076 seconds to 0.086 seconds per Web page access when the content server is under load, and from 0.15 seconds to 0.21 seconds when the resource has many embedded and changing resources.Comment: 13 pages, Technical Repor

    Scripts in a Frame: A Framework for Archiving Deferred Representations

    Get PDF
    Web archives provide a view of the Web as seen by Web crawlers. Because of rapid advancements and adoption of client-side technologies like JavaScript and Ajax, coupled with the inability of crawlers to execute these technologies effectively, Web resources become harder to archive as they become more interactive. At Web scale, we cannot capture client-side representations using the current state-of-the art toolsets because of the migration from Web pages to Web applications. Web applications increasingly rely on JavaScript and other client-side programming languages to load embedded resources and change client-side state. We demonstrate that Web crawlers and other automatic archival tools are unable to archive the resulting JavaScript-dependent representations (what we term deferred representations), resulting in missing or incorrect content in the archives and the general inability to replay the archived resource as it existed at the time of capture. Building on prior studies on Web archiving, client-side monitoring of events and embedded resources, and studies of the Web, we establish an understanding of the trends contributing to the increasing unarchivability of deferred representations. We show that JavaScript leads to lower-quality mementos (archived Web resources) due to the archival difficulties it introduces. We measure the historical impact of JavaScript on mementos, demonstrating that the increased adoption of JavaScript and Ajax correlates with the increase in missing embedded resources. To measure memento and archive quality, we propose and evaluate a metric to assess memento quality closer to Web users’ perception. We propose a two-tiered crawling approach that enables crawlers to capture embedded resources dependent upon JavaScript. Measuring the performance benefits between crawl approaches, we propose a classification method that mitigates the performance impacts of the two-tiered crawling approach, and we measure the frontier size improvements observed with the two-tiered approach. Using the two-tiered crawling approach, we measure the number of client-side states associated with each URI-R and propose a mechanism for storing the mementos of deferred representations. In short, this dissertation details a body of work that explores the following: why JavaScript and deferred representations are difficult to archive (establishing the term deferred representation to describe JavaScript dependent representations); the extent to which JavaScript impacts archivability along with its impact on current archival tools; a metric for measuring the quality of mementos, which we use to describe the impact of JavaScript on archival quality; the performance trade-offs between traditional archival tools and technologies that better archive JavaScript; and a two-tiered crawling approach for discovering and archiving currently unarchivable descendants (representations generated by client-side user events) of deferred representations to mitigate the impact of JavaScript on our archives. In summary, what we archive is increasingly different from what we as interactive users experience. Using the approaches detailed in this dissertation, archives can create mementos closer to what users experience rather than archiving the crawlers’ experiences on the Web

    Mainland Southeast Asia

    Get PDF
    The languages of Mainland South East Asia belong to five language phyla, yet they are often claimed to constitute a linguistic area. This chapter’s primary goal is to illustrate the areal features found in their prosodic systems while emphasizing their understated diversity. The first part of the chapter addresses the typology of word-level prosody. It describes common word shapes and stress patterns in the region, discusses tone inventories, and argues that beyond pitch, properties such as phonation and duration frequently play a role in patterns of tonal contrasts. The chapter next shows that complex tone alternations, although not typical, are attested in the area. The following section reviews evidence about prosodic phrasing in the area, discusses the substantial body of knowledge about intonation, and reconsiders the question of intonation in languages with complex tone paradigms and pervasive final particles. The chapter concludes with strategies for marking information structure and focus

    Combining Heritrix and PhantomJS for Better Crawling of Pages with Javascript

    Get PDF
    PDF of a powerpoint presentation from the International Internet Preservation Consortium (IIPC) 2016 Conference in Reykjavik, Iceland, April 11, 2016. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1003/thumbnail.jp

    Long-Term Landscape Changes in a Subalpine Spruce-Fir forest in Central Utah, USA

    Get PDF
    Background: In Western North America, increasing wildfire and outbreaks of native bark beetles have been mediated by warming climate conditions. Bioclimatic models forecast the loss of key high elevation species throughout the region. This study uses retrospective vegetation and fire history data to reconstruct the drivers of past disturbance and environmental change. Understanding the relationship among climate, antecedent disturbances, and the legacy effects of settlement-era logging can help identify the patterns and processes that create landscapes susceptible to bark beetle epidemics. Methods: Our analysis uses data from lake sediment cores, stand inventories, and historical records. Sediment cores were dated with radiometric techniques (14C and210Pb/137Cs) and subsampled for pollen and charcoal to maximize the temporal resolution during the historical period (1800 CE to present) and to provide environmental baseline data (last 10,500 years). Pollen data for spruce were calibrated to carbon biomass (C t/ha) using standard allometric equations and a transfer function. Charcoal samples were analyzed with statistical models to facilitate peak detection and determine fire recurrence intervals. Results: The Wasatch Plateau has been dominated by Engelmann spruce forests for the last ~10,500 years, with subalpine fir becoming more prominent since 6000 years ago. This landscape has experienced a dynamic fire regime, where burning events are more frequent and of higher magnitude during the last 3000 years. Two important disturbances have impacted Engelmann spruce in the historical period: 1) high-grade logging during the late 19th century; and (2) a high severity spruce beetle outbreak in the late 20th century that killed \u3e90 % of mature spruce (\u3e10 cm dbh). Conclusions: Our study shows that spruce-dominated forests in this region are resilient to a range of climate and disturbance regimes. Several lines of evidence suggest that 19th century logging promoted a legacy of simplified stand structure and composition such that, when climate became favorable for accelerated beetle population growth, the result was a landscape-scale spruce beetle outbreak. The lasting impacts of settlement-era landscape history from the Wasatch Plateau, UT may be relevant for other areas of western North America and Europe where sufficient host carrying capacity is important in managing for resistance and resilience to outbreaks

    A Method for Identifying Personalized Representations in Web Archives

    Get PDF
    Web resources are becoming increasingly personalized — two different users clicking on the same link at the same time can see content customized for each individual user. These changes result in multiple representations of a resource that cannot be canonicalized in Web archives. We identify characteristics of this problem by presenting a potential solution to generalize personalized representations in archives. We also present our proof-of-concept prototype that analyzes WARC (Web ARChive) format files, inserts metadata establishing relationships, and provides archive users the ability to navigate on the additional dimension of environment variables in a modified Wayback Machine
    corecore