13 research outputs found

    Impact of HTTP Cookie Violations in Web Archives

    Get PDF
    Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs

    Polish Web resources described in the "Polish World" directory (1997). Characteristics of domains and their conservation state

    Get PDF
    For the purposes of this study, the print version of the Polish World directory by Martin Miszczak (Helion, 1997) was used to create an index of historical URLs and verify their current availability and presence in Web archives. The quantitative analysis of the index was prepared  to obtain the rank data on top-level domains (TLDs) and subdomains, while the language of pages published in domains other than .PL was also examined. This study uncovered a low current availability (21.77 per cent) of Polish World URIs with a 79.6 presence in Web archives (60.35 for addresses unreachable today). Forty-six per cent of the addresses from the directory were available on domains other than .PL, of which only 15.36 per cent had content in Polish. It would seem that in 1997, Polish Internet users were able to use Polish-centric resources, mostly already available through the Polish country domain. The 180 domain names with the .PL suffix uncovered during the study constitute around 20 per cent of .PL domain names active until at least the end of 1996 on the Web.W ramach badania wykorzystano drukowaną wersję katalogu Polish World Martina Miszczaka (wyd. Helion, 1997) w celu stworzenia indeksu historycznych adresów URL i zbadania ich współczesnej dostępności oraz obecności w archiwach Webu. Zasoby katalogu poddano analizie ilościowej pod kątem statystyki domen najwyższego rzędu i subdomen oraz zbadano języki stron publikowanych w domenie innej niż PL. Badanie ujawniło niską współczesną dostępność tych adresów (21.77 proc.) przy obecności kopii w archiwach Webu na poziomie 79.6 proc. (dla nieosiągalnych dziś adresów - 60.35 proc). 40.64 proc. adresów z katalogu dostępnych było na domenach innych niż PL, przy czym tylko 15.36 proc. z nich posiadało treść w języku polskim. Wydaje się, że w początkach 1997 roku polscy użytkownicy korzystać mogli z polskocentrycznych zasobów dostępnych już przede wszystkim w polskiej domenie krajowej. Wyodrębnione w trakcie badania 180 wspólnych nazw domenowych z domeny PL to około 20 proc. nazw domenowych PL aktywnych przynajmniej do końca 1996 roku w sieci WWW

    A Framework for Aggregating Private and Public Web Archives

    Full text link
    Personal and private Web archives are proliferating due to the increase in the tools to create them and the realization that Internet Archive and other public Web archives are unable to capture personalized (e.g., Facebook) and private (e.g., banking) Web pages. We introduce a framework to mitigate issues of aggregation in private, personal, and public Web archives without compromising potential sensitive information contained in private captures. We amend Memento syntax and semantics to allow TimeMap enrichment to account for additional attributes to be expressed inclusive of the requirements for dereferencing private Web archive captures. We provide a method to involve the user further in the negotiation of archival captures in dimensions beyond time. We introduce a model for archival querying precedence and short-circuiting, as needed when aggregating private and personal Web archive captures with those from public Web archives through Memento. Negotiation of this sort is novel to Web archiving and allows for the more seamless aggregation of various types of Web archives to convey a more accurate picture of the past Web.Comment: Preprint version of the ACM/IEEE Joint Conference on Digital Libraries (JCDL 2018) full paper, accessible at the DO

    Lost in the Infinite Archive: The Promise and Pitfalls of Web Archives

    Get PDF
    The version of record can be found at http://www.euppublishing.com/doi/10.3366/ijhac.2016.0161.Contemporary and future historians need to grapple with and confront the challenges posed by web archives. These large collections of material, accessed either through the Internet Archive's Wayback Machine or through other computational methods, represent both a challenge and an opportunity to historians. Through these collections, we have the potential to access the voices of millions of non-elite individuals (recognizing of course the cleavages in both Web access as well as method of access). To put this in perspective, the Old Bailey Online currently describes its monumental holdings of 197,745 trials between 1674 and 1913 as the "largest body of texts detailing the lives of non-elite people ever published." GeoCities.com, a platform for everyday web publishing in the mid-to-late 1990s and early 2000s, amounted to over thirty-eight million individual webpages. Historians will have access, in some form, to millions of pages: written by everyday people of various classes, genders, ethnicities, and ages. While the Web was not a perfect democracy by any means – it was and is unevenly accessed across each of those categories – this still represents a massive collection of non-elite speech. Yet a figure like thirty-eight million webpages is both a blessing and a curse. We cannot read every website, and must instead rely upon discovery tools to find the information that we need. Yet these tools largely do not exist for web archives, or are in a very early state of development: what will they look like? What information do historians want to access? We cannot simply map over web tools optimized for discovering current information through online searches or metadata analysis. We need to find information that mattered at the time, to diverse and very large communities. Furthermore, web pages cannot be viewed in isolation, outside of the networks that they inhabited. In theory, amongst corpuses of millions of pages, researchers can find whatever they want to confirm. The trick is situating it into a larger social and cultural context: is it representative? Unique? In this paper, "Lost in the Infinite Archive," I explore what the future of digital methods for historians will be when they need to explore web archives. Historical research of periods beginning in the mid-1990s will need to use web archives, and right now we are not ready. This article draws on first-hand research with the Internet Archive and Archive-It web archiving teams. It draws upon three exhaustive datasets: the large Web ARChive (WARC) files that make up Wide Web Scrapes of the Web; the metadata-intensive WAT files that provide networked contextual information; and the lifted-straight-from-the-web guerilla archives generated by groups like Archive Team. Through these case studies, we can see – hands-on – what richness and potentials lie in these new cultural records, and what approaches we may need to adopt. It helps underscore the need to have humanists involved at this early, crucial stage.Social Sciences and Humanities Research Council || 430-2013-0616 Ontario Early Researcher Awar

    Rewriting History: Changing the Archived Web from the Present

    Get PDF
    The Internet Archive’s Wayback Machine is the largest modern web archive, preserving web content since 1996. We discover and analyze several vulnerabilities in how the Wayback Machine archives data, and then leverage these vulnerabilities to create what are to our knowledge the first attacks against a user’s view of the archived web. Our vulnerabilities are enabled by the unique interaction between the Wayback Machine’s archives, other websites, and a user’s browser, and attackers do not need to compromise the archives in order to compromise users’ views of a stored page. We demonstrate the effectiveness of our attacks through proof-of-concept implementations. Then, we conduct a measurement study to quantify the prevalence of vulnerabilities in the archive. Finally, we explore defenses which might be deployed by archives, website publishers, and the users of archives, and present the prototype of a defense for clients of the Wayback Machine, ArchiveWatcher

    Right HTML, Wrong JSON: Challenges in Replaying Archived Webpages Built with Client-Side Rendering

    Full text link
    Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.Comment: 20 pages, preprint version of paper accepted at the 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL
    corecore