479,795 research outputs found
Profiling Web Archive Coverage for Top-Level Domain and Content Language
The Memento aggregator currently polls every known public web archive when
serving a request for an archived web page, even though some web archives focus
on only specific domains and ignore the others. Similar to query routing in
distributed search, we investigate the impact on aggregated Memento TimeMaps
(lists of when and where a web page was archived) by only sending queries to
archives likely to hold the archived page. We profile twelve public web
archives using data from a variety of sources (the web, archives' access logs,
and full-text queries to archives) and discover that only sending queries to
the top three web archives (i.e., a 75% reduction in the number of queries) for
any request produces the full TimeMaps on 84% of the cases.Comment: Appeared in TPDL 201
Using semantic indexing to improve searching performance in web archives
The sheer volume of electronic documents being published on the Web can be overwhelming for users if the searching aspect is not properly addressed. This problem is particularly acute inside archives and repositories containing large collections of web resources or, more precisely, web pages and other web objects. Using the existing search capabilities in web archives, results can be compromised because of the size of data, content heterogeneity and changes in scientific terminologies and meanings. During the course of this research, we will explore whether semantic web technologies, particularly ontology-based annotation and retrieval, could improve precision in search results in multi-disciplinary web archives
Web archives: the future
T his report is structured first, to engage in some speculative thought about the possible futures of the web as an exercise in prom pting us to think about what we need to do now in order to make sure that we can reliably and fruitfully use archives of the w eb in the future. Next, we turn to considering the methods and tools being used to research the live web, as a pointer to the types of things that can be developed to help unde rstand the archived web. Then , we turn to a series of topics and questions that researchers want or may want to address using the archived web. In this final section, we i dentify some of the challenges individuals, organizations, and international bodies can target to increase our ability to explore these topi cs and answer these quest ions. We end the report with some conclusions based on what we have learned from this exercise
A Framework for Aggregating Private and Public Web Archives
Personal and private Web archives are proliferating due to the increase in
the tools to create them and the realization that Internet Archive and other
public Web archives are unable to capture personalized (e.g., Facebook) and
private (e.g., banking) Web pages. We introduce a framework to mitigate issues
of aggregation in private, personal, and public Web archives without
compromising potential sensitive information contained in private captures. We
amend Memento syntax and semantics to allow TimeMap enrichment to account for
additional attributes to be expressed inclusive of the requirements for
dereferencing private Web archive captures. We provide a method to involve the
user further in the negotiation of archival captures in dimensions beyond time.
We introduce a model for archival querying precedence and short-circuiting, as
needed when aggregating private and personal Web archive captures with those
from public Web archives through Memento. Negotiation of this sort is novel to
Web archiving and allows for the more seamless aggregation of various types of
Web archives to convey a more accurate picture of the past Web.Comment: Preprint version of the ACM/IEEE Joint Conference on Digital
Libraries (JCDL 2018) full paper, accessible at the DO
How Much of the Web Is Archived?
Although the Internet Archive's Wayback Machine is the largest and most
well-known web archive, there have been a number of public web archives that
have emerged in the last several years. With varying resources, audiences and
collection development policies, these archives have varying levels of overlap
with each other. While individual archives can be measured in terms of number
of URIs, number of copies per URI, and intersection with other archives, to
date there has been no answer to the question "How much of the Web is
archived?" We study the question by approximating the Web using sample URIs
from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the
number of copies of the sample URIs exist in various public web archives. Each
sample set provides its own bias. The results from our sample sets indicate
that range from 35%-90% of the Web has at least one archived copy, 17%-49% has
between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies
in public web archives. The number of URI copies varies as a function of time,
but no more than 31.3% of URIs are archived more than once per month.Comment: This is the long version of the short paper by the same title
published at JCDL'11. 10 pages, 5 figures, 7 tables. Version 2 includes minor
typographical correction
Linking Mathematical Software in Web Archives
The Web is our primary source of all kinds of information today. This
includes information about software as well as associated materials, like
source code, documentation, related publications and change logs. Such data is
of particular importance in research in order to conduct, comprehend and
reconstruct scientific experiments that involve software. swMATH, a
mathematical software directory, attempts to identify software mentions in
scientific articles and provides additional information as well as links to the
Web. However, just like software itself, the Web is dynamic and most likely the
information on the Web has changed since it was referenced in a scientific
publication. Therefore, it is crucial to preserve the resources of a software
on the Web to capture its states over time.
We found that around 40% of the websites in swMATH are already included in an
existing Web archive. Out of these, 60% of contain some kind of documentation
and around 45% even provide downloads of software artifacts. Hence, already
today links can be established based on the publication dates of corresponding
articles. The contained data enable enriching existing information with a
temporal dimension. In the future, specialized infrastructure will improve the
coverage of software resources and allow explicit references in scientific
publications.Comment: ICMS 2016, Berlin, German
The Archives Unleashed Project: Technology, Process, and Community to Improve Scholarly Access to Web Archives
The Archives Unleashed project aims to improve scholarly access to web archives through a multi-pronged strategy involving tool creation, process modeling, and community building -- all proceeding concurrently in mutually --reinforcing efforts. As we near the end of our initially-conceived three-year project, we report on our progress and share lessons learned along the way. The main contribution articulated in this paper is a process model that decomposes scholarly inquiries into four main activities: filter, extract, aggregate, and visualize. Based on the insight that these activities can be disaggregated across time, space, and tools, it is possible to generate "derivative products", using our Archives Unleashed Toolkit, that serve as useful starting points for scholarly inquiry. Scholars can download these products from the Archives Unleashed Cloud and manipulate them just like any other dataset, thus providing access to web archives without requiring any specialized knowledge. Over the past few years, our platform has processed over a thousand different collections from over two hundred users, totaling around 300 terabytes of web archives.This research was supported by the Andrew W. Mellon Foundation, the Social Sciences and Humanities Research Council of Canada, as well as Start Smart Labs, Compute Canada, the University of Waterloo, and York University. We’d like to thank Jeremy Wiebe, Ryan Deschamps, and Gursimran Singh for their contributions
Legal Deposit Web Archives and the Digital Humanities: a Universe of Lost Opportunity?
Legal deposit libraries have archived the web for over a decade. Several nations, supported by legal deposit regu-lations, have introduced comprehensive national domain web crawling, an essential part of the national library re-mit to collect, preserve and make accessible a nation’s intellectual and cultural heritage (Brazier, 2016). Scholars have traditionally been the chief beneficiaries of legal de-posit collections: in the case of web archives, the poten-tial for research extends to contemporary materials, and to Digital Humanities text and data mining approaches. To date, however, little work has evaluated whether legal deposit regulations support computational approaches to research using national web archive data (Brügger, 2012; Hockx-Yu, 2014; Black, 2016). This paper examines the impact of electronic legal deposit (ELD) in the United Kingdom, particularly how the 2013 regulations influence innovative scholarship using the Legal Deposit UK Web Archive. As the first major case study to analyse the implementation of ELD, it will ad-dress the following key research questions:• • Is legal deposit, a concept defined and refined for print materials, the most suitable vehicle for suppor-ting DH research using web archives?
• How does the current framing of ELD affect digital in-novation in the UK library sector?
• How does the current information ecology, including not for-profit archives, influence the relationship between DH researchers and legal deposit libraries
- …
