30 research outputs found
Web Archive Services Framework for Tighter Integration Between the Past and Present Web
Web archives have contained the cultural history of the web for many years, but they still have a limited capability for access. Most of the web archiving research has focused on crawling and preservation activities, with little focus on the delivery methods. The current access methods are tightly coupled with web archive infrastructure, hard to replicate or integrate with other web archives, and do not cover all the users\u27 needs. In this dissertation, we focus on the access methods for archived web data to enable users, third-party developers, researchers, and others to gain knowledge from the web archives. We build ArcSys, a new service framework that extracts, preserves, and exposes APIs for the web archive corpus. The dissertation introduces a novel categorization technique to divide the archived corpus into four levels. For each level, we will propose suitable services and APIs that enable both users and third-party developers to build new interfaces. The first level is the content level that extracts the content from the archived web data. We develop ArcContent to expose the web archive content processed through various filters. The second level is the metadata level; we extract the metadata from the archived web data and make it available to users. We implement two services, ArcLink for temporal web graph and ArcThumb for optimizing the thumbnail creation in the web archives. The third level is the URI level that focuses on using the URI HTTP redirection status to enhance the user query. Finally, the highest level in the web archiving service framework pyramid is the archive level. In this level, we define the web archive by the characteristics of its corpus and building Web Archive Profiles. The profiles are used by the Memento Aggregator for query optimization
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Web archives are a valuable resource for researchers of various disciplines.
However, to use them as a scholarly source, researchers require a tool that
provides efficient access to Web archive data for extraction and derivation of
smaller datasets. Besides efficient access we identify five other objectives
based on practical researcher needs such as ease of use, extensibility and
reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient,
distributed Web archive processing that builds a research corpus by working on
existing and standardized data formats commonly held by Web archiving
institutions. Performance optimizations in ArchiveSpark, facilitated by the use
of a widely available metadata index, result in significant speed-ups of data
processing. Our benchmarks show that ArchiveSpark is faster than alternative
approaches without depending on any additional data stores while improving
usability by seamlessly integrating queries and derivations with external
tools.Comment: JCDL 2016, Newark, NJ, US
Profiling Web Archive Coverage for Top-Level Domain and Content Language
The Memento aggregator currently polls every known public web archive when
serving a request for an archived web page, even though some web archives focus
on only specific domains and ignore the others. Similar to query routing in
distributed search, we investigate the impact on aggregated Memento TimeMaps
(lists of when and where a web page was archived) by only sending queries to
archives likely to hold the archived page. We profile twelve public web
archives using data from a variety of sources (the web, archives' access logs,
and full-text queries to archives) and discover that only sending queries to
the top three web archives (i.e., a 75% reduction in the number of queries) for
any request produces the full TimeMaps on 84% of the cases.Comment: Appeared in TPDL 201
How Much of the Web Is Archived?
Although the Internet Archive's Wayback Machine is the largest and most
well-known web archive, there have been a number of public web archives that
have emerged in the last several years. With varying resources, audiences and
collection development policies, these archives have varying levels of overlap
with each other. While individual archives can be measured in terms of number
of URIs, number of copies per URI, and intersection with other archives, to
date there has been no answer to the question "How much of the Web is
archived?" We study the question by approximating the Web using sample URIs
from DMOZ, Delicious, Bitly, and search engine indexes; and, counting the
number of copies of the sample URIs exist in various public web archives. Each
sample set provides its own bias. The results from our sample sets indicate
that range from 35%-90% of the Web has at least one archived copy, 17%-49% has
between 2-5 copies, 1%-8% has 6-10 copies, and 8%-63% has more than 10 copies
in public web archives. The number of URI copies varies as a function of time,
but no more than 31.3% of URIs are archived more than once per month.Comment: This is the long version of the short paper by the same title
published at JCDL'11. 10 pages, 5 figures, 7 tables. Version 2 includes minor
typographical correction
Profiling Web Archives
PDF of a powerpoint presentation from the 2014 International Internet Preservation Consortium (IIPC) General Assembly, Paris, France, May 21, 2014. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1011/thumbnail.jp
Tools for Managing the Past Web
PDF of a powerpoint presentation from the Archive-It Partners Meeting in Montgomery, Alabama, November 18, 2014. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1032/thumbnail.jp
Who Will Archive the Archives? Thoughts About the Future of Web Archiving
PDF of a powerpoint presentation from the Wolfram Data Summit 2013 in Washington D.C., September 5-6, 2013. Also available on Slideshare.https://digitalcommons.odu.edu/computerscience_presentations/1017/thumbnail.jp
A next-generation liquid xenon observatory for dark matter and neutrino physics
The nature of dark matter and properties of neutrinos are among the most pressing issues in contemporary particle physics. The dual-phase xenon time-projection chamber is the leading technology to cover the available parameter space for weakly interacting massive particles, while featuring extensive sensitivity to many alternative dark matter candidates. These detectors can also study neutrinos through neutrinoless double-beta decay and through a variety of astrophysical sources. A next-generation xenon-based detector will therefore be a true multi-purpose observatory to significantly advance particle physics, nuclear physics, astrophysics, solar physics, and cosmology. This review article presents the science cases for such a detector
A Next-Generation Liquid Xenon Observatory for Dark Matter and Neutrino Physics
The nature of dark matter and properties of neutrinos are among the mostpressing issues in contemporary particle physics. The dual-phase xenontime-projection chamber is the leading technology to cover the availableparameter space for Weakly Interacting Massive Particles (WIMPs), whilefeaturing extensive sensitivity to many alternative dark matter candidates.These detectors can also study neutrinos through neutrinoless double-beta decayand through a variety of astrophysical sources. A next-generation xenon-baseddetector will therefore be a true multi-purpose observatory to significantlyadvance particle physics, nuclear physics, astrophysics, solar physics, andcosmology. This review article presents the science cases for such a detector.<br