51,288 research outputs found
ArchiveSpark: Efficient Web Archive Access, Extraction and Derivation
Web archives are a valuable resource for researchers of various disciplines.
However, to use them as a scholarly source, researchers require a tool that
provides efficient access to Web archive data for extraction and derivation of
smaller datasets. Besides efficient access we identify five other objectives
based on practical researcher needs such as ease of use, extensibility and
reusability.
Towards these objectives we propose ArchiveSpark, a framework for efficient,
distributed Web archive processing that builds a research corpus by working on
existing and standardized data formats commonly held by Web archiving
institutions. Performance optimizations in ArchiveSpark, facilitated by the use
of a widely available metadata index, result in significant speed-ups of data
processing. Our benchmarks show that ArchiveSpark is faster than alternative
approaches without depending on any additional data stores while improving
usability by seamlessly integrating queries and derivations with external
tools.Comment: JCDL 2016, Newark, NJ, US
JISC Preservation of Web Resources (PoWR) Handbook
Handbook of Web Preservation produced by the JISC-PoWR project which ran from April to November 2008.
The handbook specifically addresses digital preservation issues that are relevant to the UK HE/FE web management community”.
The project was undertaken jointly by UKOLN at the University of Bath and ULCC Digital Archives department
VXA: A Virtual Architecture for Durable Compressed Archives
Data compression algorithms change frequently, and obsolete decoders do not
always run on new hardware and operating systems, threatening the long-term
usability of content archived using those algorithms. Re-encoding content into
new formats is cumbersome, and highly undesirable when lossy compression is
involved. Processor architectures, in contrast, have remained comparatively
stable over recent decades. VXA, an archival storage system designed around
this observation, archives executable decoders along with the encoded content
it stores. VXA decoders run in a specialized virtual machine that implements an
OS-independent execution environment based on the standard x86 architecture.
The VXA virtual machine strictly limits access to host system services, making
decoders safe to run even if an archive contains malicious code. VXA's adoption
of a "native" processor architecture instead of type-safe language technology
allows reuse of existing "hand-optimized" decoders in C and assembly language,
and permits decoders access to performance-enhancing architecture features such
as vector processing instructions. The performance cost of VXA's virtualization
is typically less than 15% compared with the same decoders running natively.
The storage cost of archived decoders, typically 30-130KB each, can be
amortized across many archived files sharing the same compression method.Comment: 14 pages, 7 figures, 2 table
Repository Replication Using NNTP and SMTP
We present the results of a feasibility study using shared, existing,
network-accessible infrastructure for repository replication. We investigate
how dissemination of repository contents can be ``piggybacked'' on top of
existing email and Usenet traffic. Long-term persistence of the replicated
repository may be achieved thanks to current policies and procedures which
ensure that mail messages and news posts are retrievable for evidentiary and
other legal purposes for many years after the creation date. While the
preservation issues of migration and emulation are not addressed with this
approach, it does provide a simple method of refreshing content with unknown
partners.Comment: This revised version has 24 figures and a more detailed discussion of
the experiments conducted by u
From Social Data Mining to Forecasting Socio-Economic Crisis
Socio-economic data mining has a great potential in terms of gaining a better
understanding of problems that our economy and society are facing, such as
financial instability, shortages of resources, or conflicts. Without
large-scale data mining, progress in these areas seems hard or impossible.
Therefore, a suitable, distributed data mining infrastructure and research
centers should be built in Europe. It also appears appropriate to build a
network of Crisis Observatories. They can be imagined as laboratories devoted
to the gathering and processing of enormous volumes of data on both natural
systems such as the Earth and its ecosystem, as well as on human
techno-socio-economic systems, so as to gain early warnings of impending
events. Reality mining provides the chance to adapt more quickly and more
accurately to changing situations. Further opportunities arise by individually
customized services, which however should be provided in a privacy-respecting
way. This requires the development of novel ICT (such as a self- organizing
Web), but most likely new legal regulations and suitable institutions as well.
As long as such regulations are lacking on a world-wide scale, it is in the
public interest that scientists explore what can be done with the huge data
available. Big data do have the potential to change or even threaten democratic
societies. The same applies to sudden and large-scale failures of ICT systems.
Therefore, dealing with data must be done with a large degree of responsibility
and care. Self-interests of individuals, companies or institutions have limits,
where the public interest is affected, and public interest is not a sufficient
justification to violate human rights of individuals. Privacy is a high good,
as confidentiality is, and damaging it would have serious side effects for
society.Comment: 65 pages, 1 figure, Visioneer White Paper, see
http://www.visioneer.ethz.c
The Dark Energy Survey Data Management System
The Dark Energy Survey collaboration will study cosmic acceleration with a
5000 deg2 griZY survey in the southern sky over 525 nights from 2011-2016. The
DES data management (DESDM) system will be used to process and archive these
data and the resulting science ready data products. The DESDM system consists
of an integrated archive, a processing framework, an ensemble of astronomy
codes and a data access framework. We are developing the DESDM system for
operation in the high performance computing (HPC) environments at NCSA and
Fermilab. Operating the DESDM system in an HPC environment offers both speed
and flexibility. We will employ it for our regular nightly processing needs,
and for more compute-intensive tasks such as large scale image coaddition
campaigns, extraction of weak lensing shear from the full survey dataset, and
massive seasonal reprocessing of the DES data. Data products will be available
to the Collaboration and later to the public through a virtual-observatory
compatible web portal. Our approach leverages investments in publicly available
HPC systems, greatly reducing hardware and maintenance costs to the project,
which must deploy and maintain only the storage, database platforms and
orchestration and web portal nodes that are specific to DESDM. In Fall 2007, we
tested the current DESDM system on both simulated and real survey data. We used
Teragrid to process 10 simulated DES nights (3TB of raw data), ingesting and
calibrating approximately 250 million objects into the DES Archive database. We
also used DESDM to process and calibrate over 50 nights of survey data acquired
with the Mosaic2 camera. Comparison to truth tables in the case of the
simulated data and internal crosschecks in the case of the real data indicate
that astrometric and photometric data quality is excellent.Comment: To be published in the proceedings of the SPIE conference on
Astronomical Instrumentation (held in Marseille in June 2008). This preprint
is made available with the permission of SPIE. Further information together
with preprint containing full quality images is available at
http://desweb.cosmology.uiuc.edu/wik
The Wiltshire Wills Feasibility Study
The Wiltshire and Swindon Record Office has nearly ninety thousand wills in its care. These records are neither adequately catalogued nor secured against loss by facsimile microfilm copies. With support from the Heritage Lottery Fund the Record Office has begun to produce suitable finding aids for the material. Beginning with this feasibility study the Record Office is developing a strategy to ensure the that facsimiles to protect the collection against risk of loss or damage and to improve public access are created.<p></p>
This feasibility study explores the different methodologies that can be used to assist the preservation and conservation of the collection and improve public access to it. The study aims to produce a strategy that will enable the Record Office to create digital facsimiles of the Wills in its care for access purposes and to also create preservation quality microfilms. The strategy aims to seek the most cost effective and time efficient approach to the problem and identifies ways to optimise the processes by drawing on the experience of other similar projects. This report provides a set of guidelines and recommendations to ensure the best use of the resources available for to provide the most robust preservation strategy and to ensure that future access to the Wills as an information resource can be flexible, both local and remote, and sustainable
- …