12,685 research outputs found
ArrayBridge: Interweaving declarative array processing with high-performance computing
Scientists are increasingly turning to datacenter-scale computers to produce
and analyze massive arrays. Despite decades of database research that extols
the virtues of declarative query processing, scientists still write, debug and
parallelize imperative HPC kernels even for the most mundane queries. This
impedance mismatch has been partly attributed to the cumbersome data loading
process; in response, the database community has proposed in situ mechanisms to
access data in scientific file formats. Scientists, however, desire more than a
passive access method that reads arrays from files.
This paper describes ArrayBridge, a bi-directional array view mechanism for
scientific file formats, that aims to make declarative array manipulations
interoperable with imperative file-centric analyses. Our prototype
implementation of ArrayBridge uses HDF5 as the underlying array storage library
and seamlessly integrates into the SciDB open-source array database system. In
addition to fast querying over external array objects, ArrayBridge produces
arrays in the HDF5 file format just as easily as it can read from it.
ArrayBridge also supports time travel queries from imperative kernels through
the unmodified HDF5 API, and automatically deduplicates between array versions
for space efficiency. Our extensive performance evaluation in NERSC, a
large-scale scientific computing facility, shows that ArrayBridge exhibits
statistically indistinguishable performance and I/O scalability to the native
SciDB storage engine.Comment: 12 pages, 13 figure
Theory and Practice of Data Citation
Citations are the cornerstone of knowledge propagation and the primary means
of assessing the quality of research, as well as directing investments in
science. Science is increasingly becoming "data-intensive", where large volumes
of data are collected and analyzed to discover complex patterns through
simulations and experiments, and most scientific reference works have been
replaced by online curated datasets. Yet, given a dataset, there is no
quantitative, consistent and established way of knowing how it has been used
over time, who contributed to its curation, what results have been yielded or
what value it has.
The development of a theory and practice of data citation is fundamental for
considering data as first-class research objects with the same relevance and
centrality of traditional scientific products. Many works in recent years have
discussed data citation from different viewpoints: illustrating why data
citation is needed, defining the principles and outlining recommendations for
data citation systems, and providing computational methods for addressing
specific issues of data citation.
The current panorama is many-faceted and an overall view that brings together
diverse aspects of this topic is still missing. Therefore, this paper aims to
describe the lay of the land for data citation, both from the theoretical (the
why and what) and the practical (the how) angle.Comment: 24 pages, 2 tables, pre-print accepted in Journal of the Association
for Information Science and Technology (JASIST), 201
VisIVOWeb: A WWW Environment for Large-Scale Astrophysical Visualization
This article presents a newly developed Web portal called VisIVOWeb that aims
to provide the astrophysical community with powerful visualization tools for
large-scale data sets in the context of Web 2.0. VisIVOWeb can effectively
handle modern numerical simulations and real-world observations. Our
open-source software is based on established visualization toolkits offering
high-quality rendering algorithms. The underlying data management is discussed
with the supported visualization interfaces and movie-making functionality. We
introduce VisIVOWeb Network, a robust network of customized Web portals for
visual discovery, and VisIVOWeb Connect, a lightweight and efficient solution
for seamlessly connecting to existing astrophysical archives. A significant
effort has been devoted for ensuring interoperability with existing tools by
adhering to IVOA standards. We conclude with a summary of our work and a
discussion on future developments
A Graph-structured Dataset for Wikipedia Research
Wikipedia is a rich and invaluable source of information. Its central place
on the Web makes it a particularly interesting object of study for scientists.
Researchers from different domains used various complex datasets related to
Wikipedia to study language, social behavior, knowledge organization, and
network theory. While being a scientific treasure, the large size of the
dataset hinders pre-processing and may be a challenging obstacle for potential
new studies. This issue is particularly acute in scientific domains where
researchers may not be technically and data processing savvy. On one hand, the
size of Wikipedia dumps is large. It makes the parsing and extraction of
relevant information cumbersome. On the other hand, the API is straightforward
to use but restricted to a relatively small number of requests. The middle
ground is at the mesoscopic scale when researchers need a subset of Wikipedia
ranging from thousands to hundreds of thousands of pages but there exists no
efficient solution at this scale.
In this work, we propose an efficient data structure to make requests and
access subnetworks of Wikipedia pages and categories. We provide convenient
tools for accessing and filtering viewership statistics or "pagecounts" of
Wikipedia web pages. The dataset organization leverages principles of graph
databases that allows rapid and intuitive access to subgraphs of Wikipedia
articles and categories. The dataset and deployment guidelines are available on
the LTS2 website \url{https://lts2.epfl.ch/Datasets/Wikipedia/}
Deductive Optimization of Relational Data Storage
Optimizing the physical data storage and retrieval of data are two key
database management problems. In this paper, we propose a language that can
express a wide range of physical database layouts, going well beyond the row-
and column-based methods that are widely used in database management systems.
We use deductive synthesis to turn a high-level relational representation of a
database query into a highly optimized low-level implementation which operates
on a specialized layout of the dataset. We build a compiler for this language
and conduct experiments using a popular database benchmark, which shows that
the performance of these specialized queries is competitive with a
state-of-the-art in memory compiled database system
Linked Data - the story so far
The term “Linked Data” refers to a set of best practices for publishing and connecting structured data on the Web. These best practices have been adopted by an increasing number of data providers over the last three years, leading to the creation of a global data space containing billions of assertions— the Web of Data. In this article, the authors present the concept and technical principles of Linked Data, and situate these within the broader context of related technological developments. They describe progress to date in publishing Linked Data on the Web, review applications that have been developed to exploit the Web of Data, and map out a research agenda for the Linked Data community as it moves forward
- …