3 research outputs found
The MultiDark Database: Release of the Bolshoi and MultiDark Cosmological Simulations
We present the online MultiDark Database -- a Virtual Observatory-oriented,
relational database for hosting various cosmological simulations. The data is
accessible via an SQL (Structured Query Language) query interface, which also
allows users to directly pose scientific questions, as shown in a number of
examples in this paper. Further examples for the usage of the database are
given in its extensive online documentation (www.multidark.org). The database
is based on the same technology as the Millennium Database, a fact that will
greatly facilitate the usage of both suites of cosmological simulations. The
first release of the MultiDark Database hosts two 8.6 billion particle
cosmological N-body simulations: the Bolshoi (250/h Mpc simulation box, 1/h kpc
resolution) and MultiDark Run1 simulation (MDR1, or BigBolshoi, 1000/h Mpc
simulation box, 7/h kpc resolution). The extraction methods for halos/subhalos
from the raw simulation data, and how this data is structured in the database
are explained in this paper. With the first data release, users get full access
to halo/subhalo catalogs, various profiles of the halos at redshifts z=0-15,
and raw dark matter data for one time-step of the Bolshoi and four time-steps
of the MultiDark simulation. Later releases will also include galaxy mock
catalogs and additional merging trees for both simulations as well as new large
volume simulations with high resolution. This project is further proof of the
viability to store and present complex data using relational database
technology. We encourage other simulators to publish their results in a similar
manner.Comment: 28 pages, 9 figures, submitted to New Astronom
Ranked Similarity Search of Scientific Datasets: An Information Retrieval Approach
In the past decade, the amount of scientific data collected and generated by scientists has grown dramatically. This growth has intensified an existing problem: in large archives consisting of datasets stored in many files, formats and locations, how can scientists find data relevant to their research interests? We approach this problem in a new way: by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and curated methods to extract metadata from large repositories of scientific data. We then perform searches over this metadata, returning results ranked by similarity to the search criteria. We present a model of this approach, and describe a specific implementation thereof performed at an ocean-observatory data archive and now running in production. Our prototype implements scanners that extract metadata from datasets that contain different kinds of environmental observations, and a search engine with a candidate similarity measure for comparing a set of search terms to the extracted metadata. We evaluate the utility of the prototype by performing two user studies; these studies show that the approach resonates with users, and that our proposed similarity measure performs well when analyzed using standard Information Retrieval evaluation methods. We performed performance tests to explore how continued archive growth will affect our goal of interactive response, developed and applied techniques that mitigate the effects of that growth, and show that the techniques are effective. Lastly, we describe some of the research needed to extend this initial work into a true Google for data