7,951 research outputs found
QUASII: QUery-Aware Spatial Incremental Index.
With large-scale simulations of increasingly detailed models and improvement of data acquisition technologies, massive amounts of data are easily and quickly created and collected. Traditional systems require indexes to be built before analytic queries can be executed efficiently. Such an indexing step requires substantial computing resources and introduces a considerable and growing data-to-insight gap where scientists need to wait before they can perform any analysis. Moreover, scientists often only use a small fraction of the data - the parts containing interesting phenomena - and indexing it fully does not always pay off. In this paper we develop a novel incremental index for the exploration of spatial data. Our approach, QUASII, builds a data-oriented index as a side-effect of query execution. QUASII distributes the cost of indexing across all queries, while building the index structure only for the subset of data queried. It reduces data-to-insight time and curbs the cost of incremental indexing by gradually and partially sorting the data, while producing a data-oriented hierarchical structure at the same time. As our experiments show, QUASII reduces the data-to-insight time by up to a factor of 11.4x, while its performance converges to that of the state-of-the-art static indexes
Entropy-scaling search of massive biological data
Many datasets exhibit a well-defined structure that can be exploited to
design faster search tools, but it is not always clear when such acceleration
is possible. Here, we introduce a framework for similarity search based on
characterizing a dataset's entropy and fractal dimension. We prove that
searching scales in time with metric entropy (number of covering hyperspheres),
if the fractal dimension of the dataset is low, and scales in space with the
sum of metric entropy and information-theoretic entropy (randomness of the
data). Using these ideas, we present accelerated versions of standard tools,
with no loss in specificity and little loss in sensitivity, for use in three
domains---high-throughput drug screening (Ammolite, 150x speedup), metagenomics
(MICA, 3.5x speedup of DIAMOND [3,700x BLASTX]), and protein structure search
(esFragBag, 10x speedup of FragBag). Our framework can be used to achieve
"compressive omics," and the general theory can be readily applied to data
science problems outside of biology.Comment: Including supplement: 41 pages, 6 figures, 4 tables, 1 bo
On-Disk Data Processing: Issues and Future Directions
In this paper, we present a survey of "on-disk" data processing (ODDP). ODDP,
which is a form of near-data processing, refers to the computing arrangement
where the secondary storage drives have the data processing capability.
Proposed ODDP schemes vary widely in terms of the data processing capability,
target applications, architecture and the kind of storage drive employed. Some
ODDP schemes provide only a specific but heavily used operation like sort
whereas some provide a full range of operations. Recently, with the advent of
Solid State Drives, powerful and extensive ODDP solutions have been proposed.
In this paper, we present a thorough review of architectures developed for
different on-disk processing approaches along with current and future
challenges and also identify the future directions which ODDP can take.Comment: 24 pages, 17 Figures, 3 Table
Data Mining the SDSS SkyServer Database
An earlier paper (Szalay et. al. "Designing and Mining MultiTerabyte
Astronomy Archives: The Sloan Digital Sky Survey," ACM SIGMOD 2000) described
the Sloan Digital Sky Survey's (SDSS) data management needs by defining twenty
database queries and twelve data visualization tasks that a good data
management system should support. We built a database and interfaces to support
both the query load and also a website for ad-hoc access. This paper reports on
the database design, describes the data loading pipeline, and reports on the
query implementation and performance. The queries typically translated to a
single SQL statement. Most queries run in less than 20 seconds, allowing
scientists to interactively explore the database. This paper is an in-depth
tour of those queries. Readers should first have studied the companion overview
paper Szalay et. al. "The SDSS SkyServer, Public Access to the Sloan Digital
Sky Server Data" ACM SIGMOND 2002.Comment: 40 pages, Original source is at
http://research.microsoft.com/~gray/Papers/MSR_TR_O2_01_20_queries.do
When Things Matter: A Data-Centric View of the Internet of Things
With the recent advances in radio-frequency identification (RFID), low-cost
wireless sensor devices, and Web technologies, the Internet of Things (IoT)
approach has gained momentum in connecting everyday objects to the Internet and
facilitating machine-to-human and machine-to-machine communication with the
physical world. While IoT offers the capability to connect and integrate both
digital and physical entities, enabling a whole new class of applications and
services, several significant challenges need to be addressed before these
applications and services can be fully realized. A fundamental challenge
centers around managing IoT data, typically produced in dynamic and volatile
environments, which is not only extremely large in scale and volume, but also
noisy, and continuous. This article surveys the main techniques and
state-of-the-art research efforts in IoT from data-centric perspectives,
including data stream processing, data storage models, complex event
processing, and searching in IoT. Open research issues for IoT data management
are also discussed
- …