9,685 research outputs found
Efficient Iterative Processing in the SciDB Parallel Array Engine
Many scientific data-intensive applications perform iterative computations on
array data. There exist multiple engines specialized for array processing.
These engines efficiently support various types of operations, but none
includes native support for iterative processing. In this paper, we develop a
model for iterative array computations and a series of optimizations. We
evaluate the benefits of an optimized, native support for iterative array
processing on the SciDB engine and real workloads from the astronomy domain
Partial-sum queries in OLAP data cubes using covering codes
A partial-sum query obtains the summation over a set of specified cells of a data cube. We establish a connection between the covering problem in the theory of error-correcting codes and the partial-sum problem and use this connection to devise algorithms for the partial-sum problem with efficient space-time trade-offs. For example, using our algorithms, with 44 percent additional storage, the query response time can be improved by about 12 percent; by roughly doubling the storage requirement, the query response time can be improved by about 34 percent
Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis
Exploring data requires a fast feedback loop from the analyst to the system,
with a latency below about 10 seconds because of human cognitive limitations.
When data becomes large or analysis becomes complex, sequential computations
can no longer be completed in a few seconds and data exploration is severely
hampered. This article describes a novel computation paradigm called
Progressive Computation for Data Analysis or more concisely Progressive
Analytics, that brings at the programming language level a low-latency
guarantee by performing computations in a progressive fashion. Moving this
progressive computation at the language level relieves the programmer of
exploratory data analysis systems from implementing the whole analytics
pipeline in a progressive way from scratch, streamlining the implementation of
scalable exploratory data analysis systems. This article describes the new
paradigm through a prototype implementation called ProgressiVis, and explains
the requirements it implies through examples.Comment: 10 page
Diamond Dicing
In OLAP, analysts often select an interesting sample of the data. For
example, an analyst might focus on products bringing revenues of at least 100
000 dollars, or on shops having sales greater than 400 000 dollars. However,
current systems do not allow the application of both of these thresholds
simultaneously, selecting products and shops satisfying both thresholds. For
such purposes, we introduce the diamond cube operator, filling a gap among
existing data warehouse operations.
Because of the interaction between dimensions the computation of diamond
cubes is challenging. We compare and test various algorithms on large data sets
of more than 100 million facts. We find that while it is possible to implement
diamonds in SQL, it is inefficient. Indeed, our custom implementation can be a
hundred times faster than popular database engines (including a row-store and a
column-store).Comment: 29 page
Medians and Beyond: New Aggregation Techniques for Sensor Networks
Wireless sensor networks offer the potential to span and monitor large
geographical areas inexpensively. Sensors, however, have significant power
constraint (battery life), making communication very expensive. Another
important issue in the context of sensor-based information systems is that
individual sensor readings are inherently unreliable. In order to address these
two aspects, sensor database systems like TinyDB and Cougar enable in-network
data aggregation to reduce the communication cost and improve reliability. The
existing data aggregation techniques, however, are limited to relatively simple
types of queries such as SUM, COUNT, AVG, and MIN/MAX. In this paper we propose
a data aggregation scheme that significantly extends the class of queries that
can be answered using sensor networks. These queries include (approximate)
quantiles, such as the median, the most frequent data values, such as the
consensus value, a histogram of the data distribution, as well as range
queries. In our scheme, each sensor aggregates the data it has received from
other sensors into a fixed (user specified) size message. We provide strict
theoretical guarantees on the approximation quality of the queries in terms of
the message size. We evaluate the performance of our aggregation scheme by
simulation and demonstrate its accuracy, scalability and low resource
utilization for highly variable input data sets
- …