3,274 research outputs found
A Survey on Array Storage, Query Languages, and Systems
Since scientific investigation is one of the most important providers of
massive amounts of ordered data, there is a renewed interest in array data
processing in the context of Big Data. To the best of our knowledge, a unified
resource that summarizes and analyzes array processing research over its long
existence is currently missing. In this survey, we provide a guide for past,
present, and future research in array processing. The survey is organized along
three main topics. Array storage discusses all the aspects related to array
partitioning into chunks. The identification of a reduced set of array
operators to form the foundation for an array query language is analyzed across
multiple such proposals. Lastly, we survey real systems for array processing.
The result is a thorough survey on array data storage and processing that
should be consulted by anyone interested in this research topic, independent of
experience level. The survey is not complete though. We greatly appreciate
pointers towards any work we might have forgotten to mention.Comment: 44 page
HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces
Nearest neighbor searching of large databases in high-dimensional spaces is
inherently difficult due to the curse of dimensionality. A flavor of
approximation is, therefore, necessary to practically solve the problem of
nearest neighbor search. In this paper, we propose a novel yet simple indexing
scheme, HD-Index, to solve the problem of approximate k-nearest neighbor
queries in massive high-dimensional databases. HD-Index consists of a set of
novel hierarchical structures called RDB-trees built on Hilbert keys of
database objects. The leaves of the RDB-trees store distances of database
objects to reference objects, thereby allowing efficient pruning using distance
filters. In addition to triangular inequality, we also use Ptolemaic inequality
to produce better lower bounds. Experiments on massive (up to billion scale)
high-dimensional (up to 1000+) datasets show that HD-Index is effective,
efficient, and scalable.Comment: PVLDB 11(8):906-919, 201
Doctor of Philosophy
dissertationThe increase in computational power of supercomputers is enabling complex scientific phenomena to be simulated at ever-increasing resolution and fidelity. With these simulations routinely producing large volumes of data, performing efficient I/O at this scale has become a very difficult task. Large-scale parallel writes are challenging due to the complex interdependencies between I/O middleware and hardware. Analytic-appropriate reads are traditionally hindered by bottlenecks in I/O access. Moreover, the two components of I/O, data generation from simulations (writes) and data exploration for analysis and visualization (reads), have substantially different data access requirements. Parallel writes, performed on supercomputers, often deploy aggregation strategies to permit large-sized contiguous access. Analysis and visualization tasks, usually performed on computationally modest resources, require fast access to localized subsets or multiresolution representations of the data. This dissertation tackles the problem of parallel I/O while bridging the gap between large-scale writes and analytics-appropriate reads. The focus of this work is to develop an end-to-end adaptive-resolution data movement framework that provides efficient I/O, while supporting the full spectrum of modern HPC hardware. This is achieved by developing technology for highly scalable and tunable parallel I/O, applicable to both traditional parallel data formats and multiresolution data formats, which are directly appropriate for analysis and visualization. To demonstrate the efficacy of the approach, a novel library (PIDX) is developed that is highly tunable and capable of adaptive-resolution parallel I/O to a multiresolution data format. Adaptive resolution storage and I/O, which allows subsets of a simulation to be accessed at varying spatial resolutions, can yield significant improvements to both the storage performance and I/O time. The library provides a set of parameters that controls the storage format and the nature of data aggregation across he network; further, a machine learning-based model is constructed that tunes these parameters for the maximum throughput. This work is empirically demonstrated by showing parallel I/O scaling up to 768K cores within a framework flexible enough to handle adaptive resolution I/O
The INCF Digital Atlasing Program: Report on Digital Atlasing Standards in the Rodent Brain
The goal of the INCF Digital Atlasing Program is to provide the vision and direction necessary to make the rapidly growing collection of multidimensional data of the rodent brain (images, gene expression, etc.) widely accessible and usable to the international research community. This Digital Brain Atlasing Standards Task Force was formed in May 2008 to investigate the state of rodent brain digital atlasing, and formulate standards, guidelines, and policy recommendations.

Our first objective has been the preparation of a detailed document that includes the vision and specific description of an infrastructure, systems and methods capable of serving the scientific goals of the community, as well as practical issues for achieving
the goals. This report builds on the 1st INCF Workshop on Mouse and Rat Brain Digital Atlasing Systems (Boline et al., 2007, _Nature Preceedings_, doi:10.1038/npre.2007.1046.1) and includes a more detailed analysis of both the current state and desired state of digital atlasing along with specific recommendations for achieving these goals
DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS
Scientific data analysis applications require large scale computing power to
effectively service client queries and also require large storage repositories
for datasets that are generated continually from sensors and simulations.
These scientific datasets are growing in size every day, and are becoming truly
enormous. The goal of this dissertation is to provide efficient multidimensional
indexing techniques that aid in navigating distributed scientific datasets.
In this dissertation, we show significant improvements in accessing
distributed large scientific datasets.
The first approach we took to improve access to subsets of large
multidimensional scientific datasets, was data chunking. The contents of
scientific data files typically are a collection of multidimensional arrays,
along with the corresponding metadata. Data chunking groups data elements into
small chunks of a fixed, but data-specific, size to take advantage of
spatio-temporal locality since it is not efficient to index individual data
elements of large scientific datasets.
The second approach was the design of an efficient multidimensional index for
scientific datasets. This work investigates how existing multidimensional
indexing structures perform on chunked scientific datasets, and compares their
performance with that of our own indexing structure, SH-trees. Since R-trees
were proposed, various multidimensional indexing structures have been proposed.
However, there are a relatively small number of studies focused on improving
the performance of indexing geographically distributed datasets, especially
across heterogeneous machines. As a third approach, in an attempt to
accelerate indexing performance for distributed datasets, we proposed several
distributed multidimensional indexing schemes: replicated centralized indexing,
hierarchical two level indexing, and decentralized two level indexing.
Our experimental results show that great performance improvements
are gained from distribution of multidimensional index. However, the design
choices for distributed indexing, such as replication, partitioning, and
decentralization, must be carefully considered since they may decrease the overall
performance in certain situations. Therefore, this work provides performance
guidelines to aid in selecting the best distributed multidimensional indexing
scheme for various systems and applications. Finally, we describe how a
distributed multidimensional indexing scheme can be used by a distributed
multiple query optimization middleware as a case-study application to
generate better query plans by leveraging information about the contents of
remote caches
Exploiting Graphics Processing Units for Massively Parallel Multi-Dimensional Indexing
Department of Computer EngineeringScientific applications process truly large amounts of multi-dimensional datasets. To efficiently navigate such datasets, various multi-dimensional indexing structures, such as the R-tree, have been extensively studied for the past couple of decades.
Since the GPU has emerged as a new cost-effective performance accelerator, now it is common to leverage the massive parallelism of the GPU in various applications such as medical image processing, computational chemistry, and particle physics.
However, hierarchical multi-dimensional indexing structures are inherently not well suited for parallel processing because their irregular memory access patterns make it difficult to exploit massive parallelism. Moreover, recursive tree traversal often fails due to the small run-time stack and cache memory in the GPU.
First, we propose Massively Parallel Three-phase Scanning (MPTS) R-tree traversal algorithm to avoid the irregular memory access patterns and recursive tree traversal so that the GPU can access tree nodes in a sequential manner. The experimental study shows that MPTS R-tree traversal algorithm consistently outperforms traditional recursive R-Tree search algorithm for multi-dimensional range query processing.
Next, we focus on reducing the query response time and extending n-ary multi-dimensional indexing structures - R-tree, so that a large number of GPU threads cooperate to process a single query in parallel. Because the number of submitted concurrent queries in scientific data analysis
applications is relatively smaller than that of enterprise database systems and ray tracing in computer graphics. Hence, we propose a novel variant of R-trees Massively Parallel Hilbert R-Tree (MPHR-Tree), which is designed for a novel parallel tree traversal algorithm Massively Parallel Restart Scanning (MPRS). The MPRS algorithm traverses the MPHR-Tree in mostly contiguous memory access patterns without recursion, which offers more chances to optimize the parallel SIMD algorithm. Our extensive experimental results show that the MPRS algorithm
outperforms the other stackless tree traversal algorithms, which are designed for efficient ray tracing in computer graphics community.
Furthermore, we develop query co-processing scheme that makes use of both the CPU and GPU. In this approach, we store the internal and leaf nodes of upper tree in CPU host
memory and GPU device memory, respectively. We let the CPU traverse internal nodes because the conditional branches in hierarchical tree structures often cause a serious warp divergence problem in the GPU. For leaf nodes, the GPU scans a large number of leaf nodes in parallel based on the selection ratio of a given range query. It is well known that the GPU is superior to the CPU for parallel scanning. The experimental results show that our proposed multi-dimensional range query co-processing scheme improves the query response time by up to 12x and query throughput by up to 4x compared to the state-of-the-art GPU tree traversal algorithm.ope
Indexing Cached Multidimensional Objects in Large Main Memory Systems
Semantic caches allow queries into large datasets to leverage cached
results either directly or through transformations, using semantic
information about the data objects in the cache. As the price of main
memory continues to drop and its size increases, the
size of semantic caches grows proportionately, and it is becoming
expensive to compare the semantic information for each data object in the
cache against a query predicate. Instead, we propose to create an index
for cached objects. Unlike straightforward linear scanning, indexing
cached objects creates additional overhead for cache replacement. Since
the contents of a semantic cache may change dynamically at a high rate,
the cache index must support fast inserts and deletes as well as fast
search. In this paper, we show that multidimensional indexing helps
navigate efficiently through a large
semantic cache in spite of the additional overhead and overall is
considerably less expensive than linear scanning. Little emphasis has been
laid upon the performance of multidimensional index inserts and deletes,
as opposed to search performance. We compare the performance of a few
widely used multidimensional indexing structures with our SH-tree, looking
at insert, delete, and search operations, and show that SH-trees overall
perform better for large semantic caches than the widely used indexing
techniques
- …