59 research outputs found
DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS
Scientific data analysis applications require large scale computing power to
effectively service client queries and also require large storage repositories
for datasets that are generated continually from sensors and simulations.
These scientific datasets are growing in size every day, and are becoming truly
enormous. The goal of this dissertation is to provide efficient multidimensional
indexing techniques that aid in navigating distributed scientific datasets.
In this dissertation, we show significant improvements in accessing
distributed large scientific datasets.
The first approach we took to improve access to subsets of large
multidimensional scientific datasets, was data chunking. The contents of
scientific data files typically are a collection of multidimensional arrays,
along with the corresponding metadata. Data chunking groups data elements into
small chunks of a fixed, but data-specific, size to take advantage of
spatio-temporal locality since it is not efficient to index individual data
elements of large scientific datasets.
The second approach was the design of an efficient multidimensional index for
scientific datasets. This work investigates how existing multidimensional
indexing structures perform on chunked scientific datasets, and compares their
performance with that of our own indexing structure, SH-trees. Since R-trees
were proposed, various multidimensional indexing structures have been proposed.
However, there are a relatively small number of studies focused on improving
the performance of indexing geographically distributed datasets, especially
across heterogeneous machines. As a third approach, in an attempt to
accelerate indexing performance for distributed datasets, we proposed several
distributed multidimensional indexing schemes: replicated centralized indexing,
hierarchical two level indexing, and decentralized two level indexing.
Our experimental results show that great performance improvements
are gained from distribution of multidimensional index. However, the design
choices for distributed indexing, such as replication, partitioning, and
decentralization, must be carefully considered since they may decrease the overall
performance in certain situations. Therefore, this work provides performance
guidelines to aid in selecting the best distributed multidimensional indexing
scheme for various systems and applications. Finally, we describe how a
distributed multidimensional indexing scheme can be used by a distributed
multiple query optimization middleware as a case-study application to
generate better query plans by leveraging information about the contents of
remote caches
Multiple Range Query Optimization with Distributed Cache Indexing
MQO is a distributed multiple query processing middleware that can
optimize query processing for data analysis applications on the Grid. It
has one or more proxies that act as front-end to a collection of backend
servers. The basic idea behind this architecture is semantic caching,
whereby queries can leverage available cached results in the proxy either
directly or through transformations. While this approach has been shown
to speed up query evaluation under multi-client workloads, the caching
infrastructure in the backend servers is not well used for query planning.
In this paper, we describe a distributed multidimensional indexing scheme
that enables the proxy to directly consider the cache contents available
at the backend servers for planning and scheduling. This approach is shown
to produce better query plans and faster query response times. We
experimentally demonstrate that system throughput can be improved up to
66%, compared to either load-based or round-robin scheduling
Indexing Cached Multidimensional Objects in Large Main Memory Systems
Semantic caches allow queries into large datasets to leverage cached
results either directly or through transformations, using semantic
information about the data objects in the cache. As the price of main
memory continues to drop and its size increases, the
size of semantic caches grows proportionately, and it is becoming
expensive to compare the semantic information for each data object in the
cache against a query predicate. Instead, we propose to create an index
for cached objects. Unlike straightforward linear scanning, indexing
cached objects creates additional overhead for cache replacement. Since
the contents of a semantic cache may change dynamically at a high rate,
the cache index must support fast inserts and deletes as well as fast
search. In this paper, we show that multidimensional indexing helps
navigate efficiently through a large
semantic cache in spite of the additional overhead and overall is
considerably less expensive than linear scanning. Little emphasis has been
laid upon the performance of multidimensional index inserts and deletes,
as opposed to search performance. We compare the performance of a few
widely used multidimensional indexing structures with our SH-tree, looking
at insert, delete, and search operations, and show that SH-trees overall
perform better for large semantic caches than the widely used indexing
techniques
Longitudinal evolution of cortical thickness signature reflecting Lewy body dementia in isolated REM sleep behavior disorder: a prospective cohort study
Background
The isolated rapid-eye-movement sleep behavior disorder (iRBD) is a prodromal condition of Lewy body disease including Parkinson's disease and dementia with Lewy bodies (DLB). We aim to investigate the longitudinal evolution of DLB-related cortical thickness signature in a prospective iRBD cohort and evaluate the possible predictive value of the cortical signature index in predicting dementia-first phenoconversion in individuals with iRBD.
Methods
We enrolled 22 DLB patients, 44 healthy controls, and 50 video polysomnography-proven iRBD patients. Participants underwent 3-T magnetic resonance imaging (MRI) and clinical/neuropsychological evaluations. We characterized DLB-related whole-brain cortical thickness spatial covariance pattern (DLB-pattern) using scaled subprofile model of principal components analysis that best differentiated DLB patients from age-matched controls. We analyzed clinical and neuropsychological correlates of the DLB-pattern expression scores and the mean values of the whole-brain cortical thickness in DLB and iRBD patients. With repeated MRI data during the follow-up in our prospective iRBD cohort, we investigated the longitudinal evolution of the cortical thickness signature toward Lewy body dementia. Finally, we analyzed the potential predictive value of cortical thickness signature as a biomarker of phenoconversion in iRBD cohort.
Results
The DLB-pattern was characterized by thinning of the temporal, orbitofrontal, and insular cortices and relative preservation of the precentral and inferior parietal cortices. The DLB-pattern expression scores correlated with attentional and frontal executive dysfunction (Trail Making Test-A and B: Rβ=ββ 0.55, Pβ=β0.024 and Rβ=ββ 0.56, Pβ=β0.036, respectively) as well as visuospatial impairment (Rey-figure copy test: Rβ=ββ 0.54, Pβ=β0.0047). The longitudinal trajectory of DLB-pattern revealed an increasing pattern above the cut-off in the dementia-first phenoconverters (Pearsons correlation, Rβ=β0.74, Pβ=β6.8βΓβ10β4) but no significant change in parkinsonism-first phenoconverters (Rβ=β0.0063, Pβ=β0.98). The mean value of the whole-brain cortical thickness predicted phenoconversion in iRBD patients with hazard ratio of 9.33 [1.16β74.12]. The increase in DLB-pattern expression score discriminated dementia-first from parkinsonism-first phenoconversions with 88.2% accuracy.
Conclusion
Cortical thickness signature can effectively reflect the longitudinal evolution of Lewy body dementia in the iRBD population. Replication studies would further validate the utility of this imaging marker in iRBD
A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets
Scientific applications that query into very large multidimensional datasets are becoming more common. These datasets are growing in size every day, and are becoming truly enormous, making it infeasible to index individual data elements. We have instead been experimenting with chunking the datasets to index them, grouping data elements into small chunks of a fixed, but dataset-specific, size to take advantage of spatial locality. While spatial indexing structures based on R-trees perform reasonably well for the rectangular bounding boxes of such chunked datasets, other indexing structures based on KDB-trees, such as Hybrid trees, have been shown to perform very well for point data. In this paper, we investigate how all these indexing structures perform for multidimensional scientific datasets, and compare their features and performance with that of SH-trees, an extension of Hybrid trees, for indexing multidimensional rectangles. Our experimental results show that the algorithms for building and searching SH-trees outperform those for R-trees, R*-trees, and X-trees for both real application and synthetic datasets and queries. We show that the SH-tree algorithms perform well for both low and high dimensional data, and that they scale well to high dimensions both for building and searching the trees
Parallel Tree Traversal for Nearest Neighbor Query on the GPU
The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm
Improving Access to Multi-dimensional Self-describing Scientific Dataset
Applications that query into very large multidimensional datasets are becoming more common. Many self-describing scientific data file formats have also emerged, which have structural metadata to help navigate the multi-dimensional arrays that are stored in the files. The files may also contain application-specific semantic metadata. In this paper, we discuss efficient methods for performing searches for subsets of multi-dimensional data objects, using semantic information to build multidimensional indexes, and group data items into properly sized chunks to maximize disk I/O bandwidth. This work is the first step in the design and implementation of a generic indexing library that will work with various high-dimension scientific data file formats containing semantic information about the stored data. To validate the approach, we have implemented indexing structures for NASA remote sensing data stored in the HDF format with a specific schema (HDF-EOS), and show the performance improvements that are gained from indexing the datasets, compared to using the existing HDF library for accessing the data
Co-processing heterogeneous parallel index for multi-dimensional datasets
We present a novel multi-dimensional range query co-processing scheme for the CPU and GPU. It has been reported that traversing hierarchical tree structures in parallel is inherently not efficient because of large branching factors. Besides, it is known that the recursive tree traversal algorithm required for multi-dimensional range queries is not well suited for the GPU architecture owing to its small shared memory.
In this paper, we propose co-processing range queries using both the CPU and GPU to make the most use of each architecture. In Hybrid tree that we present in this paper, we let CPU navigate the internal nodes of hierarchical tree structures and make GPU scan leaf nodes in a linear fashion using a massively large number of processing units. With the co-processing scheme, we can asynchronously leverage the strengths of each architecture. We also propose a novel dynamic GPU block scheduling algorithm for multiple range queries. In our scheduling algorithm, we consider the selection ratio of each query to determine the number of GPU blocks to launch. By assigning the right number of GPU blocks, we can significantly improve the query processing throughput for multiple concurrent queries. Our extensive experimental study shows that the proposed co-processing scheme shows up to 12?? faster query response time than the state-of-the-art GPU tree traversal algorithm. We also show that our dynamic GPU block assignment algorithm improves the query processing throughput by up to 4??
Analyzing design choices for distributed multidimensional indexing
Scientific datasets are often stored on distributed archival storage systems, because geographically distributed sensor devices store the datasets in their local machines and also because the size of scientific datasets demands large amount of disk space. Multidimensional indexing techniques have been shown to greatly improve range query performance into large scientific datasets. In this paper, we discuss several ways of distributing a multidimensional index in order to speed up access to large distributed scientific datasets. This paper compares the designs, challenges, and problems for distributed multidimensional indexing schemes, and provides a comprehensive performance study of distributed indexing to provide guidelines to choose a distributed multidimensional index for a specific data analysis application.close2
- β¦