59 research outputs found

    DISTRIBUTED MULTIDIMENSIONAL INDEXING FOR SCIENTIFIC DATA ANALYSIS APPLICATIONS

    Get PDF
    Scientific data analysis applications require large scale computing power to effectively service client queries and also require large storage repositories for datasets that are generated continually from sensors and simulations. These scientific datasets are growing in size every day, and are becoming truly enormous. The goal of this dissertation is to provide efficient multidimensional indexing techniques that aid in navigating distributed scientific datasets. In this dissertation, we show significant improvements in accessing distributed large scientific datasets. The first approach we took to improve access to subsets of large multidimensional scientific datasets, was data chunking. The contents of scientific data files typically are a collection of multidimensional arrays, along with the corresponding metadata. Data chunking groups data elements into small chunks of a fixed, but data-specific, size to take advantage of spatio-temporal locality since it is not efficient to index individual data elements of large scientific datasets. The second approach was the design of an efficient multidimensional index for scientific datasets. This work investigates how existing multidimensional indexing structures perform on chunked scientific datasets, and compares their performance with that of our own indexing structure, SH-trees. Since R-trees were proposed, various multidimensional indexing structures have been proposed. However, there are a relatively small number of studies focused on improving the performance of indexing geographically distributed datasets, especially across heterogeneous machines. As a third approach, in an attempt to accelerate indexing performance for distributed datasets, we proposed several distributed multidimensional indexing schemes: replicated centralized indexing, hierarchical two level indexing, and decentralized two level indexing. Our experimental results show that great performance improvements are gained from distribution of multidimensional index. However, the design choices for distributed indexing, such as replication, partitioning, and decentralization, must be carefully considered since they may decrease the overall performance in certain situations. Therefore, this work provides performance guidelines to aid in selecting the best distributed multidimensional indexing scheme for various systems and applications. Finally, we describe how a distributed multidimensional indexing scheme can be used by a distributed multiple query optimization middleware as a case-study application to generate better query plans by leveraging information about the contents of remote caches

    Multiple Range Query Optimization with Distributed Cache Indexing

    Get PDF
    MQO is a distributed multiple query processing middleware that can optimize query processing for data analysis applications on the Grid. It has one or more proxies that act as front-end to a collection of backend servers. The basic idea behind this architecture is semantic caching, whereby queries can leverage available cached results in the proxy either directly or through transformations. While this approach has been shown to speed up query evaluation under multi-client workloads, the caching infrastructure in the backend servers is not well used for query planning. In this paper, we describe a distributed multidimensional indexing scheme that enables the proxy to directly consider the cache contents available at the backend servers for planning and scheduling. This approach is shown to produce better query plans and faster query response times. We experimentally demonstrate that system throughput can be improved up to 66%, compared to either load-based or round-robin scheduling

    Indexing Cached Multidimensional Objects in Large Main Memory Systems

    Get PDF
    Semantic caches allow queries into large datasets to leverage cached results either directly or through transformations, using semantic information about the data objects in the cache. As the price of main memory continues to drop and its size increases, the size of semantic caches grows proportionately, and it is becoming expensive to compare the semantic information for each data object in the cache against a query predicate. Instead, we propose to create an index for cached objects. Unlike straightforward linear scanning, indexing cached objects creates additional overhead for cache replacement. Since the contents of a semantic cache may change dynamically at a high rate, the cache index must support fast inserts and deletes as well as fast search. In this paper, we show that multidimensional indexing helps navigate efficiently through a large semantic cache in spite of the additional overhead and overall is considerably less expensive than linear scanning. Little emphasis has been laid upon the performance of multidimensional index inserts and deletes, as opposed to search performance. We compare the performance of a few widely used multidimensional indexing structures with our SH-tree, looking at insert, delete, and search operations, and show that SH-trees overall perform better for large semantic caches than the widely used indexing techniques

    Longitudinal evolution of cortical thickness signature reflecting Lewy body dementia in isolated REM sleep behavior disorder: a prospective cohort study

    Get PDF
    Background The isolated rapid-eye-movement sleep behavior disorder (iRBD) is a prodromal condition of Lewy body disease including Parkinson's disease and dementia with Lewy bodies (DLB). We aim to investigate the longitudinal evolution of DLB-related cortical thickness signature in a prospective iRBD cohort and evaluate the possible predictive value of the cortical signature index in predicting dementia-first phenoconversion in individuals with iRBD. Methods We enrolled 22 DLB patients, 44 healthy controls, and 50 video polysomnography-proven iRBD patients. Participants underwent 3-T magnetic resonance imaging (MRI) and clinical/neuropsychological evaluations. We characterized DLB-related whole-brain cortical thickness spatial covariance pattern (DLB-pattern) using scaled subprofile model of principal components analysis that best differentiated DLB patients from age-matched controls. We analyzed clinical and neuropsychological correlates of the DLB-pattern expression scores and the mean values of the whole-brain cortical thickness in DLB and iRBD patients. With repeated MRI data during the follow-up in our prospective iRBD cohort, we investigated the longitudinal evolution of the cortical thickness signature toward Lewy body dementia. Finally, we analyzed the potential predictive value of cortical thickness signature as a biomarker of phenoconversion in iRBD cohort. Results The DLB-pattern was characterized by thinning of the temporal, orbitofrontal, and insular cortices and relative preservation of the precentral and inferior parietal cortices. The DLB-pattern expression scores correlated with attentional and frontal executive dysfunction (Trail Making Test-A and B: R =β€‰βˆ’ 0.55, P = 0.024 and R =β€‰βˆ’ 0.56, P = 0.036, respectively) as well as visuospatial impairment (Rey-figure copy test: R =β€‰βˆ’ 0.54, P = 0.0047). The longitudinal trajectory of DLB-pattern revealed an increasing pattern above the cut-off in the dementia-first phenoconverters (Pearsons correlation, R = 0.74, P = 6.8 × 10βˆ’4) but no significant change in parkinsonism-first phenoconverters (R = 0.0063, P = 0.98). The mean value of the whole-brain cortical thickness predicted phenoconversion in iRBD patients with hazard ratio of 9.33 [1.16–74.12]. The increase in DLB-pattern expression score discriminated dementia-first from parkinsonism-first phenoconversions with 88.2% accuracy. Conclusion Cortical thickness signature can effectively reflect the longitudinal evolution of Lewy body dementia in the iRBD population. Replication studies would further validate the utility of this imaging marker in iRBD

    Multi-dimensional Range Query Processing on the GPU

    No full text

    A Comparative Study of Spatial Indexing Techniques for Multidimensional Scientific Datasets

    Get PDF
    Scientific applications that query into very large multidimensional datasets are becoming more common. These datasets are growing in size every day, and are becoming truly enormous, making it infeasible to index individual data elements. We have instead been experimenting with chunking the datasets to index them, grouping data elements into small chunks of a fixed, but dataset-specific, size to take advantage of spatial locality. While spatial indexing structures based on R-trees perform reasonably well for the rectangular bounding boxes of such chunked datasets, other indexing structures based on KDB-trees, such as Hybrid trees, have been shown to perform very well for point data. In this paper, we investigate how all these indexing structures perform for multidimensional scientific datasets, and compare their features and performance with that of SH-trees, an extension of Hybrid trees, for indexing multidimensional rectangles. Our experimental results show that the algorithms for building and searching SH-trees outperform those for R-trees, R*-trees, and X-trees for both real application and synthetic datasets and queries. We show that the SH-tree algorithms perform well for both low and high dimensional data, and that they scale well to high dimensions both for building and searching the trees

    Parallel Tree Traversal for Nearest Neighbor Query on the GPU

    No full text
    The similarity search problem is found in many application domains including computer graphics, information retrieval, statistics, computational biology, and scientific data processing just to name a few. Recently several studies have been performed to accelerate the k-nearest neighbor (kNN) queries using GPUs, but most of the works develop brute-force exhaustive scanning algorithms leveraging a large number of GPU cores and none of the prior works employ GPUs for an n-ary tree structured index. It is known that multi-dimensional hierarchical indexing trees such as R-trees are inherently not well suited for GPUs because of their irregular tree traversal and memory access patterns. Traversing hierarchical tree structures in an irregular manner makes it difficult to exploit parallelism since GPUs are tailored for deterministic memory accesses. In this work, we develop a data parallel tree traversal algorithm, Parallel Scan and Backtrack (PSB), for kNN query processing on the GPU, this algorithm traverses a multi-dimensional tree structured index while avoiding warp divergence problems. In order to take advantage of accessing contiguous memory blocks, the proposed PSB algorithm performs linear scanning of sibling leaf nodes, which increases the chance to optimize the parallel SIMD algorithm. We evaluate the performance of the PSB algorithm against the classic branch-and-bound kNN query processing algorithm. Our experiments with real datasets show that the PSB algorithm is faster by a large margin than the branch-and-bound algorithm

    Improving Access to Multi-dimensional Self-describing Scientific Dataset

    Get PDF
    Applications that query into very large multidimensional datasets are becoming more common. Many self-describing scientific data file formats have also emerged, which have structural metadata to help navigate the multi-dimensional arrays that are stored in the files. The files may also contain application-specific semantic metadata. In this paper, we discuss efficient methods for performing searches for subsets of multi-dimensional data objects, using semantic information to build multidimensional indexes, and group data items into properly sized chunks to maximize disk I/O bandwidth. This work is the first step in the design and implementation of a generic indexing library that will work with various high-dimension scientific data file formats containing semantic information about the stored data. To validate the approach, we have implemented indexing structures for NASA remote sensing data stored in the HDF format with a specific schema (HDF-EOS), and show the performance improvements that are gained from indexing the datasets, compared to using the existing HDF library for accessing the data

    Co-processing heterogeneous parallel index for multi-dimensional datasets

    No full text
    We present a novel multi-dimensional range query co-processing scheme for the CPU and GPU. It has been reported that traversing hierarchical tree structures in parallel is inherently not efficient because of large branching factors. Besides, it is known that the recursive tree traversal algorithm required for multi-dimensional range queries is not well suited for the GPU architecture owing to its small shared memory. In this paper, we propose co-processing range queries using both the CPU and GPU to make the most use of each architecture. In Hybrid tree that we present in this paper, we let CPU navigate the internal nodes of hierarchical tree structures and make GPU scan leaf nodes in a linear fashion using a massively large number of processing units. With the co-processing scheme, we can asynchronously leverage the strengths of each architecture. We also propose a novel dynamic GPU block scheduling algorithm for multiple range queries. In our scheduling algorithm, we consider the selection ratio of each query to determine the number of GPU blocks to launch. By assigning the right number of GPU blocks, we can significantly improve the query processing throughput for multiple concurrent queries. Our extensive experimental study shows that the proposed co-processing scheme shows up to 12?? faster query response time than the state-of-the-art GPU tree traversal algorithm. We also show that our dynamic GPU block assignment algorithm improves the query processing throughput by up to 4??

    Analyzing design choices for distributed multidimensional indexing

    No full text
    Scientific datasets are often stored on distributed archival storage systems, because geographically distributed sensor devices store the datasets in their local machines and also because the size of scientific datasets demands large amount of disk space. Multidimensional indexing techniques have been shown to greatly improve range query performance into large scientific datasets. In this paper, we discuss several ways of distributing a multidimensional index in order to speed up access to large distributed scientific datasets. This paper compares the designs, challenges, and problems for distributed multidimensional indexing schemes, and provides a comprehensive performance study of distributed indexing to provide guidelines to choose a distributed multidimensional index for a specific data analysis application.close2
    • …
    corecore