Scientific data analysis applications require large scale computing power to
effectively service client queries and also require large storage repositories
for datasets that are generated continually from sensors and simulations.
These scientific datasets are growing in size every day, and are becoming truly
enormous. The goal of this dissertation is to provide efficient multidimensional
indexing techniques that aid in navigating distributed scientific datasets.
In this dissertation, we show significant improvements in accessing
distributed large scientific datasets.
The first approach we took to improve access to subsets of large
multidimensional scientific datasets, was data chunking. The contents of
scientific data files typically are a collection of multidimensional arrays,
along with the corresponding metadata. Data chunking groups data elements into
small chunks of a fixed, but data-specific, size to take advantage of
spatio-temporal locality since it is not efficient to index individual data
elements of large scientific datasets.
The second approach was the design of an efficient multidimensional index for
scientific datasets. This work investigates how existing multidimensional
indexing structures perform on chunked scientific datasets, and compares their
performance with that of our own indexing structure, SH-trees. Since R-trees
were proposed, various multidimensional indexing structures have been proposed.
However, there are a relatively small number of studies focused on improving
the performance of indexing geographically distributed datasets, especially
across heterogeneous machines. As a third approach, in an attempt to
accelerate indexing performance for distributed datasets, we proposed several
distributed multidimensional indexing schemes: replicated centralized indexing,
hierarchical two level indexing, and decentralized two level indexing.
Our experimental results show that great performance improvements
are gained from distribution of multidimensional index. However, the design
choices for distributed indexing, such as replication, partitioning, and
decentralization, must be carefully considered since they may decrease the overall
performance in certain situations. Therefore, this work provides performance
guidelines to aid in selecting the best distributed multidimensional indexing
scheme for various systems and applications. Finally, we describe how a
distributed multidimensional indexing scheme can be used by a distributed
multiple query optimization middleware as a case-study application to
generate better query plans by leveraging information about the contents of
remote caches