2,748 research outputs found
Scalability analysis of declustering methods for multidimensional range queries
Abstract—Efficient storage and retrieval of multiattribute data sets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multiattribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses. Although the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. In this paper, we derive formulas describing the scalability of two popular declustering methods¦Disk Modulo and Fieldwise Xor¦for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods, and this is corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions
Scalability Analysis of Declustering Methods for Cartesian Product Files
Efficient storage and retrieval of multi-attribute datasets
has become one of the essential requirements for many data-intensive
applications. The Cartesian product file has been known as an effective
multi-attribute file structure for partial-match and best-match queries.
Several heuristic methods have been developed to decluster Cartesian
product files over multiple disks to obtain high performance for disk
accesses. Though the scalability of the declustering methods becomes
increasingly important for systems equipped with a large number of disks,
no analytic studies have been done so far.
In this paper we derive formulas describing the scalability
of two popular declustering methods Disk Modulo and Fieldwise Xor
for range queries, which are the most common type of queries.
These formulas disclose the limited scalability of the declustering methods
and are corroborated by extensive simulation experiments.
From the practical point of view,
the formulas given in this paper provide a simple measure
which can be used to predict the response time of a given range query
and to guide the selection of a declustering method
under various conditions.
(Also cross-referenced as UMIACS-TR-96-5
A Virtual Data Grid for LIGO
GriPhyN (Grid Physics Network) is a large US collaboration to
build grid services for large physics experiments, one of which is LIGO, a
gravitational-wave observatory. This paper explains the physics and computing
challenges of LIGO, and the tools that GriPhyN will build to address
them. A key component needed to implement the data pipeline is a virtual
data service; a system to dynamically create data products requested during
the various stages. The data could possibly be already processed in a certain
way, it may be in a file on a storage system, it may be cached, or it may need
to be created through computation. The full elaboration of this system will al-low
complex data pipelines to be set up as virtual data objects, with existing
data being transformed in diverse ways
Systems Technology Laboratory (STL) compendium of utilities
Multipurpose programs, routines and operating systems are described. Data conversion and character string comparison subroutine are included. Graphics packages, and file maintenance programs are also included
Improved Bounds and Schemes for the Declustering Problem
The declustering problem is to allocate given data on parallel working
storage devices in such a manner that typical requests find their data evenly
distributed on the devices. Using deep results from discrepancy theory, we
improve previous work of several authors concerning range queries to
higher-dimensional data. We give a declustering scheme with an additive error
of independent of the data size, where is the
dimension, the number of storage devices and does not exceed the
smallest prime power in the canonical decomposition of into prime powers.
In particular, our schemes work for arbitrary in dimensions two and three.
For general , they work for all that are powers of two.
Concerning lower bounds, we show that a recent proof of a
bound contains an error. We close the gap in
the proof and thus establish the bound.Comment: 19 pages, 1 figur
Data partitioning and load balancing in parallel disk systems
Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent file system that optimizes striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces
Storage and Querying of Large Persistent Arrays
The scientic and analytical applications today are increasingly becoming data in-
tensive. Many such applications deal with data that is multidimensional in nature.
Traditionally, relational database systems have been used by many data intensive
application, and relational paradigm has proved to be both natural and ecient.
However, for multidimensional data, when the number of dimensions becomes large,
relational databases are inecient both in terms of storage and query response time.
In this thesis, we explore linearised storage, and indexed and skiplist based retrieval
on persistent arrays. The application programs are provided with a logical view of
multidimensional array. The techniques have been implemented in a home-grown
database management system called MuBase
Benchmarking BigSQL Systems
Elame suurandmete ajastul. Tänapäeval on olemas suurandmete töötlemise süsteemid, mis on võimelised haldama sadu terabaite ja petabaite andmeid. Need süsteemid töötlevad andmehulki, mis on liiga suured traditsiooniliste andmebaasisüsteemide jaoks. Mõned neist süsteemidest sisaldavad SQL keeli andmehoidlaga suhtlemiseks. Nendel süsteemidel, mida nimetatakse ka BigSQL süsteemideks, on mõned omadused, mis teevad nende andmete hoidmist ja haldamist unikaalseks. Süsteemide paremaks mõistmiseks on vajalik nende jõudluse ja omaduste uuring. Antud töö sisaldab BigSQL süsteemide jõudluse võrdlusuuringut. Valitud BigSQL süsteemdiega viiakse läbi standardiseeritud jõudlustestid ja eksperimentidest saadud tulemusi analüüsitakse. Töö eesmärgiks on seletada paremini lahti valitud BigSQL süsteemide omadusi ja käitumist.We live in the era of BigData. We now have BigData systems which are able to manage data in volumes of hundreds of terabytes and petabytes. These BigData systems handle data sizes which are too large for traditional database systems to handle. Some of these BigData systems now provide SQL syntax for interacting with their store. These BigData systems, referred to as BigSQL systems, possess certain features which make them unique in how they manage the stored. A study into the performances and characteristics of these BigSQL systems is necessary in order to better understand these systems. This thesis provides that study into the performance of these BigSQL systems. In this thesis, we perform standardized benchmark experiments against some selected BigSQL systems and then analyze the performances of these systems based on the results of the experiments. The output of this thesis study will provide an understanding of the features and behaviors of the BigSQL systems
- …