Search CORE

176 research outputs found

Improved Bounds and Schemes for the Declustering Problem

Author: Doerr Benjamin
Hebbinghaus Nils
Werth Sören
Publication venue
Publication date: 01/01/2006
Field of study

The declustering problem is to allocate given data on parallel working storage devices in such a manner that typical requests find their data evenly distributed on the devices. Using deep results from discrepancy theory, we improve previous work of several authors concerning range queries to higher-dimensional data. We give a declustering scheme with an additive error of

O_d(\log^{d-1} M)

independent of the data size, where

d

is the dimension,

M

the number of storage devices and

d-1

does not exceed the smallest prime power in the canonical decomposition of

M

into prime powers. In particular, our schemes work for arbitrary

M

in dimensions two and three. For general

d

, they work for all

M\geq d-1

that are powers of two. Concerning lower bounds, we show that a recent proof of a

\Omega_d(\log^{\frac{d-1}{2}} M)

bound contains an error. We close the gap in the proof and thus establish the bound.Comment: 19 pages, 1 figur

arXiv.org e-Print Archive

Elsevier - Publisher Connector

MPG.PuRe

Asymptotically optimal declustering schemes for 2-dim range queries

Author: Sinha Rakesh K.
Bhatia Randeep
Chen Chung-Min
Publication venue: Elsevier Science B.V.
Publication date: 01/01/2003
Field of study

AbstractDeclustering techniques have been widely adopted in parallel storage systems (e.g. disk arrays) to speed up bulk retrieval of multidimensional data. A declustering scheme distributes data items among multiple disks, thus enabling parallel data access and reducing query response time. We measure the performance of any declustering scheme as its worst case additive deviation from the ideal scheme. The goal thus is to design declustering schemes with as small an additive error as possible. We describe a number of declustering schemes with additive error O(logM) for 2-dimensional range queries, where M is the number of disks. These are the first results giving O(logM) upper bound for all values of M. Our second result is a lower bound on the additive error. It is known that except for a few stringent cases, additive error of any 2-dimensional declustering scheme is at least one. We strengthen this lower bound to Ω((logM)(d−1/2)) for d-dimensional schemes and to Ω(logM) for 2-dimensional schemes, thus proving that the 2-dimensional schemes described in this paper are (asymptotically) optimal. These results are obtained by establishing a connection to geometric discrepancy. We also present simulation results to evaluate the performance of these schemes in practice

Elsevier - Publisher Connector

Crossref

Boston University Institutional Repository (OpenBU)

Scalability analysis of declustering methods for multidimensional range queries

Author: Bongki Moon
Joel H. Saltz
Publication venue
Publication date: 01/01/1998
Field of study

Abstract—Efficient storage and retrieval of multiattribute data sets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multiattribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files across multiple disks to obtain high performance for disk accesses. Although the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. In this paper, we derive formulas describing the scalability of two popular declustering methods¦Disk Modulo and Fieldwise Xor¦for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods, and this is corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure that can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions

CiteSeerX

Semi-parametric estimation of joint large movements of risky assets

Author: Dias Alexandra
Publication venue: Warwick Business School, Financial Econometrics Research Centre
Publication date: 09/08/2008
Field of study

The classical approach to modelling the occurrence of joint large movements of asset returns is to assume multivariate normality for the distribution of asset returns. This implies independence between large returns. However, it is now recognised by both academics and practitioners that large movements of assets returns do not occur independently. This fact encourages the modelling joint large movements of asset returns as non-normal, a non trivial task mainly due to the natural scarcity of such extreme events. This paper shows how to estimate the probability of joint large movements of asset prices using a semi-parametric approach borrowed from extreme value theory (EVT). It helps to understand the contribution of individual assets to large portfolio losses in terms of joint large movements. The advantages of this approach are that it does not require the assumption of a specific parametric form for the dependence structure of the joint large movements, avoiding the model misspecification; it addresses specifically the scarcity of data which is a problem for the reliable fitting of fully parametric models; and it is applicable to portfolios of many assets: there is no dimension explosion. The paper includes an empirical analysis of international equity data showing how to implement semi-parametric EVT modelling and how to exploit its strengths to help understand the probability of joint large movements. We estimate the probability of joint large losses in a portfolio composed of the FTSE 100, Nikkei 250 and S&P 500 indices. Each of the index returns is found to be heavy tailed. The S&P 500 index has a much stronger effect on large portfolio losses than the FTSE 100, although having similar univariate tail heaviness

Warwick Research Archives Portal Repository

A Survey on Array Storage, Query Languages, and Systems

Author: Cheng Yu
Rusu Florin
Publication venue
Publication date: 19/02/2013
Field of study

Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

arXiv.org e-Print Archive

CiteSeerX

Discrepancy of arithmetic structures

Author: Hebbinghaus Nils
Publication venue
Publication date: 01/01/2005
Field of study

In discrepancy theory, we investigate how well a desired aim can be achieved. So typically we do not compare our solution with an optimal solution, but rather with an (idealized) aim. For example, in the declustering problem, we try to distribute data on parallel disks in such a way that all of a prespecified set of requests find their data evenly distributed on the disks. Hence our (idealized) aim is that each request asks for the same amount of data from each disk. Structural results tell us to which extent this is possible. They determine the discrepancy, the deviation of an optimal solution from our aim. Algorithmic results provide good declustering scheme. We show that for grid structure data and rectangle queries, a discrepancy of order (log M)^((d-1)/2) cannot be avoided. Moreover, we present a declustering scheme with a discrepancy of order (log M)^(d-1). Furthermore, we present discrepancy results for hypergraphs related to the hypergraph of arithmetic progressions, for the hypergraph of linear hyperplanes in finite vector spaces and for products of hypergraphs

MACAU: Open Access Repository of Kiel University

Alpha Entanglement Codes: Practical Erasure Codes to Archive Data in Unreliable Environments

Author: Estrada-Galiñanes Vero
Felber Pascal
Miller Ethan
Pâris Jehan-François
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/10/2018
Field of study

Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.Comment: The publication has 12 pages and 13 figures. This work was partially supported by Swiss National Science Foundation SNSF Doc.Mobility 162014, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN

arXiv.org e-Print Archive

Crossref

Data partitioning and load balancing in parallel disk systems

Author: Scheuermann Peter
Weikum Gerhard
Zabback Peter
Publication venue: Sonstige Einrichtungen. Sonstige Einrichtungen
Publication date: 01/01/1996
Field of study

Parallel disk systems provide opportunities for exploiting I/O parallelism in two possible ways, namely via inter-request and intra-request parallelism. In this paper we discuss the main issues in performance tuning of such systems, namely striping and load balancing, and show their relationship to response time and throughput. We outline the main components of an intelligent file system that optimizes striping by taking into account the requirements of the applications, and performs load balancing by judicious file allocation and dynamic redistributions of the data when access patterns change. Our system uses simple but effective heuristics that incur only little overhead. We present performance experiments based on synthetic workloads and real-life traces

NASA Technical Reports Server

Scalability Analysis of Declustering Methods for Cartesian Product Files

Author: Moon Bongki
Saltz Joel
Publication venue
Publication date: 15/10/1998
Field of study

Efficient storage and retrieval of multi-attribute datasets has become one of the essential requirements for many data-intensive applications. The Cartesian product file has been known as an effective multi-attribute file structure for partial-match and best-match queries. Several heuristic methods have been developed to decluster Cartesian product files over multiple disks to obtain high performance for disk accesses. Though the scalability of the declustering methods becomes increasingly important for systems equipped with a large number of disks, no analytic studies have been done so far. In this paper we derive formulas describing the scalability of two popular declustering methods Disk Modulo and Fieldwise Xor for range queries, which are the most common type of queries. These formulas disclose the limited scalability of the declustering methods and are corroborated by extensive simulation experiments. From the practical point of view, the formulas given in this paper provide a simple measure which can be used to predict the response time of a given range query and to guide the selection of a declustering method under various conditions. (Also cross-referenced as UMIACS-TR-96-5

Digital Repository at the University of Maryland