114,556 research outputs found
File Allocation and Join Site Selection Problem in Distributed Database Systems.
There are two important problems associated with the design of distributed database systems. One is the file allocation problem, and the other is the query optimization problem. In this research a methodology that considers both these aspects is developed that determines the optimal location of files and join sites for given queries simultaneously. Using this methodology, three different mixed integer programming models that describe three cases of the file allocation and join site selection problem are developed. Dual-based procedures are developed for each of the three mixed integer programming models. Extensive computational testing is performed which shows that the dual-based algorithms developed are able to generate solutions which are very close to the optimal. Also, these near optimal solutions are found very quickly, even for large scale problems
Centralized coded caching for heterogeneous lossy requests
Centralized coded caching of popular contents is studied for users with heterogeneous distortion requirements, corresponding to diverse processing and display capabilities of mobile devices. Users' distortion requirements are assumed to be fixed and known, while their particular demands are revealed only after the placement phase. Modeling each file in the database as an independent and identically distributed Gaussian vector, the minimum delivery rate that can satisfy any demand combination within the corresponding distortion target is studied. The optimal delivery rate is characterized for the special case of two users and two files for any pair of distortion requirements. For the general setting with multiple users and files, a layered caching and delivery scheme, which exploits the successive refinability of Gaussian sources, is proposed. This scheme caches each content in multiple layers, and it is optimized by solving two subproblems: lossless caching of each layer with heterogeneous cache capacities, and allocation of available caches among layers. The delivery rate minimization problem for each layer is solved numerically, while two schemes, called the proportional cache allocation (PCA) and ordered cache allocation (OCA), are proposed for cache allocation. These schemes are compared with each other and the cut-set bound through numerical simulations
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
Data Grids have been adopted as the platform for scientific communities that
need to share, access, transport, process and manage large data collections
distributed worldwide. They combine high-end computing technologies with
high-performance networking and wide-area storage management techniques. In
this paper, we discuss the key concepts behind Data Grids and compare them with
other data sharing and distribution paradigms such as content delivery
networks, peer-to-peer networks and distributed databases. We then provide
comprehensive taxonomies that cover various aspects of architecture, data
transportation, data replication and resource allocation and scheduling.
Finally, we map the proposed taxonomy to various Data Grid systems not only to
validate the taxonomy but also to identify areas for future exploration.
Through this taxonomy, we aim to categorise existing systems to better
understand their goals and their methodology. This would help evaluate their
applicability for solving similar problems. This taxonomy also provides a "gap
analysis" of this area through which researchers can potentially identify new
issues for investigation. Finally, we hope that the proposed taxonomy and
mapping also helps to provide an easy way for new practitioners to understand
this complex area of research.Comment: 46 pages, 16 figures, Technical Repor
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
Predicting Intermediate Storage Performance for Workflow Applications
Configuring a storage system to better serve an application is a challenging
task complicated by a multidimensional, discrete configuration space and the
high cost of space exploration (e.g., by running the application with different
storage configurations). To enable selecting the best configuration in a
reasonable time, we design an end-to-end performance prediction mechanism that
estimates the turn-around time of an application using storage system under a
given configuration. This approach focuses on a generic object-based storage
system design, supports exploring the impact of optimizations targeting
workflow applications (e.g., various data placement schemes) in addition to
other, more traditional, configuration knobs (e.g., stripe size or replication
level), and models the system operation at data-chunk and control message
level.
This paper presents our experience to date with designing and using this
prediction mechanism. We evaluate this mechanism using micro- as well as
synthetic benchmarks mimicking real workflow applications, and a real
application.. A preliminary evaluation shows that we are on a good track to
meet our objectives: it can scale to model a workflow application run on an
entire cluster while offering an over 200x speedup factor (normalized by
resource) compared to running the actual application, and can achieve, in the
limited number of scenarios we study, a prediction accuracy that enables
identifying the best storage system configuration
PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development
This paper describes PlinyCompute, a system for development of
high-performance, data-intensive, distributed computing tools and libraries. In
the large, PlinyCompute presents the programmer with a very high-level,
declarative interface, relying on automatic, relational-database style
optimization to figure out how to stage distributed computations. However, in
the small, PlinyCompute presents the capable systems programmer with a
persistent object data model and API (the "PC object model") and associated
memory management system that has been designed from the ground-up for high
performance, distributed, data-intensive computing. This contrasts with most
other Big Data systems, which are constructed on top of the Java Virtual
Machine (JVM), and hence must at least partially cede performance-critical
concerns such as memory management (including layout and de/allocation) and
virtual method/function dispatch to the JVM. This hybrid approach---declarative
in the large, trusting the programmer's ability to utilize PC object model
efficiently in the small---results in a system that is ideal for the
development of reusable, data-intensive tools and libraries. Through extensive
benchmarking, we show that implementing complex objects manipulation and
non-trivial, library-style computations on top of PlinyCompute can result in a
speedup of 2x to more than 50x or more compared to equivalent implementations
on Spark.Comment: 48 pages, including references and Appendi
How to Optimally Allocate Resources for Coded Distributed Computing?
Today's data centers have an abundance of computing resources, hosting server
clusters consisting of as many as tens or hundreds of thousands of machines. To
execute a complex computing task over a data center, it is natural to
distribute computations across many nodes to take advantage of parallel
processing. However, as we allocate more and more computing resources to a
computation task and further distribute the computations, large amounts of
(partially) computed data must be moved between consecutive stages of
computation tasks among the nodes, hence the communication load can become the
bottleneck. In this paper, we study the optimal allocation of computing
resources in distributed computing, in order to minimize the total execution
time in distributed computing accounting for both the duration of computation
and communication phases. In particular, we consider a general MapReduce-type
distributed computing framework, in which the computation is decomposed into
three stages: \emph{Map}, \emph{Shuffle}, and \emph{Reduce}. We focus on a
recently proposed \emph{Coded Distributed Computing} approach for MapReduce and
study the optimal allocation of computing resources in this framework. For all
values of problem parameters, we characterize the optimal number of servers
that should be used for distributed processing, provide the optimal placements
of the Map and Reduce tasks, and propose an optimal coded data shuffling
scheme, in order to minimize the total execution time. To prove the optimality
of the proposed scheme, we first derive a matching information-theoretic
converse on the execution time, then we prove that among all possible resource
allocation schemes that achieve the minimum execution time, our proposed scheme
uses the exactly minimum possible number of servers
- …