6,255 research outputs found
A Lower Bound Technique for Communication in BSP
Communication is a major factor determining the performance of algorithms on
current computing systems; it is therefore valuable to provide tight lower
bounds on the communication complexity of computations. This paper presents a
lower bound technique for the communication complexity in the bulk-synchronous
parallel (BSP) model of a given class of DAG computations. The derived bound is
expressed in terms of the switching potential of a DAG, that is, the number of
permutations that the DAG can realize when viewed as a switching network. The
proposed technique yields tight lower bounds for the fast Fourier transform
(FFT), and for any sorting and permutation network. A stronger bound is also
derived for the periodic balanced sorting network, by applying this technique
to suitable subnetworks. Finally, we demonstrate that the switching potential
captures communication requirements even in computational models different from
BSP, such as the I/O model and the LPRAM
On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution
Technology trends are making the cost of data movement increasingly dominant,
both in terms of energy and time, over the cost of performing arithmetic
operations in computer systems. The fundamental ratio of aggregate data
movement bandwidth to the total computational power (also referred to the
machine balance parameter) in parallel computer systems is decreasing. It is
there- fore of considerable importance to characterize the inherent data
movement requirements of parallel algorithms, so that the minimal architectural
balance parameters required to support it on future systems can be well
understood. In this paper, we develop an extension of the well-known red-blue
pebble game to develop lower bounds on the data movement complexity for the
parallel execution of computational directed acyclic graphs (CDAGs) on parallel
systems. We model multi-node multi-core parallel systems, with the total
physical memory distributed across the nodes (that are connected through some
interconnection network) and in a multi-level shared cache hierarchy for
processors within a node. We also develop new techniques for lower bound
characterization of non-homogeneous CDAGs. We demonstrate the use of the
methodology by analyzing the CDAGs of several numerical algorithms, to develop
lower bounds on data movement for their parallel execution
On the impact of communication complexity in the design of parallel numerical algorithms
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation
Communication Steps for Parallel Query Processing
We consider the problem of computing a relational query on a large input
database of size , using a large number of servers. The computation is
performed in rounds, and each server can receive only
bits of data, where is a parameter that controls
replication. We examine how many global communication steps are needed to
compute . We establish both lower and upper bounds, in two settings. For a
single round of communication, we give lower bounds in the strongest possible
model, where arbitrary bits may be exchanged; we show that any algorithm
requires , where is the fractional vertex
cover of the hypergraph of . We also give an algorithm that matches the
lower bound for a specific class of databases. For multiple rounds of
communication, we present lower bounds in a model where routing decisions for a
tuple are tuple-based. We show that for the class of tree-like queries there
exists a tradeoff between the number of rounds and the space exponent
. The lower bounds for multiple rounds are the first of their
kind. Our results also imply that transitive closure cannot be computed in O(1)
rounds of communication
Self-Stabilizing Repeated Balls-into-Bins
We study the following synchronous process that we call "repeated
balls-into-bins". The process is started by assigning balls to bins in
an arbitrary way. In every subsequent round, from each non-empty bin one ball
is chosen according to some fixed strategy (random, FIFO, etc), and re-assigned
to one of the bins uniformly at random.
We define a configuration "legitimate" if its maximum load is
. We prove that, starting from any configuration, the
process will converge to a legitimate configuration in linear time and then it
will only take on legitimate configurations over a period of length bounded by
any polynomial in , with high probability (w.h.p.). This implies that the
process is self-stabilizing and that every ball traverses all bins in
rounds, w.h.p
Tight Bounds for On-line Tree Embedding
Many treeâstructured computations are inherently parallel.
As leaf processes are recursively spawned they can
be assigned to independent processors in a multicomputer
network. To maintain load balance, an onâline
mapping algorithm must distribute processes equitably
among processors. Additionally, the algorithm itself
must be distributed in nature, and process allocation
must be completed via messageâpassing with minimal
communication overhead.
This paper investigates bounds on the performance
of deterministic and randomized algorithms for onâline
tree embedding. In particular, we study tradeoffs between
performance (loadâbalance) and communication
overhead (message congest ion). We give a simple technique
to derive lower bounds on the congestion that
any onâline allocation algorithm must incur in order to
guarantee load balance. This technique works for both
randomized and deterministic algorithms, although we
find that the performance of randomized on-line algorithms
to be somewhat better than that of deterministic
algorithms. Optimal bounds are achieved for several
networks including multiâdimensional grids and butterflies
- âŠ