Lower bounds are developed for the processor-time tradeoffs of machines such us linear arrays and two-dimensional meshes, which aze compatible with the physical limitation on speed propagation of messages. It is shown that, under this limitation, parallelism and locality combined may yield speedups superlinear in the number of processors. The results are obtained by means of on a novel technique, called the "closed-dichotomy-size technique", designed to obtain lower bounds to the computation time for networks of processors, each of which is equipped with a local hierarchical memory.
Introduction
As digital technology continues to improve, there is an emerging consensus that physical limitations to signal propagation speed and device size are becoming increasingly significant. It is therefore justified to analyze a hypothetical environment where -provocatively -the limits of physics have been attained. Such analysis [BP92] forces a breakdown of several standard models based on instantaneou signal propagation. We denote such hypothetical environment the "limiting technology".
Typically, the classical processor-time tradeoff embodied by Brent's Principle [B74] (whereby a computation running for T steps on n processors can be emulated in at most [n/p]T steps on p < n processors of the same type), no longer holds. Informally, when communication delays are proportional to physical distances, the deployment of p processors has the potential to lead to speed-ups both through p-fold parallelism and through reduction of processor-to-memory distance (exploitation of data locality).
In [BP95a], we have developed simulations of machines with n processors by machines with p < n processors, in the framework of the limiting technology. The slowdown of such simulations takes the form O((n/p)A(n , p, m), where parameter m denotes the number of memory cells in unit volume and plays an important role. It is natural to interpret the factor n/p as due to the loss of parallelism (BrunO's slowdown) and the factor A(n, p, m) as due to the loss of locality (locality slowdown). In this paper we present lower bounds to the slowdown of any simulation, which match the upper bounds of [BP95a], thereby showing that the locality slowdown is an inherent feature of the limiting technology and providing a quantitative characterization for it.
More specifically, let Md(n, p, m), d = 1, 2, denote a d-dimensional mesh of p nodes, each of which is a CPU equipped with a hierarchical memory module of mn/p locations (see Section 2 for a formal definition). Clearly M1 and M2 have the topology of the conventional linear array and two-dimensional mesh, respectively. We will establish: Theorem 1. For d = 1, 2, in the worst case, the slowdown of any simulation of Md(n, n, m) by Md(n, p, m) sa is es A(n,p, m) ), where ** p, m) = min((./p)l/d, m + m og( /Vm2d)).
We note that the term A, corresponding to the locality slowdown, increases with m up to a certain value, and then only depends on n and p. To observe this phenomenon more clearly, let us consider the special case arising when p = 1, for d -2. We have T1/Tn = s min(n 1/2, m Cog(n/m2))). When m = 1 (as is the case for cellular-automaton and many systolic computations), the worst-case locality slowdown is/2(log n). When m > n 1/2, the locality slowdown is ~2(n~/2).
As mentioned above, the lower bounds embodied by Theorem 1 match the upper bounds of [BP95a] for most values of the parameters n, p, and m.
The present paper also Contributes a new method, of independent interest, called the closed.dichotomy size technique, to obtain time lower bounds for machines with hierarchical memory. Whereas alternative techniques were already known ([HK81],[$95]) for uniprocessors, our method also applies to multiprocessor networks where each node is equipped with a hierarchical memory, by capturing the tradeoff between inter-node and intra-node access costs.
In the remainder of this paper we develop the models (Section 2) and the techniques (Section 3) needed to derive the results for linear arrays and meshes (Section 4), We conclude (Section 5) by comparing upper and lower bounds. Due to space constraints, proofs will be either omitted or sketched, and the reader is referred to [BP95b] for details.
Model
We shall consider parallel machines built as interconnections of (processingelement, memory-module) pairs. Such a pair is modeled as a Hierarchical Random Access Machine, or H-RAM [CR73],[AACS87], [$95].
Definition2. An f(x)-H-RAM is a random access machine where an access to address x takes time f(x). ** Hereafter, we let ~og(x) denote log 2 (x A-2). Note that s > 1 for any nonnegative X.
