Optimal Design of Multilevel Storage Hierarchies by Robert M. Geist & Kishor S
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
Optimal Design of Multilevel Storage Hierarchies
ROBERT M. GEIST AND KISHOR S. TRIVEDI
Abstract-An optimization model is developed for assigning a fixed
set offiles across anassemblage ofstorage devices so as to maximize
system throughput. Multiple levels ofexecutable memories and distinct
record sizes for separate files are allowed. Through the use of this
model, a general class offile assignment problems is reduced to the
optimization of a convex function over a convex feasible region. A
high-speed search procedure specifically tailored to solve this opti-
mization problem is then presented, along with numerical examples
fromreal systems whichdemonstrateordersofmagnitude improvement
inexecution time over existingroutines for solving thefile-assignment
problem. The optimal device capacity selection problem is thensolved
by simply calling the file assignment routine foreach candidate set of
device capacities.
Index Terms-Capacity selection, file-assignment problem, opti-
mization, performance evaluation, performance-oriented design,
queueing networks.
I. INTRODUCTION
THE problem of assigning a set of program and data
modules across a set ofstorage levels has been studied by
many authors [1], [9]-[11], [18], [25], [29]. This paper de-
velops a comprehensive file assignment model, consolidating
and extending the previous efforts. An efficient procedure for
solving this important problem is also developed.
In [1] Arora and Gallo develop a two-stage model for this
problem inwhich theyiterate, over a rangeofdevicecapacities,
between a file-assignment model and a central-serverqueueing
model. Their file-assignment model can be shown to provide
an optimal assignment when restricted to a uniprogramming
environment, but it is known to produce suboptimal assign-
ments in a multiprogramming environment [9].
It is the purpose ofthis paper to extend their results by re-
moving this restriction. Specifically, in Section II we refor-
mulate the file-assignment problem as the optimization of a
convex function over a convex feasible space. This formulation
ofthe file-assignment (FA) problem extends the FA problem
in [25] by allowing multiple levels ofexecutable memories and
a distinct record size for each file. Given thisformulation, and
given that the size oftheproblem (in terms ofnumber offiles
andnumber ofmemorylevels) is nottoo large, anyofanumber
ofnonlinear optimization routines, such as IMSL's ZXMIN
[14], can then be applied to obtain an optimal loading for the
given device capacities and degree ofmultiprogramming.
Nevertheless, real-world problems such as the airline res-
ervation system in [1], with 42 files to be allocated across five
Manuscript received January 26, 1981; revised July 7, 1981. Thisworkwas
supported in part by theNational Science Foundation under Grant 78-22327
and the National Library ofMedicine Project under Grant LM-03373.
The authors are with the Department of Computer Science, Duke Uni-
versity, Durham, NC 27706.
memory levels, quickly push the computation time required
by these classical optimization routines beyond acceptable
limits. Therefore, in Section III we develop the linear loading
routine (LLR), a high-speed search procedure which is spe-
cifically tailored to the general FA problem. Using this routine
to solve the file-assignment problem, we can then iterate over
the range ofcapacity choices which lie within a fixed budget
to solve the general design problem.
This approach to determine optimal device capacities and
file-assignments should becontrasted with ourearlier approach
[25], where we considered the overall design problem as a
continuous nonlinear optimization problem (much like the
development in Section II of this paper). We then derived a
variable reduction technique resulting in an efficient solution
procedure. Recognizing that capacity choices are discrete in
practice, we proposed several methods to discretize the con-
tinuous optimum and studied the errors incurred as a result
[30]. Due to our current assumption ofa distinct record size
perfile, thevariable reduction technique of [25] does notapply.
Nevertheless, since we have found an efficient file-assignment
procedure (which is applicable to our earlier model of [25])
we can now directly deal with discrete capacity choices.
Several useful extensions of the basic model of Section II
are developed in Section IV.
In Section V wediscuss applications ofour model toseveral
real-world problems. Experimental results indicate that, in
general, dramatic improvements in system throughput (over
that which results from using the so-called "Vogel loading"
ofArora and Gallo) can be obtained by using the LLR for file
allocation in a multiprogramming environment, butonly slight
improvement in system throughput is possible in the special
case that component capacities are chosen optimally. Thus,
as a system hardware selection tool the algorithm [1] seems
veryclosetothemark, butasa system tuningtoolourapproach
yields significantly better throughput. In addition, in com-
paring LLR with otherexisting approaches to theproblem we
have consistently observed a reduction in required execution
time by a factor exceeding 100.
Kleinrock [16] and Chandy et al. [3] havedeveloped opti-
mization models for open queueing networks suitable for the
design ofcomputer-communication networks. Other authors
have used decision models of closed queueing networks for
computer hardware configuration selection [5], [8], [15],
[26][29].
The software configuration design problem of optimally
allocating a set offiles over a set ofstorage devices has been
discussed by several authors. Ramamoorthy and Chandy [20]
studied this problem, but they assumed noqueueingdelays and
hence their approch is not applicable to a multiprogramming
0018-9340/82/0300-0249$00.75 © 1982 IEEE
249IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
environment. The model by Arora and Gallo [1] includes op-
timal device capacity selection in addition to optimal file as-
signment. Recognizing the complexity of the problem, they
solve it by a two-stage process. The file assignment is per-
formed in the first stage by a simple loading rule based on
Vogel's method. This stage ignores queueing delays. Next, a
cyclic queueing model is used to select optimal device capac-
ities. We have improved upon the Vogel loading by including
queueing delays in the file assignment stage.
Foster and Browne incorporated queueing delays in their
consideration of the file assignment problem [9]. However,
-they used a hybrid model consisting ofsimulation and analytic
submodels, while our approach to the problem is purely ana-
lytical. Other authors [10], [18] have decomposed the file
assignment problem by first removing device-capacity con-
straints and then solving for the optimal branching probabil-
ities. A separate model isthen used in an attempt to assign files
to devices so as to match the optimal branching probabilities
subject to capacity constraints. Since the parameters of the
queueing model vary with the file assignment, the optimal
branching probabilities also vary with the file assignment.
Therefore, an iteration is required between thetwo models in
order to achieve convergence [10]. Our version ofthe file as-
signment problem is more general since it determines the op-
timal file assignment subject only to device-capacity con-
straints, and requires no such iteration to achieve conver-
gence.
One contribution ofthis paper is the formulation ofa com-
prehensive file assignment problem as a convex programming
problem. The main contribution, however, is the development
ofa highly efficient search technique for solving this important
problem. The general design problem can then be solved by
simply calling the file assignment routine for each candidate
set ofdevice-capacities.
The need foroptimization ofthe I/O subsystem stems from
the large speed differential between the CPU and the I/O
devices [19], [21]. Besides an improved file assignment,
techniques such as blocking (or use of larger page size in a
paged system), prefetching, and caching of I/O streams can
beused toimprove performance [21], [23]. Several extensions
to the file assignment technique developed here are desired in
order to make its use practical. Both sequentiality and ran-
domness in the file access pattern need to be modeled [29], and
extensions to allow multiple job types, as well as extensive
validations ofsuch techniques, are needed.
IL. THE BASIC MODEL
We assume a memory system with two types of devices.
Executable memories are relatively fast so that the CPU will
wait while accessing information resident in them. (While our
model assumes multiple levels of executable memories, the
more common case ofa single executable memory is certainly
a special case.) On the other hand, whileaccessing information
from relatively slow nonexecutable memories, the CPU is
switched to anbther ready process. Thus, in order to access a
record resident in a nonexecutable memory, it must first be
The input parameters to the model are the same as those
used by several authors (see [1], [25], [9], and [11]) and
deemed sufficiently important to warrant the cost of mea-
surement. Specifically, the parameters for the various memory
levels are as follows:
bi = average instruction execution time for the ith execu-
table memory level,
ai = average access time (latency plus seektime) forthe ith
auxiliary memory level,
ci = average transfer time per word from the ith auxiliary
memory to its matched executable memory.
The workload will consist of program and data segments for
which we have the following fixed parameters:
S1 = size ofthejth segment in words,
Ij = average number of instruction-words executed per
reference to thejth segment,
rj = average record size of thejth segment in words,
fi = average number ofreferences (requests directed) to the
jth segment perjob.
Atthe mostbasic level then wewish todeterminethevalues
for each Xji, the number of words of segmentj loaded into
memory level i, which will maximize system throughput
subject to either capacity constraints (the file-assignment
problem) or a global cost constraint (the device capacity se-
lection problem). We assume a static loading offile segments,
while a model allowing file migration is treated elsewhere [10],
[27], [28]. The first task, however, is to derive an expression
for system throughput as a function ofthe system parameters,
the workload parameters, and our decision variables, the
X1i1s.
Consider the central server network [7] of Fig. 1. The
"CPU" time in this network is assumed to model the time
between two requests to auxiliary memories, and hence it
consists of time to access the executable levels between two
auxiliary memory requests. Upon completion of a CPU burst,
a job will request service from the ith auxiliary device with
m
probability Pi and terminate with probabilitypo = 1 - Pi,
i=1
whereupon a newjobwill enter the systemvia the new program
path. We will assume that thequeueing network belongs to the
product form class [2], [24] with the further restriction of a
single class ofjobs and single server at each node.
If we here let ti denote the average number of trips (re-
quests) perjob to the ith auxiliary device, then clearly
ti = Efj(Xi/Sj), i = 1,2," ,inm
and hence the average number oftrips to the CPU
m
to-= ZE + 1.>
i=1
Note that these terms are related to the branching probabilities
by: to = l/po and tilto = Pi, i = 1, , m. Now we can easily
express the average record size demanded for segments re-
siding on auxiliary device i by
(E-fi(Xji/Si)ry) ti
fetched into an executable level. Weassumethateach auxiliary
memory is permanently matched with a specified executable
memory.
250GEIST AND TRIVEDI: MULTILEVEL STORAGE HIERARCHIES
Fig. 1. Central server network.
and thus the average service time ofa request to the ith aux-
iliary device, denoted I/Mui, is given by
I/MHi = ai + Ci (zfj(xi/Si)ri)/ti (1)
whence the average total I/O timeperjob forthe ith auxiliary
device is
til/i = E (ai + cirj)fA(Xji/S1).
i (2)
Similarly, the total CPU time per job can be easily ex-
pressed; corresponding to each level i (executable or auxiliary)
the average number ofinstructions executed perjob is
EZf(Xji/S)Ij.
i
Thus, the total CPU time perjob can be expressed as
L bi ( fj(XVidsjiVi
executablei j
+ u bmiE(fi(Xji/SN)Ii) (3)
auxiliaryi j
where mi = executable level matched to auxiliary level i. Since
no confusion results, we can simplify notation by writing bi for
bmi; now total CPU time perjob is given by
E E bd)(Xji/Sj)Ij (4)
i J
and hence the average service time per trip to the CPU
I/Mo = E Ebj(Xjil/Sj)I4/to .
Observe that if we regard ti as the relative throughput of
device i, (i = 0, 1,..- , m), thenyi = t/l1ij is its relative utili-
zation. Note that for any constant c,ctl/ii can also bedefined
as the relative utilization; however, the choice ofconstant c =
1 is crucial in obtainingYi as a linear function ofthedecision
variables. Our choice ofc here avoids some ofthe difficulties
encountered by other authors [9], [10]. Now the system
throughput as a function ofy = (yo,YI,... Yin) is given by [7],
[13]
Gn- I(Y) T(y) = oPoYo G6 (y)
Gn_l(y)
Gn(Y)
where
(6)
Y" m
Gn,(Y) = IIy i
(ko,kl,. ..,km) i=0
2;ki=n
and n is the degree ofmultiprogramming.
Since reciprocal throughput Gn(y)/G- I(y) is convex in y [18]
and each Yi is linear in the Xji's, we can now formulate our
problems succinctly as follows.
File Assignment (FA) Problem:
Minimize Gn(y)/Gn-I(y) subject to the constraints
Xji > 0 allj, i (7a)
EXji = S1 allj (5)
Note that the operating system overhead time can be added
to (4) above by increasing Ij by an appropriate factor.
(7b)
(7c) EXji < K', capacity of level i, all i
I
251IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
Yi = E (ai + cir1)fj(Xji/S9), i = 1,2," m (7d)
Yo = Z Z bjj(Xji/Sj)Ij. (7e)
i j
Device Capacity Selection and File Assignment (DCFA)
Problem:
Minimize Gn(y)/Gn I(y) subject to the constraints
X1i > 0all],i (8a)
EXji = Sj allj (8b)
E g1(Ki) < BUDGET (8c)
yj = E (ai + cir1)fj(XjiV/S9),
I
i = 1, 2," m (8d)
ye bj(Xj1/Sj1)b (Se)
ij
where Ki = E Xji, gi(K) is the cost of memory level i as a
J
function of its capacity K1, and gi is assumed to be convex in
K,.
In either case we wish to minimize a convex function over
a convex feasible region, and hence we have the following re-
sult.
Theorem 1: Both the FA problem and the DCFA problem
have the property that any local minimum is also the global
minimum.
Thus, any of a number of nonlinear optimization routines
should have little trouble in arriving at a solution, if the
number of segments and the number of memory levels (and
hence constraints (7b) + (7c) above) are not too numerous.
Experiments with 10 segments to be allocated& over 5 levels
showed this was indeed the case. Nevertheless, as mentioned
earlier, increasing the number of segments or levels rapidly
pushes computation time beyond acceptable limits. As an
example, for the airline reservation system of [1] (detailed in
Section V) in which there are 42 segments to allocate over 5
levels, we are faced with 210 decision variables and 257 con-
straints. Of course, we can eliminate positivity constraints by
using decision variables zji, where z4- = Xi>, and we can usethe
effective nature ofthe constraint in (b) to reduce the number
ofdecision variables to 168; yet, even with these modifications
a solution to the FA problem fordegree ofmultiprogramming
(DMP) = 1 using IMSL subroutine ZXMIN required one-
halfhour of CPU time on an Amdahl 470 V/8.
Now one might argue that second-order gradient methods
such as ZXMIN, in which projected movement is approxi-
mated by- (Hessian)-1* (gradient) cannot be expected to be
effective for reciprocal throughput functions, which are
notoriously flat. Nevertheless, the classical pattern search
methods fare no better, if so for another reason: too many
constraints. Specifically, it seems that unilateral adjustment
ofvariables in a local exploration cannot properly account for
an excessive number of diagonally placed constraints, and
bilateral adjustment again requires excessive computation
time.
These considerations led us to the new search procedure to
be discussed next.
III. THE LINEAR LOADING ROUTINE
We now develop an efficient search routine specifically
tuned to the FA problem (device capacities are fixed until
further notice). This routine can be categorized as a directed
pattern search over a certain affine subspace (vector subspace,
offset from origin by a fixed translation vector) ofthe space
ofassignment vectors t... Xji, J\1,-) Xji E RI offile segments
to memory levels. The direction ofsearch at each stage will be
determined by making use of special characteristics of the
reciprocal throughput function z, (y) = G, (y)/G,- l(y), which
is, after all, our objective function.
The translation vector which brings us into the affine space
over which we shall search, our starting point, is the Vogel
loading given by Arora and Gallo in [1] which we now de-
scribe.
For all segmentsj and levels i define
dji = (Ijjbi + (ai + rqci)fj)/Sj
where ai = c1=0 if i is executable, and bi = instruction exe-
cution time ofthe matched executable level, if i is auxiliary.
Observe that for DMP = 1, the FA problem is simply to min-
imize2d1iXji subject to capacity and segment size constraints,
and hencedjican be regarded as the "cost" ofloading a word
ofsegmentj into level i. Twovery reasonable assumptions are
made in [1] in order to rank the levels according to these
"loading costs"dji: for any levels iI and i2:
1) ifail < ai2, then ci <ci2, i.e., faster access implies faster
transfer,
2) if ail < ai2, then biI < bi2, where here again bi = re-
ciprocal speed ofmatched executable level, if i is auxiliary.
We can thus rankthe levels, executable in orderofexecution
speeds followed by auxiliary in order ofaccess times, so that
dji is monotone nondecreasing in i for allj.
Finally, calculate therowdifferencesA1i = d,i+1 -dji, and
load segments in descending order ofAji's, starting with i =
1 and moving to the next level when this level is filled.
Then we have [1, page 310] the following.
Theorem 2: For DMP = 1, theVogel loading rulegives the
optimal solution to the FA problem.
(Note: Should assumptions 1) and 2) above not hold, Vogel
loading still provides a good initial point for our search rou-
tine.)
Next, we must consider those special properties ofthe ob-
jective function zn(Y) = G,(y)/G,-I(y), which will direct our
search. Ifwebegin with a Vogel loading, then in terms ofthe
relative utilizations yi we are (by virtue ofTheorem 2) atthat
point in the feasible region where :Yi = zI(y) reaches a
minimum value; denote the value of2Yy here by Wo. At that
point yin the feasible region where z, (y) reaches a minimum
(our goal), Zyi will, of course, take on some value W- >
WO.
252GEIST AND TRIVEDI: MULTILEVEL STORAGE HIERARCHIES
Let us define a function h, for fixed n, by
h(W) = minlzn(y) y e feasible region; 2;yi = W}.
Lemma: The function h is monotone nonincreasing on [Wo,
Wl.
A proofofthe lemma is given in the Appendix.
Next, we note that the minimum value ofh, h(WI), is the
minimum value ofz,(y), our goal; thus we merely need to find
the minimum ofthe single-variable function h and record the
loadings giving rise to this minimum.
At this stage the careful reader might well remark: this is
merely a reparameterization ofthe original model; what has
really been gained? The answer is the key point ofthe linear
loading routine (LLR): each evaluation ofh requires at most
1 evaluation ofz,!
Thus, although we have yet to specify how we will evaluate
h and record loadings, our top-level algorithm for the FA
problem can be described as follows.
Zn
Ws
choose initial increment A,;
choose final increment AF; 0 < AF < AI
W = W0; /* Vogel loading*/
Am= =;
min = h(W);
while (A> AF) I
while (h(W+ A) < min) I
W= W+A;
min = h(W);
A =A/2; I
min = MIN(h(W- A), h(W), h(W + A));
if (min = h(W- A) W= W- A;
else if (min = h(W + A)) W = W + A;
Now evaluation of h, by its very definition, requires opti-
mization of z,,(y) restricted to a plane ofthe form 2yi = K.
Thus, it should not be surprising that we will make use ofthe
following result.
Theorem 3: For n > 2, zM(y) restricted to the planelyi =
K is a convex function with minimum atyo = Y I = = Ym =
Klm + 1, and is symmetric about this minimum.
A proofis supplied in the Appendix. It should be noted that
the point ofequal relative utilizations need not lie in the fea-
sible region of the FA problem.
Having characterized t-he important properties of Zn and
presented the top-level search-directing algorithm, we can now
describe the details of LLR.
Start with a Vogel loading, so that W = ly' is at the mini-
mum Wo; we assume that any excess device capacity is filled
with the "null" segment, forwhichfj = Ij = rj = 0. Any move
from this point is now restricted to the affine space spanned
by the set of all exchange vectors
{(0s*-1,O *, OS.0 1-,O *,
.. 1,O O1,>* )
XJL XJK XML XMK
that is, we permit only exchanges of words between segment
J level L and segment M level K, where, of course, J,M E
Fig. 2. The LLR search function h.
{segments} and L,K e {levels}. Since all devices are alwaysfully
loaded (perhaps with "null" words) and all segments com-
pletely assigned, the capacityand segment sizeconstraints are
built into the search space and we need only worry about the
positivity constraints on theXji's.
Now in the movefrom plane Wtoplane W+ Arequired by
the top-level algorithm, wewould (byvirtueofTheorem 3) like
to move as close as possible to an assignment pointgiving rel-
ative utilizationsyo =YI = * = Ym = (W± A)/(m + 1). We
detail below a highly efficient technique for selecting the bi-
lateral exchange (J, K, L, M) which would move us closest to
this point. Our evaluation of h(W + A) then amounts to
carrying out this exchange and computing z,. The resulting
value is, admittedly, an approximation to h. Although no other
bilateral exchange will bring us closer to the point of equal
relativeutilizations, itis conceivableapriori that a multilateral
exchange would have a more beneficial effect in reducing z,
on this plane. Nonetheless, we should observethat this evalu-
ation ofh amounts to the local exploration phase ofthe clas-
sical pattern search. The fact that we have found the minimum
value ofz,n over a finite collection1(. *,Xji,
- - -) ofspanning
vectors determined (as seen below) by the top-level routine,
rather than the minimum over a local continuum in theyi's,
does notdetractfromthefact that we have found a"new low"
valueofz, which can serve as a basis for the next local explo-
ration. Thus, we should regard the top-level routine as a
guidance system which serves to speed up convergence of a
pattern search over theXji's bydirecting us toward a succes-
sion ofrestricted minima, h(W0), h(W0 + A),
- * * h(WI); of
course, we need not actually pass through each point in the
succession in order to arrive at the goal.
Yet remaining are thedetails ofthe procedure bywhich we
select that exchange which most nearly equalizes relative
utilizations. Obviously, we would like to have an exchange of
253IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
words between a relatively active segment on an over-utilized
(Yi > aver.) level and a relatively inactive segment on an
under-utilized (Yi < aver.) one. To effect this, we first de-
compose each dji into two summands
d lij = IfjbI/Sj
d 2ji = (ai + rjci)fj/Sj
then for each pair oflevels K and L and each pair ofsegments
J and M (including the null segment), an exchange of 1 word
between segment J level L and segment M level Kwould have
the following effects:
change in relative utilization ofdevice L,
changeUdevL = d 2ML- d 2JL,
change in relative utilization ofdevice K,
changeUdevK = d 2jK - d 2MK,
change in relative utilization ofthe CPU,
changeUCPU = (d 1ML-d 1JL) + (d 1JK- d IMK),
effect upon the sum ofthe relative utilizations,
EOW = changeUdevL + changeUdevK +
changeUCPU.
Note: Device i = device 0 = CPU, if i is an executable
level.
We can measure the relative effect of this exchange upon
Zn as follows: as long as EOW $ 0, let
Ao = changeUCPU * A * sign(EOW)/EOW,
AL = changeUdevL * A * sign(EOW)/EOW,
AK = changeUdevK * A * sign(EOW)/EOW,
Ai = 0, all other auxiliary devices i
and observe that an exchangeofA * sign(EOW)/EOWwords
would take us from point (yoyi,.--Yn) in plane w to point (yo
+ Ao,y1 + A1, yY + An) in plane W + sign(EOW)*
A.
Now ifthe toplevel routine calls for movement to a higher
plane W - W+ A, we consider all thoseexchangequadruples
(J, K, L, M) forwhichyL <j, AL > O,YK >j, AK < 0, and
EOW > 0, where herey = (W+ A)/(m + 1); for each, let
m m
x (J, K, L, M) =,Ai(;i-yi) -(1/2) E A2?-
i=O i=O
and note that Ai/A's can be precomputed from initial pa-
rameters. It is then a simple exercise in elementary algebra to
show that x (J, K, L, M) > x (J', K', L', M') implies the ex-
change (J, K, L, M) (ofA/EOWwords) would bring us closer
to (5,y,* yj) than would (J', K', L', M'). Similar consider-
ations hold for moves to lowerplanes W - W - A. Note that
cutting Adirectly cuts themagnitude ofthe exchange vectors
used in the local exploration phase.
We should note here that the positivity constraints could,
as in the classical pattern search, cause our routine to die
prematurely on a constraint boundaryXji = 0. Although our
experimental results seem toindicate that this does not occur
in practice, it is perhaps advisable toextend LLR byrelaxing
positivity constraints and then phasing them in through a
succession ofincreasingly restrictive penalty functions.
Another possible source oferror is a "narrowvalley" which
could be missed because AF is too large.
In either event there is an easily computed bound on any
error: forDMP = n, Zn(Y),whereyo =y = * =Ym = Wo/(m
+ 1), is clearly a lower bound (although likely unobtainable,
n > 1) on reciprocal throughputagainst which wecan compare
our results in order to determine thedesirability ofintroducing
penalty functions or cutting AF (indiscriminate reduction of
AF can wreakhavoconcomputation timewhilereturning in-
significant improvement). In the special case where wedo not
allow any loading ofexecutable levels, andrj and Ij are con-
stant (independent ofj), a tighter bound can beobtained: for
auxiliary i
lAi = ai + cir
and
to/,lo = E ( (jXii/Sj)Ibi
alli j
= E tiIbj.
auxi
Thus, we can write zn(y) = Zn(t), and since
( ti) - 1 = 2 E t
i>O auxi
= 2 E Zfj(XjI/Sj)
auxi j
= 2 E Zfj(Xj,/S )
alli j
=2Y,fj
= constant,
wecanobtain thedesired boundbysolvingtherelativelysimple
optimization problem
minimize: z, (t)
subject to: ti > 0 i=O,1,"m
Iti = constant.
As the final point ofthissection, we remark that thesolution
to the FA problem outlined above is sufficiently fast to render
the DCFA problem trivial. We specify a minimum increment
in capacity size for each device and then simply iterate over
all capacity selections within the specified budget, solving the
FA problem at each iteration.
IV. EXTENSIONS OF THE BASIC MODEL
The basic design model presented in Section II can be
augmented in many different ways, and the results ofSection
III can be extended for each case. Instead of further compli-
cating the algebra ofthe previous sections, we present several
variations on the basic design model here.
A. Locality ofFileReferences
In the basic model we have assumed that each reference to
a record ofa file located on an auxiliary memory requires the
transfer of the record into the matching executable memory
buffer. However, ifweallocate a buffer ofsizeBj to afilej in
the executable memory, then due to locality of reference only
254GEIST AND TRIVEDI: MULTILEVEL STORAGE HIERARCHIES
a fraction 1 - Hj(Bj) of all references will trigger an actual
I/O transfer. HereHj(Bj) is the hit ratio offinding a record
offile] in a buffer ofsizeBJ, where functionHj and buffersize
Bj are input parameters to the design problem. The only
change required is tosubstitutefj * (1 -Hj(Bj)) forfj in each
ofthe equations (1)-(6), (7d), (7e), (8d), and (8e) ofSection
II. In this manner we can also study thequestion ofallocation
ofa limited buffer space among the given set offiles.
In addition, we have modeled the access pattern ofrecords
using an independent reference model [6]. In practice, there
is usually sequential correlation in the reference pattern, and
blocking ofseveral file records into a single device blockisused
to take advantage of such sequentiality. Mapping such file
behavior onto a queueing network implies that thebranching
probabilities will not be fixed quantities anymore. However,
if we assume a kth order of Markov dependence in the file
access pattern, then we are able to utilize the result of Ko-
bayashi and Reiser [17] which shows that in spite ofsuch de-
pendence the network has a product form solution dependent
only upon the total work demand ofeach servicecenter. Thus,
we are able to model the sequentiality of file access pattern;
details ofsuch a model will appear in [29]. This model allows
the physical block size tobedevice-dependent and thelogical
record size to be file-dependent.
B. Open QueueingNetworks
We observe that all the results ofSections II and III assume
that the objective functionzn(Y) = G,(y)/G- 1(y) is a convex
and monotone increasing function of y. In these sections we
chose to let zn(y) be the reciprocal throughput in a closed
queueing network. We could, instead, cut open the NEW
PROGRAM PATH and turn themodel ofFig. 1 intoanopen
network (Fig. 3), which isbeing fed from a Poissonjobstream
with the average arrival rate X. The arrival rates tothe nodes
Xi, and subsequently the node utilizations pi, can be easily
computed by using standard techniques [16] as follows:
o= Xto, Xi = Xti,
p = X =Xyo, andpi =Xy1
/lo
The average response time R(y) is now given by [16] (as-
suming the stability condition pi < 1 is satisfied for all i):
1 m P, R(y) = , P
X i=0 1Pi
In [3] it is shown thatR(y) is a convex monotone increasing
function of y, and again it is easy to establish that R(y) re-
stricted to the plane lyi = h takes on a minimum at the point
of equal relative utilizations. Thus, if we let the objective
function be R(y) in the design models ofSection II, then all
of the results of Sections II and III will follow.
C. Load-Dependent Servers
Our model formulation calls for a queueing networkofthe
BCMP class [2], [4] with the further restrictions ofsinglejob
type and load-independent service at each node. These re-
strictions are necessary, since the convexity ofthe reciprocal
throughput function has been shown only for this restricted
class ofclosed queueing networks.
Our extensive experience with more general networks ofthe
BCMP type has not yet shown any departure from unimoda-
lity. Thus, we believe that any nonconvex behavior ofthe re-
ciprocal throughput function is likely to be benign. Based on
this belief, we have been successfully using extended formu-
lations of the model of Section II and the search routine of
Section III for product form networks with load-dependent
servers. Similarly, our procedure can be used for networks with
multiplejob classes. Nevertheless, the proofofconvexity (or
unimodality) for the general product form networks remains
an open problem.
V. EXPERIMENTAL RESULTS
We have applied the results ofthe previous sections to sys-
tems studied byseveral otherauthors [1], [25], [9], [ 1]. Let
us first restrict ourselves to the file-assignment problem.
1) In [25] Trivedi, Wagner, and Sigmon consider the al-
location of 10 files across three auxiliary devices, a drum (IBM
2305-1) and two disks (ITEL 7330-1 and CDC 23142). No
files were to be permanently stored in the single executable
level, and DMP was fixed at 6.
Using a modified ZXMIN routine applied to a special case
of the basic model of Section II, they obtained an optimal
throughput of51.82jobs/s; this required 20s ofCPU time on
an IBM 370/165.
Using LLR we obtained an identical throughput figure of
51.82jobs/s; nevertheless, although considerable reassignment
from the initial Vogel loading (which yielded only 41.82
jobs/s) was carried out, the total computation time required
by LLR showed improvement by a factor of 125 over the 20
s mentioned above.
2) Consider the airline reservation system studied in [1],
where there are 42 segments to be allocated across 5 devices;
the workload specifications are reproduced in Table I, and the
device parameters in Table II. Device costs and total budget
were estimated by a least squares linear fit to the 10 configu-
rations given in [1, Table IX].
Although we shall consider this system later in the context
ofcapacity-selection problems, we can here extract an inter-
esting collection of file-assignment problems by restricting
ourselves to those capacities found to be optimal in [1]:
device 1: 128 K words (executable),
device 2: 1 M words (executable),
device 3: 1 M words (auxiliary),
device 4: 7 M words (auxiliary),
device 5: 3 M words (auxiliary).
Since the assumptions in [1] call for multiple servers at the
auxiliary devices, whereas we allow, for the moment, only
single-server devices, these capacities should here be termed
reasonable but not necessarily optimal. (When we turn to the
capacity selection problem, we shall see that thesecapacities
are, in fact, not optimal for any DMP under either set ofas-
sumptions.)
Now for each DMP we have a file-assignment problem to
which we can apply LLR. In Fig. 4 we plot the results:
throughput as a function of DMP for Vogel loading and for
LLR loading.
As iseasily seen from thesegraphs, LLR becomes extremely
255IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
Fig. 3. Open central server network.
TABLE I
WORKLOAD PARAMETERS FOR THE AIRLINE RESERVATION SYSTEM
f ji
3.1
1.2
0.26
0.26
0.76
0.32
0.32
0.14
0.176
0.14
0.15
0.15
0.15
0.296
0.228
0.268
0.820
1.4 20
0.492
0.176
0.492
0.001
0.296
0.101
1.130
0.84 3
0.180
0.001
0.912
0.090
0.560
0.140
0.068
0.068
0.0001
O .001
0.750
0.0001
0.001
i'il
6000.0
3000.0
3000.0
3100.0
2900.0
2900.0
2400.0
800.0
1800.0
1800.0
1000.0
600.0
1000.0
10.0
1700.0
10.0
1.0
304 .0
3.0
1.0
10.0
20.0
27.0
30.0
40.0
47.0
30.0
20.0
30.0
1.0
50.0
4 .0
1.0
7.0
50.0
100.0
7.0
100.0
50.0
r [ji1
2112.0
6000.0
3456.0
1920.0
254 .0
3200.0
2752.0
254 .0
1088.0
1792.0
832.0
640.0
384.0
64.0
1700.0
150.0
2.0
304 .0
6.0
2.0
2000.0
462.0
27.0
35.0
46.0
47.0
33.0
28.0
33.0
1.0
202.0
2.0
1.0
7.0
245.0
209.0
7.0
1000.0
66.0
Sfji
2112.0
6000 .0
3456.0
1920.0
254.0
3200 .0
2752.0
254.0
1088.0
1792.0
832.0
640.0
384 .0
21504.0
167040 .0
24000.0
264000.0
516800.0
90024 .0
330.0
113154 .0
462.0
216000.0
1033352.0
5630400.0
135360.0
68360.0
26032.0
600000.0
1320.0
126000.0
2112.0
4400.0
93060.0
21102.0
30030.0
54012.0
60060.0
20064.0
TABLE 11
DEVICE PARAMETERS FOR THE AIRLINE RESERVATION SYSTEM
ani] b[i] clii costfil incrementfil
0.0 .0005 0.0 13.9989 131000.0
0.0 .002 0.0 6.11732 500000.0
4.3 .0005 .0042 4.56938 500000.0
17.0 .002 .0042 2.88771 500000.0
47.5 .002 .007 .818032 3000000.0
Fig. 4. Vogel versus LLR.
0.001 3.0 3.0 240042.0 3) In [9] Foster and Browne use a hybrid model consisting
5.0 75.0 1056.0 3168.0
0.001 1000.0 500.0 482608.0 ofanalytic and simulation submodels tooptimallyallocatethe
42 files ofTable I across 4 memory levels (one ofwhich is ex-
important as DMP increases, with a difference of more than ecutable) in a DMP = 7 environment. Their final throughput
9jobs/s at DMP = 20, an improvement of27 percent. Thus, figure (measured at the CPU) of96.9 does not differ signifi-
a system analysis based solely on Vogel loading, as in [1], cantly from that we obtained using LLR, 97.2. Nonetheless,
might vastly underestimate throughput potential and hence here, as in example 1), LLR showed a huge improvement in
call for new hardware, when in fact a reallocation offiles would computation time, a factor ofat least 110 over the 100 s (CDC
suffice. 6600) required by the hybrid model.
|Th,.oughp.t (j/.)
_41. 0 _
37. D0
33.00
-2900/
-25.0 ol/~~~~~~~~~~~~~~~LLR
1 00 P.00 7. 00 P. 00 91. 00 AS B 3 00 ,17. 00
O.g. of Multip.
256GEIST AND TRIVEDI: MULTILEVEL STORAGE HIERARCHIES
4) In [11] Foster and Browne study the UT2D Peripheral
Processor Library ofthe CDC 6600 system at the University
ofTexas. In this system there are 39 files to be allocated across
three I/O devices: a central memory (CM), an extended core
storage (ECS), and a CDC 808 system disk capable ofholding
all files. The single executable level (PPU) allows no perma-
nent storage and consists of4 parallel servers, so that for the
DMP under consideration, 4, there is never any queueing for
a processor.
Foster and Browne validate their hybrid model by setting
CM = 2000, ECS = 0 (actual systemvalues), and observing
that throughput ascomputed bytheir model is reasonablyclose
to the actual system throughput of52. Our model passes this
same validation test, as LLR returns a throughput of54.60 for
thegiven set ofparametervalues. It should benoted that since
assignment of library routines is considered here, the single
job class assumption seemsjustified.
In addition, we find a surprising and potentially important
result: our model suggests that this particular systemis soto-
tally processor-bound that file-assignment is unimportant! As
evidence ofthis, we point to the fact that increasing CM and
ECS capacities beyond thevalidation levels given aboveyields
nosignificant improvement in throughput. Further, ifweselect
a representative case, CM=6000, ECS=6000, with LLR
throughput 54.66, we find that doubling the PPU speed causes
LLR throughput to jump to 109.27, almost precisely
double.
We turn now to experimental results on capacity selection.
As mentioned earlier, the strategy here is as follows. For each
DMP we iterate over all sets of capacity choices which lie
within the fixed budget, solving the FA problem for each such
set; wethensimply record thatsetofcapacitychoicesand that
associated file assignment whichjointlygive rise to maximal
system throughput.
Apriori we might expect two results.
1) Thereshould benoglobally (DMP-independent) optimal
capacity selections.
2) Capacity selections based solely on the Vogel loading
rulewill agreewith LLR selections only for DMP = 1. Neither
seems to be entirely the case.
We considered twoversions ofthe airline reservation system
of [1] (parameters in Tables I and II), the first with single-
servers at all devices, the second with the multiple-server
structure of [1]:
Memory Level Servers
fast executable 1
slow executable 1
fast drum 4
slow drum 4
disk 2
For each version we chose capacity increments sufficiently
small to allow all 10 configurations listed in [1] to beconsid-
ered:
Capacity Increments:
devicel: 128K
device 2: 0.5 M
device 3: 0.5 M
device 4: 0.5 M
device 5: 3 M
In addition, since DMP is, realistically, limited by main
memory capacity, we assumed a minimum requirement of
(128K/15) words/DMP of fast executable memory, that is,
device 1; note that this is consistent with [1, page 319].
For the single-server case, the optimal capacity selections
found are summarized in Table III and the associated LLR
throughputs are plotted in Fig. 5. Thesevalues werefound in
a single run of 24 min on an IBM 370/165.
Thus, expectation 1) held for DMP 1-4, but thereafter a
stabilization took place and configuration 3 remainedoptimal
until DMP = 16, at which point the externally imposed re-
quirement of an additional increment of fast executable
memory forced a change to configuration 4, which was stable
thereafter.
We should also remark that the dramatic effects of LLR
seen earlier in the solution to the FA problem fade rapidly as
we approach optimal capacity selection; as an example, the
optimal configuration 3 at DMP = 15 has an LLR-loaded
throughput of 63.16 jobs/s, but this same configuration has
a Vogel-loaded throughput of 61.41 jobs/s, almost as good.
Note that the LLR throughput here is within 14 percent ofthe
(likely unobtainable) upper bound of 73.54jobs/s.
This brings us to expectation 2) and a surprise. Ifwe solve
the DCFA problem without LLR, that is, by restricting our-
selves to Vogel loading alone, the capacity choices made are
precisely the same as those made with LLR for DMP 1-17!
After this point (beginning DMP = 18) Vogel loading called
for too much fast executable memory:
device 1: 384K
device 2: IM
device 3: 2M
device 4: 4M
device 4: 3M
The optimal capacity selections for themultiple-server case
are summarized in Table IV. Here both of our a priori ex-
pectations come closer to being fulfilled. Optimal capacity
selection appears to be more sensitive to DMP, and with the
exception of DMP = 1-4 (where no queueing takes place at
either drum), LLR-based capacity selections differ from those
that would be made solely on the basis ofVogel loading.
Nevertheless, system tuning still provides minimal im-
provement for an optimally chosen set ofcapacities, and thus
system designers using thetechnique of l ] would not lag far
behind in terms ofthe final throughput figure. For example,
in the DMP = 10 case, the technique of [1] would call for
choices
DEVICE 1: 256K
DEVICE 2: 0.5M
DEVICE 3: 2.5M
DEVICE 4: 5.OM
DEVICE 5: 3.OM
on the basis of a Vogel-loaded throughput of 83.90 jobs/s,
whereas the choice for DMP = 10 in Table IV was made on
the basis of an LLR-loaded throughput of 84.09 jobs/s, not
much better.
System tuning ofsuboptimal capacities, on the other hand,
continues to be a major advantageofour technique over that
257IEEE TRANSACTIONS ON COMPUTERS, VOL. C-31, NO. 3, MARCH 1982
TABLE III
OPTIMAL CAPACITY SELECTIONS USING LLR
CONFIGURATION OPTIMAL FOR
DEVICE 1: 128K
DEVICE 2: 2.5M
DEVICE 3: .5M DMP = 1
DEVICE 4: 4.5M
DEVICE 5: 3M
-----------------------------------------------------------
2 DEVICE 1: 128K
DEVICE 2: 2M
DEVICE 3: 1.5M DMP - 2-3
DEVICE 4: 4M
DEVICE 5: 3M
----------------------------------------------------------
3 DEVICE 1: 128K
DEVICE 2: 1.5M
DEVICE 3: 2.5M DMP = 4-15
DEVICE 4: 3.SM
DEVICE 5: 3M
-----------------------------------------------------------
4 DEVICE 1:
DEVICE 2:
DEVICE 3:
DEVICE 4:
DEVICE 5:
256K
1. 5M
2M
3. 5M
3M
DMP = 16-20
Fig. 5. LLR for optimal capacities.
of [ 1]. If we restrict ourselves to, say, the capacities given in
configuration 2 of [1]
DEVICE 1: 128K
DEVICE 2:
DEVICE 3: O.5M
DEVICE 4: 9.OM
DEVICE 5: 3.0M
we find, at DMP = 15, a Vogel-loaded throughput of 41.58
jobs/s and an LLR-loaded throughput of 47.93 jobs/s, a 15
percent improvement. Even for "good" capacity selections,
LLR can provide some worthwhile improvement: for config-
uration 1 ofTable IV (optimal for DMP = 1-4!), we find at
DMP = 20, a Vogel-loaded throughput of 77.96 jobs/s and
an LLR-loaded throughput of 82.89 jobs/s, a 6 percent im-
provement.
TABLE IV
OPTIMAL CAPACITY SELECTIONS USING LLR (MULTIPLE-SERVER
CASE)
CONFIGURATION OPTIMAL FOR
DEVICE 1: 128K
DEVICE 2: 2.5M
DEVICE 3: 0.5M DMP 1-4
DEVICE 4: 4.5M
DEVICE 5: 3.OM
2 DEVICE 1: 128K
DEVICE 2: 1.5M
DEVICE 3: 1.5M DMP S 5-7
DEVICE 4: 5.0M
DEVICE 5: 3.OM
3 DEVICE 1: 128K
DEVICE 2: 1.0M
DEVICE 3: 2.5M DMP = 8
DEVICE 4: 4.5M
DEVICE 5: 3.OM
4 DEVICE 1: 256K
DEVICE 2: 1.5M
DEVICE 3: 1.5M DMP 9-10
DEVICE 4: 4.5M
DEVICE 5: 3.OM
------------------------------------------------------
5 DEVICE 1: 512K
DEVICE 2: 0.5M
DEVICE 3: 2.OM DMP = 11 - 12
DEVICE 4: 4.5M
DEVICE 5: 3.OM
6 DEVICE 1: 512K
DEVICE 2: 0.5M
DEVICE 3: 1.5M DMP = 13 - 20
DEVICE 4: 5.OM
DEVICE 5: 3.OM
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -_
VI. CONCLUSION
We have developed an efficient algorithm for optimal file-
assignment in a storage hierarchy. Our experiments indicate
that the algorithm developed in this paper executes orders of
magnitude faster than existing approaches to the problem,
while improving throughput substantially over the Vogel
loading rule suggested in [1]. Numerical examples show the
usefulness of our algorithm in system tuning. Using the effi-
cient file loading rule as a core, optimal storage device ca-
pacities are determined by a discrete search. The result is a
convenient and efficient design tool that allows the designer
to experiment with many different choices without a propor-
tionately large expense.
Extensions to include multiplejob types and sequentiality
offile access patterns, as well as extensive validation studies,
are desired in order to make practical use of such models.
APPENDIX
m
ProofofLemma 1: The level curves of z1(y) = y1 are
i=O
parallel mr-planes in Rm+1. If we wish to evaluate h(W0 + A),
we can (theoretically) first proceed from the plane Wo to the
parallel plane Wo + A along thestraight lineconnecting that
point at which z1 assumes its minimum (in plane WO) to that
Throughput (j/.)=
-4. 0D
C-i-i 2
34- l ~~~~~~~~Con,f
C-~fiq-ti.= 4
j.00 p.00 7.00 P.00 11.00 13.00 15.00 17.00
Dog. of H.Itip.
258GEIST AND TRIVEDI: MULTILEVEL STORAGE HIERARCHIES
point at which zn assumes its minimum (in plane WI). Since
we are moving along a straight line towards the minimum of
the convex function z, thevalue ofz, obtained, upon reaching
plane Wo + A, is no larger than the value where we started,
h(Wo) (see Fig. 2). (Remark: To be absolutely precise here we
should say only that our starting point will be in plane Wo, not
necessarily at h(Wo). There is an indeterminancy in Vogel
loading caused by arbitrary resolution ofties in the factors A**;
thus, multiple Vogel loadings arepossible over which wewould
have to minimize zn to actually find h(Wo). Nevertheless, we
continue to use the term h(WO) to denote our starting point
since, for realistic problems, multiple Vogel loadings arehighly
improbable: such would require a 16-digit tie between 'A*i's,
and then, since tied segments are loaded consecutively anyway,
an unfortunate split of a A1i-tied file pair across a device
boundary.) Finding h(Wo + A) then amounts to a further
reduction (at least no increase) obtained by moving to the
minimum ofzn(y) within theplane Wo + A, socertainlyh(Wo
+ A) < h(Wo). An entirely analogous argument showsh(Wo
+ 2A) < h(Wo +A).
ProofofTheorem 3: Thesymmetry follows trivially from
the definition ofz,, and since z, is convex it remains only to
show that
m
a zn E yi = K)/Oii;= 0
atyo=y = Ym = K/(m + 1).
We proceed by induction. Observe that
m m m
Z2= 1/2 Ey + y2/ Eyi
i=O i=O i=O .
and thus
m-1 m-1 2
((Z2 | y = K = 1/2 K+|
so that
a(Z2 £E yi=K) [2y -2K - E yi
=1,2,"0y1 1/2
(Yj K
j = 0, 1, 2, -m-1. Setting each toO andsolving, wefind the
unique solutionyo = YI = * = Ym = K/(m + 1).
Now taken > 3 and assumetheresult forZ2,** Zn-I Using
n m
therecursive formula forGn, Gn = l/n E ,y4Gn_1 (see [12]
j=1 i=O
for a,derivation), we can write
Zn =l/n[Eyi + ZH+ + >. Y
Zn-lI Zn-lZn-2 *'* ZI
Then each summand ofz,n ;yi = K (after the first) is ofthe
form
m-1 I(
mn-1 t
EL ytEK E Yi
S = i=0 i=o N
t-I D fI (z,n-p IjYy = K)
p=l
so that
__ D -() N
aSt Oyjl 09y .
3yi D2
Now using our inductive assumption and the product rule
for differentiation, we can easily see thatOD/0yj = 0 atyo =
Yi = = = K/(m + 1) (this formally requires another
induction); further, it is entirely straightforward tocheckthat
ON/Oyj = Oat yo = y1= =yn = K/(m + 1.). Thus, each
summand has each partial derivative 0 at the point of equal
relative utilizations, and the proofis complete.
REFERENCES
[1] S. R. Arora and A. Gallo, "Optimization ofstatic loading ofmultilevel
memory systems," J. Ass. Comput. Mach., vol. 20, pp. 307-319, Apr.
1973.
[2] F. Baskett, K. M. Chandy, R. R. Muntz, and F. G. Palacios, "Open,
closed, and mixed networksofqueues with different classesofcustom-
ers," J. Ass. Comput. Mach., vol. 22, no. 2, pp. 248-260, 1975.
[3] K. M. Chandy, J. Hogarth, and C. H. Sauer, "Selecting capacities in
computer communication systems," IEEE Trans. SoftwareEng., vol.
SE-3, pp. 290-295, July 1977.
[4] K. M. Chandy, J. H. Howard, and D. F. Towsley, "Product form and
local balance in queueing networks," J. Ass. Comput. Mach., vol. 24,
no. 2, pp. 250-263, 1977.
[5] W-W. Y. Chiu, "Analysis and applications ofprobabilistic models of
multiprogrammed computer systems," Ph.D. dissertation, Dep. Elec.
Eng., Univ. ofCalifornia, Santa Barbara, Dec. 1973.
[61 E. G. Coffman, Jr. and P. J. Denning, OperatingSystem Theory. En-
glewood Cliffs, NJ: Prentice-Hall, 1973.
[7] P. J. Denning and J. P. Buzen, "The operational analysis ofqueueing
network models," Comput. Surveys, vol. 10, pp. 225-261, Sept.
1978.
[8] D. Ferrari, Computer Systems Performance Evaluation. Englewood
Cliffs, NJ: Prentice-Hall, 1978.
[91 D. V. Fosterand J. C. Browne, "Fileassignmentin memoryhierarchies,"
in Modeling and Performance Evaluation of Computer Systems,
Beilner and Gelenbe, Eds. Amsterdam, The Netherlands: North-
Holland, 1976.
[10] D. V. Foster, L. W. Dowdy, and J. E. Ames, "Fileassignment in a STAR
network," Dep. Syst. Inform. Sci., Vanderbilt Univ., Nashville, TN,
Tech. Rep. 77-3, 1977.
[11] D. V. Foster and J. C. Browne, "Channel balancing in a memory hier-
archy-A casestudy," Dep. Comput. Sci., Duke University, Durham,
NC, Rep., 1976.
[121 R. Geist and K. Trivedi, "Queueing network models in computer system
design," Math. Mag., to be published.
[13] T. P. Giammo, "Extensions to exponential queueing network theory for
use in a planning environment," in Proc. IEEE Comput., Sept. 1976.
1141 IMSLReference Manual. Houston, TX: IMSL, Inc., 1979.
[15] S. K. Kachhal and S. R. Arora, "Seeking configurational optimization
in computer systems," in Proc. ACMAnnu. Conf., 1975, pp. 96-101.
[16] L. Kleinrock, QueueingSystems, Vol. IL. New York: Wiley, 1976.
[17] H. Kobayashi and M. Reiser, "On generalization ofjob routing behavior
in aqueueing network model," IBM, Yorktown Heights, NY, Res. Rep.
RC-5252, 1975.
259IEEE TRANSACTIONS ON COMPUTERS, VOL. c-31, NO. 3, MARCH 1982
[18] T. G. Price, "Probability models of multiprogrammed computer sys-
tems," Ph.D. dissertation, Dep. Elec. Eng., Stanford Univ., Palo Alto,
CA, 1974.
[19] E. W. Pugh, "Storage hierarchies: Gaps,cliffs, andtrends," IEEE Trans.
Magn., vol. MAG-7, pp. 810-814, Dec. 1971.
[20] C. V. Ramamoorthy and K. M. Chandy, "Optimization of memnory
hierarchies in multiprogrammed systems," J. Ass. Comput. Mach., vol.
17, pp. 426-445, July 1970.
[21] A. J. Smith, "Algorithms and architectures for enhanced file system
use," in Experimental Computer Performance and Evaluation, D.
Ferrari and M. Spadon, Eds., SOGESTA, 1981.
[221 G. Strang, Linear Algebra and Its Applications. New York: Aca-
demic, 1976.
[23] K. S. Trivedi, "Prepaging and applications totheSTAR-100 computer,"
in High-Speed Computer and Algorithm Organization, D. Kuck, A.
Sameh, and D. Lawrie, Eds. New York: Academic, 1977.
[24] K. S. Trivedi and R. A. Wagner, "Adecision modelofclosed queueing
networks," IEEE Trans. Software Eng., vol. SE-5, July 1979.
[25] K. S. Trivedi, R. A. Wagner, and T. M. Sigmon, "Optimal selections
ofCPU speed,devicecapacities, andfileassignments," J. Ass. Comput.
Mach., July 1980.
[26] K. S. Trivedi and R. E. Kinicki, "A model for computer configuration
design," Computer, pp. 47-54, Apr. 1980.
[27] K. S. Trivedi and T. M. Sigmon, "A performance comparison ofopti-
mally designed computer systems with and without virtual memory,"
in Proc. 6th Annu. Int. Symp. Comput. Arch., Philadelphia, PA,
1979.
[28] , "Optimal design oflinear storage hierarchies," J. Ass. Comput.
Mach., Apr. 1981.
[29] K. S. Trivedi and R. A. Wagner, "Optimal selection of CPU speed,
device capacities, and file assignments-An extension," to be pub-
lished.
[30] R. A. Wagner and K. S. Trivedi, "Hardware configuration selection
through discretizing a continuous variablesolution," in Proc. 7th IFIP
Int. Symp. Comput. Performance Modeling, Measurement, and
Evaluation, Toronto, Ont., Canada, May 1980, pp. 127-142.
Robert M. Geist received the M.A.degree in com-
puter science from Duke University, Durham, NC
and the Ph.D. degree in mathematics from the
University of Notre Dame, Notre Dame, IN.
Currently, he is an Assistant Professor of Com-
puter Science at DukeUniversity. He was former-
ly an Associate Professor of Mathematics at Pem-
broke State University, Pembroke, NC. He has
published in the areas of both computer science
(performanceevaluation) and mathematics (alge-:
braic topology). His current interests are in per-
formance-based analysis and design ofcomputer systems and fault-tolerant
computing.
Dr. Geist ig a member ofthe Association for Computing Machinery, the
American Mathematical Society, and the Mathematical Association of
America.
KishorS. Trivedi received the B.Tech. degree from
the Indian Institute of Technology and the M.S.
and Ph.D. degrees in computer science from the
University ofIllinois, Urbana-Champaign.
Presently, he is an Associate Professor ofCom-
puter Science and Electrical Engineering at Duke
University, Durham, NC. He hasserved as a Prin-
cipal Investigator on various NSF and NASA
funded projects and as a Consultant to industry
and research laboratories. Hehaspublished in the
areas of computer arithmetic, computer architec-
ture, memory management, and performance evaluation. His current interests
are in performance evaluation and fault-tolerant computing.
Dr. Trivedi is a member ofthe Association for Computing Machinery and
the IEEE Computer Society.
A Regular Layout for Parallel Adders
RICHARD P. BRENT, MEMBER, IEEE, AND H. T. KUNG, MEMBER, IEEE
Abstract-With VLSI architecture, the chip area and design reg-
ularity represent a better measure ofcostthan theconventional gate
count. We show'that addition of n-bit binary numbers can be per-
formed on a chip with a regularlayout in time proportional to log n
and with area proportional to n.
Index Terms-Addition, area-time complexity, carry lookabead,
circuit design, combinational logic, models ofcomputation, parallel
addition, parallel polynomial evaluation, prefix computation, VLSI.
Manuscript received May 12, 1980; revised February 3, 1981 and October
1, 1981. ThisworkwassupportedinpartbytheNational Science Foundation
under Grant MCS78-236-76 and the Office ofNaval Research under Con-
tracts N000014-76-C-0370, NR 044-422 and N00014-80-C-0236, NR
048-659.
R. P. Brent is with the Department ofComputer Science, Australian Na-
tional University, Canberra, Australia.
H. T. Kung is with the DepaTtment ofComputer Science, Carnegie-Mellon
University, Pittsburgh, PA 15213.
I. INTRODUCTION
W E are interested in the design ofparallel "carry look-
W ahead" adders suitable for implementation in VLSI
architecture. The addition problem has been considered by
manyother authors. See, forexample, [1], [4], [6], [7], [11],
[13], and [14]. Much attention has been paid to the tradeoff
between time and the numberofgates, but little attention has
been paid to the problem ofconnecting the gates in an eco-
nomical and regular way to minimize chip area and design
costs. In this paper we show that a simple andregulardesign
for a parallel adder is possible.
In Section II we briefly describe our computational model.
Section III contains a description ofthe addition problem and
shows how it reduces to a carry computation problem. The
basis ofour method, the reduction ofcarry computation to a
0018-9340/82/0300-0260$00.75 C 1982 IEEE
260