A Hardware Independent Parallel Programming Model  by Burrows, Eva & Haveraaen, Magne
The Journal of Logic and Algebraic Programming 78 (2009) 519–538
Contents lists available at ScienceDirect
The Journal of Logic and Algebraic Programming
j ourna l homepage: www.e lsev ie r .com/ loca te / j lap
A hardware independent parallel programming model
Eva Burrows*, Magne Haveraaen
Department of Informatics, University of Bergen, Postboks 7803, 5020 Bergen, Norway
A R T I C L E I N F O A B S T R A C T
Article history:
Available online 24 June 2009
Keywords:
Data dependency algebra
Embedding
Parallel computation
Parallel hardware architectures and chips
Parallel programming faces twomajor challenges: how to efﬁciently map computations to
different parallel hardware architectures, and how to do it in a modular way, i.e., without
rewriting the problem solving code.We propose to treat dependencies as ﬁrst class entities
in programs. Programming a highly parallel machine or chip can then be formulated as
ﬁnding an efﬁcient embedding of the computation’s data dependency into the underlying
hardware’s communication layout. With the data dependency pattern of a computation
extracted as an explicit entity in a program, one has a powerful tool to dealwith parallelism.
© 2009 Elsevier Inc. All rights reserved.
1. Introduction
The potential of parallel computing was evident more than four decades ago. Powerful parallel systems began to appear,
and the insight that programming such machines was far from obvious became clear. The expanding universe of computing
architectures was classiﬁed in many different ways, e.g., Flynn’s famous taxonomy [16]. These classiﬁcations, however, were
derived from the execution models of the systems, and said little about how to program them. As Luc Bougé pointed out [8],
massively parallel machines seemed to have developed so quickly, that little time was left to design suitable languages.
Therefore the tendencywas to reﬂect the architecture in an architecture speciﬁc programming language, typically by starting
with a wide-spread programming language and extending it with a construct for each hardware feature. This resulted in
programmingmodels that weremere subsets of the executionmodels, and could hardly survive or adapt when new systems
appeared. This led researchers to the understanding: if parallel computing is to be wide-spread, then it needs powerful
high-level parallel programming models which abstract from low-level hardware details, so that a programmer does not
need to bother about these.
Over the last decade, the Message Passing Interface (MPI) [18] and OpenMP [10] have become the two most dominant
parallel programming models. They have been taken on board by the high-performance computing communities and com-
piler vendors.MPI is a distributedmemorymodel, which comes as a library standard and is available on almost every parallel
platform. It can be used on networks of workstations as well as on parallel supercomputers. The user writes program code
for every participating (parallel) process, and manages the communication between processes by message passing. This is
time-consuming and error-prone (e.g. deadlocks, livelocks, etc.). MPI implementations are claimed portable, yet they may
need adaptations from platform to platform to be more appropriate for the target system. OpenMP is a shared memory
model of parallel hardware, and may achieve better performance than MPI in such an environment. OpenMP is based on a
set of compiler directives applied on top of a standard language (e.g. C, Fortran, OCaml). Parallel regions are identiﬁed and
annotated by the user. In general OpenMP follows a fork-join execution model. While MPI can easily be adapted to shared
*
Corresponding author.
URL: http://www.ii.uib.no/∼eva/ (E. Burrows), http://www.ii.uib.no/∼magne/ (M. Haveraaen).
1567-8326/$ - see front matter © 2009 Elsevier Inc. All rights reserved.
doi:10.1016/j.jlap.2009.06.002
520 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
memory parallel systems, OpenMP does not scale efﬁciently to distributed memory parallel systems. However, strategies
have been proposed for implementing OpenMP on clusters [11].
Today, computational devices are rapidly evolving into massively parallel systems. Multi-core processors are standard,
and high performance processors such as the Cell processor [12] and graphics processing units (GPUs) featuring hundreds
of on-chip processors are developed to accumulate processing power. They make parallelism commonplace, not only the
privilege of expensive high-end platforms. However, current parallel programming paradigms cannot readily exploit these
highlyparallel systems. In addition, eachhardware architectureyet again comesalongwith anewprogrammingmodel and/or
application programming interface (API). This makes the writing of portable, efﬁcient parallel code even more difﬁcult.
This paper depicts an experimental programming model, based on the theory of data dependency algebras (DDAs) and
embeddings [23]. It addresses two main issues of parallel programming: how to map computations to different parallel
hardware architectures to achieve efﬁcient code, and how to achieve this at a low development cost, i.e., without rewriting
the problem solving code. DDAs allow us to treat dependencies as explicit modules in programs. This in turn providesmeans
to argue about the underlying global communication structure and the efﬁciency of embeddings. Today’s most widespread
parallel programming paradigms encourage programmers to disregard the underlying hardware architecture in the name
of “user friendliness” – a line supported by most hardware vendors. Our proposed approach, on the other hand, gives
direct access to the hardware communication structure. Direct access to aspects of the hardware model is needed by some
architectures, e.g., graphics processors.However, ourmodel is fully portable andnot tied to any speciﬁcprocessor or hardware
architecture, due to our modularisation of the data dependencies.
1.1. Previous work
Miranker andWinkler introduced data dependency graphs and space–time embeddings in [32]. In [20,23,21] Haveraaen
took the basic idea further with formalising data dependency algebras (DDAs) and the constructive recursive approach (CR).
With CR the computation is separated from its dependencies, allowing both to be programmed and manipulated indepen-
dently of each other. This also entails a separation between local dependencies of a function, and the global communication
pattern of the computation. This idea hasmany similaritieswith the structural blanks approach of Cˇyras, and both techniques
are presented and compared in [15].
Based on the CR framework, a prototype compiler was implemented to provide a simple way to generate parallel code
from high-level DDA descriptions. The experiments proved this a viable way to produce parallel code for high performance
computing architectures [40].
1.2. Contribution and overview of the paper
This paper contributes to the deeper understanding of the role that DDAs can play in parallel computing. It depicts a DDA-
based hardware independent parallel programming model. It presents a DDA-based repeat construct as a new language
feature with appealing computational mechanisms. It shows that spatial placements of computations can be controlled at a
high-level using DDAs, and it gives an example of how modern parallel hardware architectures, e.g. GPU, can be deﬁned in
terms of space–time hardware DDAs.
In the following section we present the programming model. Then we give a formal introduction to DDAs, and through
some examples point out their ﬂexibility. In Section 4 we motivate, through the example of a dependency-driven repeat
construct, why DDA-based programming is a viable approach. The second half of the paper illustrates the idea of our
programming model through a case-study of a non-trivial parallel computation, the bitonic sort: in Section 5 the DDA of the
bitonic sorting network and the associated computations are presented; in Section 6 its embeddings into the hardware DDAs
of two different parallel architectures (the hypercube and a GPU) are shown. Finally, Section 7 gives an overview of related
works, before the paper concludes with a summary and a discussion of future work.
2. Parallel programming with DDAs
Here we sketch our programming model where we utilize the inherent ﬂexibility of the CR approach and use DDA
abstractions when dealing with computations to be mapped to parallel architectures. DDAs provide a high-level formalism
to deﬁne computations, hardware layouts as well as embeddings. Fig. 1 is central in this conceptual overview.
Computations exhibit data dependency patterns which can be captured by DDAs (e.g. DDA1 in Fig. 1). This allows us to
formulate the computation as expressions on the points of the DDA, such that dependencies between computational steps
(DDA points) become explicit entities in the expression. A hardware architecture’s space–time communication layout, its
API, can also be captured by special space–time DDAs, deﬁned from hardware DDAs (e.g.HDDA1). A hardware DDA describes
the static spatial connectivity of the hardware architecture. Mapping a computation to an available hardware resource then
becomes a task of ﬁnding an embedding of the computation’s DDA into the space–time of the hardware DDA. This can be
deﬁned at a high-level using DDA-embeddings.
A DDA-based compiler needs: the computation in terms of a DDA1 and the expressions over DDA1, the space–time of the
hardware HDDA1, and the embedding E1 from DDA1 to this space–time. Then the compiler can generate the code for the
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 521
Fig. 1. A hardware independent parallel programming model.
chosen hardware architecture. This could for instance be CUDA code for GPUs, vectorized C/C++ for the CELL/BE, and so on.
When the computation needs to be ported to more architectures, then these architectures need to be deﬁned as hardware
DDAs (e.g. HDDA2,… , HDDAn). For each new hardware architecture, an embedding into its space–time DDA is needed.
For example En from DDA1 to the space–time of HDDAn. The expressions on DDA1 remain unchanged irrespective of the
available hardware resource.
Since there is no need to rewrite the program solving code, the computation is hardware independent. Also, there is no
need for the compiler to do advanced parallelizing program analysis, as the embedding gives all the information needed
for efﬁcient parallel code generation. Alternative embeddings can easily be tested in search of optimal solutions, since each
embedding is deﬁned explicitly, yet at a high, easy to manipulate, level.
The DDA-based programming model allows other kinds of software re-usability as well. Different computations may
exhibit the same dependency pattern (DDA1 in Computations 1 and 2), in which case all the embeddings deﬁned for DDA1
onto the different architectures can be reused. If a new computation exhibits a new dependency pattern (e.g. DDA2 in
Computation3), thepreviouslydeﬁnedspace–timehardwareDDAscanbe reusedwhendeﬁningnewembeddings, illustrated
by dashed lines in Fig. 1.
A discussion of embeddings and ways to combine DDAs in the construction of DDAs can be found in [2,22].
3. Data dependency algebras
Theconceptofdatadependencyalgebra isequivalent to thatof adirectedmultigraph.MirankerandWinkler [32] suggested
that program data dependency graphs can abstract how parts of a computation depend on data supplied by other parts. This
is a basis for parallelizing compilers, see e.g. [43]. In the formalism of DDAs, we talk about points P, branch indices B, requests
and supplies. At each point (nodes of the graph), request branches (arcs) lead to points fromwhich data is required, whereas
supply branches (the opposite arcs) lead to points that are supplied with data from the point. So we get a set of request
arcs rg ⊆ P × B and supply arcs sg ⊆ P × B and isomorphism r˜ : rg → sg with inverse s˜ : sg → rg . We will use the shorthand
rg(p, b) for (p, b) ∈ rg and so forth.
A branch between two points of the graph stands for a request-supply connection, whereas branch indices from B are
local at each end of the branch: one identiﬁes the request and one the supply direction. Obviously wewant to keep B as small
as possible, but different points may have a varying number of request and supply arcs, motivating the need for predicates
rg and sg . Input points typically do not have any request branches, since the computation is supposed to start up from these.
Likewise, output points typically do not have associated supply branches, since the computation is supposed to cease on
these.
When doing computations on DDAs we need access to component-wise information about the connections. Therefore
we split the request isomorphism into a pair of functions, r˜(p, b) = 〈rp(p, b), rb(p, b)〉 where rp : rg → P and rb : rg → B, and
likewise the supply isomorphism s˜(q, d) = 〈sp(q, d), sb(q, d)〉 where sp : sg → P and sb : sg → B.
For a guarded pair of point and branch index, the ﬁrst component of requests, rp, identiﬁes the point the branch leads
to (i.e. where data is requested from). Likewise, for a guarded pair of point and branch, the ﬁrst component of supplies, sp,
identiﬁes the point where the branch leads to (i.e. where data is supplied to).
The second component of both requests and supplies, the branch-back functions (rb and sb), identify the branch index of
the opposite direction along the same branch. The request branch-back rb gives the branch index at the supplying end of the
arc. Similarly, the supply branch-back sb gives the branch index at the requesting end of the arc.
522 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
Deﬁnition 3.1 (DDA). A data dependency algebra is given by a 4-tuple
D = 〈P,B, req, sup〉, where:
(1) P is a set of points,
(2) B is a set of branch indices,
(3) req is the data request consisting of rg ⊆ P × B, rp : rg → P, rb : rg → B,
(4) sup is the data supply consisting of sg ⊆ P × B, sp : sg → P, sb : sg → B,
such that for all p : P and for all b : B the following axioms hold:
sg(p, b) ⇒ rg(sp(p, b), sb(p, b)), rg(p, b) ⇒ sg(rp(p, b), rb(p, b)),
sg(p, b) ⇒ rp(sp(p, b), sb(p, b)) = p, rg(p, b) ⇒ sp(rp(p, b), rb(p, b)) = p,
sg(p, b) ⇒ rb(sp(p, b), sb(p, b)) = b, rg(p, b) ⇒ sb(rp(p, b), rb(p, b)) = b.
The axioms express the duality of requests and supplies, which follow from the isomorphisms between rg and sg .
This duality and the symmetric requirements of the deﬁnition also ensure the interchangeability of req and sup. Thus we
can reverse the data ﬂow by swapping req and sup. This technique is shown in Example 3.4 where we reverse a butterﬂy
DDA.
Deﬁnition 3.2 (Countable DDA). A countable data dependency algebra is a DDA D = 〈P,B, req, sup〉 where P is countable and B
is ﬁnite.
A DDA has cycles if repeated application of requests from some starting point p : P takes us back to p. If a DDA has no
cycles it is acyclic.
Hardware DDAs typically contain cycles, whereas the space–time unfoldings and problem-related DDAs do not.
In the examples of this paper, we show only acyclic, countable DDAs. We also happen to use the same local index at both
ends, thus rb(p, b) = b and sb(p, b) = b, for all relevant points p and branch indices b. We will therefore skip the deﬁnition of
these two functions for the example DDAs in the rest of this presentation.
The examples will be presented in program code style, hence the mathematical notations of the deﬁnition have been
adapted, e.g.,rg stands for rg , etc. Theprogramcodes and language constructs used inour examples are inspiredby functional
programming notations. Our data type declarations typically declare tuple types: a type name, its components, explicit
projection functions and a constructor for the data type. An example declaration T = p1 Nat * p2 Natwould mean
that we introduce a type T with a data structure consisting of two natural numbers.1 The two projection functions are
p1:T→Nat and p2:T→Nat. The constructor is named T with proﬁle T:Nat,Nat→T. Giving a constructor the same
name as the type it constructs is common in many programming languages. The constructor and the projections are related
by the following axioms, where n1 and n2 are of type Nat and t is of type T.
T(p1(t),p2(t)) = t
p1(T(n1,n2)) = n1
p2(T(n1,n2)) = n2
This pattern applies for any set of names and anynumber of components in the data structure.Wewill call any set of functions
that satisfy such requirements with respect to a type T its projection functions and its constructor.
Wheneverwe declare a partial functionwewill associate a guardwith it. For instance,rg guardsrp in theDDA. A guard is
a predicate that identiﬁes for which arguments the corresponding function returns awell-deﬁned value [24]. A judicious use
of guards – managed by the compiler – allows, for every expression, the automatic establishment of a (syntactic) condition
that ensures the well-deﬁnedness of the entire expression. This avoids syntactic clutter when writing code, as these implicit
conditions can be assumed checked by the compiler for every expression, as long as the guards have been deﬁned.
Often we will include a data invariant DI predicate with a type deﬁnition. The data invariant will restrict the range of
values that are meaningful for the type. For instance, we may limit the elements of T above to natural numbers less than
1000 by deﬁningDI(t) = (p1(t)<1000) && (p2(t)<1000). Thiswill implicitly induce a corresponding guard
on the injection functions.
Let us illustrate the DDA concept and the use of this coding style by deﬁning the butterﬂy network. The butterﬂy network,
see Fig. 2, appears in many divide-and-conquer algorithms, such as the Fast Fourier Transform.
In the ﬁgures illustrating DDAs, the arrows go in the direction of the requests. Data will ﬂow in the opposite direction of
the arrows, i.e., along the supply directions.
Example 3.3. A butterﬂy DDA of height h ∈ N, DBFh, is deﬁned by a DDA sort BFh = row Nat * col Nat with
DI(p) = (row(p)<= h) && (col(p)<2h). Let the branch indices B={0,1}. Deﬁne the request and supply
functions by:
rg(p,b) = 0<row(p)<=h
1 Natural numbers in this presentation is the set N = {0, 1, 2, . . .}.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 523
rp(p,b) = if (b=0) then BFh(row(p)-1,col(p))
else BFh(row(p)-1,flip(row(p)-1,col(p)))
sg(p,b) = 0<=row(p)<h
sp(p,b) = if (b=0) then BFh(row(p)+1,col(p))
else BFh(row(p)+1,flip(row(p),col(p)))
where flip(i,n) ﬂips the ith bit (where bit 0 is the rightmost, least signiﬁcant bit of n) in the binary representation of
the integer n.
In Fig. 2 the layout of the butterﬂy graph is given by the row and col projections to place the points in a grid. Inputs are
assumed to reside on the points of the bottom row, and the computation proceeds upward. The ﬂow of the data throughout
thewhole computation is deﬁned explicitly by the supply functions. Consecutive points have theirrowprojections increased
by 1 at every step. When data is passed upright, along branch index 0, the col projections of the consecutive points are
preserved. And when data is passed across, along branch index 1, the col projections of consecutive points differ in their
binary representation: the bit pointed to by the row projection ﬂips. The deﬁnition of the supply function is based on this
observation. The deﬁnition of the request function is dual.
Example 3.4. When we have to deal with a butterﬂy pattern where the data ﬂow is reversed, as shown in Fig. 3, a reversed
butterﬂy DDA of height h ∈ N, DRBFh, can be deﬁned from the previous one. We preserve the DDA sort BFh, and deﬁne
request (rg, rp, rb) and the supply (sg, sp, sb) functions by simply swapping the two sets of operations:
rg(p,b) =sg(p,b)
rp(p,b) = sp(p,b)
sg(p,b) = rg(p,b)
sp(p,b) = rp(p,b)
This technique of providing alternate requests and supplies for a DDA allows us to reuse code when building new DDAs. We
use similar techniques to change only the request and supply guards, with an induced limitation on the data invariant, for a
DDA when we want to use a sub-DDA of an existing DDA for a computation.
3.1. DDA projections
We have used the projection functions row:BFh→Nat and col:BFh→Nat to deﬁne how DBFh points are mapped
into a two dimensional spatial grid for layout purposes, see Figs. 2 and 3. In general, for any given DDA, the points can be
placed in many different ways in a grid. We can simply provide new projections from the point sort to give a new layout.
Let us illustrate this by changing the layout of the reversed butterﬂy DDA of height 4, see Fig. 4. First consider projections:
altRow, altCol:BFh→Natwith:
altRow(p)=row(p)
Fig. 2. Butterﬂy DDA of height 4. The col projections of BF4 points are presented as binary numbers.
524 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
Fig. 3. Reversed butterﬂy DDA of height 4.
altCol(p)=ShRh(col(p),1)
The shift right function ShRh:Nat,Nat→Nat is deﬁned such that ShRh(n,i) returns the value of a cyclic shift to the
right on the h-bit binary representation of n by i positions.
We can also deﬁne placements that will result in a repetitive network topology, see Fig. 5. Consider
shuffleRow,shuffleCol: BFh→Nat such that shuffleCol depends on both col and row:
shuffleRow(p)=row(p)
shuffleCol(p)=ShRh(col(p),row(p))
This placement is a variation of the shufﬂe network [41], an important interconnection pattern for parallel processing. Given
an array of elements, the perfect shufﬂe of the array elements is achieved by interlacing the ﬁrst half of the array with the
elements of the second half. This can be viewed as shufﬂing a deck of cards: ﬁrst dividing them in half, and then shufﬂing the
two halves perfectly. A shufﬂe network is a repetitive network topology based on the perfect shufﬂe pattern. Our spatial grid
representation differs somewhat from a standard shufﬂe network. In our presentation, each point has one result element
that is copied to two separate branches when the shufﬂe takes place. In the classical shufﬂe network, each node has two
output values, corresponding to a pair of our points, and our two branches (with the copies of the same element) aremerged
into one channel.
We may use such projections not only when laying out a DDA on paper, but also as a means of placing computations
(at each point) on processors in a network, or choosing placements in silicon for Field Programmable Gate Arrays (FPGAs).
Fig. 4. Reversed butterﬂy DDA of height 4 with spatial grid representation deﬁned by altRow and altCol. The way butterﬂy branches cross each other
is different from the one in Fig. 3, yet the dependency is the same.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 525
Fig. 5. Reversed butterﬂy DDA of height 4 with spatial grid representation deﬁned by shuffleRow and shuffleCol.
Projections make DDAs very ﬂexible and easy to map to different network topologies. As shown, this can be handled on a
high-level by deﬁning new projections from the DDA sort.
3.2. Space–time DDAs
FollowingMiranker andWinkler wemay control the parallel execution of a computation by embedding the computation
into the space–time of a parallel machine. For instance, if we interpret the row projection as time and col projection as
space, we obtain a space–time layout of a parallel computation on a butterﬂy. Then computations across space (different
columns) for the same time-step (same row) can be done in parallel. However, there are certain consistency requirements
that space–time projections must comply with.
We ﬁrst introduce some sorts. Let Space be a sort. Space may represent the processors of a parallel machine. Let Time
be a countable sort, e.g., the natural numbers, with a total order <, a starting point zero, and an increment operator
next:Time→Time, such that t<next(t) for all t:Time.
Deﬁnition 3.5 (Space–time DDA). A space–time DDA (STA) is a DDA D=<P,B,req,sup> with projections
space:P→Space andtime:P→Time and constructor functionP:Space,Time→P, such that on top of the general
consistency requirements imposed on projections and constructors, the following also holds:
time(rp(p,b))<time(p)
for all relevant p:P and b:B.
Relevant means as allowed by the guards.
Deﬁnition 3.6 (Finite span STA). A ﬁnite span space–time DDA is a space–time DDAwhere the type Space is ﬁnite, and there
is a natural number tspan such that at most tspan applications of next takes the value time(rp(p,b)) to the
value time(p), for all relevant points p and branch indices b.
Space–time DDAs can abstract hardware communication layouts projected over time. They can be deﬁned fromhardware
DDAs. A hardware DDA is a DDA that describes the communication structure of a machine architecture, layout for an FPGA,
etc. We will exemplify space–time algebras in Section 6, where the hypercube and the API of a graphic processor are
presented.
4. Computing over DDAs: the repeat statement
Wemotivate this sectionwith an example. LetA be an array typewith index-sortP, element-sortE and a partial indexing
operation[_]:A,P→Eguardedbyig. Further, letBbea typewithconstants0:B,1:Bandapartial functionrp:P,B→P
guarded by rg. Assume that we have an array V:A such that the data values in V are related to each other by
V[p] = if cond(p) then min(V[rp(p,0)],V[rp(p,1)])
else max(V[rp(p,0)],V[rp(p,1)])
526 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
where min:E,E→E and max:E,E→E are minimum andmaximum functions on elements, andcond is some condition
on indicesp:P. Recall that the guards (hereig andrg) ensure that expressions are only appliedwhen they arewell deﬁned,
thus the equation is not considered if ig(V,p) does not hold for some p. But what if the right-hand side expression was
well-deﬁned in this case – it could be used to deﬁne the value of V at this point. We propose a repeat statement that will
allow us to do this.
We focus the repeat statement on the array data type. Array types can be considered functions mapping from an index
type to an element type. Due to our focus on how to compute the contents of an array step by step, we prefer to use the
data oriented array concept rather than the corresponding map. For computability reasons, we will limit the index to be a
countable data type, i.e., either ﬁnite or isomorphic to the set N.
Arrays can represent any structured data type such as lists and trees. A structured data type can be thought of as implicitly
indexed by the traversal pattern used to access a given element. For a list, e.g., we can encode the traversal as the number of
tail operations before the head operation that accesses a given element, and use this number as an index.
Deﬁnition 4.1. Let D=<P,B,req,sup> be a DDA and S a selection of points from P. Then a point p:P is 1-reachable
from S along D, if all relevant rp(p,b) belong to S.
Deﬁnition 4.2. Let D=<P,B,req,sup> be a countable DDA, V be an array indexed by P with element type E, e an
expression of type E such that each occurrence of the array V in e has the form V[rp(p,b)] for b an expression of
type B.
LetS be the set of pointswhereig(V,p) holds. Then thee-derived 1-extension of V along D is the arrayextD,e(1,V),
s.t., ig(extD,e(1,V),p) holds exactly when p is in S or p is 1-reachable from S along D, and
extD,e(1,V)[p] = if (p in S) then V[p] else e
The e-derived k-extension of V along D is the array extD,e(k,V), for the remaining k ∈ N, deﬁned recursively by:
extD,e(0,V) = V
extD,e(k,V) = extD,e(1,extD,e(k-1,V))
The e-derived extension of V along D is the array extD,e(V) deﬁned as the limit of extD,e(k,V) as k grows.
The array extD,e(V) is well deﬁned for any countable P. Consider any value p:P. For a suitably large k, either
ig(extD,e(k,V),p)holds, inwhichcaseig(extD,e(V),p)holds, andextD,e(V)[p] = extD,e(k,V)[p]; or
ig(extD,e(k,V),p) does not hold, and there is no larger k that will make it hold, thus ig(extD,e(V),p) does not
hold.
If there are any circularities in theDDA, suchpointswill not be givenavaluewith this extension (unless they arepredeﬁned
inV). The circularitywill prevent all requiredV[rp(p,d)] to be deﬁned, since some of these (indirectly) depend onV[p]
which has not been computed. Also, points which do not require any inputs (the rg does not hold) will not be given a value,
unless V already deﬁnes them.
The following lemma states that the order in which the 1-reachable points are chosen has no effect on the ﬁnal
result.
Lemma 4.3. Let D=<P,B,req,sup> be a countable DDA, V be an array indexed by P with element type E, and e an
expression of type E such that all occurrences of the array V in e has the form V[rp(p,b)] for b an expression of
type B.
Assumep:P is a 1-reachable point from the set containing all pointswhereV is deﬁned, and letV′ be the arrayV extended
with the value of e at p. Then the e-derived extension of V along D equals the e-derived extension of V′ along D.
Proof First note that the e-derived (k + 1)-extension of V along D always contains the e-derived k-extension of V′ along
D. Then note that the e-derived k-extension of V′ along D always contains the e-derived k-extension of V along D. Thus at
the limit when k grows, both extensions will be equal. 
Deﬁnition 4.4 (Syntax of the repeat statement). Let V be an array with index type P and element type E. And let
D=<P,B,req,sup> be a countable DDA. Then the repeat statement has the syntax
repeat p:P along D from V in e
where repeat, along, from, and in, are keywords, imposing the following syntactic and type requirements:
• emust be an expression of type E.
• p is a freshly declared variable of type P, with scope e.
• All occurrences of the array V in emust have the form V[rp(p,b)], where b is an expression of type B, and rp is
the request from the DDA D.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 527
Note that the use of the rp-function in the repeat statement indicates that there is a data dependency between the
computational steps of the application. Exactly what this data dependency is, is given by the DDA D. This gives a clean
interface between the application e and its possible dependency patterns.
Deﬁnition 4.5 (Semantics of the repeat statement). The meaning of a syntactically and type correct repeat statement with
countable data dependency D=<P,B,req,sup>,
repeat p:P along D from V in e,
is the e-derived extension of V along D.
Example 4.6. The instantiation of the repeat statement for our original example, using the reversed butterﬂy DDA of
height h, is:
repeat p:P along DRBFh from V in
if cond(p)
then min(V[rp(p,0)],V[rp(p,1)])
else max(V[rp(p,0)],V[rp(p,1)])
A repeat statement becomes an imperative statement by placing the computed values into V itself rather than returning
a new array. The statement itself is a closure statement, computing the missing pieces of V – as much as possible – using e.
The deﬁnition itself does not imply any speciﬁc computational algorithm.
Onepossible algorithmwould be to take the standard approach to recursion, asweﬁnd it inmost programming languages.
The implementation would start with a point, and trace down the requests until we end up at known values. Then we may
compute the value one step above, and backtrack the computation one step. This traces the dependency by a tree-directed
traversal of the points, giving rise to exponential (in the number of branches) computing time per starting point. The use of
memoisation techniques may reduce such computational work signiﬁcantly.
Another simple idea is to repeatedly try all points p:P, computing a value every time the conditions for a one-step
computation are met. This process terminates when there are no 1-reachable points from the set of (currently) computed
values. Again this gives an exponential running time.
Instead, wewill provide several computing strategies that explicitly utilize the underlying data dependency and therefore
has controlled resource usage.
4.1. Dependency-driven computation
Given a repeat statement we may exploit the supply component of the DDA to achieve an efﬁcient computation. We
traverse the opposite graph of the requests, using the supply direction to get a dependency-driven computation.
Deﬁnition 4.7 (Dependency-driven computation). Given a repeat statement repeat p:P along D from V in e
with countable data dependency D=<P,B,req,sup> where the element type of V is E. The dependency-driven com-
putation of the repeat statement will gradually ﬁll in a fresh array V′ with index type P and element type E. V′ contains all
the values that have been computed up to a certain step during the computation. Initially V′ is a copy of V together with its
guard.
The algorithm uses a set F and a support array M which contain intermediate information about the computation, and
both will be updated at every step of the computation:
• The set F⊆P is the set of indices where V′ contains a value, but where this value has not been propagated in the DDA.
Initially F is the set of all indices where V is deﬁned.
• The array M has index type P and as elements arrays with index type B and element type E. In M[p][b] the value
that will be requested by point p via branch-index b is stored, until the value at p has been computed. Initially M is
an empty array, i.e., its guard never holds.
Weneedamodiﬁedversione′ of e,whereall occurrencesof thepatternV[rp(p,b)]havebeenreplacedbyM[p][b].
The computation repeats the following steps until F becomes empty:
• Choose a point q∈F and remove it from F.
• For all branches d:B such that sg(q,d) holds do:
· Let p=sp(q,d)
· If ig(V′,p) does not hold, do:
– let M[p][sb(q,d)] = V′[q], creating the sub-array if needed,
– if p has received values from all branches b where rg(p,b) holds, then add p to F, compute e′ and insert
the value in V′ at point p, and empty the sub-array M[p].
When the intermediate F becomes empty, M can be freed completely, and the array V′ is the result.
528 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
Proposition 4.8. The dependency-driven computation of repeat p:P along D from V in e with countable
DDA D=<P,B,req,sup> computes the e-derived extension of V along D.
Proof In its evolving computation of V′, the algorithm works by taking a point p 1-reachable from the deﬁned points of V′,
computing a value for it, and adding this value to V′. Since this value is in the e-derived 1-extension of V′ along D, we know
by Lemma 4.3 that the e-derived extension of the extended array V′ along D will give the required semantics to the repeat
statement.
To see that the right value is computed for a point p note that M[p][b] = V′[rp(p,b)], for all b:B where
rg(p,b) holds, since, by the DDA axioms in Deﬁnition 3.1, sp(q,d) = p and sb(q,d) = b when q=rp(p,b)
and d=rb(p,b). Thus e′ will compute the required value.
The algorithmwill stop when F becomes empty, but this can only happen when there is no point p 1-reachable from the
deﬁned points of V′, hence we are at the limit where V′ equals the e-derived extension of V′ along D. In other words, the
computation stops exactly when it has computed the e-derived extension of V along D. 
Note that in practice the computation will not terminate when an inﬁnitely large part of P can be computed. But at the
limit of this computation, the e-derived extension of V′ along Dwill have been computed.
In this way the algorithm traverses P from the deﬁned indices of V, computing V′[p] when all of its dependents have
been computed, until we have computed as much of V′ as can be deﬁned given the information we have.
Often we may not be interested in the entire array V, but only in a few of its computed elements. We then retain values
of V′ for those indices only. This can be handled in a more elaborate repeat statement which allows identiﬁcation of output
points.
The run-time cost of the entire dependency-driven computation is of the order of the size of the computed elements of
V′, since every point passes throughF atmost once, andwe need to compute once for every such point. Thus the data-driven
computation gives us control of the computation time of the repeat construct.
The size of the intermediate data structures is not under control of this algorithm. The order in which the points of F are
chosen will inﬂuence the number of points to be kept in F and the number of intermediate arrays M must hold at any one
time. Thus the maximal space used by the computation is not being controlled, implying that the memory requirements are
not known.
4.2. Space-time controlled repetition
The dependency-driven computation of the repeat statement is a sequential algorithm, dealing with one point at a time.
In many circumstances the computation for different points are independent of each other, therefore it should be possible
to introduce some parallelism into the computation mechanism. We will do this by exploiting the space–time projections
of the DDA points.
When the underlying DDA can be seen as a space–time DDA, with the corresponding space and time projections (see
Deﬁnition 3.5), we have control over the space–time execution of the repeat statement.
Example 4.9. To illustrate this, consider the butterﬂy DDA DBFh of height h (Example 3.3) together with some DDA-based
repeat statement which deﬁnes for each relevant p:BFh how V[p] is to be computed. We consider the row and col
projections as time and space. One can easily verify that the butterﬂy dependency with these projections satisﬁes the
consistency requirements of a space–time DDA.
Assume that initial values are deﬁned for the points on the bottom row. Then all new points in the derived 1-extension
of V can be computed in parallel, independently of each other. In fact, all values at BFh points within the same row are
independent of each other, and can be computed in parallel.
We can step through the computation of the repeat statement from time-step zero and upwards, using the in-
crement operator next to move from one time-step to the next. At each time-step t:Time, we choose all the points
BFh(t,s)∈F for all relevant s:Space. This allows the parallel computation of all elements at the given time-step t,
since the requirements assert that the computation for each spatial index is independent of each other.
The space–time algorithm below is formulated for shared memory, i.e, where every processor may write into another
processor’s memory.
Deﬁnition 4.10 (Space-time controlled repetition). Consider the repeat statementrepeat p:P along D from V in
e with ﬁnite span space–time DDA D=<P,B,req,sup> where the element type of V is E. Let Time = Nat be used
as the time type, and tspan:Nat be the maximum number of time-steps spanned by a request, i.e., tspan=max_p,
b(time(p)-time(rp(p,b)). Deﬁne the type T = {0,1,…,tspan-1}.
The space–time algorithm to compute the repeat statement will gradually ﬁll in a fresh array V′, using a support array M
which contains intermediate information about the computation. Both will be updated at every step of the computation.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 529
• The array V′ has index type P and element type E. It contains all values that have been computed so far. The array is
distributed, s.t., V′[p] is at processor space(p) for a point p:P. Initially V′ is a copy of V together with its guard.
• The array M has index type Space*T and as elements arrays with index type B and element type E. M is used to
store, as they become available, all the element values a point P(s,t), for s:Space and t:Time, requests from
branches b:B, until the point has been computed. Let % be the modulus operator. Then the array is distributed, s.t.,
M[s,t%tspan] is at processors. InitiallyM is an array of empty sub-arrays, i.e., the guardig(M[s,t%tspan],
b) does not hold for any s:Space, t:Time and b:B.
We also need a modiﬁed version e′ of e, where all occurrences of the pattern V[rp(p,b)] have been replaced by
M[space(p),time(p)%tspan][b]. For every time-stept, starting atzero, the computation repeats the following
steps in parallel for all s:Space:
• Let p=P(s,t).
• If ig(V′,p) does not hold, M[s,t%tspan] is not empty, and for all b:B, ig(M[s,t%tspan],b) holds
whenever rg(p,b), then compute e′, and insert its value into V′[p].
• Empty M[s,t%tspan].
• Synchronise with the other processes.
• If ig(V′,p) holds,
then for all relevant branches d:B do:
− Let q=sp(p,d).
− Let M[space(q),time(q)%tspan][sb(p,d)] = V′[p].
• Synchronise with the other processes.
These steps are to be repeated until for all p:Pwhere ig(V,p) holds it is the case that time(p)<t (all initial values
have been visited), and all sub-arrays in M are empty (nothing more can be computed).
After this the intermediate M is cleared, and the array V′ is the result of the computation.
With the given assumptions, this algorithm will, time-step by time-step, in parallel compute exactly the e-derived
extension of V along D.
Proposition 4.11. Given the appropriate conditions, the space–time controlled computation of the repeat statement
repeat p:P along D from V in e with DDA D=<P,B,req,sup> computes the e-derived extension of V
along D.
Proof The space–time controlled computation covers all points p in the region P(s,t) for all relevant space–time
coordinates (s,t).
An expression e′ at a point p is only computed when V′[p] should be assigned such a value. By the axioms of the DDA,
it is clear that the new value e′ computed using M[s,t%tspan] is the value of that point in the e-extension.
Thus when the computation ends, V′ contains the e-derived extension of V along D. 
This computational style can be used also on a sequential processor. In that case, a loop for each spatial index must be
run between every synchronisation step.
It is also possible to provide a message-passing space–time controlled computation of the repeat statement. This was
used in our prototype compiler [40]. The shared memory version given here is new, and is more appropriate for the current
many-core architectures. Note that we can control locality by selecting space-projections that let requests and supplies
communicate only with neighbouring processors (whenever possible).
The space–time controlled repetition gives full control of the space–time resources. Let S be the size of the ﬁnite sort
Space, then atmostS*tspannodeswill be active at any timeduring the computation. Thus by controlling the space–time
projection we can control both time (number of time-steps needed for the computation) and memory usage. This entails
full control over resource usage.
5. Bitonic sort DDA
We will present bitonic sorting as a fully worked example of DDA-based programming. The bitonic sort is a well studied
parallel sorting algorithm [17], ﬁrst presented by Batcher [5]. It sorts n elements in parallel in(log2n) time. The basic opera-
tion consists of repeatedly merging balanced bitonic sequences of increasing size into ascending or descending sequences. A
sequence is called bitonic when it is the juxtaposition of an ascending and a descending sequence. It is balanced if both parts
have the same size. In the beginning we have bitonic sequences of length two, which are combined to bitonic sequences of
length four, and so forth. In the penultimate step we have a bitonic sequence of the size of the input sequence, and in the
ﬁnal step this last bitonic sequence is merged into a sorted sequence.
The data dependency of a bitonic sorter can be seen as a combination of several reversed butterﬂy DDAs of different
height, see Fig. 6. At the top is one reversed butterﬂy of size h, underneath two reversed butterﬂies of size h − 1, and so forth.
530 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
At the bottom are 2h−1 reversed butterﬂies of height 1. Each sub-butterﬂy corresponds to a bitonic merge. Bitonic sorting is
deﬁned then as min/max functions on the points of this DDA.
Example 5.1. We deﬁne the bitonic sort DDA for 2h inputs, h ∈ N, DBSh, with the DDA sort BSh = sbf Nat * row
Nat *col Nat, with data invariant:
DI(p) = (0<sbf(p)<=h) &&
((row(p)<sbf(p)) || (row(p)<=sbf(p) && sbf(p)=1)) &&
(col(p)<2h)
The ﬁrst projection identiﬁes the reversed sub-butterﬂy’s height, therow projection is the local row number in the reversed
sub-butterﬂy, while col gives the global column number. For the points which belong to two reversed sub-butterﬂies,
we use the projections corresponding to the lower reversed butterﬂy. The request and supply functions for branch indices
B = {0,1} become:
rg(p,b) = !(row(p)=1 && sbf(p)=0)
rp(p,b) =
if (((sbf(p)>1) and (0<=row(p)<=sbf(p)-2)) or
((sbf(p)=1) and (row(p)=0))) then
if (b=0) then BSh(sbf(p),row(p)+1,col(p))
else BSh(sbf(p),row(p)+1,flip(row(p),col(p)))
else if ((sbf(p)>1) and (row(p)=sbf(p)-1)) then
if (b=0) then BSh(sbf(p)-1,0,col(p))
else BSh(sbf(p)-1,0,flip(row(p),col(p)))
sg(p,b) = !(row(p)=0 && sbf(b)=h)
sp(p,b) =
if (!(sbf(p)=h) and (row(p)=0)) then
if (b=0) then BSh(sbf(p)+1,sbf(p),col(p))
else BSh(sbf(p)+1,sbf(p),flip(sbf(p),col(p)))
else if (((sbf(p)=1) and (row(p)=1)) or
(sbf(p)>1) and (1<=row(p)<=sbf(p)-1)) then
if (b=0) then BSh(sbf(p),row(p)-1,col(p))
else BSh(sbf(p),row(p)-1,flip(row(p)-1,col(p)))
The deﬁnition glues together reversed butterﬂies of increasing height on top of each other. Alternatively, we might have
pasted them together with an extra communication step between two levels of reversed butterﬂies. This would make the
expressions for mapping the bitonic sort DDA into a shufﬂe network simpler.
The computations to be performed at each node are deﬁned next. The array V represents the result of the entire compu-
tation, i.e., data for all the steps of the bitonic sort. Let vi, i∈{0,1,…,2h − 1} be the input elements to be sorted. They reside
on the bottom points of the DDA graph: V[BSh(1,1,i)]=vi. We deﬁne a condition whether we should do a minimum
or a maximum:
cond(p) = (bit(sbf(p),col(p)) == bit(row(p),col(p))
Here bit:Nat,Nat→Nat is a function, s.t., bit(i,n) extracts the ith bit from the binary representation of the
number n.
We can now deﬁne the expression V[p] to be computed for all p:BSh by:
repeat p:BSh along DBSh from V in
if cond(p)
then min(V[rp(p,0)],V[rp(p,1)])
else max(V[rp(p,0)],V[rp(p,1)])
To understand the computation deﬁned by the repeat statement recall that the expressions are guarded, so they will stay
within the limits of theDDADBSh. The explanation below follows the imperative version of the algorithm fromDeﬁnition 4.7
(dependency-driven computation).
Accordingly, each point p is associated with a memory cell M[p][b] for each b ∈ B, and the sub-expressions
V[rp(p,b)] are replaced with M[p][b]. The computation is started by fetching the data vcol(p) and using this as the
value of V[p] for each point p at the bottom row. The value V[p] is then supplied to the memory cells M[sp(p,b)]
[sb(p,b)] for every b ∈ B and point p on the bottom row.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 531
Fig. 6. Bitonic sort DDA for 23 inputs. Thecol projections of BS3 points are presented in binary at the bottom. Thesbf androw projection are presented
as pairs on the left. At each node, either a minimum (“m”) or a maximum (“M”) is computed.
Fig. 7. Example of sorting 23 inputs. On the bottom nodes reside the actual initial data values to be sorted. The rest of the nodes show the values V[p]
computed at each time-step. On the top row the initial data values have become sorted.
532 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
The computation proceeds, for each point p where the required memory cells have been given a value, to compute the
expression V[p]. This value is then supplied to the memory cells M[sp(p,b)][sb(p,b)] for every b ∈ B.
The result of the computation are the values of V[p] for the points p on the top row. Depending on the target machine
they may remain in memory for being used in future computations, or ofﬂoaded in some way or another.
Fig. 7 shows the result of the computations of the values V[p] on points p:BS3 for a concrete initial data set.
How the computation actually proceeds through the DDA (row parallel, a point at a time, or skewed in some strangeway),
can be controlled by the embedding of the bitonic sort DDA into the space–time of a chosen hardware architecture. This is
discussed in the next section.
6. DDA-embeddings into hardware
We will treat computing devices as DDAs. In a hardware DDA, the points identify the processors and the branches the
communication channels. The granularity of a processor can range froma single logical gate to an arbitrarily complex function
or a general CPU. In our example the computational entities are the min/max functions of the bitonic sorting algorithm.
The space–time of a hardware DDA describes its dynamic behaviour. Computation on a processor takes place at the points
of the space–timeDDA,while communication betweenprocessors along the communication channels take a time increment,
basically the pairing of the hardware DDA with time-stepping.
An embedding of a computation consists of three functions, EPwhich deﬁnes how DDA points map into hardware space
and time coordinates, ERwhich deﬁnes how a request branch at a DDA point is translated into an incoming communication
channel, and ES which deﬁnes how a supply branch at a DDA point is translated into an outgoing communication channel
[23,22]. The embedding controls the utilisation of the available hardware resource.We sketch this for two different hardware
architectures: the hypercube and the CUDA programmingmodel – an access route to Nvidia’s GeForce 8800 GPU series [34].
6.1. Hypercube
Deﬁnition 6.1. Ahypercube space–timeDDAofdimensiond ∈ N,DHSTd, isdeﬁnedbyaDDAsortHSTd = node Nat * time
Nat, with DI(p) = 0<=node(p)<2d. Then for branch indices BHSTd={0, 1, . . ., d}, as shown in Fig. 8, the
request and supply functions can be deﬁned by:
rg(p,b) = 0<time(p)
rp(p,b) = if (b = 0) then HSTd(node(p),time(p)-1)
else HSTd(flip(b-1,node(p)),time(p)-1)
sg(p,b) = 0<=time(p)
sp(p,b) = if (b = 0) then HSTd(node(p),time(p)+1)
else HSTd(flip(b-1,node(p)),time(p)+1)
The hypercube space–time is an example of an inﬁnite countable DDA. Here the computations have a starting point (time
0), but may continue forever.
Example 6.2. An embedding of a bitonic sort DDA for 2h inputs into a hypercube space–time DDA of dimension h is given by
the functions (see Fig. 9):
Fig. 8. Hypercube space–time DDA of dimension 4 illustrated in 4 time-steps. Channel 0 is the communication within a node, the other channels between
nodes. The node projections of HST4 points are presented in binary.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 533
Fig. 9. Bitonic sort DDA for 23 inputs embedded into the hypercube space–time DDA of dimension 3 for 7 time-steps. Request and supply branches of the
bitonic sort DDA (thick lines) are mapped to hypercube communication channels.
EP(p) = HSTh(col(p),grow(sbf(p),row(p)))
ER(p,b) = if (b = 0) then 0 else row(p)+1
ES(p,b) = if (b = 0) then 0 else
if (row(p) = 0) then sbf(p)+1 else row(p)
where the function grow:Nat,Nat→Nat computes the global row number of a point in the bitonic sort DDA based on
the sub-butterﬂy and local row numbers, deﬁned by grow(b,r) = b(b-1)/2 + b-r.
We can now execute the bitonic sorter V on the hypercube. The steps for the DBSh points p become iterations at each pro-
cessor node(EP(p))=col(p) for time-steps time(EP(p)): receive data from channels ER(p,b) for all relevant
b:B, compute V, and then send the result on channels ES(p,b) for all relevant b:B. In the ﬁrst step the initial values
v_col(p) will reside in the memory of the corresponding node, and in the ﬁnal step the computed values (sorted data)
will also reside in the memory of the nodes.
If there aremore data than processors,wemay introducemultiplememory cells on each node.Wewill show this principle
when discussing the GPU version below.
6.2. CUDA
Graphics Processing Units (GPUs) are very powerful and yet cost-efﬁcient computational devices [35]. They feature
hundreds of processors on more than a dozen chips. Now that graphics vendors provide general, non-graphics oriented
programming models, e.g. Nvidia’s CUDA [34] and ATI’s CTM [1], this computational power is becoming available for
general-purpose compute-intensive data-parallel applications.
In Nvidia’s CUDA [34] the GPU device operates as a Co-processor to the main CPU. The GPU is capable of handling a
huge number of threads. The threads are downloaded by the host onto the device in the form of a kernel. Threads within
a kernel are organised as a grid of blocks. A block contains a limited number of threads (512 for the GeForce 8800 Series
cards), but a grid of blocks may contain any (reasonable) number of blocks. Threads within one block can synchronize and
share data through fast on-chip shared memory. Threads in different blocks within the same kernel can only communicate
asynchronously via the main GPUmemory. There is no guarantee as to which blocks run in parallel, or in which order blocks
are sequenced, when the grid has more blocks than can be executed in parallel. So threads in different blocks are in practice
534 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
Fig. 10. Bitonic sort DDA for 24 inputs, embedded into NVIDIA’s CUDA programming model, executed by 4 kernel invocations, each consisting of 4 blocks
of threads.
unable to exchange information within the same kernel. The GPU memory is persistent across kernels, and is also a means
for initialising computations and outputting results.
Deﬁnition 6.3. The CUDA space–time DDA DCUST, with  the number of threads per block, is deﬁned by the sort
CUST = space CUB * time Nat, where space is comprised by CUB = block Nat * thread Nat,
with data invariants:
DI(p) = DI(space(p)) for CUST
DI(s) = thread(s)< for CUB
Branch-indices are of typeCUB aswell, representing communication channels between threads between andwithin blocks.
Then requests and supplies in DCUST are deﬁned along CUB branches, describing communication between threads in
successive time-steps:
rg(p,b) = 0<time(p)
rp(p,b) = CUST(b,time(p)-1)
sg(p,b) = 0⇐time(p)
sp(p,b) = CUST(b,time(p)+1)
The number  is the number of threads allowed per block as deﬁned by the hardware itself. Threads are interpreted as the
space component, such that CUB identiﬁes the local thread-index and the block-index the thread belongs to. Here threads
have a local index 0 through  − 1.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 535
Wheneverblock(b)=block(space(p))wehave fast, intra-blockcommunication.Otherwisewehave inter-block
communication through the GPU memory. The latter also implies that we have to end the kernel, and start a new one in
order to continue our computation.
Since threads access memory locations by using their thread index, the abstractions we use for them and their commu-
nication implicitly provide information about data locality. This can be extracted by referring to the projections of CUB and
CUST.
Example 6.4. The embedding of DBSh into DCUST becomes:
EP(p) = CUST(CUB(col(p)/,col(p)%),grow(sbf(p),row(p)))
ER(p,b) = CUB(col(rp(p,b))/,col(rp(p,b))%)
ES(p,b) = CUB(col(sp(p,b))/,col(sp(p,b))%)
where rp and sp are the request and supply functions of DBSh.
We see that the bitonic sort will be split across several kernels in order to communicate between blocks for the wider
branches. The process in each thread will receive data by reading from the memory location corresponding to the channel
given by CUB. If this channel is across block numbers, then the read is from the global GPU memory. Otherwise it is local
block memory. Then the appropriate min/max values are computed, and data is stored in the appropriate CUB memory
location. Inter-block storage means storing in the global GPUmemory. Storage to global GPUmemory also takes place at the
end of the kernel. With this strategy, the initial data sets are stored in the relevant locations on the global GPUmemory, and
the result of the bitonic sort algorithm likewise ends in the global GPU memory.
Fig. 10 illustrates the embedding for  = 4. Data shared within one block is through fast shared memory. When the edges
of the bitonic sort DDA graph intersects block borders, threads of different blocks need to share data via the global GPU
memory. The darker grey frames across all threads highlight this: threads write to the GPU memory, the kernel terminates,
and control is handed over to the host, which invokes a new kernel with threads ﬁrst fetching data from the GPU memory.
Note that the address splitting technique used here between blocks and threads, also can be used to split large networks
between nodes and local memory on, e.g., a hypercube architecture.
7. Related work
The number of processors per chip is expected to double every year over the next few years, entering parallel processing
into themassmarket. As a consequence, software needs to be parallelized and ported in an efﬁcientway tomassively parallel,
possibly heterogeneous, architectures. The programming community is in great need of high-level parallel programming
models to adapt to the new era of commonly available parallel computing devices. The research on parallel computing
blooms like never before. While in the early years of computing, parallelism seemed to be desirable for high throughput,
today taking parallelism into account has become inevitable.
In their excellent discourse [3] on the state of the art of parallel computing research, Berkeley researchers argue that
current programming methodologies may be sufﬁcient for systems with up to 8 processors, but they are not likely to scale
beyond that. Inspired by the success of parallelismat the extremes of the computing spectrum, i.e., embedded computing and
high-performance computing, the report suggests several design targets that aprogrammingmodel shouldmeet. The target is
set to 1000 cores per chip, but programmingmodels shouldnot dependon thenumber of processors.Models should allow the
user to indicate locality, and should support awide rangeof data types. They should supportwell-known formsof parallelism:
data-parallelism, task parallelism, and instruction-level parallelism. Finally, instead of traditional benchmarks, they suggest
new methods, borrowed from scientiﬁc computing, to design and evaluate programming models and architectures.
Research directions of earlier times are often revisited. In hindsight it is often easier to understand why a very promising
approach did not succeed. High Performance Fortran (HPF) is one of these [29]. HPF is a high-level data-parallel programming
systembasedonFortran,whichwasacceptedwithgreatenthusiasmin theearly90s.However, the initial excitementhas faded
out. This is attributed to reasons that are worth learning from: immature compiler technology leading to poor performance,
lack of ﬂexible distributions, inconsistent implementations, missing tools, and lack of patience by the high-performance
computing community. Although HPF itself failed to achieve success, it had a great impact on the development of high-level
parallel languages (e.g. Fortran 95, OpenMP, Chapel, Fortress, X10).
In recent years, several new programming models have been proposed, and in some cases implemented. However, the
underlying basic ideas point back to approaches from decades ago. In general, they come in either a task parallel or data-
parallel fashion, though the latter seems to outweigh the former as to popularity. This should not come as a surprise, since
data-parallelism, by its nature, scales better with growing number of processors.
Some of the current programming models targeting multi-core and/or GPU programming in imperative style are:
• CellSs [36] has been proposed as an alternative task parallel programming model for multi-core processors, and its
current implementation targets IBM’s CELL/BE processor. It is based on the automatic exploitation of the functional
536 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
parallelism of a sequential program through the different processing elements of the Cell/BE architecture. It is based
on annotations to a sequential code, similar to that of OpenMP.
• Also on the task parallel front, with Intel’s TBB library [38] one can express parallelism in a C++ program. It is based on
a higher-level, task-based parallelism that abstracts platform details and threading mechanism for performance and
scalability.
• RapidMind [33] is a ready to use development and runtime platform to a variety of architectures, including GPUs,
the Cell/BE, and multi-core CPUs. It is a data-parallel programming model which takes a high-level abstraction of
parallelism and maps it to what is available.
• Microsoft Research’s Dryad [28] is a general-purpose distributed execution engine for coarse-grain data-parallel ap-
plications. It combines computational vertices with communication channels to form a data-ﬂow graph. The user can
specify a directed graph to describe the application’s communication patterns between the computation vertices.
• A heterogeneous data-parallel computational model for the Cell/BE has been recently presented in [30]. It proposes to
aggregate the computing power of the twodifferent processing elements of the Cell/BE, i.e., of the synergistic processor
elements, and of the heavy duty processor element.
Declarative approaches also tackle multi-core and GPU programming. Since declarative languages offer a pure view of
computation, such settings are suitable to express data-parallelism in high-level descriptions [31].
• In [39]Singharguesabout the importanceof theability to targetdifferent computingdevices fromthesamedescription.
He shows howGPUs and FPGAs can be targeted froma single data-parallel description, based on higher order functions
and polymorphism.
• Data Parallel Haskell [9] aims to implement the programming model of nested data-parallelism into the Glasgow
Haskell Compiler, by extending NESL [7] in terms of expressiveness and efﬁciency. NESL is a portable nested data-
parallel language of the early 90s which allowed high-level, concise descriptions of nested data-parallel programs.
• Obsidian [42] is a data-parallel language embedded in Haskell which targets GPU programming. The current version
is implemented for NVIDIA’s CUDA. The programming style is inspired by Lava combinators [6], high-level structural
hardware design elements.
• Skeletal parallel programming evolved from the idea of higher-order function applications [13]. Skeletons (“higher-
order functions”) are high-level parallel constructs, which describe computational structures without overspecifying
the details. They come as library packages with efﬁcient implementations for the different parallel systems. The
programmer identiﬁes the patterns that ﬁt his need, and customizes them by ﬁlling in the “arguments”. These can be
other functions/procedures to which the skeleton is applied to produce the problem speciﬁc ﬁnal program [37,14].
Few, if any, of these programming models represent fundamentally new ideas. Task parallelism goes back to the 60s idea
of concurrent processes [19] with shared memory. Message passing style parallelism goes back to the mid 70s [27] and its
foundation has been investigated in several process algebras from the 80s onwards. Data-parallelism evolved from systolic
arrays, and was a ﬁrmly established approach by the mid-80s [26] when it was the main programming model for some of
that time’s high-end parallel machines. A wide variation of early parallel programming models are surveyed in [4], and a
good literature list can be found [25].
Our approach is data-parallel by considering the entire computation at once. We emphasize modular separation of the
computational expressions from the data dependency, and separation of the dependency from its embedding onto hardware.
The latter allows us to fully control resource usage, including data locality. This separates us frommost functional approaches,
though our expression-oriented notation would be considered functional. The focus on data-ﬂow graphs is in commonwith
theDryadapproach,where thegraphs, in response tosomeevents thatoccurduring thecomputation, canchangedynamically.
We target ﬁne-grain data-parallelism where the dependency graphs are ﬁxed throughout the computation. This makes our
approach somewhat more static and less ﬂexible then that of Dryad’s.
8. Conclusion and future work
In this paper we propose to use high-level DDA abstractions for parallel computing. We believe that DDAs can play an
important role when dealing with parallelism. We showed that DDA-based repetition is a viable programming approach
based on an explicitly dependency-driven computation mechanism. The DDA then provides full control over computation
time. Spatial placements of computations – an important factor when parallelizing – can also be controlled at a high-level
from DDAs. This gives full control over space usage, whether sequential or parallel execution is desired. Moreover, in the
parallel case, the DDAs give full control over processor andmemory allocation and communication channel usage, while still
at the abstraction level of the source program. A consequence of this is that we have modularised parallel distribution and
data placement from the rest of the program.
Our approach gives a hardware independent parallel programmingmodel, inwhich computations and hardware commu-
nication layouts are deﬁned by DDAs, and the computations are mapped onto the hardware in terms of DDA-embeddings.
These are deﬁned at a high-abstraction level, yet a DDA-based compiler produces parallel code speciﬁc for each hardware
architecture, fully controlled by the programmer, without the need of rewriting the problem solving code.
E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538 537
We illustrated these ideas through a case-study of bitonic sorting. We extracted the DDA of the bitonic sorter, expressed
the computations on this DDA, and sketched its embeddings into the space–time layout of twodifferent parallel architectures
(hypercube and a GPU), and described how the embedded computation could be compiled onto both hardwares.
In the case of the GPU, the embedding was deﬁned into an intermediate access route of the hardware (CUDA), since the
concrete hardware communication layout of the GPU is somewhat hidden.
We are building a compiler to generate parallel code for architectures such as NVIDIA’s GPUs with CUDA and the CELL/BE
processor. Previously we built a prototype compiler, Sapphire, producing parallel code from DDAs using the MPI message
passing library. It also seems promising to take this technology a step further towards low-level hardware and investigate
possible ways of combining DDAs with circuit design, e.g., FPGA hardware accelerators.
Acknowledments
We would like to thank those who have contributed with their suggestions and edifying comments to this paper: Mary
Sheeran, Anya Helene Bagge, and not least the anonymous reviewers.
References
[1] AMD, ATI CTM Guide. Technical Reference Manual (2006). http://ati.amd.com.
[2] A. Anderlik, M. Haveraaen, On the category of data dependency algebras and embeddings, Proc. Estonian Acad. Sci. Phys. Math., 52 (4) (2003) 337–355.
[3] K. Asanovic, R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, K.A. Yelick, The landscape
of parallel computing research: a view from Berkeley, Tech. Rep. UCB/EECS-2006-183, EECS Department, University of California, Berkeley, December
2006.
[4] J.L. Baer, A survey of some theoretical aspects of multiprocessing, ACM Comput. Surv. 5 (1) (1973) 31–80.
[5] K.E. Batcher, Sorting networks and their applications, in: AFIPS Spring Joint Computing Conference, 1968.
[6] P. Bjesse, K. Claessen, M. Sheeran, S. Singh, Lava: hardware design in Haskell, SIGPLAN Notices 34 (1) (1999) 174–
184. <http://doi.acm.org/10.1145/291251.289440>
[7] G.E. Blelloch, J.C. Hardwick, J. Sipelstein,M. Zagha, S. Chatterjee, Implementation of a portable nested data-parallel language, J. Parallel Distrib. Comput.
21 (1) (1994) 4–14.
[8] L. Bougé, The data parallel programming model: a semantic perspective, in: The Data Parallel Programming Model: Foundations, HPF Realization, and
Scientiﬁc Applications, Lecture Notes in Computer Science, vol. 1132, Springer, 1996.
[9] M.M.T. Chakravarty, R. Leshchinskiy, S.P. Jones, G. Keller, S. Marlow, Data parallel Haskell: a status report, in: DAMP ’07: Proceedings of the 2007
Workshop on Declarative Aspects of Multicore Programming, ACM, New York, NY, USA, 2007.
[10] R. Chandra, R. Menon, L. Dagum, D. Kohr, D. Maydan, J. McDonald, Parallel programming in OpenMP, Morgan Kauffman, San Franciso,
2000 <http://www.OpenMP.org>
[11] B. Chapman, The challenge of providing a high-level programming model for high-performance computing, High-Performance Computing: Paradigm
and Infrastructure, Wiley, 2005<http://dx.doi.org/10.1002/0471732710.ch2>
[12] T. Chen, R. Raghavan, J.N. Dale, E. Iwata, Cell broadband engine architecture and its ﬁrst implementation – a performance view, IBM J. Res. Develop.
51 (5) (2007) 559–572.
[13] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation, MIT press, Mass, 1989.
[14] M. Cole, Bringing skeletons out of the closet: a pragmatic manifesto for skeletal parallel programming, Parallel Comput. 30 (3) (2004) 389–406.
[15] V. Cˇyras, M. Haveraaen, Modular programming of recurrencies: a comparison of two approaches, Informatica 6 (4) (1995) 397–444.
[16] M.J. Flynn, Some computer organizations and their effectiveness, IEEE Trans. Comput. C-21 (9) (1972) 948–960.
[17] A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction to Parallel Computing, Second ed., Addison-Wesley Longman Publishing Co., Inc., Boston, MA,
USA, 2003.
[18] W. Gropp, E. Lusk, A. Skjellum, Using MPI: Portable Parallel Programming with the Message-passing Interface, second ed., MIT Press, 2000.
[19] P.B. Hansen, Concurrent programming concepts, ACM Comput. Surv. 5 (4) (1973) 223–245.
[20] M. Haveraaen, Distributing programs on different parallel architectures, in: D.A. Padua (Ed.), Proceedings of the 1990 International Conference on
Parallel Processing (ICPP), Software, vol. II, The Pennsylvania State University Press, University Park and London, 1990.
[21] M. Haveraaen, Efﬁcient parallelisation of recursive problems using constructive recursion (research note), Euro-Par ’00: Proceedings from the 6th
International Euro-Par Conference on Parallel Processing, Springer-Verlag, London, UK, 2000, pp. 1–5.
[22] M. Haveraaen, An algebra of data dependencies and embeddings for parallel programming, Formal Aspects of Computing, accepted for publication.
[23] M. Haveraaen, Data dependencies and space time algebras in parallel programming, Tech. Rep. 45, Department of Informatics, University of Bergen,
Norway, June 1990.
[24] M. Haveraaen, E.G. Wagner, Guarded algebras: disguising partiality so you won’t knowwhether it’s there, in: Recent Trends In Algebraic Development
Techniques, Lecture Notes in Computer Science, vol. 1827, Springer Verlag, 2000, pp. 3–11.
[25] P. Hibbard, Multiprocessor software design, in: ACM ’80: Proceedings of the ACM 1980 Annual Conference, ACM, New York, NY, USA, 1980.
[26] W.D. Hillis, J.GuyL. Steele, Data parallel algorithms, Commun. ACM 29 (12) (1986) 1170–1183.
[27] C.A.R. Hoare, Communicating sequential processes, Commun. ACM 21 (8) (1978) 666–677.
[28] M. Isard, M. Budiu, Y. Yu, A. Birrell, D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, SIGOPS Oper. Syst. Rev. 41
(3) (2007) 59–72.<http://doi.acm.org/10.1145/1272998.1273005>
[29] K. Kennedy, C. Koelbel, H. Zima, The rise and fall of high performance Fortran: an historical object lesson, in: HOPL III: Proceedings of the Third ACM
SIGPLAN Conference on History of Programming Languages, ACM, New York, NY, USA, 2007. <http://doi.acm.org/10.1145/1238844.1238851>.
[30] B. Li, H. Jin, R. Zheng, Q. Zhang, A heterogeneous data parallel computational model for Cell Broadband Engine, ChinaGrid, Annual Conference, 2008,
pp. 325–330. <http://doi.ieeecomputersociety.org/10.1109/ChinaGrid.20%08.56>
[31] B. Lisper, Data parallelism and functional programming, in: The Data Parallel Programming Model: Foundations, HPF Realization, and Scientiﬁc
Applications, Lecture Notes in Computer Science, vol. 1132, Springer, 1996.
[32] W.L. Miranker, A. Winkler, Spacetime representations of computational structures, Computing 32 (2) (1984) 93–114.
[33] M. Monteyne, Rapidmind platform, Tech. rep., RapidMind, 2008. <http://www.rapidmind.net/pdfs/WP-RapidMindPlatform.pdf>
[34] NVIDIA, CUDA Programming Guide, 2008. <http://www.nvidia.com>
[35] J.D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Krüger, A.E. Lefohn, T.J. Purcell, A survey of general-purpose computation on graphics hardware,
Comput. Graph. Forum 26 (1) (2007) 80–113.<http://dx.doi.org/10.1111/j.1467-8659.2007.01012.x>
538 E. Burrows, M. Haveraaen / Journal of Logic and Algebraic Programming 78 (2009) 519–538
[36] J. Perez, P. Bellens, R.M. Badia, J. Labarta, CellSs: Making it easier to program the Cell Broadband Engine processor, IBM J. Res. Develop. 51 (5) (2007)
593–604.<http://www.research.ibm.com/journal/rd/515/perez.pdf>
[37] F.A. Rabhi, S. Gorlatch (Eds.), Patterns and Skeletons for Parallel and Distributed Computing, Springer-Verlag, London, UK, 2003.
[38] J. Reinders, Intel Threading Building Blocks: Outﬁtting C++ for Multi-core Processor Parallelism, O’Reilly, 2007.
[39] S. Singh, Declarative programming techniques for many-core architectures, 2008. <http://research.microsoft.com/∼satnams/dec-manycore.pdf>.
[40] S. Søreide, Compiling Sapphire into Sequential and Parallel Code Using Assertions. Master Thesis, Universitetet i Bergen, P.O. Box 7800, N-5020 Bergen,
Norway, Spring 1998.
[41] H. Stone, Parallel processing with the perfect shufﬂe, IEEE Trans. Comput. C-20 (2) (1971) 153–161. Feb.
<http://ieeexplore.ieee.org/xpls/abs-all.jsp?arnumber=16%71798>
[42] J. Svensson, An embedded language for data-parallel programming, Master’s thesis, Chalmers University, Göteborg, 2008.
[43] M. Wolfe (Ed.), High Performance Compilers for Parallel Computing, Addison Wesley, Reading, Mass, 1996.
