Scalable data abstractions for distributed parallel computations by Hanlon, James et al.
                          Hanlon, J., Hollis, S. J., & May, D. (2012). Scalable data abstractions for
distributed parallel computations. arXiv , (1210.1157), [1210.1157].
Link to publication record in Explore Bristol Research
PDF-document
University of Bristol - Explore Bristol Research
General rights
This document is made available in accordance with publisher policies. Please cite only the published
version using the reference above. Full terms of use are available:
http://www.bristol.ac.uk/pure/about/ebr-terms.html
Take down policy
Explore Bristol Research is a digital archive and the intention is that deposited content should not be
removed. However, if you believe that this version of the work breaches copyright law please contact
open-access@bristol.ac.uk and include the following information in your message:
• Your contact details
• Bibliographic details for the item, including a URL
• An outline of the nature of the complaint
On receipt of your message the Open Access Team will immediately investigate your claim, make an
initial judgement of the validity of the claim and, where appropriate, withdraw the item in question
from public view.
ar
X
iv
:1
21
0.
11
57
v1
  [
cs
.PL
]  
3 O
ct 
20
12
Scalable data abstractions for distributed parallel computations
James Hanlon, Simon J Hollis and David May
Department of Computer Science
University of Bristol
Bristol, UK
{hanlon, simon, dave}@cs.bris.ac.uk
Abstract—The ability to express a program as a hierarchical
composition of parts is an essential tool in managing the
complexity of software and a key abstraction this provides is
to separate the representation of data from the computation.
Many current parallel programming models use a shared
memory model to provide data abstraction but this doesn’t
scale well with large numbers of cores due to non-determinism
and access latency. This paper proposes a simple programming
model that allows scalable parallel programs to be expressed
with distributed representations of data and it provides the
programmer with the flexibility to employ shared or distributed
styles of data-parallelism where applicable. It is capable of an
efficient implementation, and with the provision of a small set
of primitive capabilities in the hardware, it can be compiled to
operate directly on the hardware, in the same way stack-based
allocation operates for subroutines in sequential machines.
Keywords-Parallel programming, composability, parallel sub-
routines, data-parallelism, distributed memory, compilation
techniques.
I. INTRODUCTION
When developing a program of any complexity, the ability
to express it in terms of a simpler set of components is
essential. A component presents a simple interface that
allows its implementation to be considered independently,
and when combined with other components, the internal
details can be ignored and its functionality treated in an
abstract way. This allows a program to be constructed using
modules, ranging from small functions to libraries, and for
any component to be substituted with another that adheres
to the same interface. The importance of abstraction as a
tool in computer programming was recognised by Turing
in the 1940s [1] and was formalised in the 1970s by the
structured programming methodology [2]. This aimed to
improve the quality of programs and productivity of pro-
grammers through judicious use of hierarchical structuring
and subroutines. These principles have been foundational for
modern sequential programming languages.
A key issue with composability is separating the rep-
resentation of data from the structure of a computation.
Mainstream CPU and general-purpose GPU (GPGPU) par-
allel programming models are based on a shared memory
model, where data are globally accessible. This is the form
of parallel random access machine (PRAM) [3] and the
related bulk-synchronous parallel (BSP) [4] model. Shared
memory parallelism allows sequential approaches to data
abstraction and conventional data structures to be employed,
but it does not scale well with large numbers of cores.
Access latency can vary significantly and unpredictably
due to the physical distribution of data across a machine.
This makes it difficult to exploit locality, which is essential
for scaling a computation, and poses problems for barriers
which are delayed by the slowest participant. Additionally,
when accesses are made to shared data they can incur latency
from collisions, and when they are updating it, behaviour can
become non-deterministic.
There are a number of issues related to the implementation
of a shared memory system that pose further problems for
this type of data abstraction. Mainstream parallel processors
take the form of symmetric multi-processors (SMPs) and
these have brought about a number of parallel programming
approaches such as the Cilk [6] language, OpenMP [7] and
Intel’s Threaded Building Blocks (TBB) [8]. These employ
a multi-threaded execution model where a number of threads
are managed by a scheduler, but problems can arise with pro-
grams that combine parallel components. Performance can
be affected significantly by threads competing for execution,
causing unnecessary context-switches, and idling within a
component due to a load imbalance, causing under utilisa-
tion. The effects of this are dependent on combinations of
program components and result in unpredictable execution
time, exacerbating non-deterministic behaviour. OpenCL [9]
is emerging into the mainstream and is designed to support
the programming of heterogeneous systems. These are typ-
ically comprised of CPUs and GPGPUs. It uses a shared
memory model but exposes distinct address spaces and in
order to compose components operating in different ones,
variables must be explicitly transferred between them [11].
Parallelism is now the primary means of sustaining growth
in computational performance [12] and the shared memory
model will continue to be useful. However, it looks certain
that future systems will involve large numbers of processors
and it will not be effective in delivering performance on
them. Therefore, it is necessary that parallel programming
models, as well as supporting shared memory approaches,
also support composable representations of distributed data.
This paper proposes a simple distributed programming
model that builds on the approach of the occam program-
ming language [13] with notations to control the distribution
of parallelism and a server construct that is active only in
response to requests. Arrays of servers can be combined to
construct distributed data structures, independently from the
computational aspects of a program, providing access for
shared or distributed styles of data-parallelism. This gives
the programmer flexibility to employ the most appropriate
data representation for the purposes of the program and
scalability. Server-based data structures can be composed
with similar scoping rules to conventional variable decla-
rations to simplify the task of building scalable programs
by allowing them to be composed in a modular way. With
the provision of a small set of primitive capabilities in the
hardware, the model can be compiled with a fixed allocation
of processors. This is so it can operate efficiently and directly
on the hardware, without the use of dynamic allocation
mechanisms. The idea is similar to stack-based allocation
for subroutines in sequential machines.
The following specific contributions are made:
1) A server construct that can be used to express compos-
able representations of distributed data structures with
arrays of server processes, for both shared memory
and message passing distributed memory style parallel
computations.
2) An efficient implementation of distributed parallelism
based on a compile-time allocation of processors.
3) An implementation of server processes that allows
many-to-one client connections to be established ef-
ficiently and without deadlock.
4) Demonstration of the proposed notations with three
example programs that are characteristic of general-
purpose applications and employ different styles of
parallelism.
The rest of this paper is organised as follows. Section II
overviews related work; Section III presents the proposed
programming model and notations in terms of a conceptual
machine model; Section IV describes the requirements of
a target architecture and the compilation scheme for it;
Section V discusses how several example programs that
require distinct styles of parallelism can be expressed in the
model; Section VI concludes.
II. RELATED WORK
Distributed memory architectures are most common in
high performance computing (HPC) systems and the Mes-
sage Passing Interface (MPI) [14] is the standard program-
ming approach. MPI provides features for the construction of
modular components such as libraries [15] with features to
name groups of processes and provide scoping for operations
within them, but it does not allow a separation of data
because of its SPMD (single program multiple data) model.
The success of MPI can be attributed to its simple com-
pilation and execution model, which provides predictable
execution that allows programmers to make efficient use of
a machine. Other more dynamic languages push resource
allocation and management into runtime components that
require significant overheads in execution time and storage,
and result in less predictable execution of program compo-
nents.
Dynamic process creation was introduced in MPI-2, and
in particular, a server construct, similar to the proposal in
this paper, was introduced to address the need to support
groups of reactive processes that accept connections from
other groups [14, Sect. 10.4]. The problem with this is that
the location of processes is not known at compile-time. To
quote the specification directly Almost all of the complexity
in MPI client/server routines addresses the question “how
does the client find out how to contact the server?”. This
issue also lies at the heart this work, but the solution is
simplified by the choice of notations, the restrictions placed
on them and support required in the architecture.
Partitioned global address space (PGAS) languages such
as UPC [16], Chapel [17] and X10 [18] are based on globally
accessible variables that are divided into logical segments
to provide a clean composition of distributed data and
computation. These segments have affinity with particular
processes to provide a notion of locality for fast memory
accesses, and global accesses are compiled into message
passing communications. These languages include a range of
distributed data types with high level notations for operating
on them. Static distributions can be compiled into message
passing programs, although it is not yet clear how efficient
they are compared to manually crafted MPI equivalents, and
as yet PGAS languages have not had widespread adoption.
Charm [19] is another HPC-orientated language but takes
a different approach. Parallelism is expressed with arrays
of objects and communication is performed with remote
method calls. A runtime system is responsible for dy-
namically mapping objects onto processors and scheduling
communication. As is the case with dynamic processes
in MPI, this requires all communications to be directed
through proxy processes aware of object locations. Although
Charm encourages modular development, it does not directly
support composable representations of distributed data.
Occam [13] and its descendant XC [20] are message
passing languages for distributed memory architectures.
Predicable execution is a key principle of them and this
is achieved primarily with a compile-time allocation of
memory by prohibiting recursion and dynamically sized
arrays. Implementations require the allocation of processors
to be specified statically in a mapping file and program
components cannot employ distributed parallelism internally.
Developments as part of the occam 3 specification intro-
duced the concept of a server component [21, Chapt. 13].
The proposed notation builds on this with a distributed
execution model and relaxed communication constraints.
III. PROPOSAL
A. Architectural model
The proposed programming model is based on a simple
conceptual architecture where, to a first order approximation,
there is an infinite array of processors. Each processor
has a relatively small private memory, but the ability to
communicate with any other processor via a network in
a constant amount of time, independent of the processor
locations. This is an idealised view held by the programmer
to simplify programming.
A realistic parallel machine can provide a good approx-
imation to this with a fixed number of processors and
a logarithmic-diameter, high-capacity network such as a
Clos/fat tree [22] or hypercube [23]. Networks such as
meshes do not provide these properties and programs must
be carefully mapped to preserve locality to obtain good
performance.
This model is analogous to the random access machine
(RAM) model of computation [24] which models the essen-
tial aspects of a conventional sequential computer. It consists
of a program that operates on an infinite capacity memory
where accesses take a constant amount of time, independent
of the address. In practical sequential computers, memory
size is limited and access incurs a latency related to capacity,
also by a logarithmic scaling.
B. Notations
The following is an informal description of the proposed
language notations. An imperative block-structured syntax
is used and the basic features of this are based on the oc-
cam programming language [13]. It includes sequential and
parallel composition, replication and channel-based commu-
nication and provides a platform for the main contributions
of this paper: notations to express local and distributed
parallelism and a server construct. Local parallelism relates
to concurrent threads that access a shared memory and dis-
tributed relates to distinct memories. Diagrams are included
throughout to provide an intuition for the programming
model and behaviour of the notations in isolation and in
composition.
1) Composition: A program is built as a hierarchical
collection of processes that can be composed in sequence
or in parallel. Sequential composition is denoted by the ‘;’
separator and causes a set of processes to be executed one
after another. If P , Q and R are processes, then the process
P ; Q ; R
is executed by running P , Q and then R. Sequential com-
position can be replicated to produce a number of similar
processes executed in sequence. If P (i) is a process, then
the process
seq i=1 for n do P(i)
is equivalent when n = 4 to
P (0) ; P (1) ; P (2) ; P (3).
Parallel composition causes the component processes to
start simultaneously and the execution can be directed to
occur locally or distributed over an array of processors.
Local parallel execution is denoted by the ‘|’ separator. The
process
P | Q | R.
causes the component processes P , Q and R to start
simultaneously on a processor pk, where k is the identifier
(ID) of a processor, and it terminates when all component
processes have terminated.
P Q
pk
R
Distributed parallel composition is denoted by the ‘&’ sep-
arator. The process
P & Q & R
is equivalent to the above local composition, except that
P , Q and R start simultaneously on different processors
pk, pk+1, pk+2.
P
pk
Q
pk+1
R
pk+2
Distributed composition can be replicated to produce a
number of similar parallel processes and can be thought of
as declaring a process array. The process
par i=1 for n do P(i)
is equivalent when n = 4 to
P (0) & P (1) & P (2) & P (3).
Processes in distributed composition are allocated on
consecutively numbered processors as this simplifies the task
of establishing communication channels with them because
they can addressed with a base and offset. This property of
the notation also allows correspondences to be established
between different arrays. For example, replication, combined
with local composition can be used to layer arrays of parallel
processes on the same array of processors. The process
par i=1 for m do P(i) |
par i=1 for n do Q(i)
for m = n causes each processor pk, pk+1, · · · , pk+n−1 to
execute P (x) and Q(x) for some x.
P (1)
pk
P (2)
pk+1
P (n)
pk+n−1
Q(1) Q(2) Q(n)
· · ·
· · ·
state
Server
call1
call2
.
.
.
calln
client1
client2
clientm
.
.
.
Figure 1. A server process, serving a set of clients.
The result of this is a direct correspondence between P and
Q with the same index and any communication between
them will be performed locally. For m 6= n, one array will
be larger and allocated over more processors. In contrast,
the distributed composition of the same replicators
par i=1 for m do P(i) &
par i=1 for n do Q(i)
allocates both process arrays on disjoint sets of processors
P (1)
pk
P (2)
pk+1
P (n)
pk+n−1
Q(1)
pℓ
Q(2)
pℓ+1
Q(m)
pℓ+m−1
· · ·
· · ·
for ℓ ≥ k + n.
2) Servers: The server notation provides a simple way of
separating a representation of data from the computations
which act on it and can be used in conjunction with
replicators to implement distributed structures that can be
accessed concurrently. Furthermore, it allows both shared
and distributed memory style parallelism to be expressed in
a similar way. This is a significant capability as it allows a
programmer to move easily between them.
A server is a special kind of process that is only active in
response to clients. The interface to a server is a set of calls,
which behave in the same way as conventional procedure
calls, except the parameters and results are transferred to
and from the server so that execution of the call occurs
at the server. Fig. 1 illustrates a single server with a set of
clients. This mechanism is known generally as a remote pro-
cedure call (RPC) [25] and is attractive because it provides
clean semantics, hiding the underlying communication, and
provides the ability to move easily between the local and
remote forms of a call.
A server definition specifies a set of potential calls and
provides responses to them. Its only action while running
is to repeatedly serve calls and it terminates when its scope
terminates. Local state can be initialised by a special initial-
isation process and a corresponding termination process can
be used to finalise the server upon termination. In object-
orientated programming, this relates directly to the concept
of an object with a constructor that takes an initial value and
methods that operate on the private attributes.
As an example, Process 1 defines a server to provide
access to an array. When it initialises, each element of the
array is set to an initial value, specified as a parameter
(init), and when the server is running, calls can be made
to read or write to specific locations.
Process 1
server Store(val init)
interface(
call read(val i, var v),
call write(val i, val v)) to
{ var data[N];
inital
{ var i;
seq i=0 for N do
data[i] := init
}
accept
{ read ? (val i, var v)
v := data[i]
write ? (val i, val v)
data[i] := v
}
final {}
}
The following specifies an instance of the Store server
with the name s, for use with an anonymous client process
that executes in parallel and makes calls to write to each
store location.
server s is Store(0) &
seq i=0 for n do s.write(i, i)
Servers can be replicated with a similar notation to a
conventional array declaration. For example the server array
server s is Store(0)[n]
· · ·
creates n instances of the store server, with each initialised
by the same parameters, in this case 0. A call to a particular
server is made by specifying a server with an array subscript
such as s[0].
s[0]
pk
s[1]
pk+1
s[n-1]
pk+n−1
· · ·
C. Expressing data-parallelism
With the proposed notations for controlling distribution
and creating arrays of servers that can be accessed by
collections of clients, it is possible to express both shared
and distributed memory forms of data-parallel computations.
s[0]
pk
s[1]
pk+1
s[n-1]
pk+n−1
a[0] a[1] a[m-1]
pℓ pℓ+1 pℓ+m−1
· · ·
· · ·
· · ·
Storage
Access
Clients
Figure 2. Illustration of the layout and structure of the shared memory
implementation in Process 2. The storage is distributed over a disjoint array
of processors, hence ℓ ≥ k + n.
1) Shared memory: A shared memory, distributed over
an array of processors, can be expressed with two server
arrays, one to act as a store and the other to provide an access
abstraction. For example, Process 2 provides an access server
(Access which has the same interface as Store) to each
of the m client processes. The access and client processes
Process 2
server s is Store(0)[n] &
{ server a is Access(s)[m] |
par i=0 for m do
{ · · ·; a[i].write(·, ·); · · · }
}
are layered over the same processors so interaction between
these is local. Each access server holds a reference to the
array of n storage servers and takes read and write requests
from the client and performs them over this array. Fig. 2
illustrates the layout and structure of this.
To avoid uneven distribution of accesses and load on
particular servers, which would result in increased access
latency, the access servers could select storage servers by
some appropriate hash function. This is the form of a
PRAM and the memory system of a BSP machine. For
the most general concurrent-read concurrent-write (CRCW)
form of memory, read combining could also be used to avoid
excessive access collisions [26].
2) Distributed memory: A distributed representation of
data can be expressed in a similar way, without an ac-
cess abstraction and with the server and client processes
distributed over the same set of processors. Process 3 is
similar to Process 2, except clients are co-located with a
storage server and access it directly. Since there is a local
Process 3
server s is Store(0)[n] |
par i=0 for n do
{ · · ·; a[i].write(·, ·); · · · }
correspondence between servers and clients, this call will
not incur any overhead due to the underlying interconnection
network.
s[0] s[1] s[n-1]
pk pk+1 pk+n−1
Servers
Clients
· · ·
· · ·
Figure 3. The layout and structure of the distributed memory Process 3.
Servers are situated with clients for fast access.
Data stored with other server or client processes could
be accessed with server calls, but race conditions can arise
from concurrent access to shared data and synchronisation
is required to avoid this. Instead, synchronised message
passing communication avoids these issues and is widely
used for scalable algorithms, typically in large systems such
as supercomputers. In general, simple scalable structures
such as pipelines, grids and trees are used [27] which are
easily expressed in occam and hence are composable with
server-based representations of data. This is demonstrated
with the matrix multiplication example in Section V-A.
Shared and distributed memory forms of data-parallelism
lend themselves to different applications and the ability of
the proposed programming model to cleanly support both is
significant. It provides the programmer with the flexibility
to employ a notation that best suits a given application.
IV. COMPILATION
The choice of notations and their restrictions allow for an
efficient implementation. This does however depend on the
provision of certain functionality to support the execution of
a collection of communicating parallel processes and, in par-
ticular, many-to-one patterns of communication. These are
described first, as an architectural target for the compilation
scheme.
A. Architectural target
The following defines the basic requirements of the pro-
posed language notations, independent of a specific hard-
ware or software implementation.
1) Processor addressing. Each processor in a system of
p processors has a unique integer ID in the range 0 to
p− 1 identifying it.
2) Multi-threading. A processor has the ability to sup-
port multiple concurrent threads of execution and any
thread has the ability to create additional threads.
3) Point-to-point communication. Any two threads can
communicate by passing messages over bidirectional
point-to-point channels. A channel is composed of two
channel ends that are each local to a thread. A channel
end has an ID that combines a local unique ID with
the processor’s ID so that it can be uniquely identified
in a system. Before a process p can send a message
to another process q, it must set the destination of a
local channel end to be the channel end ID of q, that
q is using to receive messages. It is not necessary for
q to specify p as the source unless it sends a message
in return to p. All messages are delivered in-order.
4) Many-to-one connections. A channel end may be
specified as a destination by multiple senders. In this
case, a sender must be able to establish a connection to
ensure other messages from different senders cannot
be delivered and interrupt a communication sequence.
These requirements are based on the INMOS trans-
puter [28] and related XMOS XS1 [29] architectures, which
provide low-level or hardware support for them. Other
larger-scale message passing architectures such as Blue-
Gene/L [30] and BlueGene/Q [31] realise similar concepts
in their software point-to-point messaging layer.
B. Scheme
1) Compile-time process allocation: As the size of all
process arrays (both replicated processes and servers) can
be determined at compile-time, it is possible to determine a
complete static schedule for the allocation of processes to
processors. This maps process arrays to contiguous blocks
of processors and logically adjacent processes to the same
processor. For example, the runtime use (and reuse) of
processors by Process 4 is illustrated by Fig. 4. This dynamic
behaviour is analogous to the allocation of stack frames in
memory for procedure calls.
Process 4
server a is A(· · ·)[n] &
{ P; Q; R;
{ server b is B(· · ·)[m] |
server c is C(· · ·)[m] |
{ X; Y }
}
}
Allocation is performed by initialising a base processor
b to be ID 0. A process is then assigned to processor b and
for each distributed parallel composition that it contains,
n component processes of it are assigned to processors
b, b + 1, · · · , b + n − 1. The allocation is then applied
recursively to each component process with b set to b + n.
Parallel composition with local distribution is compiled
into thread-based execution with instruction sequences to
perform initialisation, start execution and synchronise before
termination.
2) Server communication: A single server is addressed
by its processor ID and local channel end ID. This can
be packed into a single word and passed as a reference.
An array of servers are addressed by a base processor ID,
common local channel end ID and an offset. This allows
a normal server call s.c(· · · ) or subscripted call s[i].c(· · · ),
where s is the server reference, to exactly specify a particular
server.
a
P
0
0
a
Q
1
0
a
R
2
0
a
b c X
3
0 1 2
a
b c
Y
4
0 1 2
Pr
o
ce
ss
o
rs
Step
Thread
Figure 4. Illustration of the runtime use of processors according to the
compile-time allocation for Process 4, ‘step’ relates to the sequence of
execution.
The set of calls for a server are implemented with this
single channel and each call is assigned an ID unique to the
server. Let a0, a1, · · · , an−1 be a set of actual parameters
and P be a process making a call c to a server s of the
form
s.c(a0, a1, · · · , an−1).
Then, for a channel end c local to P , it is compiled as the
sequence:
1) set the destination of c to be the channel end of s;
2) connect to s;
3) send the channel end ID for c;
4) send the call ID for s;
5) send each actual parameter ai;
6) receive each referenced actual a′
i
and set ai ← a′i;
7) disconnect from s.
Once the client has connected to the server, it sends the
identity of its channel end so that the server can make the
necessary corresponding responses to the above sequence.
By establishing a connection with the server, calls made by
other clients will block until the server becomes free. In this
sense, server calls are atomic.
A key issue in the implementation of servers is to guaran-
tee that calls always complete. In a simple implementation,
there is potential for deadlock to occur. This is caused by a
situation where multiple clients are waiting for a busy server.
If to service a call the server must perform communication,
it might not be possible to establish a route in the network
due to waiting requests holding network resources. To avoid
this, a server must always be able to consume requests so
that a call is always guaranteed to complete. In practice,
the number of clients accessing any one server is likely
to be small and a small queue, with a size logarithmically
related to the number of clients, will probably suffice for
most programs. To avoid deadlock when the queue becomes
full, clients can reattempt to connect, at a rate according
to an exponential backoff, similar to the Ethernet protocol.
Alternatively, two separate physical or logically partitioned
networks could be used, one for server calls and the other
for general communication. This way, queued calls would
never interfere with any external communication a server
makes.
3) Process distribution: The processor allocation is
known for each process at compile-time. At run time, the
instruction sequence constituting a process must be available
at a processor that is scheduled to execute it. There are two
approaches that can be taken to this. With static distribution,
compilation would produce a set of p binary images for
a p processor system, with each binary containing all the
processes that will be executed by the given processor. This
requires each processor to have a large enough memory to
store every process that it will execute over the course of
a program, in addition to the memory requirements of each
process. For large p, the size of the binary package could
also be significant. With dynamic distribution, processes are
loaded onto processors at runtime, before they are executed.
Compilation produces two binaries, a master image that
contains all the program and a slave image that waits to
receive processes to execute. The benefit of this is a smaller
per-processor memory requirement and binary package in-
dependent of the size of a system. Dynamic distribution can
be made efficient by employing recursion [32].
In addition to a component parallel process being avail-
able at a processor, execution on a remote processor also
requires the complete lexical environment, i.e. all of the
variables it uses that are external to its scope. This can be
determined at compile-time and message passing sequences
generated both to supply these variables and to receive any
updates to them when the process terminates.
V. EXAMPLES
This section presents three example programs to demon-
strate the proposed notations: matrix multiplication, a ray
tracer and a compiler. The choice of these is based on
general-purpose applications that require different styles of
parallelism.
A. Matrix multiplication
Matrix multiplication is widely used in scientific pro-
grams. It is inherently data-parallel and the most scalable
parallel formulations employ message passing structures.
Cannon’s algorithm [33] is a simple distributed algorithm
that is structured as a 2D grid.
For an n × n grid of processes, this can be expressed
as Process 5. It takes three arrays of sub-matrix servers
(a, b and c) as parameters that represent the input and
result matrices. The subroutine proceeds by creating a 2D
array of nodes with each node connected by channels in
four directions and assigned a single sub-matrix server.
The node process performs computations on the local sub
matrices sends and receives sub-matrices in each direction
according to the algorithm. This subroutine encapsulates the
algorithm, separating the message passing implementation
from the distributed representation of the matrices. The
layout of this is illustrated in Fig. 5b.
A subroutine like this will most likely be employed as a
component of a more complex program, but even included
in a program that does nothing else, it requires additional
components for the initialisation of the input matrices and
a way to output the result. A simple way to do this is
to directly read or write values to the distributed matrices
in a global initialisation phase. Process 6, for example,
iterates over each sub matrix and performs initialisation
directly. A similar process could be conducted to output
the result. A complete minimal program to perform matrix
multiplication could then be composed as Process 7 where
the three matrices are declared as server arrays with a
layered distribution. The client process sequentially loads
the input matrices, performs the multiplication and outputs
the result. Fig. 5 illustrates the distribution of processes and
communication patterns for the load and multiply phases of
the algorithm.
Process 5
proc multiply(
server Matrix[n][n] a,
server Matrix[n][n] b,
server Matrix[n][n] c, val n) is
{ chan[n][n+1] h;
chan[n][n+1] v;
var x, y;
par y = 0 for n do
par x = 0 for n do
node(a[x][y], b[x][y], c[x][y],
v[x][y], v[x][(y+1) rem n],
h[y][x], h[y][(x+1) rem n])
}
Process 6
proc loadMatrix(
server Matrix[n][n] m, val n) is
{ var i, j;
seq i=0 for n do
seq j=0 for n do
loadSubMatrix(m[i][j])
}
Process 7
server a is Matrix(M, M)[n][n] |
server b is Matrix(M, M)[n][n] |
server c is Matrix(M, M)[n][n] |
{ loadMatrix(a, n);
loadMatrix(b, n);
multiply(a, b, c, n);
output(c, n)
}
node
a00 b0,0 c0,0
pk+0
node
a10 b1,0 c1,0
pk+1
node
a20 b2,0 c2,0
pk+2
node
a01 b0,1 c0,1
pk+3
node
a11 b1,1 c1,1
pk+4
node
a21 b2,1 c2,1
pk+5
node
a02 b0,2 c0,2
pk+6
node
a12 b1,2 c1,2
pk+7
node
a22 b2,2 c2,2
pk+8
load
(a) Load phase
node
a0,0 b0,0 c0,0
pk+0
node
a1,0 b1,0 c1,0
pk+1
node
a2,0 b2,0 c2,0
pk+2
node
a0,1 b0,1 c0,1
pk+3
node
a1,1 b1,1 c1,1
pk+4
node
a2,1 b2,1 c2,1
pk+5
node
a0,2 b0,2 c0,2
pk+6
node
a1,2 b1,2 c1,2
pk+7
node
a2,2 b2,2 c2,2
pk+8
(b) Multiply phase
Figure 5. Process distribution and communication structures for successive phases of the matrix multiply program. Each employs different communication
structures; loading performs a sequence of calls to the server array and the multiplication algorithm performs only local server accesses, but with grid-based
message passing communication.
B. Ray tracing
Ray tracing is a technique for generating realistic 2D
images from 3D scenes. It is highly parallelisable as the
calculation of each pixel, based on intersecting a ray with
a world model, can be performed independently. When
the world model is small enough to fit into the memory
of a single processor, a parallel scheme requires only the
communication of work and results. When it is larger than
a single memory, it has to be distributed and accessible by
all processes calculating ray intersections.
A distributed world model has a simple form with the
same structure as the shared memory in Process 2. Work is
distributed in a task farm structure, by a master process
to a collection of worker processes. This is outlined in
Process 8 and illustrated in Fig. 6. Process 8 includes
separate initialisation and output phases, similar to the ones
described for the matrix multiply program (Process 6).
Process 8
server master is Master() &
server objs is ObjectStore()[n] &
{ server access is WorldAccess(objs)[m] |
{ var i;
loadWorldModel(access);
par i=0 for m do
worker(master, access);
output(master)
}
}
Each of the m workers can access the world model
(distributed over n servers) via a specific server and will
do so frequently during the computation. In addition to
optimising the implementation of shared memory, it is
o[0]
pk
o[1]
pk+1
o[n-1]
pk+n−1
a[0] a[1] a[m-1]
worker0
pℓ+1
worker1
pℓ+2
workerm−1
pℓ+m−1
master
pk−1
· · ·
· · ·
· · ·
World
model
Workers
Figure 6. Structure of the parallel ray tracer where a world model is
provided by an array of servers and accessed concurrently by a collection
of workers. These are delegated work by a single master process. The world
model is disjointedly distributed (ℓ ≥ k + n) but it could also be layered
with the workers (ℓ = k).
necessary to reduce the number and latency of accesses to
obtain a scalable ray tracing algorithm [34]. To do this, each
access server can maintain a summary structure, usually a
bounding volume hierarchy (BVH), to minimise ray-object
intersection tests; it can also cache objects. With existing
parallel programming models, this functionality would be
implemented as part of the worker, but in Process 8 it is
encapsulated in the representation of the data, allowing a
simple world model interface to be presented to the workers.
C. Compiler
Compilers are complex programs that employ many
different algorithmic techniques and data structures. This
makes them a canonical example of a general-purpose piece
of software and a non-trivial test case for mapping realistic
sequential applications to a parallel architecture. Due to
this, there has been little work on parallel compilation,
although there are opportunities to, particularly during the
optimisation and code generation phases [35]. In particular,
many optimisations can be applied locally at an expression,
statement, block or procedure level, and hence may be
performed independently and in parallel over different parts
of a parse tree or intermediate representation.
The structure of a simple compiler is given in Process 9.
Process 9
server store is TreeStore()[n] &
server tree is TreeAccess(store)[m] &
server symbols is Table() &
{ parse(tree[0], symbols);
semantics(tree[0], symbols, m);
optimise(tree, symbols);
{ server store is BufStore()[l] |
server buffer is BufAccess(store) |
generateInsts(tree[0], buffer);
}
}
Two server arrays store and tree provide a concurrently
accessible parse tree, using the same principle as Process 2.
Initially, parsing and semantic analysis phases operate se-
quentially on the parse tree, using a single access server.
Local optimisations, as part of the optimise subroutine,
can be performed in parallel on the parse tree and this will
also require concurrent access to the symbol table. Finally,
instructions are output sequentially to a distributed buffer.
This buffer is declared in a separate scope to demonstrate
it could be included as part of the generateInsts
subroutine.
VI. CONCLUSION
This paper proposes a simple programming model for
expressing scalable parallel programs. A server construct
can be used in combination with notations for expressing
local and distributed parallelism to build abstractions for
distributed data structures with both shared and distributed
access structures. This gives the programmer the flexibility
to move between shared and distributed forms of data
parallelism, depending on the structure of the program and
scalability requirements. Server-based data structures can
be composed with other program components in a similar
way to conventional variable declarations and have similar
scoping rules. This allows them to be operated on by
sequences of potentially parallel subroutines, simplifying the
task of developing a complex parallel program.
The distribution model allows a compile-time allocation
of processing resources, to produce a static schedule. This
provides efficient runtime performance and predictable tim-
ing, which are essential for building programs that scale
to large numbers of cores. The compilation scheme re-
quires support from the architecture, in particular to pro-
vide bounded low latency communications, to support the
distribution model and general patterns of communication
between program components and servers, and in message
passing structures such as pipelines, grids and trees.
The example programs demonstrate how the proposed
notations can be used to compose computational components
that require varied forms of parallelism with distributed data
structures, in a clear and concise way.
ACKNOWLEDGEMENT
This work was funded by EPSRC grant SB1933.
REFERENCES
[1] A. M. Turing, “Proposals for development in the mathematics
division of an automatic computing engine (ACE). Report
E882, Executive Committee, NPL,” Feburary 1946, reprinted
April 1972 as NPL Report Com. Sci 57.
[2] O. J. Dahl, E. W. Dijkstra, and C. A. R. Hoare, Eds.,
Structured programming. London, UK, UK: Academic Press
Ltd., 1972.
[3] S. Fortune and J. Wyllie, “Parallelism in random access ma-
chines,” in Proceedings of the tenth annual ACM symposium
on Theory of computing, ser. STOC ’78. New York, NY,
USA: ACM, 1978, pp. 114–118.
[4] L. G. Valiant, “A bridging model for parallel computation,”
Communications of the ACM, vol. 33, no. 8, pp. 103–111,
1990.
[5] P. B. Hansen, “The origin of concurrent programming,” P. B.
Hansen, Ed. New York, NY, USA: Springer-Verlag New
York, Inc., 2002, ch. Design principles, pp. 382–393.
[6] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson,
K. H. Randall, and Y. Zhou, “Cilk: an efficient multithreaded
runtime system,” SIGPLAN Not., vol. 30, no. 8, pp. 207–216,
1995.
[7] OpenMP Architecture Review Board, “OpenMP
application program interface version 3.1,” 2012,
http://www.openmp.org/mp-documents/OpenMP3.1.pdf.
[8] Intel, “Intel threading building blocks reference manual,”
http://threadingbuildingblocks.org/documentation.php, 2012.
[9] The Khronos group, “OpenCL - The open standard
for parallel programming of heterogeneous systems,”
http://www.khronos.org/opencl/.
[10] NVIDIA, “CUDA Zone,” http://www.nvidia.com/object/cuda home new.html
[11] B. R. Gaster and L. Howes, “Can GPGPU programming be
liberated from the data-parallel bottleneck?” Computer, IEEE,
vol. 45, no. 8, pp. 42 –52, August 2012.
[12] S. H. Fuller and L. I. Miller, Eds., The Future of Com-
putung Performance: Game Over or Next Level? National
Acadamies Press, 2011.
[13] D. May, “Occam,” SIGPLAN Not., vol. 18, no. 4, pp. 69–79,
1983.
[14] MPI: A message-passing interface standard, Message Passing
Interface Forum, September 2009.
[15] A. Skjellum, N. Doss, and P. Bangalore, “Writing libraries
in MPI,” in Proceedings of the Scalable Parallel Libraries
Conference, October 1993, pp. 166–173.
[16] W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick,
E. Brooks, and K. Warren, “Introduction to UPC and Lan-
guage Specification,” The George Washington University,
Tech. Rep. CCS-TR-99-157, May 1999.
[17] B. Chamberlain, D. Callahan, and H. Zima, “Parallel pro-
grammability and the Chapel language,” Int. J. High Perform.
Comput. Appl., vol. 21, no. 3, pp. 291–312, 2007.
[18] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra,
K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-
oriented approach to non-uniform cluster computing,” in
OOPSLA ’05: Proceedings of the 20th annual ACM SIG-
PLAN conference on Object-oriented programming, systems,
languages, and applications. New York, NY, USA: ACM,
2005, pp. 519–538.
[19] L. Kale, B. Ramkumar, A. B. Sinha, and A. Grsoy, “The
Charm parallel programming language and system: Part I -
Description of language features,” University of Illinois, Tech.
Rep., 1994.
[20] D. Watt, Programming XC on XMOS Devices, September
2009, http://www.xmos.com/support/documentation.
[21] G. Barrett, occam 3 reference manual, INMOS Ltd., March
1992.
[22] C. E. Leiserson, “Fat-trees: universal networks for hardware-
efficient supercomputing,” IEEE Trans. Comput., vol. 34,
no. 10, pp. 892–901, October 1985.
[23] L. G. Valiant, “General purpose parallel architectures,” in
Handbook of theoretical computer science (vol. A): algo-
rithms and complexity. Cambridge, MA, USA: MIT Press,
1990, pp. 943–973.
[24] S. A. Cook and R. A. Reckhow, “Time-bounded random
access machines,” in Proceedings of the fourth annual ACM
symposium on Theory of computing, ser. STOC ’72. New
York, NY, USA: ACM, 1972, pp. 73–80.
[25] A. D. Birrell and B. J. Nelson, “Implementing remote pro-
cedure calls,” ACM Trans. Comput. Syst., vol. 2, no. 1, pp.
39–59, 1984.
[26] A. G. Ranade, “How to emulate shared memory,” in Foun-
dations of Computer Science, 1987., 28th Annual Symposium
on, oct. 1987, pp. 185 –194.
[27] P. B. Hansen, Model programs for computational science:
parallel programming paradigms. John Wiley and Sons,
Ltd, 1993.
[28] INMOS Ltd., Transputer Databook, INMOS Ltd., 1988, first
Edition.
[29] D. May, The XMOS XS1 Architecture, October 2009.
[30] G. Almasi, C. Archer, J. G. Castanos, J. A. Gunnels, C. C.
Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pin-
now, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and
B. Toonen, “Design and implementation of message-passing
services for the Blue Gene/L supercomputer,” IBM Journal
of Research and Development, vol. 49, no. 2.3, pp. 393–406,
March 2005.
[31] D. Chen, N. Eisley, P. Heidelberger, R. Senger, Y. Sugawara,
S. Kumar, V. Salapura, D. Satterfield, B. Steinmacher-Burow,
and J. Parker, “The IBM Blue Gene/Q interconnection net-
work and message unit,” in High Performance Computing,
Networking, Storage and Analysis (SC), 2011 International
Conference for, November 2011.
[32] J. Hanlon and S. Hollis, “Fast distributed process creation
with the XMOS XS1 architecture,” in Communicating process
architectures 2011, ser. WoTUG, vol. 33. IOS Press, June
2011, pp. 195–207.
[33] L. E. Cannon, “A cellular computer to implement the Kalman
Filter algorithm,” Ph.D. dissertation, Bozeman, MT, USA,
1969.
[34] E. Reinhard and F. W. Jansen, “Rendering large scenes using
parallel ray tracing,” Parallel Comput., vol. 23, no. 7, pp.
873–885, July 1997.
[35] T. Gross, A. Sobel, and M. Zolg, “Parallel compilation for
a parallel machine,” in Proceedings of the ACM SIGPLAN
1989 Conference on Programming language design and im-
plementation, ser. PLDI ’89. New York, NY, USA: ACM,
1989, pp. 91–100.
