Fast Distributed Process Creation with the XMOS XS1 Architecture by Hanlon, James & Hollis, Simon J.
Communicating Process Architectures 2011
P.H. Welch et al. (Eds.)
IOS Press, 2011
c© 2011 The authors and IOS Press. All rights reserved.
1
Fast Distributed Process Creation with the
XMOS XS1 Architecture
James HANLON and Simon J. HOLLIS
Department of Computer Science, University of Bristol, UK.
{hanlon , hollis} @cs.bris.ac.uk
Abstract. The provision of mechanisms for processor allocation in current distributed
parallel programming models is very limited. This makes difficult, or even prohibits,
the expression of a large class of programs which require a run-time assessment of
their required resources. This includes programs whose structure is irregular, compos-
ite or unbounded. Efficient allocation of processors requires a process creation mech-
anism able to initiate and terminate remote computations quickly. This paper presents
the design, demonstration and analysis of an explicit mechanism to do this, imple-
mented on the XMOS XS1 architecture, as a foundation for a more dynamic scheme.
It shows that process creation can be made efficient so that it incurs only a fractional
overhead of the total runtime and that it can be combined naturally with recursion to
enable rapid distribution of computations over a system.
Keywords. distributed process creation, distributed runtime, dynamic task placement,
parallel recursion,
Introduction
An essential issue in the design of scalable, distributed parallel computers is the rate at which
computations can be initiated, and results collected as they terminate [1]. This requires an
efficient method of process creation capable of dispatching a program and data on which to
operate to a remote processor. This paper presents the design, implementation, demonstration
and evaluation of a process creation mechanism for the XMOS XS1 architecture [2].
Parallelism is being employed on an increasingly large scale to improve performance
of computer systems, particularly in high performance systems, but increasingly in other ar-
eas such as embedded computing [3]. As current programming models such as MPI (Mes-
sage Passing Interface) provide limited support for automated management of processing re-
sources, the burden of doing this mainly falls on the programmer. These issues are not rel-
evant to the expression of a program as, in general, a programmer is concerned only with
introducing parallelism (execution on multiple processors) to improve performance, and not
how the computation is scheduled on the underlying system. When we consider that future
high performance systems will run on the order of 109 threads [4], it is clear that the pro-
gramming model must provide some means of dynamic processor allocation to remove this
burden. This is the situation we have with memory in sequential systems, where allocation
and deallocation is performed with varying degrees of automaticy.
This observation is not new [5,6], but it is only as existing programming models and
software struggle to meet the increasing scale of parallelism that the problem is again coming
ar
X
iv
:1
10
5.
38
43
v1
  [
cs
.D
C]
  1
9 M
ay
 20
11
2 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
to light. For instance, capabilities for process creation and management were introduced in
the MPI-2.0 specification, stating that: “Reasons for including process management in MPI
are both technical and practical. Important classes of message-passing applications require
this control. These include task farms, serial applications with parallel modules and prob-
lems that require a run-time assessment of the number and type of processes that should be
started” [7]. Several MPI implementations support process creation and management func-
tionality, but it is pitched as an ‘advanced’ feature that is difficult to use and problematic
with many current job-scheduling systems. More encouragingly, language-level abstractions
for dynamic process creation and placement have appeared recently in the Chapel [8] and
X10 [9], which are being developed by Cray and IBM respectively as part of DARPA’s High
Productivity Computing Systems program. Both support these concepts as key ingredients in
the design of parallel programs, but they are built on software communication libraries and
statically-mapped program binaries. Consequently, they are subject to the same communica-
tion inefficiencies and inflexibility of single-program approaches.
A run-time assessment of required processing resources concerns large class of programs
whose structure is irregular, such as unstructured-grid algorithms like the Spectral Element
Method [10], unbounded such as recursively-structured algorithms like Branch-and-Bound
search [11] and Adaptive Mesh Refinement [12], or composite, where a program may be
composed of different parallel subroutines that are themselves executed in parallel, possibly
each with its own structure. These all require a means of dynamic processor allocation that
is able to distribute computations over a set of processors, depending on requirements de-
termined at runtime. The combination of parallelism and recursion is a powerful mechanism
for growth which can be used to implement distribution efficiently. This must be supported
with a mechanism for process creation with the ability to dispatch, initiate and terminate
computations efficiently on remote processors.
This paper presents the design and implementation of an explicit scheme for dynamic
process creation in a distributed memory parallel computer. This work is intended to be a key
building block for a more automatic scheme. The implementation is on the the XMOS XS1
architecture, which has low-level provisions for concurrency, allowing a convincing proof-
of-concept implementation. Based on this, the process creation mechanism is evaluated by
combining it with controlled recursion in two simple algorithms to demonstrate the rate and
granularity at which it is possible to create remote computations. Performance models are
developed in each case to interpret the measured results and to make predictions for larger
systems and workloads. This analysis highlights the efficiency, scalability and effectiveness
of the concept and approach taken.
The rest of this paper is structured as follows. Section 1 describes the XS1 architec-
ture, the experimental platform and the notations and conventions used. Section 2 gives a
brief overview of the design and implementation details. Section 3 presents the performance
models and experimental and predicted results. Finally, Section 4 concludes and Section 5
discusses possible future extensions to the work.
1. Background
1.1. Platform
The XMOS XS1 processor architecture [2] is general-purpose, multi-threaded, scalable and
has been designed from the ground up to support concurrency. It allows systems to be con-
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 3
structed from multiple XCore processors which communicate with each other through fast
communication links. The key novel aspect of this architecture with respect to the work in
this paper is the instruction set support for processes and communication. Low-level thread-
ing and communication are key features, exposed with operations, for example, to provide
synchronous and asynchronous fork-join thread-level parallelism and channel-based message
passing communication. Provision of these features in hardware allows them to be performed
in the same order of magnitude of time as memory references, branches and arithmetic. This
allows efficient high-level notations for concurrency to be effectively built.
The system used to demonstrate and evaluate the proposed process creation mechanism
is an experimental board called the XK-XMP-64 [13]. It connects together 64 XCore pro-
cessors in 16 XS1-G4 devices which run at 400MHz. The G4 devices are interconnected
in a 4-dimensional hypercube which equivalently can be viewed as a 2-dimensional torus.
Mathematically, this is defined in the following way [14]:
Definition 1. A d-dimensional hypercube is a graph G = (N,E) where N is the set of 2d
nodes and E is the set of edges. Each node is labeled with a d-bit identifier. For any m,n ∈ N,
an edge exists between m and n if and only if
m⊕n = 2k
for 0 ≤ k ≤ d where ⊕ is the bitwise exclusive-or operator. Hence, each node has d = logN
edges and |E|= d2d−1.
Each core in the G4 package has a private 64kB memory and is interconnected via inter-
nal links to an integrated switch. It is convenient to view the whole system as a 6-dimensional
hypercube. As each core can run 8 hardware threads, the system is capable of 512-way con-
currency with an aggregate 25.6 GIPS performance.
1.2. Notation
For presentation of the algorithms in this paper, a simple imperative, block-structured no-
tation is used. The following points describe the non-standard elements that appear in the
examples.
1.2.1. Sequential and Parallel Composition
A set of instructions that are to be executed in sequence are composed with the ‘;’ separator.
A sequence of instructions comprises a process. For example, the block
{ I1 ; I2 ; I3 }
defines a simple process to perform three instructions, I1, I2 and I3 in sequence. Processes
may be executed in parallel by composition within a block with the ‘|’ separator. Execution
of a parallel block initiates the execution of the constituent processes simultaneously. The
parallel block successfully terminates only when all processes have successfully terminated.
This is referred to as synchronous fork-join parallelism. For example, the block declaration
{ P1 | P2 | P3 }
denotes the parallel execution of three processes P1, P2 and P3.
4 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
1.2.2. Aliasing
The aliases statement is used to create new references to sub-sections of an array. For exam-
ple, the statement
A aliases B[i . . . j]
sets A to refer to the sub-section of B in the index range i to j.
1.2.3. Process Creation
The on statement reveals explicitly to the programmer the process creation mechanism. The
statement
on p do P
is semantically equivalent to executing a call to P, except that process P is transmitted to pro-
cessor p, which then executes P and communicates back any results using channels, leaving
the original processor free to perform other tasks. By composing on in parallel, we can exploit
multi-threaded parallelism to offload work while executing another process. For example, the
statement
{ P1 | on p do P2 }
causes P1 to be executed while P2 is offloaded and executed on processor p.
1.3. Measurements
All timing measurements presented were made with hardware timers, which are accessible
through the ISA and have 10ns resolution. Constant values were extrapolated through the
measurements taken by fitting performance models to the data.
1.4. Conventions
All logarithms are to the base 2. p is defined as the number of processors and is taken to be a
positive power of two. A word is taken to be 4 bytes and is a unit of input in the performance
models.
2. Implementation
The on statement causes the closure of a process P located at a guest processor to be sent to
a remote host processor, the host to execute P and to send back any updated free variables
of P stored at the guest. The execution of on is synchronous in this respect. The closure of
a process P is a complete description of P allowing it to be executed independently and is
defined in the following way:
Definition 2. The closure C of a process P consists of three elements: a set of arguments A,
which represents the complete variable context of P as we don’t consider global variables, a
set of procedure indicies I and a set of procedures Q:
C(P) = (A, I,Q)
where |A| ≥ 0 and |I|= |Q| ≥ 1. Each argument a ∈ A is a ordered sequence of one or more
integer values. Each process P ∈ Q is an ordered sequence of one or more instructions. IP is
an integer value denoting the index of procedure P.
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 5
Each core maintains a fixed-size jump table denoted ‘jump’, which records the location
of each procedure in memory. As the procedure address may not be consistent between cores
the indicies are guaranteed to be. This allows relative branches to be expressed in terms of
an index which is locally referenced at execution. Each node in the system is initialised with
a minimal binary containing the process creation kernel. The complete program is loaded on
node 0, from where parts of it can be copied onto other nodes to be executed.
2.1. Protocol
The process creation mechanism is implemented as a point-to-point protocol between a guest
core and a host core. Any running thread is able to spawn the execution of a process on any
other core. It consists of the following four phases.
2.1.1. Connection Initialisation
A guest initiates a connection by sending a single byte control token and a word identifying it-
self. It waits for an acknowledgment from the host indicating a host thread has been allocated
and the connection is properly established. A core may host multiple guest computations,
each on a different thread.
2.1.2. Transmission of Closure
C(P) is transmitted in three parts. Firstly, a header is sent containing |A| and |Q|. Secondly,
each a∈A is sent with a single word header denoting the type of the argument. For referenced
arrays, this is followed by length(a) and the values contained. The host writes these directly
into heap-allocated space and the argument value is set to this address. Single-value variables
are treated similarly and constant values can be copied directly into the argument value.
Lastly, each P ∈ Q is sent with a two word header denoting IP and length(P) in bytes. The
host allocates space on the heap and receives the instructions of P from the guest, read from
memory in word-chunks from jump[IP] to jump[IP]+ length(P). On completion, the host sets
jump[IP] to the address of P on the heap.
2.1.3. Execution/Wait for Completion
Once C has been successfully transmitted, the host initialises the thread’s registers and stack
with the arguments of P and initiates execution. The connection is left open and the guest
thread waits for the host to indicate P has halted.
2.1.4. Transmission of Results and Teardown
Once P has halted, all referenced array and variable arguments contained in C (now the
results) are transmitted back to the guest. The guest writes them back directly to their original
locations. Once this has been completed, the connection is terminated. The guest continues
execution and the host thread frees the memory allocated to the closure and yields.
2.2. Performance Model
The runtime cost of this mechanism is captured in the following way:
Definition 3. The runtime of process creation Tc is a function of the total size of the argument
values n, procedure descriptions m and the results o and is given by
6 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
proc distribute (t, n) is
if n = 1 then node (t)
else
{ distribute (t, n/2)
| on t+n/2 do distribute (t+n/2, n/2) }
Figure 1. A recursive process distribute to rapidly distribute another process node over a set of processors.
Tc(n,m,o) = (Ci+Cwn+Cwm+Cwo) ·Cl
where Ci and Cw are constants relating to initialisation and termination, and overhead per
(word) value transmitted respectively. The value n is inclusive of the size of referenced arrays
and hence o ≤ n. As all communication is synchronised, Cl is a constant factor overhead
relating to the latency of the path between the guest and host processors.
Normalising Cl = 1 to a single hop off-chip, the per-word overhead Cw was measured as
150ns. The initialisation overhead Ci is dependent on the size of the closure.
3. Demonstration and Evaluation
The aim of this section is to demonstrate the use of process creation combined with paral-
lel recursion to evaluate the performance of the design and its implementation in realising
efficient growth. To do this, we develop performance models to combine with experimental
results, allowing us to extrapolate to larger systems and inputs. We start with a simple algo-
rithm to demonstrate the fast distribution of parallel computations and then show how this
can be applied to a practical problem.
3.1. Rapid Process Distribution
The algorithm distribute given in Figure 1 is inspired by [1] and works by spawning a new
copy of itself on a remote processor each time it recurses. Each process then itself recurses,
continuing this behaviour and hence, each level of the recursion subdivides the set of pro-
cessors in half, resulting in a doubling of the capacity to initiate computations. This growth
follows the structure of a binary tree. When each instance of distribute executes with n = 1,
the node process is executed and the recursion halted. The parameter t indicates the node
identifier and the algorithm is executed from node 0 with t = 0 and n = p.
3.1.1. Runtime
The hypercube interconnection topology of the XK-XMP-64 provides an optimal transport in
terms of hop distance between remote creations; this is established by the following theorem.
Theorem 1. Every copy of distribute is always created on a neighbouring node when executed
on a hypercube.
Proof. Let H = (N,E) be a d-dimensional hypercube. When distribute is executed with t = 0
and n = N, starting at node 0 on H, the recursion follows the structure of a binary tree of
depth d = log |N|, where identifiers at level i are multiples of |N|/2i. A node p at depth i with
identifier k|N|/2i creates a new remote child node c with identifier k|N|/2i + |N|/2i+1. As
|N|= 2d , c = k2d−i+2d−i−1 and hence, p⊕ c = 2d−i−1.
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 7
0
20
40
60
80
100
120
10 20 30 40 50 60
Ti
m
e
(µ
s)
p
?Td
Td
(a) Measured vs. predicted (?) execution time.
0
5
10
15
20
1 2 3 4 5 6
Ti
m
e
(µ
s)
Level
(b) Execution times for each level of recursion of dis-
tribute .
Figure 2. Measured execution time of distribute over varying numbers of processors. (b) clearly shows the
inter- vs. intra-chip latencies.
Given that m and n are fixed, that o = 0 (there are no results) and from Theorem 1 we
can normalise Cl to 1, the runtime Tc(m,n,o) of the on statement in distribute is Θ(1), which
we define as the initialisation overhead C j. Using this, we can express the parallel runtime
of distribute Td on p processors. In each step, the number of active processes double, but we
count the runtime at each level of recursion, which terminates when n/2i = 1 or i = logn.
Hence,
Td(p) =
log p
∑
i=1
(Tc+Co)
=(C j +Co) log p (1)
where Co is the the sequential overhead at each level. C j was measured as 18.4µs and Co was
measured as 60ns.
3.1.2. Results
Figure 2a gives the predicted and measured execution time of distribute as a function of the
number of processors. The prediction almost exactly matches the runtime given by Equa-
tion 1. Figure 2b shows the inaccuracy between the measured and predicted results more
clearly, by giving the measured execution time for each level in the recursion, that is, the
difference between consecutive points in Figure 2a. It shows that the assumption made based
on Theorem 1 does not hold and that the first two levels take fractionally less time than the
last four levels (3.85µs). This is due to the reduced on-chip communication costs. Overall
though, each level of recursion completes on average in 18.9µs and it takes only 114.60µs
to populate all 64 processors. Moreover, using the performance model given by Td , we can
extrapolate to larger p than is possible to measure with the current platform. For example,
when p = 1024, Td(1024) = 190µs.
3.1.3. Remarks
By using the performance model to make predictions, we have assumed a hypercube topology
and efficient support for concurrency. Although other architectures and larger systems cannot
8 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
make such provisions, the model and results provide a reasonable lower bound on execution
time with respect to the approach described.
The hypercube has rich communication properties and supports exponential growth, but
it does not scale well due to the number of connections at each node and length of wires in
realistic packagings. Although distribute has optimal single-hop behaviour and we obtain peak
performance, it is well known that efficient embeddings of binary trees into lower-degree
networks such as meshes and tori exist [14], allowing reasonable dispersion. In this case,
the granularity of process creation would have to be chosen to match the capabilities of the
architecture.
Provision of efficient ISA-level operations for processes and communications allows
fine-grained performance, particularly in terms of short messages. Many current architectures
do not support these operations at a such a low-level and cannot exploit the full potential of
this approach, although again it generalises at a coarser granularity of message size to match
the relative performance of these operations.
3.2. Mergesort
Mergesort is a well known sorting algorithm [15] that works by recursively halving a list
of unsorted numbers until unit sub-lists are obtained. These are then successively merged
together such that each merging step produces a sorted sub-list, which can be performed
in time Θ(n) for sub-lists of size n/2. Figure 3a gives the sequential mergesort algorithm
seq-msort .
Mergesort’s branching recursive structure matches that of distribute , allowing us to com-
bine them to obtain a parallel version. Instead of sequentially evaluating the recursive calls,
conditional on some threshold value Cth, a local recursive call is made in parallel with the
second call which is migrated to a remote core. This threshold is used to control the extent to
which the computation is distributed. In each of the experiments for an input of size 2k and
available processors p = 2d , the threshold is set as 2k/p. The approach taken in distribute is
used to control the placements of each of the sub-computations. Initially, the problem is split
in half; this will have the greatest benefit to the execution time. Depending on the problem
size, further remote branchings of the problem may not be economical, and the remaining
steps should be evaluated locally, in sequence. In this case, the algorithm simply reduces to
seq-msort .
This parallel formulation of mergesort is essentially just distribute with additional work
and communication overhead, but it will allow us to more concretely quantify the relative
costs of process creation. The parallel implementation of mergesort par-msort is given in Fig-
ure 3b. It uses the same sequential merge procedure and the parameters t and n control the
placement of processes in the same way as they were used with distribute .
We can now analyse the performance and behaviour of par-msort and the process creation
mechanism by looking at the parallel runtime.
3.2.1. Runtime
We first define the runtime of the sequential components of par-msort . This includes the
sequential merging and sorting procedures. The runtime Tm of merge is linear and is defined
as
Tm(n) =Can+Cb
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 9
proc seq-msort (A) is
if |A| > 1 then
{ a aliases A[0..|A|/2−1]
; b aliases A[i..|A|]
; seq-msort (a)
; seq-msort (b)
; merge(A,a,b)
}
(a)
proc par-msort (t, n, A) is
if |A| > 1 then
{ a aliases A[0 . . . |A|/2−1]
; b aliases A[i . . . |A|]
; if |A| > Cth then
{ par-msort (t, n/2, a)
| on t+n/2 do
par-msort (t+n/2, n/2, b) }
else
{ par-msort (t, n/2,a)
; par-msort (t+n/2, n/2,b) }
; merge(A,a,b)
}
(b)
Figure 3. Sequential and parallel mergesort processes.
for constants Ca,Cb > 0, relating to the per-word and per-merge overheads respectively. These
were measured as Ca = 90ns and Cb = 830ns. The runtime Ts(n,1) of seq-msort , is expressed
as a recurrence:
Ts(n,1) = 2Ts
(n
2
,1
)
+Tm(n) (2)
which has the solution
Ts(n,1) = n(Cc logn+Cd) (3)
for constants Cc,Cd > 0. These were measured as Cc = 200ns and Cd = 1200ns. Based on
this we can express the runtime of par-msort as the combination of the costs of creating new
processes, moving data, merging and sorting sequentially. The key component of this is the
cost Tc, relating to the on statement in the parallel formulation, which is defined as
Tc(n) =Ci+2Cwn.
This is because we can normalise Cl to 1 (due to Theorem 1), the size of the procedures sent
is constant and the number of arguments and results are both n. The initialisation overhead Ci
was measured as 28µs, larger than that for distribute as the closure contains the descriptions of
merge and par-msort . For the parallel runtime, the base sequential case is given by Equation 2.
With two processors, the work and execution time can be split in half at the cost of migrating
the procedures and data:
Ts(n,2) = Tc
(n
2
)
+Ts
(n
2
,1
)
+Tm(n).
With four processors, the work is split in half at a cost of Tc(n/2) and then in quarters at
a cost of Tc(n/4). After the data has been sequentially sorted in time Ts(n/4,1) it must be
merged at the two children of the master node in time Tm(n/2), and then again at the master
in time Tm(n):
Ts(n,4) =Tc
(n
2
)
+Tc
(n
4
)
+Tm
(n
2
)
+Tm(n)+Ts
(n
4
,1
)
10 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
Hence in general, we have:
Ts(n, p) =
log p
∑
i=1
(
Tc
( n
2i
)
+Tm
( n
2i−1
))
+Ts
(
n
p
,1
)
for n ≥ p as each leaf sub-process of the sorting computation must operate on at least one
data item. We can then express this precisely by substituting our definitions for Ts, Tc and Tm
and simplifying:
Ts(n, p) =Cw
2n
p
(p−1)+Ci log p+Ca 2np (p−1)+Cb log p+
n
p
(
Cc log
n
p
+Cd
)
=
2n
p
(p−1)(Cw+Ca)+(Ci+Cb) log p+ np
(
Cc log
n
p
+Cd
)
(4)
For p = 1, this reduces to Equation 3. This definition allows us to express the a lower bound
and minimum for the runtime.
3.2.2. Lower Bound
We can give a lower bound Tms on the parallel runtime Ts(n, p) such that ∀n, p
Ts(n, p)≥ Tms .
This is obtained by considering the parallel overhead, that is the cost of distributing the
problem over the system. In this case it relates to the cost of process creation, including
moving processes and their data, the Tc component of Ts:
Tms (n, p) =
log p
∑
k=1
Tc
( n
2k
)
=
log p
∑
k=1
(
Ci+2Cw
n
2k
)
=Ci log p+Cw
2n
p
(p−1). (5)
Equation 5 is then the sum of the costs of process creation and movement of input data.
When n = 0, Tms relates to Equation 1; this is the cost of transmitting and initiating just the
computations over the system. For n≥ 0, this includes the cost of moving the data.
3.2.3. Minimum
Given an input of length m≤ n for some sub-computation of par-msort , creation of a remote
branch is beneficial only when the cost of this is less than the local sequential case:
Tc
(m
2
)
+Ts
(m
2
,1
)
+Tm(n)< Ts(m,1)
Tc
(m
2
)
+Ts
(m
2
,1
)
+Tm(n)< 2Ts
(m
2
,1
)
+Tm(m)
Tc
(m
2
)
< Ts
(m
2
,1
)
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 11
Hence, initiation of a remote sorting process for an array of length n is beneficial only when
Tc(n)< Ts(n,1).
That is, the cost of remotely initiating a process to perform half the work and receiving the
results is less than the cost of sequentially sorting m/2 elements. Therefore at the inflection
point we have
Tc (n) = Ts (n,1) . (6)
3.2.4. Results
Figure 4 shows the measured execution time of par-msort as a function of the number of
processors used for varying input sizes. Figure 4a shows just three small inputs. The smallest
possible input is 256 bytes as the minimum size for any sub-computation is 1 word. The
minimum execution time for this size is at p = 4 processors, when the array is subdivided
twice into 64 byte sections. This is the point given by Equation 6 and indicates directly the
total cost incurred in offloading a computation. For p < 4, the cost of sorting sequentially
dominates the runtime, and for p > 4, the cost of creating a new processes and transferring
the array sections dominates the runtime. With the next input of size 512 bytes, the minimum
moves to p = 8, where the array is again divided into 64 byte sections. This holds for each
input size and in general gives us the minimum size for which creating a new process will
further reduce the runtime.
The runtime lower bound Tms (0, p) given by Equation 5 is also plotted on Figure 4a. This
shows the small and sub-linear cost with respect to p of the overheads incurred with the dis-
tribution and management of processes around the system. Relative to Ts(64, p) this consti-
tutes most of the overall work performed, which is expected as the array is fully decomposed
into unit sections. For larger sized inputs, as presented in Figure 4b, this cost becomes just a
fraction of the total work performed.
Figure 5 shows predicted execution times for par-msort for larger p and n. Each plot
contains the execution time Ts as defined by Equation 4, and Tms with and without the transfer
of data. Figure 5a gives results for the smallest input size possible to sort on 1024 cores (4kB)
and includes the measurements for Tms (0, p) and Ts. It reiterates what was shown in Figure 4a
and shows that beyond 64 cores, very little penalty is incurred to create up to 1024 sorting
instances, with Tms accounting for around 23% of the total runtime for larger systems. This is
due to the exponential growth of the distribution mechanism. Figure 5b gives results for the
largest measured input of 32kB, showing the same trends, where Tms this time is around just
3% of the runtime between 64 and 1024 cores.
Figure 5c and Figure 5d present predictions made by the performance model for more
realistic workloads of 10MB and 1GB respectively. Figure 5c shows that 10MB could be
sorted sequentially in around 7s and in parallel in at least 0.6s. Figure 5d shows that 1GB
could be sorted in just under 15m sequentially or at least 1m in parallel. What these results
make clear is that the distribution of the input data dominates and bounds the runtime and
that the distribution of data constituting the process descriptions is a negligible proportion
of the overall runtime for reasonable workloads. The relatively small sequential workload
O(n/p log(n/p)) of mergesort, which decays quickly as p increases, emphasises the cost of
data distribution. For heavier workloads, such as O((n/p)2), we would expect to see a much
more dramatic reduction in execution time and the cost of data distribution still eventually to
bound runtime, but then by a relatively fractional amount.
12 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 4 8 16 32 64
Ti
m
e
(m
s)
p
Tms (0, p)
Ts(256B, p)
Ts(512B, p)
Ts(1kB, p)
(a) Log-linear plot for varying small inputs.
0.1
1
10
100
1 2 4 8 16 32 64
Ti
m
e
(m
s)
p
Ts(256B, p)
Ts(512B, p)
Ts(1kB, p)
Ts(2kB, p)
Ts(4kB, p)
Ts(8kB, p)
Ts(16kB, p)
Ts(32kB, p)
(b) Log-log plot for larger inputs.
Figure 4. Measured execution time of par-msort as a function of the number of processors. (a) highlights the
minimum execution time and the Tms lower bound.
0.0001
0.001
0.01
0.1
1
10
1 2 4 8 16 32 64 128
256
512
1024
Ti
m
e
(m
s)
p
Tms (0, p)
?Tms (0, p)
?Tms (n, p)
Ts(n, p)
?Ts(n, p)
(a) n= 64 (256B) with measured results up to 64 cores.
0.0001
0.001
0.01
0.1
1
10
100
1 2 4 8 16 32 64 128
256
512
1024
Ti
m
e
(m
s)
p
Tms (0, p)
?Tms (0, p)
?Tms (n, p)
Ts(n, p)
?Ts(n, p)
(b) n = 8192 (32kB) with measured results up to 64
cores.
0.001
0.01
0.1
1
10
100
1000
10000
1 2 4 8 16 32 64 128
256
512
1024
Ti
m
e
(m
s)
p
?Tms (0, p)
?Tms (n, p)
?Ts(n, p)
(c) n = 2621440 (10MB).
0.001
0.01
0.1
1
10
100
1000
10000
100000
1e+06
1 2 4 8 16 32 64 128
256
512
1024
Ti
m
e
(m
s)
p
?Tms (0, p)
?Tms (n, p)
?Ts(n, p)
(d) n = 268435465 (1GB).
Figure 5. Predicted (?) performance of par-msort for larger n and p≤ 1024. All plots are log-log.
J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture 13
4. Conclusions
This paper presents the design, implementation, demonstration and evaluation of an efficient
mechanism for dynamically creating computations in a distributed memory parallel com-
puter. It has shown that a computation can be dispatched to a remote processor in just tens
of microseconds, and when this mechanism is combined with recursion, it can be used to
efficiently implement parallel growth.
The distribute algorithm demonstrates how an empty array of processors can be populated
with a computation exponentially quickly. For 64 cores, it takes just 114.60µs and for 1024
cores this will be of the order of 190µs. The par-msort algorithm extends this by performing
additional computational work and communication of data which allowed us to obtain a
clearer picture of the cost of process creation with respect to varying problem sizes. As the
cost of transferring and invoking remote computations is related primarily to the size of the
closure, this cost grows slowly with system size and is independent of data. With a 10MB
input, it represents around just 0.001% of the runtime.
The sorting results also highlight two important issues: the granularity at which it is
possible to create new processes and costs of data movement. They show that the computation
can be subdivided to operate on just 64 byte chunks and for performance to still be improved.
The cost of data movement is significant, relative to the small amount of work performed
at each node; for more intensive tasks, these costs would diminish. However, these results
assume a worst case, where all data originates from a single core. In other systems, this
cost may be reduced by concurrent access through a parallel file system or from prior data
distribution.
The XS1 architecture provides efficient support for concurrency and communications
and the XK-XMP-64 provides an optimal transport for the described algorithms, so we expect
our lightweight scheme to be fast, relative to the performance of other distributed systems.
Hence, the results provide a convincing proof-of-concept implementation, demonstrating the
kind of performance that is possible and, with respect to the topology, establish a reasonable
lower bound on the performance of the approach presented. The results generalise to more
dynamic schemes where placements are not perfect and other larger architectures such as
supercomputers, where interconnection topologies are less well connected and communica-
tion is less efficient. In these cases, the approach applies at a coarser granularity with larger
problem sizes to match the relative performance.
5. Future Work
Having successfully designed and implemented a language and runtime allowing explicit
process creation with the on statement, we will continue with our focus on the concept of
growth in parallel programs and plan to extend the work in the following ways. Firstly, by
looking at how placement of process closures can be determined automatically by the run-
time, relieving the programmer of having to specify this. Secondly, by implementing the lan-
guage and runtime with C and MPI to target a larger platform, which will provide a more
scalable demonstration of the concepts and their generality. And lastly, by looking at generic
optimisations that can be made to the process creation mechanism to improve overall perfor-
mance and scalability. More details about the current implementation are available online1,
1http://www.cs.bris.ac.uk/~hanlon/sire
14 J. Hanlon and S. J. Hollis / Fast Distributed Process Creation with the XMOS XS1 Architecture
where news of future developments will also be published.
Acknowledgments
The authors would like to thank XMOS for their support, in particular from David May, Henk
Muller and Richard Osborne.
References
[1] David May. The Transputer revisited. In Millennial Perspectives in Computer Science: Proceedings of
the 1999 Oxford-Microsoft Symposium in Honour of Sir Tony Hoare, pages 215–246. Palgrave Macmillan,
1999.
[2] David May. The XMOS XS1 Architecture. XMOS Ltd., October 2009. http://www.xmos.com/
support/documentation.
[3] Asanovic, Bodik et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical
Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, Dec 2006. http:
//www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.
[4] Dongarra, J., Beckman, P. et al. International Exascale Software Project Roadmap. Technical Report UT-
CS-10-654, University of Tennessee EECS Technical Report, May 2010. http://www.exascale.org/.
[5] D. May. The Influence of VLSI Technology on Computer Architecture [and Discussion]. Philosophical
Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 326(1591):pp.
377–393, 1988.
[6] Per Brinch Hansen. The nature of parallel programming. Natural and Artifical Parallel Computation,
pages 31–46, 1990.
[7] MPI 2.0. Technical report, Message Passing Interface Forum, November 2003. http://www.
mpi-forum.org/docs/.
[8] B.L. Chamberlain, D. Callahan, and H.P. Zima. Parallel programmability and the Chapel language. Inter-
national Journal of High Performance Computing Applications, 21(3):291–312, 2007.
[9] Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal
Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform clus-
ter computing. In OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on Object-
oriented programming, systems, languages, and applications, pages 519–538, New York, NY, USA, 2005.
ACM.
[10] A. Patera. A spectral element method for fluid dynamics: Laminar flow in a channel expansion. Journal
of Computational Physics, 54(3):468–488, June 1984.
[11] Bernard Gendron and Teodor Gabriel Crainic. Parallel branch-and-bound algorithms: Survey and synthe-
sis. Operations Research, 42(6):1042–1066, 1994.
[12] Marsha J Berger and Joseph Oliger. Adaptive mesh refinement for hyperbolic partial differential equations.
Journal of Computational Physics, 53(3):484 – 512, 1984.
[13] XMOS. XK-XMP-64 Hardware Manual. XMOS Ltd., Feburary 2010. http://www.xmos.com/
support/documentation.
[14] F. Thomson Leighton. Introduction to parallel algorithms and architectures: array, trees, hypercubes.
Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1992.
[15] D. E. Knuth. The Art of Computer Programming, volume 3, Sorting and Searching, chapter 5.2.4, Sorting
by Merging, pages 158–168. Reading, MA: Addison-Wesley, 2nd ed. edition, 1998.
