Demonstration of Run-time Spatial Mapping of Streaming Applications to a Heterogeneous Multi-Processor System-on-Chip (MPSOC) by  et al.
Demonstration of Run-time Spatial Mapping of Streaming
Applications to a Heterogeneous Multi-Processor
System-on-Chip (MPSC)
Philip K.F. Ho¨lzenspies, Jan Kuper, Gerard J.M. Smit, Johann Hurink
University of Twente
Department of Electrical Engineering, Mathematics and Computer Science
P.O. Box 217, 7500 AE Enschede, The Netherlands
p.k.f.holzenspies@utwente.nl
Abstract
In this paper, the problem of spatial mapping is defined. Reasons are presented to show
why performing spatial mappings at run-time is both necessary and desirable and criteria for
the qualitative comparisson of spatial mappings are introduced. An algorithm is described that
implements a preliminary spatial mapper. The methods used in the algorithm are demonstrated with
an illustrative example.
1 Introduction
Academia and industry alike recognize the trend towards parallelism in computation.
Although many techniques exist for the analysis of (data and temporal) dependencies be-
tween parallel processes, programmingmodels formulti-processor architectures are subject
of current research. This paper deals with models for streaming applications on hetero-
geneous multi-processor systems, with a special focus on MPSC cases, where energy
efficiency is a key requirement.
The remainder of this section introduces the concepts relevant tot this paper. Section 2
describes the contributionsmade by this paper. Section 3 describes a formalmodel of spatial
mapping, includingquality criteria, followedbyadescriptionof analgorithm implementing
this formal model in Section 4. To provide some intuation, a full case example is given in
Section 5, before conclusions are drawn in Section 6.
1.1 Streaming DSP applications
Streaming DSP algorithms are implemented and used in portable and otherwise energy
constrained embedded systems and require an energy-efficient processing architecture.
Typical examples are found in signal processing for wireless baseband processing (for
wireless LAN, digital radio, UMTS[11]), multi-media processing, medical image processing
and sensor processing. Streaming DSP applications can be modelled as task graphs with
streams of data items (the edges) flowing between computation kernels (the nodes)[4].
Analyzing the common characteristics of these applications, we can observe that they:
• require relatively simple processing on huge amounts of data.
• display a high degree of regularity in the communication between tasks.
• have data flowing through the tasks in a pipelined fashion, thus allowing tasks to be
executed in parallel, either on different processors or in a time-multiplexed fashion.
• often require real-time throughput and latency guarantees for both communication
and computation.
• have a semi-static life-time, i.e. typically in the order of minutes, rather than millisec-
onds.
1
Dagstuhl Seminar Proceedings 07101
Quantitative Aspects of Embedded Systems
http://drops.dagstuhl.de/opus/volltexte/2007/1138
• display a high degree of periodicity (possibly dependent on data arrival times)
From these observations, we can conclude that these applications have a predictable be-
haviour, both temporally and spatially.
On a functional level, a streaming application can be described as a Kahn Process
Network (), because only the functional decomposition of an application and the data
dependencies between the components is specified. Concrete implementations of these
components can be specified with much more detailed information, i.e. Worst-Case Ex-
ecution Time () and granularity of consumption and production of data (e.g. an
implementation of a function on video frames may read the entire frame before executing,
but it may also work only on the first few lines). These added details can be described
in terms of Synchronous Data Flow [9] () graphs. Here the endings of the edges are
labelled with consumption and production rates in terms of tokens. Also, annotations are
added to the edges to signify the number of tokens that “currently reside on the edge” (i.e.
are produced and not yet consumed). Finally, nodes are labelled with the  of a single
execution.
When the difference between the time spent reading and writing large tokens and the
time spent executing becomes large, a more fine grained specification by means of Cyclo-
Static Data Flow [2] () graphs will allow for more exact analysis[15]. In , the labels
from the  graph are split into the different phases of an execution of the implementation.
1.2 Tiled architectures
Although multi-processor systems are not a new concept, the MPSC concept is on the
rise. Recently, considerable numbers of MPSC designs have been proposed and built (e.g.
[7, 14, 3]) and design templates have been developed (e.g.[1, 6, 12]).
What is referred to as a tiled architecture in this paper, is a chip made up out of multiple
autonomous processing elements. Autonomicitymeans that tasks can be started and stoped
on a processor without directly effecting (independent tasks on) other processors. In other
words, the guaranteed resource bounds of other tasks are not threatened. For these proces-
sors to form one architecture, they must be interconnected. The combination of a processor
and its interface to the architecture’s interconnect is referred to as a tile. Autonomicity of
the interconnect requires that guarantees can be provided with respect to throughput and
latency[8]. The remainder of this paper assumes that the tiles on the chip are interconnected
by a Network-on-Chip (NC)[5].
1.3 Run-time spatial mapping
Generally, spatial mapping is the allocation of spatial resources for applications. In the
context of tiled architectures, spatial resources are tiles and—in the case of a NC—routers.
Thus, spatial mapping is the assignment of tasks from the  describing the streaming
application to tiles and channels to paths through the NC. A feasible spatial mapping
satisfies the mapped application’s Quality of Service (QoS) (QS) requirements. A spatial
mapping’s quality depends on the extend to which it minimizes cost (in our case: energy
consumption) under the resource constraints. The objective of run-time spatial mapping is
to find a feasible spatial mapping with the best quality (in our case: the lowest energy cost).
To be able to utilize heterogeneous multi-processor systems, tasks that are used often
should be implemented for different processor types. For example, for a frequently used
DSP kernel such as an FFT there are implementations for an embedded ARM +processor
and for a reconfigurable core. Thereby, a flexibility is introduced that allows a task to be
executed, even if there is no processor available of the preferred type.
1.3.1 Necessity and advantages
Performing the spatial mapping at run-time is arguably both necessary and desirable. Pre-
liminary experiments[13] are promising with respect to the feasibility of run-time spatial
mapping. Performing the spatial mapping at run-time is necessary, whenever the applica-
tion set is not known completely at design-time, e.g. when the platform allows the user to
use software from any vendor, developed for that platform.
Availability of resourcesdependson the set of applications running simultaneously. Also,
variations in QS requirements due to changes in the environment effect an application’s
resource demands. Because of these dependencies, all possible combinations of applications
2
need to be known at design-time, in order to do design-time spatial mapping with energy-
efficiency as an optimization objective, which is—even in small systems—impossible.
Performing the spatial mapping at run-time offers very desirable flexibility. Unforesee-
able changes in applications can be taken into account. Moreover, defective tiles can be
avoided, which both increases yields and makes a system more robust against aging.
1.3.2 Goals and requirements
In our context, the objective of the spatial mapping is to minimize the energy consumption
of the entire application: processing (including memory requirements thereof) as well as
interprocess communication. In principle, the spatial mapping is performed only when a
new streaming application is started.
To be able to perform themapping of an application to tiles, a spatial mapping algorithm
needs models of the application to be mapped and the (MPSC) platform to be mapped
to. Furthermore, the constraints of the application (e.g. throughput requirements and
latency bounds) need to be known, as well as the resource requirements of the available
implementations (e.g. time, memory, etc.)
When performing the spatial mapping at run-time, some figures can only be determined
at run-time. Inter-process communication parameters (e.g. estimated latency, energy con-
sumption), for example, need to be determined at run-time as these are dependent on the
specific mapping. Likewise, it is only known at run-time on which tile a process will be
executed andwhich processes are already running on this tile, so the actual response time of
a process is only known at run-time. However, the choice made at run-time is from a finite
set of implementations, all of which have properties that are determined at design-time.
The constraints of the application can only be checked after it has been mapped. Thus,
when a spatial mapping has been determined and latencies and throughputs of processes
running on processors are known, the constraints can be checked. We use a data flow
analysis for this check, that is beyond the scope of this paper. Instead, we reference [15].
Finally, a spatial mapping is considered feasible if it is adherent and all the application’s
constraints are met.
2 Contribution
Run-time spatialmapping is a very young research topic. As such, opinions vary onwhat
it does and does not comprise. Current practice is to perform both the spatial and temporal
mapping of applications to multi-processor architectures simultaniously at design-time.
Even at design-time, exhaustive search for optimal mappings is not always possible. Thus,
heuristics are often used to perform this design-time mapping.
Run-timemapping posesmuch tighter time constraints on the search process. Therefore,
better-tailored heuristics are required. The separation of spatial and temporal mapping is
one such heuristic. This separation is made clear in this paper by means of a formal
definition of spatial mapping (see Section 3), a description of the algorithm used (see
Section 4) and an illustrative example (see Section 5).
3 A formal definition of spatial mapping
3.1 Hardware platform
In this section we will formally describe tiled multi-processor systems. In such a system,
a tile may for example consist of a processor, some memory, and a router. Tiles can be
connected to each other through links. A link between two tiles enters those tiles through
the routers, such that communication with a processor is only possible through the router
on the same tile. On the other hand, in order to let data, sent from one tile to another,
pass through intermediate tiles, only the routers on these tiles need to be involved and the
processors need not.
Furthermore, a tiled multi-processor system has a certain capacity to run software appli-
cations.
Thus, a tiled multi-processor system is a graph T = 〈T ,E〉 where T is a set of tiles (nodes)
and E ⊆ T × T is a set of links (edges) between tiles.
Edges in a graph T are unlabelled, i.e., an edge is completely determined by its source
and destination. We will assume that a graph T is directed. Furthermore, T is connected,
i.e., there is a path, possibly consisting of several edges, between each pair of tiles.
3
A tiled multi-processor system typically contains processors of various types, say there
can be ARM-processors in a tiled system, FPGAs, DSPs, etc.
The main issue in this paper is the mapping of the processes of a software application on
a system of processors. Since these processors need not be directly linked to each other,
communication channels betweenprocesses aremapped onto paths between tiles. Therefore
we consider a “higher order graph,” in which the edges are the paths in T.
Thus, let E∗ be the set of all cycle-free paths over a tiled systemT = 〈T ,E〉, then the graph
T∗ = 〈T ,E∗〉 is called the pathed tiled system over T. A (possibly empty) path from t0 to tn
will be denoted by 〈t0, t1, . . . , tn〉 (with 〈ti, ti+1〉 ∈ E), and may be considered as a label of an
edge in T∗ from t0 to tn. The length of a path is the number of steps in it, i.e., the length of
the path mentioned is n. Clearly, a pathed graph is a directed multi-graph.
3.1.1 Capacities
Each tile in T has capacities. One can think of computational and memory capacites, but
also of the maximum number of processes that can be assigned to it; e.g. ASICs can not
switch between processes, so they have a maximum of one process assigned to it, while an
ARMmay be able to serve as many processes as there are slots in its TDMA scheduler.
Capacities concerning communication between processors such as bandwidth, are also
supposed to be expressed as capacities of the tiles. For example, the bandwidth of a link
between tiles can be expressed as the capacity of the outgoing port(s) on a tile connected to
that link.
Thus, we consider all relevant (local) capacities of a tiled system as being expressed as
capacities of tiles. That is to say, all capacities of a tile t are expressed simultaneously by
its capacity vector C(t). The ‘shape’ of every C(t) is the same for all t, i.e. the capacity of the
processor on every tile has the same dimension. These dimensions are orthogonal, i.e. the
corresponding capacities are independent.
Given the above, it is possible to derive capacities of a path in T∗. For example, the
bandwidth of a path is the minimum bandwidth of the hops in the path.
3.2 Software applications
An application (task) is a directed graph P = 〈P,F 〉 where P is a set of processes and
F ⊆ P × P a set of channels between processes along which processes communicate with
each other.
For all processes in an application, implementations have to exist such that a process can
be executed on a processor. For one process several implementations may exist, though not
necessarily for all available types of processors. The set of implementations for process p is
I(p). The subset Iτ(p) ⊆ I(p) denotes the set of implementations of p that can be executed
on processors of type τ.
3.2.1 Requirements
Any implementation poses requirements on the processor it is executed on. Examples of
such requirements are the computational and memory loads. For a given implemenation i,
its requirements are expressed simultaneously by its requirement vector Rpi(i). The subscript
pi indicates that in this case the requirements are posed by (an implementation of) a process.
Below, requirements of channels will be dealt with.
The dimensions of a requirement vector are the same, and in the same order, as with
capacity vectors. Hence, implementation i of process p can be executed on tile t of type τ if
i ∈ Iτ(p) and C(t) − Rpi(i) ≥ 0.
Likewise, the communication between processes along channels poses requirements on
routers and links in a tiled system. Here too, the requirements of communication channels
will be expressed as vectors of the same form as the capacity vectors. Note that such
requirements vectors will contain zeroes on positions where they are not relevant, such as
with memory requirements on the tiles.
Note that the communication through channels does not depend on the selected imple-
mentations for individual processes, but on the specification of the application as a whole.
Requirements following from communiciation along a channel cwill be expressed as Rγ(c).
In order to determine whether there is sufficient capacity on a tile t for information that
has to flow through this tile, it is important in which direction this information will flow
through the tile. We will come back to that point below.
4
3.3 Spatial mapping
Software applications have to run on a tiled system, i.e. tiles have to be associated to
processes, and paths in the tiled system have to to be associated to channels. Clearly, this
has to be done in such a way that the necessary implementations exist, and the capacity of
the tiled system is not exceeded.
A task assignment function α is a function which maps a software task P to a pathed tiled
system T∗. More precisely, for every process p ∈ P we have that α(p) is a tile in T , and for
each channel 〈p, q〉 ∈ F we have that α〈p, q〉 is an edge from α(p) to α(q) in E∗. Thus, α〈p, q〉
is a path from α(p) to α(q) in T. If needed, we will distinguish between αpi and αγ, where αpi
denotes that part of α that deals with processes, and αγ deals with channels.
Let I be a functionwhich selects an implementation for a process p. Then a spatial mapping
m is a pair 〈α, I〉 of a task assignment function α and an implementation selector I.
Suppose that for a given process p we have that α(p) = t, where t is of type τ. Then an
implementation of p for a processor of type τ should exist. A spatial mapping is considered
adequate if every process is mapped to a tile type for which an implementation is available.
Formally, we call m = 〈α, I〉 adequate if for every process p such that the type of α(p) is τ, we
have that Iτ(p) is non-empty, and I(p) ∈ Iτ(p).
3.3.1 Computational load
Next, we discuss the computational load of a spatial mappingm = 〈α, I〉 on a tile t. First, we
discuss the load as resulting from mapping processors on tiles, later we come to the load
resulting from cummunication.
Define the inverse of the task assignment function α concerning tiles as follows:
α−1pi (t) = { p ∈ P | α(p) = t }
Thus, α−1pi (t) is the set of processes that is assigned by α to tile t.
The computational load Lmpi (t) of a spatial mapping m = 〈α, I〉 on the processor of a tile t is
given by
Lmpi (t) =
∑
p∈α−1pi (t)
Rpi(I(p))
where Rpi(I(p)) is the requirement vector of the implementation I(p) of process p. Thus, the
load of a tile on which several processes may run, is a vector of the same structure as the
capacity vector of a tile, and also as the requirement vectors of the implementations of the
individual processes.
Next, we turn to the load caused by communication along channels. First we define the
corresponding inverses of a task assignment function α:
α−1γ (t) = { c ∈ F | t ∈ α(c) }
Thus, α−1γ (t) yields the set of those channels in the application that are mapped on paths in
T that pass through the router of tile t.
In order to calculate the load on a tile t caused by the communication through the router
of that tile t, we need to know from which channel this communication comes and in what
direction it goes. To define the direction of the communication through tile t, caused by
channel c, suppose
α(c) = 〈t0, . . . , t, . . . , tn〉,
i.e., t is one of the tiles in the path associated to c by α.
We denote the immediate predecessor of t in α(c) by t−α , and the immediate successor of
t by t+α . In case t = t0 or t = tn, we choose t−α = t0 and t+α = tn, respectively.
Let ϕ(t−α ,t,t+α ) be a function that determines which components of the load vector of tile t
are changed when information flows through tile t in the direction coming from tile t−α and
going to tile t+α . Thus, if the requirement vector of a channel c in an application is Rγ(c), then
the load on a tile t resulting from the communication through t according to c, now can be
expressed as
ϕ(t−α ,t,t+α )
(
Rγ(c)
)
.
The total load on tile t caused by all communication in all channels that are mapped on a
path through t, now is
Lmγ (t) =
∑
c∈α−1γ (t)
ϕ(t−α ,t,t+α )
(
Rγ(c)
)
5
Assign processes to tile types
Assign processes to tiles
Assign channels to paths
Check application constraints
chose I and αpi
refined αpi
chose αγ
feasible
inadherent
inadherent
infeasible
time for improvement
Figure 1: Hierarchical search with iterative refinement
The total load on tile t, covering both the load caused by processes and the load caused by
channels, now is
Lm(t) = Lmpi (t) +Lmγ (t).
When a spatial mapping is adequate and no tiles are overloaded (the resources required
by all the implementations mapped to a tile do not exceed the resources offered by the tile),
it is considered adherent. In other words, we call m adherent if m is adequate, and if for all t
we have that
C(t) − Lm(t) ≥ 0
4 Algorithm
Even when only considering the assignment of processes to a heterogenous multi-
processor platform, we find a Generalized Assignment Problem [10] (GAP), which is
NP-complete. Considering the prohibitve complexity of exhaustive search, we propose
an application domain aware heuristic: hierarchical search with iterative refinement. We divide
the search process in steps, starting with a very coarse grained perspective in the first step
and gradually adding more detail. At each step decisions are made that shrink the search
space in the next. Decisions made in previous steps are considered fixed in later steps.
As is to be expected of heuristics, this abstraction carries with it the danger that decisions
made in early steps, using very high level abstract information, lead to search spaces in
6
later steps that contain no feasible solutions. Since this infeasibility only comes to light in
later steps, we propose a strategy for iterative refinement. Figure 1 shows the hierarchical
decomposition into steps used in our run-time spatial mapping tool for heterogenous
MPSoCs. We will now describe each of these steps in more detail.
1. Assign implementations to processes. The goal of the first step is to choose an
implementation (and thereby tile type) for every process, i.e. to choose I inm = 〈α, I〉.
By choosing I prior to αpi, this step implies a contract for αpi, i.e. inadequacy can be
prevented later on by limiting the choice of αpi(p) to tiles of type τ, where I(p) ∈ Iτ.
To prevent running into inadherence directly after this step, we only consider those
implementations for which an adhering mapping exists, i.e. that fit on at least one tile
in the system. Thus, we only consider I(p) = i when there is at least one tile t of type
τ, where i ∈ Iτ and Ct − Ri ≥ 0.
We go about this choice iteratively. The choice of the next process to pick an imple-
mentation for is based on its desirability. The desirability of a process is the difference
between the cheapest assignment and the second cheapest assignment of the process
to a tile. In other words, if the alternative is more expensive, the desirability to map
the process now increases.
If a process has been chosen to be assigned next, not only do we choose this process’
implementation,we alsomap it to the first tilewe come acrosswith sufficient resources
(i.e. a first-fit packing). This guarantees that after this step (if this step manages to
map all processes), at least one αpi exists that does not break the adherence of m,
although mmight still be inadherent when no αγ exisits.
2. Assign processes to tiles. From step one, we now have a chosen implementation for
every process. The (greedy) assignment of processes to specific tiles αpi obtained in the
previous step is now improved upon by taking more detail into account. Again, we
iteratively choosewhat to do based on desirability. In perticular, for every iterationwe
try, for every implementation, to remove it from the tile it ismapped onto and, by local
search, to map it onto the best available tile of the required type. The difference in cost
between the original mapping and the best tile found in the local search is, again, the
measure of desirability for choosing this reassignment. Only the best reassignment is
actually performed every iteration. Because a process can only be reassigned to a tile
with the same type as the tile it is already assigned to, this step maintains adequacy.
The previous step simply iterated until all processeswhere assigned to a tile (assigning
one process each iteration). Deciding when to stop at this step can be based upon a
minimum gain from iteration (once an iteration improves the total solution by a lesser
amount than a chosen threshold, we decide to stop) and/or by a maximum number
of iterations. Besides cost factors based solely on the mapping of a process to a tile,
an assignment should be awarded a bonus for proximity to the process’ neighbours
in the application graph. This stimulates locality, causing the communication routes,
assigned in the next step, to likely be short. Moreover, we again prevent immediate
inadherence in the next step, by only considering tiles for a process that have sufficient
communication resources to facilitate the processes communication requirements, at
least, locally.
3. Assign channels to paths. For the concrete realization of step three, the chan-
nels are sorted by non-increasing throughput. Next, iteratively for each channel, a
corresponding path is determined, taking into account the loads resulting from the
previously mapped channels.
The sorting is done to increase the probability that a heavy demanding channel gets
assigned a better path. In each iteration for a given channel, a shortest path between
the source and destination tile of the channel has to be determined, where only such
tiles are taken into account which still have enough capacity for the throughput
requirement of the current channel. Thus, an αγ is constructed iteratively, never
overpacking communication capacities of a tile.
Adding αγ to the αpi and I from the first two steps, the result of this step is an adherent
spatial mapping m = 〈α, I〉where α = 〈αpi, αγ〉.
4. Check application constraints. The last step checks the global application con-
straints. When any such constraint is violated, them is infeasible and feedback should
be given to higher steps to try and improve upon those characteristics of the mapping
7
Remainder
Prefix
removal
Freq. off.
correction
Inverse

Equali-
zation
Phase off.
correction
Demap-
ping

64 64
52
5248
80
b
Figure 2: Decomposition of a HLAN/2 receiver
that violate the constraint(s). Should no constraint be violated, m is feasible. When
we decide we have time to look for improvements of this solution, possible points of
improvement should also be identified here and fed back into the first step (keeping
the current mapping inmind, should the feedback only result in infeasible results and
feasible mappings that are further removed from the optimum).
In general, the production of feedback immediately triggers a new iteration, to prevent
that multiple changes influence the mapping process. In other words, if any step fails to
find a satisfactory result, it will immediately generate feedback so that ‘higher’ steps may
generate a more suitable result.
It is important to realize that this proposed iterative hierarchical approach differs sig-
nificantly from simple local search methods and global-local search methods that are often
used in heuristics. The feedback from a lower level may result in a completely different
mapping on a higher level in a next iteration.
5 Case: HLAN/2
In order to illustrate the above with an example, an implementation and mapping of a
HLAN/2 receiver is described in this section.
5.1 Application Level Specification
The receiver’s decomposition into communicating processes is shown in the  in
Figure 2. The control part of the receiver application is included for completeness, but
it is not part of the data stream. Orthogonal Frequency Division Multiplexing ()
applications are based on () frames of -symbols, which, in turn, consist of samples
(complex numbers). In the HLAN/2 case, frames consist of 500 symbols and every
symbol consists of 80 32-bit complex numbers. The control part only comes into play
briefly at the beginning of each frame, while all other processes in the  operate on every
 symbol.
The last three processes have been grouped to form one process. Not only do they fit
well together in a single implementation, but treating them separately needlessly lengthens
this example. The numbers shown on the edges of this  indicate the number of 32-bit
complex numbers per symbol coming in at each process. The size of the output of the
HLAN/2 receiver (b), depends on what ‘mode’ the receiver is in. The standard defines
seven modes, that only differ with regards to the demapping (hence the input from the
control process, which selects the demapping type). Depending on the chosen demapping
type, the output can be 2 bits (Binary Phase-Shift Keying—), 4 bits (Quadriphase-Shift
Keying—), 16 bits (Quadrature-Amplitude Modulation-16—16) or 64 bits (64)
per sample. Thus the minimum output is 12 bytes and the maximum is 384 bytes (per 
symbol). One  symbol is fed into the application once every 4µs.
This is the Application Level Specification () of the  receiver. It serves as a
contract between the application and the implementations of processes. A lot of freedom
is left for the implementations, because a lot of behaviour has not been specified in the .
8
Table 1: Available implementations
Phasesa Avg. energy[
nJ/symbol
]Input
[token]
Output
[token]
Execution Time[
clockcycles
]
Process PE type
Prefix removal $ 〈82, 〈8, 0〉8〉 〈02, 〈0, 8〉8〉 〈1818〉 60
M 〈180, 0〉 〈017, 164〉 〈181〉 32
Freq. off. correction $ 〈8, 0, 0〉 〈0, 0, 8〉 〈18, 32, 18〉 62
M 〈164, 02〉 〈02, 164〉 〈166〉 33
Inverse  $ 〈64, 0, 0〉 〈0, 0, 64〉 〈66, 4250, 54〉 275
M 〈164, 053〉 〈065, 152〉 〈164, 170, 152〉 143
Remainder $ 〈52, 0, b〉 〈0, 0, b〉 〈54, 2250, b + 2〉 140
M 〈152, 0, 0 〈0, 0, 1b〉 〈152, 73 − b, 1b〉 76
a We will use the notation 〈xn, ym〉 to denote n +m phases, where the value for the first n phases is x and for the last m
phases is y.
For example, the fact that an  symbol is fed into the application every 4µs does not
imply there is one burst of size b at the output every 4µs. The output may be a continuous
stream, or it may still be bursty, but its periodicity may be less strict (i.e. output may show
jitter). Similarly, processes are semantically defined on  symbols (in various states of
abridgement), but concrete implementations may very well work on a per sample basis, or,
adversely, on a group of symbols.
5.2 Implementations
Given a set of implementations of the processes in Figure 2, the spatial mapping al-
gorithm can now choose implementations, map them to concrete processors, route the
communication channels through the interconnect and construct a  graph of the entire
receiver. The description of any implementation should include a  graph, describing
its behaviour correctly and in as much detail as is relevant and possible. As stated above,
many processes can be described as a single  actor. Table 1 lists implementations of the
processes in Figure 2. The phases described in the table are the phases of the  actors
corresponding to the implementations. In this table the notation for the inputs 〈0, a, b10〉
means: in phase 1, 0 tokens are read, in phase 2, a tokens are read, and in phase 3 through
12, b tokens are read at every phase (the superscript is a shorthand for the number of phases
with equal parameters). For example: the inverse  on the  has 3 phases; in phase
1, 64 tokens are read, 0 tokens are written and the  is 66 clock cycles; in phase 2, 0
tokens are read, 0 tokens are written and the  is 4250 cc; in phase 3, 0 tokens are read,
64 tokens are written and the WCET is 54 cc.
Control-flow has been omitted from this table, but will be taken into account in the
verification process. Also, only s using cache have been described here. It will be
shown later that s with Communication Assists (s) will have behaviour that requires
multiple actors.
Note that the input and output token counts are in terms of symbols (i.e. one token
corresponds to one complex number). In an actual description, the token size should be no
bigger than the smallest word-size of the hardware, to avoid having to translate between
a process emitting a ‘complex number’ and a piece of hardware taking in a ‘byte’. Also,
the execution times given in the table are in terms of clock cycles. Therefore, execution
times have to be normalized by taking into account the clock frequency of the processor
assigned to the implementation and, where applicable, scheduler settings need to be taken
into account to translate execution times to response times.
5.3 Hardware
A few notes are required on the hardware of the test environment. The HLAN/2
receiver is mapped to a MPSC, consisting of s with cache, s with s and their
own Scratch Pad Memory () and Ms.
Caches allow for burst reads, based on the locality principle. In the case of streaming
applications, there is generally very little locality. However, reading ahead in the input
9

$



 

M
NC
Figure 3: MPSC architecture
R R R
R R R
R R R
 M2
Sink M1
A/D $
Figure 4: MPSC layout
stream and caching that data does increase the speed of read operations. Since caches
are non-deterministic, it is very hard to guarantee any behaviour in the general case, but
when processes can be exclusively assigned a cached processor, the behaviour may very
well become fully deterministic (locking of caches). In this text, caches will be assumed
to incidentally cause lucky speed-ups, but execution times will be subject to worst-case
assumptions.
A  is essentially a local memory manager that can autonomously stream data into or
out of the  (the latter being a dual-port memory), i.e. independent of the  on the
same tile. Predominantly, s decouple communication from computation, at least from
the processor’s perspective. The required number of tokens for the next firing is known
to the , so it can gather that number of tokens before signalling the processor that the
input data is ready. With regards to the final model of an implementation, a  allows
communication to occur in parallel with computation, thus the  should be modelled with
a separate actor.
The interconnect consists of a NC with routers that provide guarantees with regards to
provided throughput andmaximum latency. All tiles have their ownNetwork Interface ()
to connect to theNC, but theNCsideof that interface is the same for every tile. The routers
in the NC have buffered inputs and round-robin arbitration on the output, which imposes
a maximum latency of 4 clock cycles. For brevity, a homogeneous NC is considered here,
implying that every step in a communication path has the same behaviour and, thus, the
same description in the  graph.
The architecture of the MPSC is summarized in Figure 3. It shows one instance of
each type of tile and the interconnecting NC. The hypothetical MPSC used for this
example has two Ms and two s. Of the latter, one has cache and the other is
communication assisted.
Figure 4 shows a possible MPSC layout with these specifications. The tiles without
labels in this figure are tiles of types not relevant to this example. The tile labeled ‘A/D’ is
the source of all the incoming data. The tile labeled ‘Sink’ is the tile that has to receive the
stream flowing out of the HLAN/2 receiver.
10
 M
Step  $ 1 2 Cost Remark
0 Pfx.rem. Frq.off. Inv. Rem. 11 Initial (greedy) assignment
1 Frq.off. Pfx.rem. Inv. Rem. 11 No improvement, revert
2 Pfx.rem. Frq.off. Rem. Inv. 9 Improvement, keep
3 Frq.off. Pfx.rem. Rem. Inv. 7 Improvement, keep
No further choices
Table 2: Processor assignment iterations
5.4 Mapping
The first step of the mapping process is to choose what implementation to use for which
process. This choice is iterative, i.e. an implementation is chosen for one process before
choosing an implementation for the next. The measure by which to determine the order by
which to choose implementations is desirability. This is defined as the difference between
the cost of the best (i.e. cheapest) implementation and that of the next-best.
In this example, the ‘Inverse ’ process is the most desirable. Thus, it is assigned
to its preferred processor type, being a M. Likewise, the ‘Remainder’ process is
assigned a M. At this point, both Ms are occupied and thus, the available
implementations for the M architecture can no longer be assigned a processor. This
means that all these implementations are ignored from here on. As such, both remaining
processes only have  implementations and are thus chosen per default.
In the second step, the implementations chosen in the first need to be assigned to specific
processors. This step uses heuristics to look ahead towards communication, but does not
have exact knowledge of the status of the complete NC. Since the first step already
constructed a greedy assignment to processors, pairs of assigned implementations can now
be swapped to find improvements. Improvements arise from having to communicate less
(probably, since exact routing is not known here) and from being able to turn off tiles.
Given that multi-tasking processors are not considered in this example, minimizing the
number of processors in use (so processors not used can be switched off) does not improve a
mapping. As a look ahead heuristic, the Manhattan distance is used to estimate howmuch
a channel’s communication will cost. The total cost of assigning an implementation to a
processor is the sum of the Manhattan distances of all the implementation’s incoming and
outgoing channels.
Table 2 shows the iterations that lead to the final implementation to processor assign-
ment. Swaps can, of course, only occur between processors of the same type. The sum
of all Manhattan distances of the application (the cost column) can increase or remain the
same for any iteration. When this happens, that choice is rejected and another mutation
from the previous assignment is evaluated. Currently, the algorithm commits to the first
improvement it finds, i.e. when an iteration decreases the total cost, it is never revoked.
This will potentially result in a local extreme that is not globally optimal.
As a last step in the mapping process, step three performs incremental routing. This
means the channels from the  are routed incrementally with a point-to-point shortest
path algorithm. Only those lanes in the NC that can guarantee sufficient throughput are
considered. Figure 5 shows the resulting  graph. This graph can be checked (with [15])
to see whether the throughput of the mapping suffices to meet the constraints laid down
in the  (step 4 of the spatial mapper). The buffer sizes Bi are calculated by the algorithm
used in step 4. When they are smaller than the buffers reserved by the implementations,
not further action is required. When they are larger, an attempt should be made to allocate
the additional required buffer size on the tiles the consuming actor is mapped onto. If this
additional buffer capacity can not be allocated, the mapping is infeasible and the spatial
mapper should iterate. The buffer capacity of the Sink actor (x) is fixed by the specification
of Sink.
6 Conclusions and future work
This paper presented a formal model of spatial mapping. The formal definitions of
adequacy and adherence give testable criteria of spatial mappings. The notion of feasibility
can only be defined formally, if the constraints of the application are defined formally as
11
A/D
1
R
4
R
4
Pfx. rem.
〈1818〉
R
4
R 4R
4
Frq. off.
〈18, 32, 18〉
R
4
R
4
iOFDM
〈164, 170, 152〉
R
4
R
4
Rem.
〈152, 73 − b, 1b〉
R
4
R 4R
4
Sink
1
4 4 B1 4
4
4B244
B3
4 4 B4 4
4
4x
Figure 5: Final  graph (production and consumption rates omitted to prevent clutter)
well. An application independent formalization of application constraints and feasibility is
considered future work.
The algorithm we presented earlier has now been shown to implement the formalism
we have presented. Other algorithms, designed for the purpose of spatial mapping, can
now be related to the algorithm in this paper by relating it to the formalism.
Optimization objectives have not been treated formally in this paper. Future work
should include a formalization thereof, so that different algorithms for spatial mapping can
be analysed and compared qualitatively. Moreover, the formalism assumes point-to-point
connections (NC), but should be extended to deal with any kind of interconnect.
References
[1] Arthur Abnous. Low-Power Domain-Specific Processors for Digital Signal Processing. PhD
thesis, University of California, Berkeley, 2001.
[2] G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete. Cycle-static dataflow. Signal
Processing, IEEE Transactions on Signal Processing, 44(2):397–408, 1996.
[3] Tilera Corporation. Tile64™ processor product brief. Corporate product brief.
[4] William JamesDally, Ujval J.Kapasi, BrucekKhailany, andAbhishekAhn, JungHoand-
Das. Stream processors: Progammability and efficiency. Queue, 2(1):52–62, 2004.
[5] Giovanni deMicheli and Luca Benini. Networks on chip: A new paradigm for systems
on chip design. In DATE ’02: Proceedings of the conference on Design, automation and test
in Europe, page 418, Washington, DC, USA, 2002. IEEE Computer Society.
[6] Paul M. Heysters. Coarse-Grained Reconfigurable Processors – Flexibility meets Efficiency.
PhD thesis, University of Twente, Enschede, The Netherlands, sep 2004.
[7] James A. Kahle, Michael N. Day, H. Peter Hofstee, Theodore R. Johns, Charles R. and-
Maeurer, and David Shippy. Introduction to the cell multiprocessor. IBM Journal of
Research and Development, 49(4/5):589–604, July/September 2005.
[8] Nikolay Kavaldjiev. A run-time reconfigurable Network-on-Chip for streaming DSP appli-
cations. PhD thesis, University of Twente, 2006.
[9] E.A. Lee and D.G. Messerschmitt. Synchronous dataflow. In Proceedings of the IEEE,
volume 75(9), pages 1235 – 1245, September 1987.
12
[10] Silvano Martello and Paolo Toth. Knapsack problems: algorithms and computer implemen-
tations. John Wiley & Sons, Inc., 1990.
[11] T. Ojanpera and R. Prasad. An overview of air interface multiple access for imt-
2000/umts. IEEE Commun. Mag., 36(9):82–95, September 1998.
[12] G.J.M. Smit, Andre B.J. Kokkeler, Pascal T.Wolkotte, Philip K.F. Ho¨lzenspies, Marcel D.
van de Burgwal, and Paul M. Heysters. The chameleon architecture for streaming dsp
applications. EURASIP Journal on Embedded Systems, 2007:78082, 2007.
[13] L.T. Smit, J.L. Hurink, and G.J.M. Smit. Run-time mapping of applications to a het-
erogeneous soc. In Proceedings of the 2005 International Symposium on System-on-Chip,
pages 78–81, November 2005.
[14] Michael Bredford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben
Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma,
Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman
Amarasinghe, and Anant Agarwal. The raw microprocessor: A computational fabric
for software circuits and general-purpose programs. IEEE Micro, 22(2):25–35, 2002.
[15] Maarten Wiggers, Marco Bekooij, and G.J.M. Smit. Efficient computation of buffer
capacities for cyclo-static dataflow graphs. In DAC ’07: Proceedings of the 44th annual
conference on Design automation, pages 658–663, New York, NY, USA, 2007. ACM Press.
13
