Embedded systems are o en designed as complex architectures with numerous processing elements. E ectively programming such systems requires parallel programming models, e.g. task-based or data ow-based models. With these types of models, the mapping of the abstract application model to the existing hardware architecture plays a decisive role and is usually optimized to achieve an ideal resource footprint or a near-minimal execution time. However, when mapping several independent programs to the same platform, resource con icts can arise. is can be circumvented by remapping some of the tasks of an application, which in turn a ect its timing behavior, possibly leading to constraint violations. In this work we present a novel method to compute mappings that are robust against local task remapping. e underlying method is based on the bio-inspired design centering algorithm of L p -Adaptation. We evaluate this with several benchmarks on di erent platforms and show that mappings obtained with our algorithm are indeed robust. In all experiments, our robust mappings tolerated signi cantly more run-time perturbations without violating constraints than mappings devised with optimization heuristics.
INTRODUCTION
Multi-and manycore architectures have permeated the majority of today's embedded systems. Examples include ARM big.LITTLE systems [17] , the alcomm Snapdragon family of processors [31] , Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for pro t or commercial advantage and that copies bear this notice and the full citation on the rst page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permi ed. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior speci c permission and/or a fee. Request permissions from permissions@acm.org. SCOPES'17, Texas Instruments Keystone [3] or the He-P2012 platform [8] . e quick spread of parallel architectures took tool providers by surprise and so ware development became a major concern. Several programming models have been proposed in academia to solve the parallel programming problem. Popular examples in the embedded domain are data ow-based or task-based programming models, particularly well-suited for streaming and multimedia applications. ese models have in common that applications are represented as a graph of interacting entities (i.e., graph nodes) that communicate over statically de ned channels (i.e., graph edges) 1 . In data ow graphs or process networks, the entities cannot communicate over shared memory, which makes these models appropriate for distributed memory architectures. Many languages and models exist to represent such graphs, e.g., Cal [13] , YAPI [11] , C for process networks [38] or the Distributed Object Layer (DOL) [45] .
With applications represented as graphs, programmers face the mapping problem, i.e., which hardware resources should be used to run computation and to implement communication (see Figure 1 ). e mapping problem has been thoroughly studied in the embedded domain, e.g., for performance and so real-time [4, 6, 14, 21, 24, 29] , for energy e ciency [25] or for reliability [10] (see also overviews in [26, 41] ). Most of the approaches seek to nd a near-optimal static mapping for a single application and compile-time, mostly for ensuring time predictability. Authors have also looked into SCOPES'17, 2017, G. Hempel et al., Gerald Hempel, Andrés Goens, Jeronimo Castrillon, and Josefine Asmus, Ivo F. Sbalzarini the problem of dealing with multiple applications competing for resources. Scenarios have been used to characterize the way application may interact [34] . Spatial isolation in which applications are given shapes of the hardware has also been proposed to provide time-predictability in the presence of multiple applications [48] .
Most of previous work has focused on computing a xed mapping, which is enforced by a runtime manager or by strict spatial isolation. e underlying assumption is that nothing unpredictable may modify the mapping decisions at runtime. is can be achieved in bare metal implementations, but is less probable in higher-end embedded systems that run a full edged mainstream operating system and run unpredictable workloads. In this paper, we look at how to compute mappings that not only meet real-time constraints but are robust to slight variations at runtime, e.g., by re-mapping decisions from the operating system. To nd a robust mapping, one must not only nd a near-optimal solution using an optimization process, but also has to characterize the "volume" of feasible solutions around that one solution. Intuitively speaking, the larger the volume, the more robust the solution is. To determine robust mappings we apply design centering, a technique known for the design of feasible circuits given parameter variations in individual components [16] . In particular, we use a recent bio-inspired algorithm that is well-suited for non-convex feasibility spaces. Additionally, it returns a quality measure based on the estimated volume along with the design center [2, 27] .
As main contribution, this paper analyzes for the rst time design centering algorithms for the mapping problem. In particular, we study the applicability of design centering approaches to compute mappings that are robust to local re-mapping decisions. We introduce algorithmic modi cations to make design centring applicable to the characteristics of the design space of mapping problems. We analyze benchmark applications from the streaming and multimedia domains on two di erent multicores with di erent characteristics, using a state-of-the-art mapping ow for applications represented as Kahn Process Networks (KPNs). We report promising results that demonstrate the higher robustness of the mapping computed with design centering. More speci cally, we show that robust mappings for most applications tolerate 70% to 98% of the variations, as opposed to an average of ≈ 50% for mappings computed with standard heuristics. We believe these results open more possibilities to adapt the problem to compute mappings that are robust to other types of variations, most notably, input-dependent execution characteristics. Even though we used KPNs as programming model, the approach is applicable to other similar programming models as well.
e rest of this papers is organized as follows. Section 2 provides background on design centering algorithms and introduces the programming ow used in this paper, whereas Section 3 provides the speci c of the L p adaptation algorithm we used. en, Section 4 describes how we applied design centering to the mapping problem.
e method is evaluated in Section 5 and related approaches are treated in Section 6. Finally, we draw conclusions and discuss future work in Section 7.
BACKGROUND
As mentioned above, in this paper we apply design centering to the mapping problem (cf. Figure 1) . Before explaining the proposed approach, this section provides details about the mapping problem, as well as the concrete tool ow used and a brief introduction to design centering approaches.
KPN mapping ow
Out of many mapping frameworks, we use the parallel ow of the MAPS framework [5] , now commercially available in the SLX Tool Suite from Silexica [39] . is state-of-the-art, to which we will refer as mapper in the following, includes a fast internal simulator for a variety of multicore platforms. Having a fast simulator is key for applying design centering as will be discussed in Section 4. Additionally, the mapper supports an expressive parallel programming model, based on KPNs, that allows to represent more applications compared to static data ow models, like Synchronous Data Flow. An overview of the programming ow is shown in Figure 2 . e major components are discussed in the following.
e mapper receives a KPN application wri en in the language "C for process Networks" [38] . In a KPN graph, nodes are computational processes that exchange data only through FIFO channels (the edges of the graph) using atomic data items called tokens. Processes may have an arbitrary control ow, holding internal state, and accessing input and output channels in a data-dependent fashion (i.e., not statically analyzable at compile time). Extra inputs to the mapper are an abstract model of the target platform and user-de ned constraints. e la er includes real-time and resource constraints, among others. e model of the target platform contains a speci cation of the processing elements, the interconnect and the memory architecture (including latencies and communication bandwidths).
e mapper uses execution traces from pro ling runs to capture the runtime behavior of the processes. Based on the traces, several heuristics are available to compute a mapping [6] . e mapping includes an assignment of processes to processors, a scheduling policy for processes running on the same processor, a mapping of logical communication channels to physical resources, and sizes for all bu ers. Broadly speaking, heuristics would iterate to improve the mapping and to try and meet the user constraints. To evaluate the quality of a mapping and whether the mapping is feasible, the mapper uses an internal discrete event simulator. e simulator fetches cost models from the architecture description and replays the traces according to the suggested mapping, taking runtime overheads into account. It returns a Gan chart of the execution, Robust Mapping of Process Networks to Many-Core Systems using Bio-Inspired Design Centering SCOPES '17, 2017, bu er utilization statistics, and an estimated energy consumption, among others.
Once a feasible mapping has been found, the mapper exports it in a so-called mapping con guration. is descriptor is then passed to the back-end, which generates code accordingly.
Design Centering
Design centering is a long-standing and central problem in systems engineering. It is concerned with determining design parameters of a system that guarantee operation within given speci cations and are robust against parameter variations. While design optimization aims to determine the design that best ful lls (one aspect of) the speci cations, design centering wants to nd the design that meets the speci cations most robustly. Traditionally, this problem has been considered in electronic circuit engineering [16] , where a typical task is to determine the nominal values of electronic components (e.g., resistances, capacitances, etc.) such that the circuit ful lls some speci cations and is robust against manufacturing tolerances in the components. An illustration for a two-dimensional design space, i.e., with two parameters, is shown in Figure 3 . e gure also contains the contour of a ctive cost function and two regions with sets of feasible solutions (marked with dashed lines). While optimization is concerned with nding the optimum (red cross in the gure), design centering would opt for the solution that allows more variation in the parameters without leaving the feasible set (the black cross in the gure).
ere are usually many designs that ful ll the speci cations. Each design is described by a number n of design parameters and can hence be interpreted as a vector in R n . e region (subspace) of the parameter space that contains the designs for which the system meets the speci cations is called the feasible region A ⊂ R n (see disjoint regions in Figure 3 ). Depending on available side-information about design speci cations, di erent operational de nitions of the design center m ∈ A exist, including the nominal design center, the worst-case design center, and the process design center [37] . Here, we follow the statistical de nition of the design center [23] and seek among all parameter vectors x ∈ A the design center m ∈ A that represents the mean of a probability distribution p(x) of maximal volume covering the feasible region A with a given target hi ing probability P. For convex feasible regions, using the uniform probability distribution over A and P = 1, the design center coincides with the geometric center of the feasible region (corresponding to the crosses in Figure 3 ).
One distinguishes between geometrical and statistical approaches to design centering. Geometrical approaches use simple bodies to approximate the feasible region, which is usually assumed to be convex [12, 35, 36] . Statistical approaches explore the feasible region by Monte-Carlo sampling. Since exhaustive sampling is not feasible in high dimensions, the key ingredient of statistical methods is a smart sampling proposal to nd, and concentrate on, informative regions [42] [43] [44] . Most of the existing methods assume a convex feasible region, e.g., [36, 43] , or di erentiability of the speci cations, e.g., [46] . Others require an explicit probabilistic model of the variations in the design parameters [35] . For the mapping problem dealt with in this paper, we cannot assume di erentiability or convexity.
erefore, we use a recently proposed statistical method called
. is method supports non-convex disconnected regions (like in Figure 3 ), and can handle arbitrary speci cation constraints as long as they can be decided for a given candidate design. In addition, the method also retrieves the radius and an estimated volume, providing an idea of how robust the design center is. e concrete algorithm is described in the next section.
ADAPTATION ALGORITHM
is an algorithm inspired by how robustness has evolved in biological networks, such as cell signaling networks, blood vasculature networks, and food chains [22] . It samples candidate designs from L p -balls as proposal distributions, which are dynamically adapted based on the sampling history. Intuitively speaking, an L p -ball is an n-dimensional ellipsoid according to a particular norm, the L p norm. e dynamic a ne adaptation of the L p -balls is based on the concept of Gaussian Adaptation (GaA) [23] , which continuously adapts the mean and the covariance matrix (describing correlations and scaling between the parameters) of a Gaussian proposal based on previous sampling success.
Combining the adaptation concept of GaA with the use of L pballs [37] as non-Gaussian proposals, L p -Adaptation provides both e cient design centering and robust volume approximation. L padaptation draws samples uniformly from an L p -ball and iteratively adapts the mean and covariance matrix of an a ne mapping applied to the balls. Importantly, the target hi ing probability P, i.e., the probability of hi ing the feasible region A with a sample is controlled as described below. e design center is then approximated by the mean of the nal L p -ball B, and the volume estimate is of the form vol(A) ≈ P · vol(B), where vol(B) is the volume of the current n-dimensional L p -ball B. For improved sampling and adaptation efciency, L p -Adaptation uses an adaptive multi-sample strategy [18] that is considered state-of-the-art in bio-inspired optimization [19] .
L p -Adaptation can be interpreted as a synthetic evolutionary process that tries to maximize the robustness, rather than the tness, of the underlying system. Robustness is measured in terms of the volume vol(B) of an n-dimensional L p -ball B, of which a certain fraction P (i.e., the target hi ing probability) overlaps with the feasible region A. is ball is called an L p -ball since it is part of the Banach space L n p = (R n , · p ). is basically means that we endow SCOPES'17, 2017, G. Hempel et al., Gerald Hempel, Andrés Goens, Jeronimo Castrillon, and Josefine Asmus, Ivo F. Sbalzarini the vector space R n with a norm di erent from the Euclidean one,
denotes the center of the ball, and C ∈ S n×n + is a symmetric positive-de nite (covariance) matrix de ning the linear map for scaling and rotation of the L p -ball B, i.e.
L p -adaptation then seeks to maximize the following criterion: max
e hypervolume of an L p -ball is completely determined by the volume of the unit L p -ball (with zero mean and n-dimensional identity matrix C = 1) and the determinant of the matrix C. us, the robustness criterion can be rewri en as a non-convex log-det maximization problem:
is objective function provides a natural non-convex extension of the maximum inscribed ellipsoid method [37] . For instance, if A is a convex polyhedron with known parameterization and P = 1, then Problem 2 is a convex problem that can be e ciently solved using interior point methods. However, in the general case, no e cient algorithms are known to solve Problem 2. L p -adaptation approximately solves this problem by using a synthetic evolutionary process consisting of the four steps Initialization, Sampling, Evaluation, and Adaptation, which are repeated in iterations until a stopping criterion is ful lled.
It is, however, important to properly control the target hi ing probability P of L p -Adaptation to the task at hand. e hi ing probability must be neither too low, nor too high. Low hi ing probabilities lead to low sampling e ciencies. High hi ing probabilities lead to slower adaptation to the feasible region, which may prevent exploring remote parts of the feasible region. For a Gaussian proposal and a convex feasible region, a hi ing probability of exp(−1) is information-theoretically optimal [23] . When sampling uniformly from L p -balls over non-convex regions, no such result is available. We hence manually adapt the hi ing probability depending on the task at hand, starting from exp(−1) as an initial value.
DESIGN CENTERING AND MAPPING
In this section we explain how we adapted the design centering (DC) algorithm to create robust mappings of KPN applications to manycore systems.
Integration into KPN Tool-Flow
In order to use the DC algorithm to create and evaluate mappings we have to integrate the DC algorithm into the KPN tool ow as shown in Figure 4 . For the integration, it is required to import mappings generated by the DC algorithm into the mapping ow. Similarly, the results of the discrete-event simulator of the mapping ow (recall Figure 2 ) must be interpreted by the oracle. at is, the function in the DC algorithm that decides, for a given point x in the design space, if the point x is feasible, i.e., if x ∈ A.
As rst step (I), an initial mapping for a CPN application onto the target architecture is generated by the mapper. is is necessary since the starting point of the DC algorithm has to be member of the feasible set A.
e initial mapping is then simulated (II) using a representative dataset as input. From the simulation results, it is possible to check for a variety of constraints. In this paper we restrict ourselves to checking for the total execution time, but the current constraint could be easily replaced by a more complex evaluation, e.g. the maximum execution time for di erent input data sets. A erwards, the resulting execution time is fed to the oracle function (III) which triggers the DC algorithm to generate a new parameter set. In order to simulate the determined parameters, they must be interpreted as mapping and adapted to the given architecture (IV). is adaptation is carried out by the oracle function. e extracted parameter description includes the assignment of the cores to the respective processes only. is transfer of mapping-graphs to a parameter vector abstracts from the actual graphical structure and communication relations between cores.
us, the DC algorithm has no knowledge about the underlying hardware architecture or communication structures.
e newly generated mapping is simulated again to produce the next results (III) for the oracle function. e DC algorithm requires approximately 10,000-30,000 samples to obtain a valid DC. is does not mean every sample is a simulation, since many samples can refer to the same mapping. We implemented a cache that holds results of already simulated mappings for later use. is is important since the simulation time dominates the overall execution time of the algorithm.
Architectures
To apply the DC algorithm to the mapping problem, we need to leverage information about the target architectures themselves. In this paper, we use two architectures that provide su cient processing elements for our evaluation. For our approach we used architectures with uniform processing elements, which simpli es the (1) A generic ARM SoC with a heterogeneous memory structure is used (cf. Figure 5 ). e chosen architecture contains 16 Cortex A9 cores with a clock frequency of 1GHz and a hierarchical bus system. Each core has a direct connection to a local L1 cache. In addition, 4 cores are grouped into one cluster sharing one L2 cache. In case that data has to be exchanged over boundaries of a cluster, a global main memory must be used. e latency of the required communication structures as well as the respective memory structures increase with decreasing locality. us, processes that are scheduled on the same core, provide the lowest communication latencies. e highest communication latency, on the other hand, comes from cores using the shared memory. However, mapping two processes to one core requires additional context switches each time the process is activated, which introduces additional latencies. (2) A homogeneous network on chip architecture as shown in Figure 6 consisting of 16 cortex A9 cores. In contrast to the previous architectures, there are no global memories available. e individual cores have local memories and are connected to a network on chip similar to the Epiphany III architecture from Adapteva [1] . e used simulation provides a system-level view of the NoC behavior without detailed simulation of routing contentions. 
Mappings for Design Centering
In order to apply the L p -Adaptation method, we need to encode mappings as vectors in the normed space L n p for some suitable n, p. To do this, we rst have to understand precisely what a mapping is from a mathematical standpoint. Once we have done this, and have a (mathematical) set M of mappings, we can nd a function encode : M → L n p for some suitable n, p. In order for this function encode to have a valuable meaning, it should also capture an intuitive notion of distance between two mappings as the p norm in L n p . In other words, we have to nd a mathematical description of mappings not only as a set M, but rather, as a metric space (M, d ), with a distance function d. We would then require some sort of compatibility: If d (m 1 , m 2 ) is small, we want encode(m 1 ) − encode(m 2 ) to be small, in some sense. Ideally, we would like encode to be an isometry (or an isometric embedding), i.e., d (m 1 , m 2 ) = encode(m 1 ) − encode(m 2 ) and bijective (injective). However, the discrete structure of mappings is ill-suited to nd such a representation, and we should not expect to nd a function encode that is an isometry for any intuitive description of the mapping set M. On the other hand, we would ideally like encode to be a bijective function (parametrization) of the set M, since we can then nd exactly one point in the design space L n p for every mapping. Since M is nite and L n p uncountable, a bijective function will not exist. An injective function is imperative though, since otherwise we cannot uniquely determine a mapping from a point in the design-space L n p . Figure 7A . In order to help the visualization, we coarsened the KPN of a two-channel audio lter to consist of only two processes with two channels: the source (src) and the sink (snk). e source process reads the lter le, splits both channels and performs an inverse Fast Fourier Transform (FFT) on each. en, the sink channel lters each channel in the frequency domain and applies an FFT to convert back to the time domain. e gure shows three di erent mappings, each coded with a di erent color, onto Architecture 1 ( Figure 5 ). Consider concretely the orange mapping, which maps the source to ARM 14 and the sink to ARM 15 . Both FIFO channels are mapped to the shared L2 of Cluster 3, even though it is not explicitly shown in the gure. Such a KPN mapping is thus an assignment of KPN processes (and channels) to system resources. In this paper, we concentrate on process-to-core mappings. Mathematically, such an assignment is best described as a function. If we thus let K be the set of KPN processes, and R the set of system resources, then a KPN mapping is just a function m : K → R. e orange mapping from Figure 7 A is, thus, the function
Encoding Mappings. Consider the example depicted in
Since the sets K and R are nite, we can describe the set R K of such functions as the set of |K |-tuples over R, R |K | = R n , if we set n := |K |. For this, we can x an enumeration of K, K = {k 1 , . . . , k |K | }. We can thus encode the mapping m as:
For the orange mapping in Figure 7A , the mapping of Equation 3, this would mean encoding it as the tuple m= (ARM 14 , ARM 15 ).
By xating an enumeration of R in the same manner, this yields an embedding onto R |K | as sets, i.e., encode is an injective function. e example above reduces to m= (14, 15) For the metric we rst de ne a metric on the architecture, d arch . is metric depends on the architecture itself, but is chosen such that d arch (PE i , PE i ) = 0 for all i, i.e. for all PEs. Furthermore, we chose the distance according to the latencies, such that d arch (PE i , PE j ) < d arch (PE i , PE k ), if the latency between PE i and PE j is always smaller than that between PE i and PE j . For example, in the architecture depicted in Figure 7 , we have
Having de ned this metric d arch , we de ne the metric on mappings as:
For example, the distance between the orange mapping m 1= (14, 15) and the dark-blue mapping m 2= (7, 7) in Figure 7 is:
On the other hand, the distance between the dark-blue and the light-blue mapping is 2 + 1 = 3 4.3.2 Norm . We selected the p = 1 norm for applying DC to mappings. For a nite-dimensional real space, the L n 1 norm is simply:
On a discrete space, this norm is commonly referred to as the "Manha an" norm, and it corresponds to the metric on M from Equation 4 . We believe out of the L p norms, this one best captures the distance between processing elements.
e oracle
As explained above, the DC algorithm works using an oracle. It has a very simple speci cation: given a mapping m ∈ M, say if the encoded mapping encode(m) is in the feasible set A (encode(m) ∈ A ⊆ {1, . . . , |R|} |K | ⊆ R n ). Whether a mapping is feasible depends on the user constraints (see Figure 2 ), e.g., number of resources, maximum energy consumption, or overall performance. As mentioned before, we only consider a single time bound t 0 :
We say a mapping m is feasible ⇔ the time t from the simulation is below t 0 us, de ning an oracle is simple: given a mapping m, it will run the simulator (c.f. II in Figure 4 ) and return feasible if the simulated time is ≤ t 0 . Note that the DC algorithm works on R n , whereas our encoding is a function in {1, 2, . . . , |R|} n , a nite and discrete set. To deal with this, we round all coordinates of an x ∈ R n to the nearest integer and let the oracle return infeasible if it falls outside the de ned set. With this method, however, we neglect some information about A we already posses, namely, that it is a subset of {1, 2, . . . , |R|} n . In future work we plan to address this by formulating a discrete ( nite) version of the DC algorithm.
Consider again the example from Figure 7 . For this very simple application, since the design-space is two-dimensional, we can Robust Mapping of Process Networks to Many-Core Systems using Bio-Inspired Design Centering SCOPES'17, 2017, actually visualize it. If we let the application execute on every one of the 16 2 = 256 mappings in this example, we get the heatmap depicted in B, where the color corresponds to the execution time. We see where the three mappings from A correspond to through the do ed lines. If we set the oracle function to use the time t 0 = 80ms, then the feasible space A will be the orange squares, whereas for t 0 = 160ms, it will encompass everything except the dark-blue diagonal.
EVALUATION
In this section we evaluate our approach, using three embedded applications on the architectures described in Section 4.2. e goal of the evaluation is twofold. First we want to generate a mapping with the DC approach from an initial mapping. is mapping was optimized with the conventional mapping tool ow of the SLX Tool Suite. Second we want to verify the robustness of the mapping obtained with the DC algorithm in comparison to the initial mapping and a set of randomly generated mappings. To this, we introduce perturbations to the mapping and check to what extend application constraints are still met.
e perturbations correspond to local remapping decisions, emulating what operating system would do if it cannot deploy the statically computed mapping.
Applications
For testing, we use three applications from the signal processing and multimedia domains. e rst one is an audio lter in the frequency domain. is lter processes a stereo audio signal from an input le of 16 bit samples at 48 kHz. It is functionally the same as the one presented in Figure 7 , but split in 8 processes to expose more parallelism. e second application is a multiple input, multiple output orthogonal frequency division multiplexing (MIMO-OFDN) algorithm, similar to those used for 4G wireless communication. e code of the benchmark operates on randomly generated packets. e last application is a Sobel lter from the image processing domain.
Search Strategy
During the design-centering exploration, as part of the algorithm, we vary the radius of the L p -ball. We achieve this by changing the hi ing probability p during the iterations of the algorithm. Constrained to a small hi ing probability, the DC algorithm tends to enlarge the sample region and vice versa. For the experiments we specify the following list of target hi ing probabilities at di erent iterations (S) of the DC algorithm:
0.05 for 0 < S ≤ 15000 0.5 for 15001 < S ≤ 18750 0.75 for 18751 < S ≤ 22500 0.8 for 22501 < S ≤ 26250 0.95 for 26251 < S ≤ 30000 e adaptation of the radius and the sampled hi ing probability is illustrated in Figure 8 , which shows a run of the DC algorithm for the MIMO-OFDN application. For the rst half of the samples the hi ing probability is xed to p = 0.05, which forces the algorithm to increase the search radius in order to reach the target hi ing probability p. A er the rst 15,000 samples, p decreases steadily and is strategy ensures that the calculated design center is located within a large feasibility region and does not get stuck on a local maximum.
Besides the dynamic adaptation of the search radius, the shape of the L p -ball is also decisive for an exact hypervolume approximation. As mentioned before, we use the Manha an metric as distance metric for the hypervolume. is is a good t since we work on a 2-dimensional grid of PEs. e minimum radius (r min ) of the L 1 -ball was set to one and the maximum to the half of the parameter space r max = |K ||{PE i }|/2, with |K | being the dimension of the mapping vector (number of tasks) and |{PE i }| the number of PEs.
Results
In order to assess the robustness of the resulting mappings, we designed a perturbation method for testing it. e idea is to test whether a mapping still works within the given constraints a er the static mapping is slightly changed. e modi cation of a single mapping m is performed by randomly perturbing the parameter vector encode(m). For this, three random cores are chosen from the mapping and replaced by a di erent core of the given architecture. Our perturbation analysis consists in obtaining 100 modi ed versions of the original mapping encode(m) and testing how many of those still meet the constraints. Note that the modi cations of the vector are carried out without further consideration of the architecture. is means that a di erent core was selected without consideration of cluster boundaries or other communication infrastructures. In doing so, it was taken into account that some small changes of the mapping may result in a large impact on the run-time while others have very li le e ect. However, we believe those extrema are leveled out by the large number of perturbations.
us, the robustness of di erent mappings is still comparable with this method.
To evaluate how robust the mapping computed with our approach really is, we also perform the perturbation analysis on other 199 feasible mappings obtained with the mapping ow. Figures 9-11 show the results of the perturbation analysis for the three applications on both target architectures. In the gures, the mappings (on the x-axis) are sorted according to robustness, i.e., what percentage of the 100 variations still meet the constraints. e mapping obtained with the DC approach and the rst mapping computed Figure 4 ) are marked as "DC" and "Init" respectively. We see that in all instances, the mapping selected by the DC algorithm is in the rst two deciles, and is, indeed, a very robust mapping. In particular, the results of the perturbation analysis for all three applications (Figure 9-11 found design centers provide a clear improvement of robustness in comparison to the optimized initial mapping. Nonetheless, the evaluated design centers are of di erent quality, while the center of the MIMO-OFDM benchmark seems to be very robust against hardware perturbations (98%), the remaining benchmarks provide centers from 87% to 61%. us, it is quite possible that the algorithm will determine a local center, which could be surpassed in its quality by some random mapping. An investigation of this e ect revealed that the deviations are triggered by di erent starting values, which is a usual behavior for a heuristic-based algorithm.
e perturbation analysis described above is quite time consuming. We also analyze whether it is possible to determine the quality of the design center without carrying the detailed analysis. To this end, we investigate the relation between the estimated hypervolume ("radius") of the feasible region with the quality (robustness). By construction, we expect said radius of the L 1 -balls of di erent design centers to correlate with the robustness of the mappings. We compared di erent design centers from the MIMO-OFDM benchmark on the ARM SoC with the described perturbation analysis (using identical random seeds for each center). e results of this, alongside a linear regression, are shown in Figure 12 . Due to the small number of evaluated design centers (10), it is not clear how representative these values really are. However, the obtained results provide promising indications that there is a correlation between the size of the ascertained hypervolume and the quality of the results. A regression analysis also suggests a very high correlation (correlation coe cient of 0.984). Radius of L 1 -ball Mappings passed in % Figure 12 : Correlation between radius of the L 1 -ball and robustness of the DC results in multiple duplicate measurements as slightly di erent samples can be merged to identical mappings. In order to avoid repetitive simulations of mappings, an internal cache was implemented in the oracle function. is problem can only be solved by using an internal discretization of the DC algorithm. is adaption for discrete problems is part of the ongoing development of the design DC algorithm.
Limitations

RELATED WORK
is section introduces previous works that are related to the problem of adaptive dynamic mapping for changed hardware or so ware constraints. Typically, dynamic mapping of process networks or tasks graphs is necessary for multiple tasks running on multicores on an embedded operating system. is problem is sometimes solved by holding multiple static schedules calculated at compile time of the application, e.g. [20, 28, 49] . As this requires comprehensive knowledge of all possible system states, these approaches su er from a bad scalability. us, several a empts were made to provide light-weight run-time mappings for process networks. Most of this approaches use a twofold strategy; computing parts of the mapping at compile time and make the nal adaptations to the mapping at runtime. Examples for such hybrid approaches can be found in [9, 33, 47] .
In order to provide these compile time optimized mappings for di erent usage scenarios a comprehensive design space exploration is required. is issue has been extensively studied in recent years, e.g. [32] tries to nd Pareto fronts within the design space of a data-ow application, or [40] describes an exploration methodology for multi-objective constraints. In general, it is di cult to compare these mapping approaches [15] .
However, most of the hybrid approaches for dynamic mapping either run into scalability problems as they provide only static parts or require considerable computing resources at runtime. Our idea was to decrease the computational e ort at runtime to nearly zero by providing a mapping that can be modi ed within certain boundaries without a ecting the given constraints of the application.
erefore, we used a design centering approach as it is commonly used in integrated-circuit design [7] and material sciences [30] . Our approach requires also a design space exploration at compile-time but uses the gathered information to provide a robust design. To the best of our knowledge, we are the rst using design centering for the development of robust mapping in the context of data ow applications.
CONCLUSIONS AND PERSPECTIVE
We described an application of a bio-inspired design centering algorithm to compute robust mappings for multicores. In contrast to conventional optimization methods, the applied algorithm does not try to nd one distinct optimal point, but rather a whole region that ful lls a certain condition. For this purpose, the design centering algorithm explores the design space in order to nd a point which is in the center of a determined hypervolume of points that meet the given constraints. In this work, the algorithm was used to nd mappings that provided a certain degree of robustness against slight remapping changes during runtime. We believe design centering can be used to generate mappings that are robust to other kinds of perturbations, and that this is an interesting area of research.
To evaluate our approach we used a state-of-the-art tool ow for mapping KPN applications to multicore architectures. We performed a perturbation analysis for three applications onto two fundamentally di erent architectures. Compared to conventionally optimized mappings, the generated mappings turned out to be ≈29% more robust against changes of hardware resources.
In future work, we will specialize the design centering algorithm to be directly applied to discrete problems. Furthermore, we are planning to apply the algorithm to other problems from the eld of process network optimizations, e.g., mapping robustness with respect to changes in the internal control ow of the processes.
ACKNOWLEDGMENTS
is work was supported by the German Research Foundation (DFG) as part of the Cluster of Excellence "Center for Advancing Electronics Dresden" (cfaed).
e authors thank Silexica (www.silexica.com) for making their embedded multicore so ware development tool available to us.
