Abstract-Numerous researchers have studied the contention that arises among tasks running in parallel on a multicore processor. Most of those studies seek to derive a tight and sound upper-bound for the worst-case delay with which a processor resource may serve an incoming request, when its access is arbitrated using time-predictable policies such as round-robin or FIFO. We call this value upper-bound delay (ubd). Deriving trustworthy ubd statically is possible when sufficient public information exists on the timing latency incurred on access to the resource of interest. Unfortunately however, that is rarely granted for commercial-of-the-shelf (COTS) processors. Therefore, the users resort to measurement observations on the target processor and thus compute a "measured" ubd m . However, using ubd m to compute worst-case execution time values for programs running on COTS multicore processors requires qualification on the soundness of the result. In this paper, we present a measurement-based methodology to derive a ubd m under round-robin (RoRo) and first-in-first-out (FIFO) arbitration, which accurately approximates ubd from above, without needing latency information from the hardware provider. Experimental results, obtained on multiple processor configurations, demonstrate the robustness of the proposed methodology. In spite of the potential benefit to available performance, embracing multicores for the real-time systems industry is a difficult challenge. Chip providers are driven by the mainstream market and industrial developers of real-time systems must stay in the mainstream and use COTS solutions in order to contain procurement costs. However, mainstream COTS multicores are designed to improve average performance rather than time predictability, which is an essential ingredient to compute tight and sound worst-case execution time (WCET) bounds for real-time software programs. Sadly, at the present state of the art, analysis solutions capable of delivering tight and sound WCET bounds for COTS multicores do not yet exist, and execution-time bounds (ETB) are derived instead, which may or may not be true upper-bounds.
INTRODUCTION
T HE real-time systems industry has started to consider multicore processors (multicores in the following) as their baseline computing platform, in response to the increasing performance requirement of new applications. This situation extends across a variety of application domains, including automotive [1] , avionics [2] , and space [3] .
In spite of the potential benefit to available performance, embracing multicores for the real-time systems industry is a difficult challenge. Chip providers are driven by the mainstream market and industrial developers of real-time systems must stay in the mainstream and use COTS solutions in order to contain procurement costs. However, mainstream COTS multicores are designed to improve average performance rather than time predictability, which is an essential ingredient to compute tight and sound worst-case execution time (WCET) bounds for real-time software programs. Sadly, at the present state of the art, analysis solutions capable of delivering tight and sound WCET bounds for COTS multicores do not yet exist, and execution-time bounds (ETB) are derived instead, which may or may not be true upper-bounds.
One of the challenges of timing analysis for COTS multicores stems from the difficulty of determining the worstcase impact of contention on access to hardware shared resources. In this paper, the term ubd, for upper-bound delay, denotes that impact factor. Studies exist that investigate the ubd arising on access to the on-chip bus [4] and the memory controller [5] , [6] . Those works however yield a tight and sound ubd estimation only when enough information about the timing behaviour of the target processor is available.
Both the Static Timing Analysis (STA) and the Measurement-Based Timing Analysis (MBTA) methods [1] need trustworthy ubd to compute sound ETBs. STA uses the ubd to cost every request to a shared hardware resource issued by a software program. MBTA, the most used practice in industry at present, needs to know the ubd to gage the contention delay that may be suffered by application programs.
Unfortunately, as the complexity of multicores continues to rise and information on their internal function is increasingly restricted by intellectual property, the static derivation of ubd becomes inordinately harder. As a testimony to that, the contention behaviour of the P4080 processor has been analyzed by an avionics end-user and an STA tool provider [7] using measurements, thereby obtaining a measured approximation of the ubd [8] , here denoted ubd m .
The net consequence of that difficulty is that the confidence that can be placed on ETB rests on the confidence that can be attached to the ubd m ; in particular, on how well it approximates the actual ubd.
To the best of our knowledge, the state-of-the-art techniques used to compute ubd m most frequently employ specialized programs executing in the application space, often called resource stressing kernels (rsk) [8] , [9] , [10] , also referred to as micro-benchmarks. The rsk approach computes the ubd m by running the software component under analysis (scua) against a battery of rsk. In particular, the ubd m is derived by dividing the execution-time increment suffered by the scua, (D ET ), owing to the contention generated by the rsk, by the number n r of access requests made by the scua: ubd m ¼ D ET =n r . Interestingly, whereas rsk are expressly designed to produce high contention on a given shared hardware resource (e.g., the bus) so that the designated victim suffers high slowdown, insufficient attention has been devoted to determining whether the ubd is best approximated using the scua or an rsk as victim.
We show in this paper that the state-of-the-art rsk methodology may fail at producing sound ubd m values. In particular, we analyse the impact that round-robin (RoRo) and first-in-first-out (FIFO) arbitration policies, widely used in real-time systems due to their time-predictable traits [4] , [11] , have on the computation of the ubd m .
In this context, this paper makes the following contributions: 1) We show that a na€ ıve use of rsk neither guarantees that the scua's requests suffer the highest contention (ubd) on accessing the target shared hardware resource, nor helps deriving accurate ubd m approximations to it. The key reason behind this defect is that, under heavy contention scenarios, RoRo and FIFO produce a "synchrony effect" that causes each request issued by the scua to suffer a contention delay that can be systematically inferior to the ubd. We show that the contention delay is determined by the time elapsed since the preceding request was served until the current request is issued. We term this duration injection time. This phenomenon manifests itself differently for RoRo and FIFO. 2) We propose a methodology to derive a trustworthy ubd m that does not need to know the specific latencies of the target shared hardware resource, and thus works for a wide range of COTS processors. Our approach consists in inferring the ubd m value by varying the injection time between requests to the shared hardware resource. This is realized by inserting a given number of nop instructions in between access requests. As the results obtained vary noticeably depending on whether the arbitration policy is RoRo or FIFO, we propose different methods to obtain the proper ubd m for each policy. 3) We demonstrate our methodology to derive trustworthy ubd m for the bus and memory controller on a multicore setup that matches that of the Cobham Gaisler NGMP processor [12] , a four-core multicore considered by the European Space Agency for future missions, which embeds per-core data and instruction caches connected to the L2 with an AMBA AHB bus.
For the sake of completeness, we also test our methodology on a variant of this reference multicore design.
Where measurement observations are the most practical and perhaps the only available means to derive the impact of contention bounds for increasingly complex multicores, our approach provides essential aid to computing a trustworthy ubd m for the bus and the memory controller, and thus increased trustworthiness of the resulting ETB.
The remainder of this paper is organized as follows. Section 2 introduces the impact that RoRo and FIFO incur from contention on access to hardware shared resources. Section 3 describes our reference processor architecture and the constituents of our measurement-based approach to derive the ubd m , namely, the resource stressing kernels. Section 4 illustrates the synchrony effect that occurs with RoRo and FIFO under heavy load conditions. Sections 5 and 6 show how our proposed solutions derives ubd m for the bus and the memory controller, respectively. Section 7 empirically validates our approaches. Section 8 presents related works. Section 9 draws our conclusions from this study.
CONTENTION ANALYSIS FOR RORO AND FIFO

Studying the Bus and the Memory Controller
The interconnection network and the memory controller are two of the hardware resources whose sharing in multicores causes most bottlenecks for contending tasks that run in parallel. The determination of the ubd for those resources has already received the attention of researchers, under the hypothesis that public documentation on the internal functioning of the processor exists.
Bus-based interconnection networks are known to require little energy as well as to ease protocol design and verification, while incurring an acceptable slowdown [13] , [14] . The Advanced Microcontroller Bus Architecture (AMBA) is a bus exemplar widely used in microcontroller devices as well as in a number of ASIC and SoC parts with real-time capabilities. The AMBA bus is the focus of our work here. The memory controller, which polices access to memory and thus is necessarily shared across cores, causes considerable contention and exacts a high toll on the ETB. Several memory controller designs have been proposed to contain this contention overhead [5] , [15] , [16] , [17] , which we consider in this work. We study how to derive ubd for those two hardware shared resources, assuming RoRo and FIFO arbitration. While other arbitration policies exist that aim at better average performance, they usually lead to more pessimistic-or simply not computable-ubd. This is the case for some types of priority arbitration [11] and policies like first-ready firstcome first-served (FR-FCFS) [18] .
Let us now look at each policy of interest in isolation.
RoRo. Consider a RoRo-arbitrated resource, contended by N c cores, with an access time l max res cycles, where l max res is the maximum delay that it takes for a request to be serviced by the resource. In Sections 5 and 6 we discuss this delay for the bus and the memory, respectively called l max bus and l max mem . When core c i , with i 2 f1; . . . ; N c g, has the highest priority in a given round of RoRo arbitration, the priority ordering for the subsequent round becomes: fc iþ1 ; c iþ2 ; . . . ; c Nc ; c 1 ; c 2 ; . . . ; c i g, where c iþ1 becomes the core with the highest priority and c i gets the lowest. As RoRo is work conserving, a lower-priority requester can be granted access to the resource when all higher-priority contenders do not require it.
When all cores continuously issue access requests, the theoretical worst case is that any request r i issued from the scua always has the lowest priority. We therefore have
Under a contention scheme of this type, both STA and MBTA can be applied to the scua in isolation (hence with no parallel contention) and then the worst-case contention overhead can be added compositionally by factoring the above ubd into each access to the shared resource.
Obviously however, the particular time alignment between the scua's access and the circulation of the RoRo priority token across cores determines the contention delay actually suffered, so that the ubd m may be significantly lower than the ubd. This is further discussed in Section 5.
FIFO. Consider now the same resource, this time with FIFO arbitration, accessed by N c cores, where each core can have only up to one pending request in flight. FIFO assigns access priority in order of arrival so that the requests arriving earlier to the arbiter get higher priority.
The theoretical worst case for the scua occurs when all cores have a pending request and a request r i from the scua is preceded by N c À 1 older requests from the other cores. This produces the same ubd as for RoRo
However, by the time request r i is issued, the oldest request at the top of the FIFO queue may have progressed to near completion, which-again-causes ubd m to be lower than ubd. We can thus observe that under both RoRo and FIFO, the worst case occurs contingent on a particular alignment between the scua's request(s) and those of all other contending cores, and is distinct for each arbitration policy.
Difficulties in Determining the UBD UBD
When the internal workings of the processor cannot be known, the ubd cannot be determined analytically, but only approximated via ubd m , as was the case in [7] .
We noted earlier that designing observation experiments to maximize the impact that the interfered scua's requests suffer from other cores (which is required to "conjure" the ubd) is impaired by the need to control the alignment in time between the scua's requests and those of the contending cores.
Consider N c arbitrary software components, SC ¼ fsc 1 ; sc 2 ; . . . ; sc N c g, one of which is our scua, with each sc i pinned to a distinct core, all contending access to a RoRo-arbitrated resource. It is evident that if we simply run all those programs together, with no other precaution, it would be highly unlikely that each and every scua's request incurred worst-case contention. This is so because, when a request r i from the scua is issued in the program run, its RoRo priority is not necessarily the lowest and therefore its wait time is less than ubd RoRo . The case of FIFO arbitration is analogous, because it is equally unlikely that every single scua request is issued when all other cores have pending requests enqueued and none of them is already being served.
In principle, given a specific scua, one might possibly design matching contenders capable of issuing their access requests with the frequency needed to cause the scua's requests to always be last in the queue and incur ubd contention. However, this effort would be utterly disproportionate, owing to its extreme sensitivity to the particular behaviour of (the particular version of) the scua and, even worse, to its critical dependence on detailed knowledge of the inner workings of the resources of interest so that the desired timing of request generation can be well understood and fully controlled.
We can therefore maintain that soundly approximating the ubd with observation measurements that are affordable for design and implementation costs is an open problem. Interestingly however, solving that problem would be of great value to industrial users, as they would be provided with scua-independent test sets capable of causing ubd m to be a sound approximation of ubd, which could thus be used as an additive factor to the ETB determined for the scua in isolation, with state-of-the-art single-core analysis techniques. This is the challenge we tackle here.
ELEMENTS OF THE PROPOSED SOLUTION
We now present the COTS processor that we studied, which we describe first, and the resource-stressing kernels (rsk) [8] , [9] , [10] , small application-level programs designed to stress specific hardware resources, which form the state-of-the-art building block to our solution.
Processor Architecture
The processor we consider in this work is Cobham Gaisler's Next-Generation Multi-Purpose Processor [12] , which is one of the multicores currently considered by the European Space Agency for use aboard future satellite missions.
The NGMP is a quad-core processor with private percore instruction and data caches, referred to as IL1 and DL1 respectively, each with 16 KB capacity, four-way, 32-byte lines, and one-cycle hit latency. An AMBA AHB bus serves as the bridge between the IL1 and DL1 on core and the second-level 256 KB four-way cache (L2), which can be partitioned across cores, one way per core. DL1 is writethrough and all caches use LRU replacement. In the NGMP, whose general architecture is depicted in Fig. 1 , contention only occurs on access to the bus and to the memory controller, since the L2 is partitioned.
Resource Stressing Kernels
We first discuss the specialization of rsk for the processor resources of interest, and then we show that they fail to safely approximate the respective ubd. Subsequently we present a new methodology to do that.
Bus. We call the rsk dedicated to the bus, bus stressing kernel (bsk). The bsk is designed to cause every instruction to miss in DL1 and hit in L2. This structure ensures a short turn-around time for memory requests, which keeps the bus as busy as possible.
Given that DL1 uses LRU replacement, the bsk comprises a loop with W þ 1 load instructions, where W is the number of DL1 cache ways (see Fig. 2a ). Those loads have a predefined stride among them so that they access the same DL1 set, thus exceeding its capacity and systematically missing in DL1. Furthermore, the memory addresses referenced by the bsk are designed to exactly fit in L2. In this way, all accesses miss in DL1 and hit in L2.
To hit in L2 we use load operations, which produce the highest bus contention. In the NGMP in fact, L2 hits hold the bus until the L2 serves the request, while L2 load misses are split transactions, which release the bus until memory sends the missed data, and store requests are immediately served, thus keeping the bus busy for a shorter duration. Fig. 2a presents the bsk for the NGMP: as the DL1 has four ways, the loop body of the bsk includes five instructions that all map to the same set.
Had the DL1 replacement policy been unknown, we would have designed the loop body to perform N ) W þ 1 distinct accesses to the same set, for an N that does not exceed the L2 capacity in the corresponding L2 set, to make it highly unlikely for memory operations to hit in DL1.
Memory Controller. Analogously, we call msk the rsk dedicated to stressing the memory controller. The msk design follows the same principles as for the bsk, except that the memory accesses in the msk have to yield L2 misses. The factors of influence to this end are the size of the way for DL1 and L2 (to cause L2 misses and therefore access the memory controller), and the size of the cache line (that determines the unit of transfer).
For the NGMP, we use a load stride of 64 KB, which is an integer multiple of the DL1 way size (4 KB). Hence, all memory accesses map to the same DL1 set. This is also the way size of the L2, hence memory accesses also map onto one and the same set. As L2 uses LRU replacement, every memory access made by the msk results in a miss. Fig. 2b presents the pseudo-code of the msk.
THE SYNCHRONY EFFECT
Intuitively, one would expect that assigning specialized rsk to all cores contending with the scua should capture the worst-case contention scenario, and thus allow obtaining a trustworthy approximation of the relevant ubd.
As we show next however, this intuition is wrong in practice, because, when exposed to heavy load conditions, both FIFO and RoRo experience a particular phenomenon that we term the synchrony effect. The essence of this phenomenon is that, when all cores issue requests at a given constant rate to the resource of interest, their requests interleave in a particular way systematically, so that their interleaving becomes synchronous. In that situation, the resulting contention delay becomes constant and, more important, unlikely to match the ubd.
We now discuss the synchrony effect for the bus, which we obtain by using N c À 1 bsk as contenders to the scua, under both FIFO and RoRo. Table 1 lists the key symbols we use in the discussion.
Synchrony Effect Under FIFO
The synchrony effect causes the shared resource to behave as if it was multiplexed across all cores, with each core being assigned a time slot of duration equal to the service time of an individual request. Interestingly, this applies to both FIFO and RoRo. Let us now study that effect for FIFO.
The contention delay suffered by the scua for its request r iþ1 depends on the time elapsed since its preceding request r i and how r iþ1 positions in the request queue.
Let us assume that the scua may issue multiple requests to the bus, which we denote R scua ¼ fr 0 ; r 1 ; . . . ; r m g. Assume that those requests may be issued at arbitrary times, so that some time elapses between any two subsequent requests from the scua. Let us call injection time, denoted d i , the time span between the issue of requests r iÀ1 and r i for any R. Accordingly, for R scua , we have fd
; . . . ; d scua m g. In our reference architecture, d i corresponds to the time elapsed since the data loaded by r iÀ1 is sent back to DL1, until r i is ready to access the bus. A minimum injection time d min separates any two subsequent requests from R. d min is equal to the time it takes for DL1 and the core to process r iÀ1 , once it is served, and execute the instruction corresponding to r i , until r i gets ready to access the bus. ;
. . . ; g scua m g. Since the bsk are designed to access the bus with high frequency, their requests have low injection time. In concept, the maximum contention scenario should occur for d min ¼ 0.
We now illustrate the synchrony effect under FIFO with an example where contenders are bsk and the scua can be either another bsk or any other software component. We explore two scenarios, with d min ¼ 0 and d min > 0 respectively. The former, while infeasible in reality, serves for illustration.
Scenario d min ¼ 0. Let us assume that request r i of the scua is just serviced and all other cores have pending requests enqueued. Fig. 3 (rows d min ¼ 0) illustrates how g iþ1 varies as a function of d iþ1 (shown in the first row). For instance, if
since r iþ1 cannot be granted access to the bus until the ongoing request from c 0 is completed (which takes two more cycles) and requests from cores c 1 and c 2 are also serviced (which takes another 3 þ 3 ¼ 6 cycles) since they are both already in the queue.
Assuming that each core can only have one pending request, the worst contention (ubd) occurs when r iþ1 is delayed by the full service of N c À 1 requests coming from the other N c À 1 cores. In this example in Fig. 3 , this means g iþ1 ¼ 9. When d min ¼ 0 and l bus denotes the bus service time for an individual request, the synchrony effect manifests in the fact that g iþ1 has a periodic behavior that ranges from ðN c À 2Þ Â l bus þ 1 (when one contending request is near completion) to ðN c À 1Þ Â l bus (when all other contending requests are pending and none is being serviced). Thus, the particular value of d iþ1 determines the value of g iþ1 .
If d The bottom rows in Fig. 3 show the impact on g rþ1 when d min ¼ 2. Right after r i is serviced, g rþ1 would be equal to ubd. However, for two cycles r iþ1 cannot reach the bus and thus d rþ1 ! 2. In particular, if d rþ1 ¼ 2, then c 0 's request is already being processed at the time r iþ1 is issued, hence g rþ1 < ubd. If d ¼ 3, then c 0 's request has been processed and its subsequent request will take at least two cycles to be issued and reach the queue.
finds the same scenario as for d rþ1 ¼ 2, with the only difference that the particular requests in the queue have different core owners, but for the same contention effect on r iþ1 . Hence, when the scua executes against bsk, it cannot experience ubd contention regardless of whether the scua is a bsk itself or not.
In general, if the contending cores execute bsk, g scua i for request r i 2 R scua can be described with the following equation, where d ! d min holds:
Note, however, that this does not mean that ubd cannot be experienced systematically. For instance, assume that the scua is a bsk and the contending cores execute programs that Fig. 4 . In this scenario, after r i is serviced, the queue is empty for two cycles, and when d ¼ d min ¼ 2, then r iþ1 is issued and contends with requests from all other cores, which arrive simultaneously and are enqueued before it. All requests are processed in order and r iþ1 experiences g ¼ ubd. Then, the queue is empty again for d min cycles until the same scenario for d ¼ 2 repeats for d ¼ 16. However, while this scenario could be hypothetically produced, it is very difficult-if at all possible-for a user to create programs with given d values, which align in time properly, while ensuring that when requests arrive to the bus simultaneously, they are systematically enqueued in the desired way.
Synchrony Effect Under RoRo
Under RoRo, the incoming requests are not necessarily served in order of arrival, but in the order determined by the round-robin assignment of access slots.
Again, we assume that bsk are run as contenders. If d min ¼ 0, all contenders always have a request pending in the queue. Hence, the only parameter that determines who is granted access to the bus is the current priority order. This is better illustrated in Fig. 5 (see the d min ¼ 0 rows). As shown, c 0 , c 1 and c 2 always have requests in the queue, either in service or still pending. Notably, r iþ1 from c 3 becomes the highest priority request when 
We also observe that g rþ1 ¼ ubd only when d rþ1 ¼ 0. Otherwise, g rþ1 traverses all values from ubd À 1 down to 0 consecutively in a round-robin fashion as d rþ1 increases.
Hence, if d min ¼ 0, running a bsk as scua would suffice to observe the highest contention consistently for all of its requests. However, as we noted before, the general case is d min > 0, owing, for example, to the DL1 cache latency. In general, assuming 0 < d min ubd (as is often the case in reality) so that 100 percent bus utilization can be reached, then g stays exactly the same as if d min ¼ 0. This is so because d min only effects the contents of the request queue. Hence, r iþ1 can only incur g rþ1 < ubd. Moreover, if d is constant for all of the scua's requests, then g is also constant. This observation is of crucial importance in our methodology, as we discuss in the next section.
In the scenario where all contenders are bsk, g can be described with the following equation:
In general, d depends on d min and the particular scua. An arbitrary scua may observe different values of d and so little can be concluded about actual contention. Alternatively, running a bsk as scua, we observe exactly g ¼ ubd À d min for all requests. In fact, it is hard to determine the actual value of d min even when cache latencies are known, since some pipeline stages may delay the access of DL1 misses to the bus. Hence, nothing can be concluded for certain about whether the highest contention has been observed or how far the observation is from the highest extreme.
Taking stock of the synchrony effect, we now present a measurement-based method which computes a ubd m guaranteed to be a safe approximation of the ubd for COTS multicore hardware shared resources, specifically the bus and memory controller, arbitrated with round-robin or FIFO policies.
DERIVING THE UBD FOR THE BUS
In this section, we first describe the strategy we follow. Then we show how it can be implemented and applied in practice for the bus in our reference architecture, considering both FIFO and RoRo arbitration. Finally, we summarise some architectural issues of relevance.
Nop-Based Methodology
As captured with Equations (3) and (4), when using bsk as contenders, the synchrony effect causes the amount of contention suffered by any request to be a function of d.
We use that notion to construct a new bsk, illustrated in Fig. 6a , which we call bsk-nop. In the bsk-nop we intersperse low-latency (nop) operations between the (load) instructions that access the bus. The effect of those nops is to delay the injection time of each request to the bus, which modifies the d value accordingly. Hence, whereas in the bsk, constituted of consecutive contending requests, we have d ¼ d min , if we add just one (for the sake of example) nop in between loads,
nop , where d nop is the delay added by one nop.
By varying the number k of nop instructions inserted between load operations, each resulting bus request experiences a different d k . Fig. 7 shows this effect for FIFO with d min ¼ 1, which manifests as a saw-tooth profile. An analogous phenomenon occurs for RoRo, see Fig. 9 . cycles find decreasing contention load in the queue, until a contending request issued by one bsk running in parallel on another core is queued again. The maximum contention delay experienced is g ¼ ubd À d min , hence systematically inferior to ubd, since once a contending request is serviced, it takes d min cycles for a new request to be enqueued. At that time, contention is highest when the contenders are bsk, and amounts to the theoretical worst case (ubd) minus the progress performed during d min cycles. Observing the saw-tooth shape in Fig. 7 , we see that its period is equal to l bus . The maximum of the corresponding function is: ðN c À 1ÞÂ l bus À d min . In this case, ubd corresponds to N c À 1 periods of the function. For instance, if we consider the example in Section 4.1 where N c ¼ 4, d min ¼ 2 and l bus ¼ 3, the sawtooth will range between 7 and 5 cycles, and it will repeat every l bus ¼ 3 cycles. Thus, ubd ¼ l bus Â ðN c À 1Þ ¼ 9. As shown, although we cannot observe the actual ubd, we can accurately infer it based on measurements with our methodology. Fig. 8 illustrates this phenomenon, for l bus ¼ 2,
Bsk-Nop
and an increasing number of inserted nops, with d nop ¼ 1. We start from scenario a), where we assume
and we see that the request issued from core c 3 , where the scua runs, suffers a contention of gðd rsk Þ ¼ 5 cycles. In scenarios b) and c), we show the effect of increasing the number of nop instructions inserted between load operations in all contenders. In scenario b), we see that gðd rsk þ d nop Þ ¼ 4, whereas in scenario c), core c 3 loses its turn for access to the bus, which increases its g to five cycles again and shows the periodicity of g as a function of l bus , where l bus ¼ 2 Â d nop in this case. For higher nop counts, scenarios a), b) and c) repeat. Fig. 9 shows the variation in the contention delay incurred with RoRo, as captured with Equation (4). The contention value reaches ubd À 1 at most, which, for d min > 0, occurs periodically at every ubd cycles.
Bsk-Nop for RoRo
This phenomenon is better illustrated in Fig. 10 , again for
In scenario g), when k ¼ 6, the situation becomes the same as in scenario a).
The following observations are made: (i) for ubd ! d min > 0, we have g ubd À 1, as per Equation (4); (ii) the variation of g is periodic, with period ubd, independent of d min ; and, more importantly, (iii) the exact value of ubd can be inferred from the period of gðdÞ, which varies with k: this holds true for any d min as long as d min ubd.
Applying the Rsk-Nop Method
Our method to determine the ubd requires carrying out several experiments using rsk-nop as scua and normal rsks as contenders. rsk-nopðkÞ is parametrized by varying, incrementally, the number k of nop instructions inserted between the operations that access the bus. Fig. 10 . Timeline of the RORO scenario for different k nop instructions:
We 
The magnitude of the ubd depends on two factors: the number of rounds that the request has to wait to gain access to the shared resource of interest (denoted N c À 1); and the longest possible service time from it (denoted l max bus ). In the measurement-based approach presented in this work, specialized rsks have to be designed to incur a l max bus response time. In our reference architecture, that duration is determined by whether the accesses to the bus are reads or writes, and hits or misses in the L2. In [19] , we empirically determine, for the processor of interest, that read hits to the L2 use the bus for nine cycles, read misses for seven cycles, and writes for one cycle, regardless of whether or not they miss in L2. In our methodology we secure a l max bus response time by causing all memory operations in the rsk to be read hits to the L2.
Multicycle Nop Operation
So far we have assumed that d nop ¼ 1. This is indeed the case in most architectures, since nop instructions do not have input/output dependencies and use the fast integer pipeline, if present. In the unlikely case that d nop > 1, varying the number of nop instructions in the scua will be equivalent to sampling the saw-tooth behavior shown in Figs. 7 and 9 . If the value of d nop can be determined, then we can obtain the saw-tooth period easily. Otherwise, we infer d nop as follows: we use a rsk whose loop body solely includes k nop instructions, as many as possible without causing misses in the instruction cache; at that point, by dividing the observed execution time of that rsk by k, we obtain a very accurate measure of d nop .
Summary
The method we have illustrated in this section empirically derives ubd m , requiring little in the way of knowledge about the underlying architecture, which is in fact very rarely available as public documentation.
Let us summarize the essence of our contribution at this point. First, we tested our approach for the bus, under FIFO and RoRo, and we have shown it to work. Second, our approach requires knowing the type of instructions that may generate requests to the bus, which is typically documented in the processor's manuals. Third, we can claim confidence in the ubd m obtained with our method for two reasons. On the one hand, N c À 1 cores running a rsk should suffice to raise the bus utilization up to 100 percent, also considering the handshaking overhead. This can be ascertained using the performance monitoring counter (PMC) support provided by most COTS processor architectures (the Cobham Gaisler NGMP, for instance, provides registers 0x17 and 0x18 to measure per-core and cumulative bus utilization [20] ). On the other hand, we have shown how the user can gauge d nop , which is needed to determine the saw-tooth period.
The derived bound, ubd m , can be used by STA as ubd by adding it compositionally to the access time to the bus considered without contention [7] . With MBTA, instead the user must determine an upper bound n r to the number of requests that the scua issues to the bus. The ETB of the scua is then padded with the quantity n r Â ubd m .
UBD FOR THE MEMORY CONTROLLER
In this section we show how to empirically derive the ubd for the memory controller. In our reference architecture, the L2 forwards its misses to a request queue located in front of the memory controller. Each core has one entry in that request queue, which therefore has four positions. On an L2 miss, a split command is sent to the bus to stall the core that caused the miss, until the corresponding memory request has been served. In the meanwhile, the other cores can continue working. To determine which pending request accesses memory, the memory controller implements two arbitration policies, FIFO and RoRo, which we discuss below in isolation.
Before we do that, though, we must clarify an inner detail of consequence. Assume that, at a given point in time, the request queue is full, so that it contains N c requests. Once one of those requests, r i , has been served, two actions occur. First, a new request r j from another core is granted access to memory. Second, the core that issued r i (and has now resumed working) may miss again in L2 and therefore cause a request r 0 i to be stored in the request queue of the memory controller. As an L2 access is faster than a memory access, it is fair to assume that, in general, r 0 i gets stored in the request queue before r j is served.
As a building block to our measurement-based analysis, we use the msk concept outlined in Fig. 2b . This kernel causes a continuous stream of misses in DL1 and in L2, with each such request going to memory. Following the same methodology as for the bus, we generate a variant of this kernel, called msk-nop, which inserts a variable number of nop instructions in between cache accesses (cf. Fig. 6b ).
Msk-Nop for FIFO
Using msk as scua, under FIFO arbitration, we must consider that the time to serve a memory request is longer than the time it takes for the scua to reach memory with another request r iþ1 after its previous request has been served. When r iþ1 reaches memory, it is preceded by exactly N c À 1 pending requests, one for every other core, which all run msk. N c À 2 of those requests are still awaiting service, whereas one of them has begun to be serviced for a duration that corresponds to the d min factor for memory. We can therefore see that this scenario is analogous to the one we have seen for the bus under FIFO, shown in Fig. 3 for d min > 0. The extent of contention captured in that case is high, but not enough to observe a ubd contention effect.
Using msk-nop as scua allows us to explore a range of g whose period extends to l mem . During that duration, the number of pending requests that precede r iþ1 is exactly N c À 1 for l mem À d min cycles, and N c À 2 for d min cycles. Plotting the observed g as a function of the nop instructions inserted in the msk-nop used as scua, we would see the exact same shape as shown in Fig. 7 , except with a different scale.
Msk-Nop for RoRo
Analogously to the case of FIFO arbitration, if we use an msk as scua, whenever a request r iþ1 reaches memory after its previous request has been served, it is preceded by exactly N c À 1 requests. One of those pending requests has begun to be serviced for d min cycles: this means g ¼ ubd À d min . We can therefore see that this scenario is analogous to what we saw for the bus with RoRo, as shown in Fig. 5(d min > 0) . Once again, ubd contention is not observed.
Using msk-nop as scua, we obtain the "sawtooth plot" depicted in Fig. 9 , in which g ranges between ubd À 1 and 0, which allows us to the derive ubd for the memory controller analogously to what we do for the bus under RoRo.
Deriving L max mem
As for the bus, the response time of requests to main memory, which is required to determine l max mem , may vary. As noted in [16] , [21] , the duration of a DRAM request in general depends on: i) the memory mapping scheme, which defines the mapping of physical addresses from the processors to the actual memory blocks in the memory devices; ii) the row-buffer policy; iii) the type of the request; and iv) the type of the predecessor request.
The response time to a memory request depends on the type of request, the target page (bank and rank), and the same set of parameters for the immediately preceding request. For instance, serving a request of a given type is typically faster when the preceding request is of the same type (i.e., Read-After-Read or Write-After-Write) than otherwise, and obviously influenced by whether the accesses go to the same bank and rank or not. In the same line, access to open pages (i.e., hitting in the row-buffer) is faster than to close pages that have to be loaded back in the row-buffer.
Those effects have been thoroughly studied in the literature [16] , [21] and are typically well documented in DRAM specifications [22] , [23] , [24] .
Based on this information, msk can be designed to cause the service time to equal l max mem . All it takes is to alternate the types of operations, and to set the address of the accesses to target the desired bank and rank in accord with the memory row-buffer managing policy and the memory mapping scheme in place.
Memory Refresh
An intuitive solution to deal with memory refreshes consists in factoring the refresh delay t RFC in the ubd. However, this solution is exceedingly pessimistic, as it considers that every individual request is affected by a refresh operation.
With measurement-based approaches instead, the execution time observations taken on the real platform already naturally account for the impact of refreshes. Depending on how measurements align with refresh periods, the number of refreshes that can affect the execution time may be just one more than those actually observed. Hence, it is enough to pad the observed execution time with t RFC .
Another solution is possible when a D cont factor is used to compositionally increase the task's WCET, determined in isolation, with the contention overhead on access the bus and the memory computed without considering refreshes. In that case, the number N REF of refresh operations occurring during D cont can be easily computed with the following recurrence relation:
, where t REFI is the rate at which refresh commands [22] are sent to all banks, and t RFC is the number of cycles that a refresh command takes to complete. In fact, the impact of refresh operations can just be added to the computed WCET, without having to be captured in the computation of the ubd. 
Side Effects of Bus Contention
When deriving the ubd for memory, we must consider that the access requests to it may also compete for the bus, thus incurring some further delay effects. In general however, bus contention is much lower than memory contention, hence the former cannot mask the latter during observation runs. Moreover, owing to the synchrony effect discussed earlier (which originates from the fact that the msk issue requests at constant rate), the bus access requests corresponding to memory accesses are served in the bus with the same frequency as the service rate of memory, but with lower occupancy. Assume for example that l mem ¼ 10 cycles and l bus ¼ 2 cycles. In that case, we could have memory requests served at cycles ½2::11 for core c 0 , at cycles ½12::21 for core c 1 , at cycles ½22::31 for core c 2 , and so forth, and bus requests at cycles ½0::1 for core c 0 , at cycles ½10::11 for core c 1 , at cycles ½20::21 for core c 2 , and so forth. When using msk-nop as scua, the issue of requests from c 0 is increasingly delayed until they collide in the bus with requests from c 1 . Under RoRo arbitration, this collision is not an issue since which request is granted access to the bus first has no impact on memory contention as long as all contending requests reach memory before the corresponding core becomes the highest priority contender in memory. Under FIFO arbitration instead, if both requests are issued to the bus at exactly the same cycle, whether one or the other gets granted first may invert the order of access to memory for one round. However, as long as the hardware behaviour is deterministic, the shape of the plots will remain as in Fig. 7 , and our approach to derive ubd will continue to work correctly.
EVALUATION
We first present our experimental set-up. Subsequently, in Sections 7.2 and 7.3, we show how rsk-nop allows deriving the ubd in the face of the synchrony effect. In that narration, we first assume knowing the bus and memory controller latency as well as the actual value of the corresponding ubd. This information is instead assumed unknown in Section 7.4, which demonstrates the applicability of our methodology to a real COTS multicore.
Experimental Setup
We model a four-core NGMP simulator [12] running at 200 MHz, comprised of a bus that connects cores to the L2 cache and an on-chip memory controller, see Fig. 1 . Each core has its own private instruction (IL1) and data (DL1) caches. IL1 and DL1 are 16 KB, four-way with 32-byte lines. The shared second level (L2) cache is split among cores, with each core receiving one way of the 256 KB four-way L2. Hence, contention only happens on the bus and the memory controller. DL1 is write-through and all caches use LRU replacement policy. Our simulator model includes a closed-page 2-GB one-rank DDR2-667 [23] memory, with four banks, burst of four transfers, and a 64-bit bus that provides 32 bytes per access, which fits a cache line. In our configuration, the longest service latency for requests of any type is 23 cycles.
In a study that we carried out for the European Space Agency, we assessed the performance fidelity of our simulator against a real NGMP implementation, the N2X [20] evaluation board. To that end, we used a low-overhead realtime kernel that allowed cycle-accurate observations and run benchmark applications on it. The results we obtained for the EEMBC benchmarks [25] , a suite of real-world automotive software functions, showed an average deviation of less than 3 percent. For the HAWAII benchmark [26] , an algorithm used to process raw frames coming from the state-of-the-art near-infrared (NIR) HAWAII-2RG detector, the deviation reduced to less than 1 percent.
Synchrony Effect on the Bus
In order to show the robustness of the proposed methodology, we evaluate it in the reference architecture as presented above, as well as a in variant architecture (labeled as ref and var respectively in following figures). In the latter, we change DL1 and IL1 access latency to four cycles (instead of one cycle). This variation increases the minimum injection time (d min ) of all bus-access instructions by three cycles.
For the purpose of showing how rsk-nop enables sound valuations of the ubd to be inferred from ubd m , we use the following timing information for both architectures. A given request suffers maximum contention latency of l bus ¼ 9 cycles per contender: six cycles corresponding to the L2 hit latency, and three cycles for bus transfer and arbitration handover. Following Equation (1), this yields ubd ¼ 27 cycles for the bus.
In a first experiment, we run eight four-task workloads randomly generated with the EEMBC benchmarks, on the ref architecture. The workloads are itemized in Table 2 . Fig. 11a presents the histogram of the number of contenders ready to send a request when the EEMBC benchmark in core c 0 requests the bus to start a transaction under FIFO (results for RoRo are analogous). The results obtained for different workloads are quite similar so we omit them here.
Most of the times, the requests issued by the EEMBC benchmark in c 0 find the bus empty or with just one contender. Only occasionally, the EEMBC in c 0 crosses ways with 2 or 3 contenders. This provides empirical evidence that real application workloads are not easily amenable to generate scenarios in which the number of contending requests is as high as the theoretical worst case. As workloads or time-alignments results may vary, in fact, no a-priori guarantees can be provided that requests align in the worst possible way.
Incidentally, while for FIFO all contenders will be served first at some point, in the case of RoRo the particular state of the priority assignment determines whether those contenders will be served before or after c 0 .
In a second experiment we run four bsk that constantly access the bus. In this case, (see the pink and light grey bars in Fig. 11a) , we observe that, for almost every arbitration round, the number of contenders is N c À 1 ¼ 3. Hence, the bsks reach their goal of causing maximum contention load on the bus. Yet, owing to the synchrony effect, this ability does not suffice to ensure that every individual request from the scua incurs a ubd. As we have seen earlier, in fact, when d min > 0 for both FIFO and RoRo, the actual contention is always inferior to ubd.
This experiment, in which we run four bsks, allows observing this phenomenon in more detail, measuring the actual contention delay g i that each individual request issued by c 0 suffers. Fig. 11b shows the histogram of g under the reference and the variant architecture. Results for FIFO (shown in the figure) and RoRo (not shown in the figure) are practically identical. We observe that the synchrony effect causes almost all requests in each case to incur the same latency, since the injection time among requests is the same. Moreover, we observe that the distance among the (observed) ubd m and the actual ubd We may therefore conclude that, in the general case, when the details about the latency of the bus are unknown, the use of bsks does not allow estimating the ubd with sufficient accuracy and confidence.
Synchrony Effect on the Memory
The same conclusions presented in the previous section for the bus, also hold for the memory controller, where a request may suffer a maximum contention of 23 cycles, whereby ubd ¼ ðN c À 1Þ Â 23 ¼ 69 cycles.
Our results, omitted for space constraints, confirm that: i) using three msks, one per core contending with the scua, suffices to ensure that more than 98 percent of the times any requests issued by the scua find three pending contending requests enqueued at the memory controller; ii) in both the reference architecture and the variant one, the ubd m is 69 cycles.
Evaluation of the Bsk-Nop Methodology for the Bus
For the evaluation of the bsk-nop methodology, for FIFO and RoRo, we assume that no latency information is known. FIFO. As shown in Section 5.1, to infer ubd, the injection time can be varied by inserting nop instructions between consecutive accesses of the rsk used as scua.
The Y-axis in Fig. 12 shows the slowdown suffered by bsk-nop with respect to its execution in isolation and the horizontal axis represents the variation of g as a function of the number of nop instructions inserted.
The experimental results match those in Fig. 7 : the period of each sawtooth is nine cycles, which corresponds to l bus . As discussed in Section 5.1, however, we have to take into account N c À 1 periods. For instance, from the first peak (cycle 10) until the fourth one (cycle 37), the difference is exactly ubd ¼ 37 À 10 ¼ 27 cycles. Notably, the results for the ref and var architectures are exactly the same, but the absolute contention value decreases as d min increases.
RoRo. Fig. 13 shows the result of the same experiment when the bus uses RoRo. As predicted in Fig. 9 , the slowdown is sawtooth-shaped, with period ubd ¼ 51 À 24 ¼ 27 cycles for var, and ubd ¼ 54 À 27 ¼ 27 cycles for ref. Hence, the period of the sawtooth is the same for both architectures, which proves the robustness of our method in inferring the ubd under different processor arrangements.
Evaluation of the Msk-Nop Methodology for the Memory
We now repeat the same experiment as for the bus, by injecting nop instructions in the msk-nop used as scua. Fig. 14 shows the slowdown in cycles, compared with execution in isolation. The horizontal axis shows the number of nop operations inserted between memory accesses as shown in Fig. 6b for the msk. We can observe the same sawtooth shape as in Fig. 12 , but with larger scale. The shape reaches its peak with a period of 23 cycles when 2, 25, 48 and 71, . . . nop instructions inserted, which means l mem ¼ 25 À 2 ¼ 23, as expected.
Beyond 71 nops, the results stop following the sawtooth shape. We studied why that happens and concluded that at that point, the number of nop instructions in the loop is large enough to exceed the IL1 capacity, so that IL1 misses occur at each iteration. In order to confirm this observation, we repeated the experiment with a processor set-up in our simulator that comprises a perfect IL1, i.e., an IL1 in which all accesses are hits. This is shown as "L1 perfect" in Fig. 14: we observe that execution times follow the sawtooth shape, confirming our hypothesis about the increase in the number of conflicts in IL1. In order to solve this problem we propose the following approach.
Instruction Cache-Aware Msk-Nop Methodology. The msknop methodology first adds a given number of memory accessing operations (loads) in the main loop. This number is usually high to reduce the overhead (in relative terms) of the loop control applications, see Fig. 2 . In the msk used for the experiments in the previous section, 50 load operations were included in the loop body, whose memory size therefore is around 200 bytes. When we add one nop instruction between successive loads, the loop body doubles in size. When the number of nop instructions between loads reaches 80, the size grows to ð50 Â 80Þ Â 4 ¼ 16;000 bytes, which equals the IL1 size. As shown in Fig. 12 , the results start degrading just past that number of nop instructions.
To test the impact of having more than 80 nop instructions between load operations, we simply reduce the number of load operations in the loop body such that its size, taking into account the size of load operations and the nop instructions between them, does not exceed the instruction cache size (16 KB in this case). For instance, we place 50 loads in the loop for all experiments below 80 nop instructions. Then, we reduce the load count down to 40 for all experiments until 100 nop instructions, and so forth, always ensuring not to exceed the IL1 capacity.
With the new experiment in Fig. 15 we can corroborate that ubd m ¼ ð49 À 26Þ Â 3 ¼ 69 cycles, so ubd m ¼ ubd.
RoRo. Fig. 16 shows the results for RoRo with the original msk-nop that can exceed the IL1 cache size. As for FIFO, the shape degrades beyond 80 nop instructions, with the difference that, in this case, deriving the ubd may not be possible if we do not fix our msk-nop methodology. Again, when making the IL1 perfect, the sawtooth shape obtained is as expected, so we apply exactly the same solution as for FIFO: we keep the loop size below the IL1 cache size at all times. We do so with the experiment reported in Fig. 17 : there we observe that the distance between two teeth of the plot is exactly udb m ¼ 136 À 67 ¼ 69 cycles, so ubd m ¼ ubd.
Summary
As shown, our methodology based on injecting nop operations in the corresponding rsk allows producing the sawtooth shapes needed to derive the ubd for both FIFO and RoRo arbitration policies. Differences in the shape across resources (bus and memory) only affect the scale of the plots, but not their interpretation. Also, we have observed that it is critically important to keep the size of the rsk small enough to fit in IL1 to prevent IL1 misses from corrupting the observation results and consequently breaking our methodology.
RELATED WORK
Timing analysis techniques can be broadly categorized into Static and Measurement-Based Timing Analysis (STA and MBTA respectively) [1] .
STA relies on an accurate timing model of the hardware under test. STA further creates a mathematical representation of the application, which is combined with the timing model to derive bounds to the application's timing behavior on that hardware. STA places strong emphasis on soundness and safeness guarantees, which allows it in principle to conform with the requirements of safety and qualification standards. However, the validity of the bounds depends on the correctness of the hardware timing models, which are difficult to develop and test. This is compounded by the lack of timing information of processor implementations [27] . Even when hardware manufactures provide timing information, experience shows that it can be inaccurate or outdated with respect to the actual chip implementation. For example, the FreeScale e500mc core documentation alone comprises three revisions already, with considerable changes among them [28] . In the case of multicores, this lack of information affects the impact of contention that tasks suffer in the access to shared hardware resources. All these difficulties have caused real-time industry and even STA tool providers to use measurement-based techniques [8] to derive contention bounds, as done for the P4080 [7] . MBTA executes the program on the real platform under stressful conditions and collects measurement observations from it. Those measurements are later reconditioned to approximate an upper bound to the program's WCET. For instance, the longest observed execution time, or high watermark, is recorded and inflated with a safety margin (e.g., 20 percent) pre-determined based on expert knowledge. For multicores, the reliability of results obtained with MBTA depends, among other factors, on ensuring that the measurement runs cause the application to incur maximum contention (ubd) on all of its accesses to all hardware shared resources. Resource-stressing kernels (rsk) [10] are used to gauge the contention occurring on access to certain shared resources in parallel processor architectures. They are also used in [9] to characterize the NGMP [12] and in [8] to study the Freescale P4080.
The authors of [29] analyze the impact of resource sharing in multicores and critique the confidence that one can obtain with rsk. We acknowledge the need to increase the confidence on the results provided with rsk, which in fact is the focus of this paper by proposing the rsk-nop-based methodology.
The authors of [30] highlight a counter-intuitive behavior with a RoRo-based multicore: the execution time of a task running against a given number of cores can be smaller than when running against less cores. Our work nails down the prime reason behind this particular behavior, namely the synchrony effect, and takes advantage of it to derive the ubd.
WCET estimates for various arbitration policies have been derived in the past for RoRo [11] , TDMA [31] , a RoRoinspired group-based policy called MBBA [32] , and even comparatively [4] . The authors of [33] propose a method based on Performance Monitoring Counters to compute WCET estimates with measurement-based timing analysis, when the ubd for a RoRo bus is known.
All these works assume knowledge about the bus timing, whether the slot sizes or the maximum transfer times. Our work requires no knowledge about that.
In the conference version of this paper [34] , we concentrated on RoRo arbitrated buses. In this work, we extend our methodology in two directions: we cover another common arbitration policy (FIFO) and provide solutions for another shared resource (memory). Moreover, we also analyze the timing interactions of different hardware shared resources such as the bus and memory access. While the methodology proposed in this paper is assessed against the NGMP processor, we expect it to apply for other processors that embed fully non-blocking caches and out-of-order execution like the ARM Cortex A9 and A15.
Whereas in our reference architecture each core can have a single outstanding request to the L2, thereby exploiting memory-level parallelism among tasks, other architectures allow multiple outstanding requests per core to the L2 to be stored until service. In the latter case, the rsk should be designed to ensure that the L2 request buffer saturates so that each request actually takes l max res to be served. Out of order execution, which is a challenge per se for timing analysis, can be accounted for in the design of the rsk so that it does not affect the intended behaviour. The fact that rsks use only nop instructions and memory operations should ease that fix.
It is worth noting that, at present, the real-time systems industry predominantly uses multicores to consolidate multiple independent applications on the same chip. Those applications either share no data at all or, if they do, the sharing happens off-chip (e.g., in memory). This trend reflects the fact that current timing analysis techniques expect the last-level cache to be partitioned among cores, precisely to prevent data sharing. Hence, while it is clear that, in the long run, parallel (as opposed to partitioned) programming will become mainstream for real-time systems, this is still a recent (and very active) area of research, not yet ready for industrial use. For this reason, in this work we can safely assume that the application programs are partitioned across cores and do not share data, so that the coherence mechanism does not perturb the execution-time measurements.
CONCLUSION
The lack of information about the internal working of modern COTS processors makes the use of measurement observations the sole viable means to infer the timing parameters required to dimension the worst-case execution time of application programs.
For the bus and the memory, which are highly shared resources in multicore hardware, the parameter of interest is the maximum contention delay that a request can suffer on access, which we call upper-bound delay, ubd.
The level of trust that can be placed in the execution time bounds derived for application programs running on COTS multicore processors depends on the soundness of the analysis technique in use and the accuracy of the timing parameters that it employs, including the ubd for buses and memory.
In this paper we have presented a measurement-based methodology that requires no knowledge on the timing parameters for access to the bus and memory resources, and yet is able to derive their ubd soundly and tightly. This ability increases the confidence in the execution-time bounds computed for application programs running on COTS multicore processors that use FIFO or RoRo arbitration. Tullio Vardanega received the MSc degree in computer science from the University of Pisa, Italy, and the PhD degree in computer science from the Technical University of Delft, The Netherlands. He is with the University of Padua, Italy. He worked with European Space Agency (ESA) from July 1991 to December 2001. At the University of Padua, he teaches and leads research in the areas of high-integrity distributed real-time systems and advanced software engineering methods. He has a vast network of national and international research collaborations. He has co-authored more than 150 refereed papers and held organizational roles in several international events and bodies, for ESA, the European Commission, ISO, IEEE, and Ada-Europe.
Francisco J. Cazorla is the leader of the CAOS group, Barcelona Supercomputing Center. He has led projects funded by industry (IBM and Sun Microsystems), by the European Space Agency (ESA) and public-funded projects (FP7 PROAR-TIS project and FP7 PROXIMA project). He has participated in FP6 (SARC) and FP7 Projects (MERASA, VeTeSS, parMERASA). His research area include both high-performance and realtime systems on which he is co-advising several PhD theses. He has co-authored three patents and more than 100 papers in international conferences and journals.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
