Abstract-The performance and power efficiency of multi-core processors are attractive features for safety-critical applications, as in avionics. But increased integration and average-case performance optimisations pose challenges when deploying them for such domains. In this paper we propose a novel approach to compute an interference-sensitive Worst-Case Execution Time (isWCET) considering variable access delays due to the concurrent use of shared resources in multi-core processors, particularly focusing on shared interconnects and main memory. Thereby we tackle the problem of temporal partitioning as required by safety-critical applications. In particular, we introduce additional phases to state-of-the-art timing analysis techniques to analyse an application's resource usage and compute an interference delay. We further complement the offline analysis with a runtime monitoring concept to enforce resource usage guarantees. The concepts are evaluated on Freescale's P4080 multi-core processor in combination with SYSGO's commercial real-time operating system PikeOS and AbsInt's timing analysis framework aiT. We abstract real applications' behaviour using a representative task set of the EEMBC Autobench benchmark suite. Our results show a reduction of up to 53% of the multi-core Worst-Case Execution Time (WCET), while implementing full transparency to the temporal and functional behaviour of applications, enabling the seamless integration of legacy applications.
I. INTRODUCTION
In recent years, the decreasing relative cost of electronics and the pace of electronics development have led to the adoption of modern Commercial Off-The-Shelf (COTS) computing architectures in avionics. Increasing demand for energy efficiency and performance will foster the usage of multicore processors, and especially Multi-Processor Systems on Chip (MPSoCs) in safety-critical domains, such as avionics. Besides certification issues with COTS hardware [1] , multicore processors introduce additional problems related to the isolation of functionally disjoint applications.
In avionic systems, the Integrated Modular Avionics (IMA) concept is a standard system architecture integrating applications of different criticality on the same hardware platform. The so-called partitioning concept is introduced for the management and analysis of safety aspects and to enable incremental Development and Certification (iD&C) paradigms [2] , [3] . Partitioning ensures spatial and temporal separation of unrelated functions, comprising the isolation of address spaces and bounded temporal interferences [4] , [5] . Spatial separation is considered to be solved as it is not solely required in safety-critical but also general-purpose systems. For instance, techniques like Memory Management Units (MMUs) and Input/Output MMUs (IOMMUs) are common features in today's computing platforms. However, multi-core processors introduce challenges for temporal partitioning [6] , complicating the exact determination of the timing behaviour for sharedresource accesses, such as a Network-on-Chip (NoC) or shared caches.
In this paper we tackle the problem of temporal partitioning for multi-core processors, which is not yet sufficiently solved although it is strongly required for safety-critical systems. Our main contribution is the interference-sensitive Worst-Case Execution Time (isWCET) analysis concept, an extension to classical Worst-Case Execution Time (WCET) analysis for singlecore processors that accounts for the inter-process interferences due to the use of shared resources in multi-core processors. Intuitively explained, we split the timing analysis into core-local timing and resource usage analyses and complement them with the computation of shared-resource interference delays. We supplement the offline analysis with runtime resource usage enforcement to bound the maximum inter-core interference. Using this approach, we address the question of how to efficiently compute a multi-core WCET and guarantee temporal and resource usage behaviour for an arbitrary number of hard real-time applications. In contrast to related approaches, our approach is able to guarantee deadlines for arbitrary hard real-time applications, while reducing the analysis complexity and avoiding resource privatisation and mutual analysis of inparallel scheduled applications. Hence enabling independent analysis of applications supporting iD&C. We evaluate our approach on a modern COTS MPSoC, Freescale's P4080, using extensions to SYSGO's Real-Time Operating System (RTOS), PikeOS, and AbsInt's timing analysis framework, aiT. To address known predictability issues with modern COTS architectures, we apply a more predictable configuration for the system, cf. Section V-C. For the evaluation, we focus on shared NoC and main memory, while our approach is also valid for other stateless resources. Stateful resources, such as shared caches, are explicitly excluded since techniques like cache partitioning can be used to avoid access conflicts. Hardware support for cache partitioning is available in some recent Systems-on-Chip (SoCs) like the P4080 [7] , but can also be implemented in software [8] .
The paper is structured as follows: The basic terminology and the resource capacity enforcement are defined in Section III. Details of the WCET analysis extensions and the computation of a multi-core timing bound are covered in Sec-tion IV. In Sections V and VI we present our implementation and evaluate the approach. We discuss the results and related research in Sections VII and II. We conclude the paper with a short summary and future work topics in Section VIII.
II. RELATED WORK
How to use multi-core processors in real-time systems is an active field of research. Hence, several very different approaches have been presented over the last years.
[9], [10] , [11] propose deterministic execution models to control the access to shared resources. The basic concept is to divide program execution into multiple phases and restrict their capabilities. In [9] different resource access schemes are compared based on splitting the execution in acquisition, execution, and replication phases. Each phase gets an execution time and a maximum number of accesses to shared resources assigned. With the different schemes the phases in which communication to shared resources is permitted, are distinguished and evaluated. In [10] the PRedictable Execution Model (PREM) architecture for single-core COTS processors is proposed, introducing a co-scheduling for shared resources. The authors split a program into sequences of so-called predictable and compatible intervals. Predictable intervals are used to preload all data and instructions into local caches, while system calls and interrupt preemptions are prohibited. Traffic from peripheral devices is only permitted during the execution phase of a predictable interval, resulting in an architecture with very few contention for accesses to shared resources. A comparable approach is presented in [11] . They also split execution in communication and local execution phases. In contrast to Pellizzoni, they target a multi-core processor and study several tools to realise such an architecture. All of these approaches have the same idea in common, they serialise accesses to the shared resource, i.e. they apply resource privatisation through a Time Division Multiple Access (TDMA) scheme. Such approaches are known to poorly utilise the respective resource [12] , hence it is questionable in how far they provide a cost efficient solution for multi-core-based real-time systems. For that purpose we avoid resource privatisation.
Another popular approach is joint analysis. To address sharing of resources those approaches analyse the program flows on all cores using the considered shared resource. Therefore detailed knowledge on the state of execution is required. In [13] WCET analysis is applied to multi-core processors with shared L2 caches by analysing inter-thread dependencies. The analysis is based on the program control flow and accounts for all possible conflicts on the shared cache. This work is extended by identifying possibly overlapping threads [14] and reducing the number of possible conflicts between overlapping threads [15] . In [16] , [17] , [18] shared cache and bus analysis are combined, applying a TDMA bus arbitration. In [18] , the authors combine cache and bus analysis with other architectural features such as pipelines and branch prediction. Overall joint analysis approaches have to explore huge state spaces due to the various possible interactions between different tasks, resulting in huge computational complexity. Hence their scalability to a rising number of cores is questionable. Additionally, the mutual analysis of different applications contradicts the iD&C requirements of safety-critical systems. [19] , [20] are examples of approaches which propose changes to the hardware to address the resource sharing problem and its consequences for WCET analysis. Since the focus of this paper is on COTS processors, further details on solutions based on custom hardware will not be discussed.
Further, different approaches towards the response time analysis and the augmentation of additional delays due to resource conflicts have been proposed. In [21] a method to compute the inter-task interference due to shared resources based on a minimum distance between memory accesses is proposed. The authors focus on the computation of the maximum number of cache misses per core within a certain time interval, assuming preemptive scheduling. These bounds are further used to compute the inter-core interference and the response time of a task. [22] , [23] discuss a similar approach but assuming a non-preemptive task model. This allows some optimisations which finally lead to tighter bounds on the number of cache misses per core. [21] , [22] , [23] commonly consider online scheduling schemes focusing on the computation of the task response time. In contrast, we target static scheduling, motivated by avionics standards. Instead of modeling the temporal occurrence of shared resource requests we abstract processes by their pure number of requests. Firstly, this allows for a more efficient analysis, greatly reducing complexity. Secondly, since it is much easier to pre-define resource limits than arrival curves at design time, it also enables iD&C. We further consider the requirements of safety-critical and partitioned systems by providing additional monitoring mechanisms to control the inter-core interference.
With respect to monitoring [24] , [25] introduce the idea to leverage built-in processor counters to acquire additional task runtime information. [26] , [27] propose similar methods to isolate the behaviour of different cores. They propose serverbased approaches which assign a certain limit on the amount of cache misses. Yun et al. [26] focus on the isolation of one critical core from multiple non-critical cores by using one resource server per non-critical core. They parametrise the server limits based on the WCET and resource usage of the critical core. Their main focus is on minimising the performance impact on the non-critical cores, which is very different from guaranteed deadlines for arbitrary hard realtime tasks, as it is our goal. Similar to [22] , [23] the authors use arrival curves to model cache miss behaviour of tasks. [27] follows a similar approach, but applying a hierarchy of servers to all cores, instead of only the non-critical cores. The main difference of our approach against [26] is our focus on isolation and guarantees for hard real-time systems with an arbitrary number of hard real-time tasks per core. Also the assumed static schedule differentiates this paper from Yun et al. and Behnam et al., introducing different analysis assumptions. Considering the approaches in [21] , [22] , [23] , [26] , [27] they share the assumption of uniform memory access times, i.e. they rely on a single memory latency for their analysis. They inherently assume a linear increase due to interference, which contradicts measurements on real hardware, e.g. [28] . To address this we assume different access latencies, depending on the number of requesters. Moreover, none of these approaches accounts for the runtime overhead, i.e. additional execution time and resource requests, and its impact on the system. Only [26] mentioned some temporal overhead. Finally, only [27] consider the implications of iD&C. Addressing these issues, we evaluate the overhead of our approach and discuss the applicability of iD&C.
Considering the individual drawbacks of the discussed approaches further research towards the integration and analysis of hard real-time systems on multi-core processors is required.
III. RESOURCE CAPACITY ENFORCEMENT
Temporal isolation is not a new topic in safety-critical systems. It becomes much more complex when such systems deploy multi-core processors. In single-core systems, temporal isolation with respect to shared resources has to be considered only if Direct Memory Access (DMA)-capable peripheral devices or interrupts are enabled [29] . By means of abstraction layers, device drivers and disabling of unpredictable interrupts it is possible to avoid any unintended parallel accesses of processing cores and peripheral devices to memory. In case of multi-core processors however, parallel resource accesses are inherent and hardware arbitrated. The exact arbitration in COTS processors is not disclosed in most cases as vendors try to maintain their competitive advantages. Moreover, preventing or controlling interfering accesses of low criticality applications may not be possible, due to their loose certification requirements, cf. [5] . Hence the temporal behaviour of applications is hard or even impossible to predict and timing analysis becomes increasingly complex.
We target an integrated approach of multi-core worstcase timing analysis and runtime monitoring mechanisms. The runtime mechanisms are required to enforce defined resource capacities per scheduling entity. The implications of the capacity guarantees are further used to compute additional delays due to concurrent resource accesses -interference delays. We calculate an interference-sensitive WCET bound, based on the single-core/core-local WCET and the interference delays.
The concepts are designed such that they (1) are functionally and temporally transparent to applications, (2) avoid mutual analyses of in-parallel scheduled tasks and (3) allow to utilise parallel resources. Functional transparency avoids any ability of applications to control the runtime mechanisms, which is inevitable to fulfill safety requirements. Temporal transparency prevents influences on the timing of an application. Avoiding mutual analysis greatly reduces complexity, since each application can be analysed separately. Furthermore, it is essentially required to enable iD&C of applications, cf. [2] , [3] . As already mentioned, multi-core processors are, amongst others, interesting due to their performance. Consequently, leveraging built-in parallelism is inevitable for efficient utilisation of the platform.
In the following, we use the term process to refer to the smallest scheduleable unit. We are aware of different notations for such a unit, depending on application domain and abstraction layer. General purpose operating systems for instance use the terms thread and process, where processes are separated from each other and threads share the address space of their parent process. The ARINC 653 standard [4] , which is commonly used in the avionics domain, describes partitions and processes. Partitions can consist of multiple processes and shall be separated from each other. The separation requirements in the partition/process model are more strict than in the process/thread model since each partition gets guarantees when and how long it is executed.
To partition a resource, it is necessary to quantify its capabilities. We abstract a resource φ k ∈ Φ, using its capacity κ φk . This number is a representative for a specific resource parameter, e.g. bandwidth or number of accesses. The actual parameter depends on the target resource. Each requesting process π i is assigned a certain share κ φk πi of κ φk , called process limit or capacity. In order to provide a safe mechanism, applicable to real-time systems, each share needs to be guaranteed. To describe the approach we introduce the concepts of limitation, monitoring and suspension.
A. Limitation
Limitation is an offline mechanism used to assign the capacity κ φk πi per process. To provide a safe partitioning the limits are required to bound the resource usage of processes. Furthermore, the resource shall not be over-utilised to allow safe scheduling. Hence the sum over the capacities of all inparallel scheduled processes π i ∈ Π || must not exceed the overall resource capacity κ φk , cf. Equation 1. The computation of the limits is part of the extended worst-case analysis and discussed in Section IV.
B. Monitoring
Runtime monitoring is used to observe the resource usage of a process. Once a process reaches a limit, monitoring is responsible for triggering the suspension mechanism. The monitoring provides two functional benefits. For one it covers unsafe resource boundaries. Although we concentrate on static analysis, also measurement-based and hybrid approaches can be applied, depending on the criticality level of an application. Since the latter two techniques are not proven to provide safe bounds, they might underestimate the resource usage. As countermeasure the monitoring provides a safety net [30] to prevent partitioning violations. Secondly, the monitoring prevents partitioning violations caused by external events, such as Single Event Upsets (SEUs), which might alter the control flow of an application at runtime.
Any delays between limit violation and process suspension have to be taken into account, e.g. via a sufficient safety margin in the limits. Monitoring as such has to be transparent, in both the functional and temporal dimension. Functional transparency avoids any control from processes to the monitoring mechanism. Temporal transparency is required to prevent influences on the execution time of processes, which otherwise would additionally complicate timing analysis.
C. Suspension
A suspension action is triggered once a violation of the limit has been detected. The action is responsible to prevent further accesses to the shared resource, avoiding interference to other processes until the end of the current process window. From the system perspective, a process suspension is equivalent to a deadline miss. Known techniques for failure tolerant systems have to be applied to properly handle the impact on the system safety. This can be a redundancy concept to compensate for the missing results or a system reset if the system cannot be operated properly without the results. The particular technique depends on the system and the criticality of the affected processes.
Requirements to suspension are functional transparency and deterministic reaction timings. The latter are required to account for resource requests that a process can issue until it is finally suspended. Hence they have to be included in the limits.
D. Scheduling Model
According to the avionics standard ARINC 653 [4] , we consider a time-triggered, static scheduling scheme. That is, a process has a fixed activation time and deadline, further called process frame, within a time frame. During each process frame the execution time and resource usage boundaries are guaranteed, based on the described mechanisms. If multiple processes share the same core they are assigned to different process frames. It is important to note that the process frames are synchronised over all cores. The schedule is not work conserving, i.e. if a process finishes before its deadline, the processor core is idle for the rest of the process frame. Instances of the same process can be executed in different process frames, depending on the application requirements. Once all process frames of the time frame have been executed, the whole time frame is started over again.
E. Example: Concurrent NoC Accesses
To illustrate the operation of the mechanism, we describe an example. Figure 1 depicts a system schedule as described above. It shows the processes π 0 to π |Π|||−1 assigned to different process frames on cores ρ 0 and ρ 1 . The processes π 0 , π 1 are described in more details, but the mechanism applies for all process frames. Both processes are assigned to the same process frame and therefore competing for NoC accesses. Each diagram plots the accumulated number of memory accesses (κ) over time. Continuous lines represent normal execution, dashed lines depict abnormal and dotted lines partitioned conditions. Horizontal dot-dashed lines represent the limits κ φk πi . The finalisation of a process is marked by an x.
Under normal conditions (continuous lines), both processes finish within their process frame. For abnormal conditions (dashed lines) π 0 issues significantly more accesses than in the normal case. This can, for instance, be caused by a software error (unbounded loop) or SEUs. Consequently π 0 executes until it is stopped at the end of the process frame. Its additional accesses interfere with those of π 1 such that π 1 suffers higher delays, causing a deadline violation. To avoid such unbounded temporal impacts between processes we introduce the described partitioning approach (dotted lines). Once π 0 exhausts its limit (κ φk π0 ), it is suspended. Consequently, π 1 experiences some increase in execution time, but since this is bounded by κ φk π0 , π 1 is able to finish within its deadline. In summary, a guaranteed resource usage limit per process, bounds the interference with other processes. Hence, faults do not propagate over process boundaries, providing temporal isolation and fault containment. 
IV. WCET ANALYSIS
This section describes our approach to compute an isWCET bound. This bound includes additional delays due to interferences with in-parallel scheduled processes without requiring mutual analyses of all processes. We will first give a brief overview of state of the art static timing analysis techniques and challenges that arise with modern processor architectures. Based on that, we derive additional analysis blocks necessary to compute a multi-core bound. For this purpose a timingcompositional system with respect to shared resource usage is required.
A. Timing Analysis
Over the last several years, a more or less standard architecture for timing analysis tools has emerged [31] , composed of three major building blocks: (1) control-flow reconstruction with control-and data flow analyses, (2) micro-architectural analysis, computing upper bounds on execution times of basic blocks [32] and (3) path analysis, computing the longest execution paths through the program [33] . The data flow analysis also detects infeasible paths, i.e. unreachable program points during real execution. This reduces the complexity of the following micro-architectural analysis. Basic block timings are determined using an abstract processor model to analyse the instruction flow through the pipeline accounting for cache hit/miss information. This model defines a cycle-level abstract semantics for each instruction's execution yielding a certain set of final system states. After the analysis of one instruction has been finished, these states are used as start states in the analysis of the successor instruction(s). The pipeline analysis examines all possible execution paths whenever the timing model introduces non-determinism, e.g. due to unknown cache information. This architecture is implemented by AbsInt's timing analysis tool aiT.
B. Multi-core Analysis
The main contribution of this paper is the computation of an upper bound for the isWCET (τ is ). The computation consists of two phases: (1) core-local timing and resource analyses per process and (2) combination of the core-local results of in-parallel scheduled processes to derive the multicore bounds. The core-local timing analyses include pipeline and local cache analyses. In addition a resource analysis has to be performed for each shared resource used by the respective process. Consequently, the core-local analyses of process π x return an upper timing bound, denoted τ s (π x ), and an upper bound for the usage (κ φk πx ) of every utilised shared resource φ k , further denoted Worst-Case number of shared Resource Accesses (WCRA). Both timing and resource analyses are implemented using the described standard architecture for timing analysis. The computation of the multi-core timing bound (τ is (π x )) is based on core-local analyses and the interference delays which are guaranteed by the runtime resource capacity enforcement. For the sake of simplicity, we further focus on only one shared resource, the main memory, including the required interconnect.
The general problem of shared resources are unpredictable access delays that depend on the resource usage of connected devices, e.g. processor cores and peripherals. The interference stems from arbitration delays if multiple requests are issued to a resource with limited bandwidth. For instance, if a resource can handle one request at a time and two requests are issued in parallel one of them is processed while the other one has to wait. At this point we have to assume a fair arbitration in order to be able to determine upper bounds on the imposed delays. Assuming round robin arbitration, the worst-case delay for the second request includes the execution time for both, the first and second request as well as any additional arbitration overhead. Consequently any resource that has to arbitrate between parallel requests influences the latency. Examples are NoCs, shared caches and main memory controllers. The arbitration delay may increase with the number of parallel requests. This is expressed in Equation 2 , where δ i denotes the resource access delay when i requests are issued in parallel. In practice Equation 2 can be interpreted such that no two delays δ i , δ i+1 exist, where the relative delay δ i is greater than δ i+1 , normalising to the number of requesters.
To compute the additional interference delays, it is necessary to determine the worst-case overlap scenario for concurrent requests by different cores in the same process frame. According to Equation 2, the worst-case overlap for Π || cores appears if all cores issue their requests in parallel. For architectures where Equation 2 is not valid, the worst-case overlap has to be derived differently, either by the architecture parameters or by analysing permutations of sequential and parallel requests. The general case to compute a multi-core bound for process π x and a single shared resource is expressed in Equation 3 . We assume that the processes are sorted according to their capacities, i.e. κ φk πi ≤ κ φk πi+1 . The accesses of all processes with higher WCRAs than π x in the corresponding process frame have to be considered as overlaps. Equation 3 can be extended to cover multiple shared resources if needed. To provide WCETs for all processes in a system, Equation 3 has to be computed for every process considering its respective process frame. If a process is assigned to multiple process frames, its isWCET needs to be computed for each process frame, since the interference by in-parallel scheduled processes can be very different.
The proposed isWCET computation is safe since the runtime mechanisms enforce the process capacities.
C. Predictability Challenges
Predictability and timing analysis are challenging for modern COTS processors due to their optimisation towards average-case performance. Execution history related features, such as caches and branch predictors increase the search space for analysis since instructions cannot be analysed in isolation. Instead mutual interactions need to be considered in order to obtain tight results. Additionally, so-called timing anomalies [34] , [35] , can drastically increase the analysis state space. Intuitively, a timing anomaly is a situation where a local worst-case situation does not contribute to the global worstcase. For instance, a cache miss -the local worst-casemay result in a globally shorter execution time than a cache hit because of hardware scheduling effects. Hence, it is not safe to rely on local decisions. Instead both paths need to be considered, which drastically increases analysis complexity. [36] categorises the timing composability of computing architectures according to the presence of timing anomalies. Fully compositional architectures, such as the ARM7, contain no timing anomalies. Compositional architectures only contain bounded timing effects, while non-compositional architectures contain unbounded anomalies, so-called domino effects. However, missing composability does not prevent timing analysis, but greatly increases analysis complexity. To avoid high complexity, composability is commonly assumed in related approaches, motivated by the problem domain, to enable efficient analysis [21] , [23] .
As explained, it is hard to formally prove composability for modern COTS processors. But the configuration can increase the predictability by avoiding known sources of anomalies [37] , e.g. through the cache write policy and branch predictor settings. Some techniques to handle timing anomalies are described in [34] , for instance by using special pipeline synchronisation instructions in the PowerPC architecture. Additionally, in the context of mixed-criticality systems the particular criticality level of an application has to be taken into account. For example, even if compositionality is not formally proven, it can still be valid to rely on local decisions during analysis, as long as the obtained results are suitable for the criticality of the target application.
For our implementation, composability is required for the final computation of the isWCET. As explained, composability is a common assumption for single-core systems. Hence, it does not pose different requirements than related approaches. On the other hand, the isWCET computation additionally requires composability in order to ensure that the interference delay is additive to the core-local WCET, without influencing core-local pipeline analysis. We argue that variations in shared resource access latencies do not cause processor pipeline timing anomalies, since naturally even the best-case latency is orders of magnitudes higher than typical pipeline latencies. For example, according to the e500mc manual [38] typical pipeline latencies are between 1 to 3 cycles while memory latencies are roughly between 40 and 1000 cycles depending on the number of requesters, cf. Table I . Caused by the huge differences between pipeline and memory latencies, the processor's pipeline will drain in any case, while the access is processed. Hence, the interference delays do not affect pipeline analysis in the sense of timing anomalies.
V. IMPLEMENTATION
This section covers our implementation of the described concepts. The WCET analysis extensions are implemented using AbsInt's aiT, the leading commercial framework for static timing analysis. It has been successfully applied to real avionics applications [39] . The runtime mechanisms have been implemented in SYSGO's PikeOS, a commercial RTOS which has been used in multiple certified projects, like the Airbus A400M [40] and the Airbus A350 [41] . As target computing platform we select Freescale's P4080. Although the P4080 is classified as non-compositional architecture, according to [36] , it is the reference platform for the QorIQ series from Freescale, which is commonly used for research evaluation of future platforms for the safety-critical application domain [42] . As such the P4080 is for example under investigation in multiple research projects such as ARAMiS [43] , MUSE [44] and RECOMP [45] . While the P4080 poses predictability issues complicating the deployment in future avionics and automotive systems, it provides a first step towards the use of multi-core processors in safety-critical real-time systems, hence the platform is sufficient for the evaluation of our work. The eight e500mc PowerPC cores and the CoreNet NoC platform interconnect are of special interest for our work. The NoC of P4080 is considered as black box since in general NoC design-internal information are not disclosed by silicon manufacturers.
A. Limitation
Process capacities can either be assigned using static analysis, measurement-based approaches or manually. We implemented the described multi-core timing analysis in AbsInt's aiT analysis framework. The existing single-core/core-local analysis has been extended to calculate the WCRA for memory accesses. The calculation is based on the results of cache analysis classifying accesses as cache hit/miss or unknown. Cache hits count as local accesses and cache misses as memory/shared resource accesses. In order to provide safe boundaries, accesses marked as unknown are treated as shared resource accesses, too. The WCRA is determined for each basic block. We extended path analysis to solve an Integer Linear Program (ILP) optimising for WCRA. Besides this extension, an appropriate architecture model for the e500mc cores is required. In the current version a prototype, based on an early e600 model is used. Architectural differences have been derived from the processor manuals. The current model can handle at most one cache level. Thus it does either represent L1 caches or the mini (L0) cache Data Line Fill Buffer (DLFB) and Instruction Line Fill Buffer (ILFB). This results in large overestimations of the resulting WCETs and WCRAs. This is however no limitation to the proposed approach. To apply the tool for the verification of safety-critical systems model refinement and additional tool validation is required. The described computation of the isWCET has to be performed for each process frame, hence individual process frames are independent from each other, as intended for partitioned systems. Also the core-local analyses of processes are independent since no details on other processes are required, therefore enabling iD&C. Since the isWCET computation requires the WCETs and WCRAs of all processes it has to be performed during system integration.
B. Monitoring and Suspension
Monitoring and suspension are implemented in SYSGO's real-time Operating System (OS), PikeOS. A similar implementation using a bare metal OS layer has been used for evaluation purposes. Both implementations use the built-in processor core Performance Monitor Counters (PMCs). The cores of the P4080 provide the Bus Interface Unit Access events, which count all accesses to the shared system NoC. This includes explicit instruction fetches, data load/store operations as well as accesses caused by pre-fetching and cache related write-backs. Hence, using these event we are able to monitor all transactions that a core issues to the shared memory hierarchy. To properly handle the PMCs we implemented two routines. The first routine is called per process at the start of its process frame to initialise the PMC with the WCRA of the process. The second routine is a callback function which is triggered by the overflow exception of the PMC when the configured limit has been exhausted. Accordingly, the routine implements the suspension action which terminates the process execution for the current process frame. Instances of that process in other process frames are not affected. This means that the process is normally executed in subsequent process frames, with replenished resource limit. The current version of the suspension action ignores outstanding write-backs. In later versions, these must either be handled during analysis or avoided by invalidating the caches without updating the system memory.
C. Processor Core Configuration
The caches, Translation Lookaside Buffers (TLBs), Branch Target Buffers (BTBs) and Branch History Tables (BHTs) of the e500mc processor cores use Pseudo Least Recently Used (PLRU) and First In First Out (FIFO) replacement schemes. This makes them a non-timing-compositional architecture according to [36] . In order to still be able to assume timing compositionality we use the cores in a very deterministic configuration, avoiding any known domino effects. Hence, the branch prediction is switched off and all TLBs are preloaded to avoid any miss. To increase the predictability of corelocal caches, data caches can be used in write-through mode, while 2nd-level caches are exclusively used as scratchpad memories. Further, partial cache locking can be used to obtain Least Recently Used (LRU) replacement policy. However, as explained in Section V-A, the current aiT architecture model does not allow to analyse more than one cache level. In order to have a comparable setup between static analysis and actual measurements and since DLFB and ILFB cannot be turned off, we disabled the L1, L2 and L3 caches. It shall be understood, that this is not a restriction of the approach, but rather a direct consequence of the prototypical architecture model. As a natural effect, the absolute timing and resource bounds are significantly increased.
Avoiding the mentioned sources of domino effects is still no formal proof of timing compositionality, but it shall be noted that is not easily possible to formally prove composability of a processor. Although the impact of a noncompositional evaluation platform influences the safety of the bounds, it does not limit the validity of the approach, rather than the applicability of the architecture. While also a fully timing compositional architecture, such as the ARM7, could have been chosen, we target the P4080 since it is the most considered evaluation platform in industry. It would surely lower the impact of the evaluation if we would disregard this fact.
VI. EVALUATION
In this section we evaluate the described approach with respect to the reduction of the isWCET. This also requires the evaluation of the core-local analysis phases for timing and resource usage. We conclude the section by demonstrating the runtime effect of the partitioning, by integrating all of the selected benchmarks. We use a set of benchmarks from the EEMBC Autobench benchmark suite [46] . After careful analysis of real avionics applications the selected benchmarks have been identified to adequately represent different applications' behaviour.
A. Worst-case Analysis
As in [27] , we use the maximum contention approach as baseline comparison as there are no data for other multi-core WCET approaches available. With the maximum contention approach, each memory access is accounted with the maximum delay (δ 8 = 1007 for Π || = 8 cores). This is a valid approach, as long as no assumptions on in-parallel applications can be made. The timing bound is computed according to Equation 4 .
(4) Table I shows the memory access latencies for read and write operations with increasing number of interfering cores. They have been acquired using the approach described in [28] , while mapping all eight cores of the P4080 to the same memory controller. Since the path from processor cores to main memory is very complex, it is not possible to obtain real worst-case latencies without in-depth knowledge of the chip design. Hence, the latencies in Table I shall only be understood as indicators. For the evaluation, we always used the higher latency, as marked in Table I , since the exact distribution of read and write accesses is not known. We compare the analysed timing and resource usage bounds (τ s (π x ), κ φk πi ) and respective, observed maximum values in Table II . The bound deviation is relative to the observed values. The overall overestimation is reasonable for a prototype of such a complex core. The only exception is aifftr. There, most of the time is spend in a library routine computing the remainder of 2x π which contains many data-dependent loops. The deviations for execution time and resource bounds are comparable. Table III compares the computed multi-core bounds of the maximum contention (τ max ) and our interference-sensitive approach (τ is ). The isWCETs are computed, based on the analysed bounds in Table II , scheduling all benchmarks in parallel. The reduction is relative to the maximum contention approach. The isWCET bounds are reduced up to 53%, proving the necessity of our approach, even if the absolute timings are relatively huge. As a consequence of the isWCET analysis approach the process with lowest resource bound will always be assumed to suffer maximum contention. As can be seen in Table II a2time has the lowest WCRA. In consequence τ max and τ is are equal for this process. 
B. Functional Behaviour
To demonstrate the functional behaviour of monitoring and suspension we use three scenarios:
(1) isolated: core 0 executing the reference benchmark, and cores 1 to 7 being idle, (2) interfered: as (1) with interfering benchmarks on cores 1 to 7 without limiting their memory accesses, (3) partitioned: as (2) and limitation of cores 1 to 7, according to the analysed WCRAs 1 .
The reference benchmark, bitmnp, is executed on core 0. The benchmarks on cores 1 to 7 are used to introduce interference for scenarios (2) and (3), while our partitioning approach is only enabled for scenario (3) . All benchmarks are parametrised with the analysed WCRAs from Table II. To intense the effect of interference, the benchmarks on cores 1 to 7 are executed repeatedly, until the reference benchmark is finished. This is done to increase the probability for the benchmarks to reach their resource limit, which otherwise is very unlikely or even impossible due to static analysis. The results are shown in Figures 3, 4 and 5, respectively. Each figure shows a single diagram per core, plotting the resource usage (κ) over time. Figure 3 only shows core 0, since the others are idle. Triangles indicate a suspension due to a limit violation. For evaluation we use eight benchmarks. Since the P4080 has eight cores we only consider a single process frame. However, since our analysis considers each process frame separately, this can easily be extended to multiple process frames, cf. A comparison of the results illustrates the impact of interference and partitioning. While bitmnp in isolation finishes after 1071ms, its execution time is increased to 2876ms by unlimited interfering cores. Enabling the partitioning causes suspensions on cores 1 to 4, reducing the execution time of bitmnp to 2230ms. The measured WCRAs for cores 1, 2, 3 and 4 further show a deviation of 8 to 12 accesses compared to the configured limits. This stems from the overhead of the suspension routine as described in Section III. The static analysis of the corresponding routine returns a timing bound of 25.0μs and a WCRA of 567 accesses. 1 The validation of the functional behaviour of the monitoring and suspension mechanism was done before the refined analysis results were ready. The limits used are thus more conservative than those in Table II . 
VII. DISCUSSION
The evaluation has shown the validity and applicability of the presented approach. The results of the core-local timing and resource analysis are considerably high but reasonable having in mind the prototype status of the architecture model. The overestimation for both, timing and resource analysis, is in the same order of magnitude. Comparing the different benchmarks, the results for aifftr are significantly higher than for others. This can be explained by the code structure, containing triangular loops highly dependent on input data, causing the analysis to consider many paths that are infeasible in practice.
The evaluation of the resulting isWCETs shows a significant reduction of up to 53% over the maximum contention approach. Comparing the reduction for different benchmarks, it can be seen that the resulting effect clearly depends on the individual benchmark characteristics. Precisely, the larger the difference of WCRAs, the higher the WCET improvement. This can be used to optimise system scheduling, for instance, by scheduling applications such that one process with a lower and one with a higher WCRA are running in parallel. Applying this effect to legacy applications, one can expect applications with many shared resource accesses to suffer a much higher WCET increase, compared to applications with a relatively small number of shared resource accesses.
The evaluation of the functional behaviour shows the runtime effect as described in Section III, the correctness of the implementation and the predicted overhead for suspension. We have derived a bound for the suspension overhead by analysing the respective routine. In addition, the interrupt latency and outstanding instructions have to be accounted. The interrupt latency for the e500mc cores is limited to ≤ 10cycles, unless a guarded load or a cache-inhibited stwcx. instruction is in the last completion queue entry [38] . For the latter cases the latency is determined by the memory locations targeted by the operation. In general, the outstanding operations depend on the application, but in terms of memory accesses the worst-case occurs if every pending instruction is a load or store. The use of performance counters does not impose any runtime overhead, according to Freescale. Hence an active process does not suffer temporal delays by enabled monitoring.
Besides the multi-core WCET improvements, the described partitioning further provides a safety net [30] , which isolates applications in cases of miss-behaviour and faults. The so ensured temporal fault containment guarantee is a new property for multi-core processors.
In summary, the architecture model is sufficient in order to show the validity of the presented approach. For the judgement of the results it shall be understood that the development of an accurate architecture model is extremely time-consuming. That is why we used a prototype that does not exactly represent the real hardware. However, this does not limit the presented approach and the general statement of the paper, since it only influences the absolute results, but not the order of magnitude for the WCET reduction. Moreover, it is possible to replace the static analysis with a different technique, for instance a measurement-based approach. The applicability only depends on the target application and their certification requirements. In particular, the resulting assurance for an application has to be sufficient for its criticality.
VIII. SUMMARY AND FUTURE WORK
In this paper we addressed the problem of computing WCET bounds for multi-core processors to enforce temporal partitioning. Intuitively explained, we split the timing analysis into core-local analyses and computation of the worst-case interference delay caused by the use of shared resources. We further extend core-local analysis with additional phases to account for the usage of shared resources. The proposed approach uses runtime resource capacity enforcements to bound the interference between processes executed in-parallel. Using this approach we are able to independently analyse multiple applications, enabling the seamless integration of legacy applications and the use of iD&C. We further significantly reduced the complexity, by abstracting processes by the sheer number of resource requests instead of their precise occurrence in time.
The validity of the approach has been shown by integrating application blocks, representing different behaviours of real avionics applications. The WCET analysis extensions have been implemented in aiT, AbsInt's framework for static timing analysis. The runtime mechanisms were added to PikeOS, SYSGO's RTOS. Both are commercial products, successfully applied to certified projects. We evaluated the approach on Freescale's P4080 MPSoC, which is commonly used for evaluations in the safety-critical application domain. The results show a reduction of the multi-core WCET bound of up to 53%. Great benefits, compared to other approaches, are the true parallel usage of shared resources while avoiding mutual analysis of applications. This enables the utilisation of multicore benefits while still reducing analysis complexity. Mixed criticality workloads can even be used to adjust the WCET bounds, optimising overall system execution time. In summary, we have shown that it is possible to avoid the assumption of maximum contention, even if no complex analysis or hardware modifications are applied and still gain significant timing bound reductions. As such the proposed approach is one contribution towards the analysis of stateless shared resources in multi-core processors. The problem of stateful shared resources, e.g. shared caches, has not been addressed since it is considered to be sufficiently solved by partitioning techniques, with hardware support available in current MPSoCs.
Even though the absolute multi-core bounds are considerably high, this is no limitation to the approach, but rather explained by the prototypical architecture model. For use in real systems, the architecture model needs to be refined to estimate the WCET more precisely. It is also possible to replace static analysis by alternative approaches, e.g. hybrid measurements. Doing so will not require any changes to the presented approach. Assumptions and configurations made to increase the analysability of the architecture are also no limitation to the approach but necessary to cope with the complexity of the processor. In particular the disabled core-local caches have a serious impact on the overall performance. This configuration has been applied in order to have a comparable basis between static analysis and measurements. There is no further need to disable the caches once the timing model of the e500mc core handles all cache levels. Hence this particular configuration is not required by the approach but a consequence of the current implementation status. With respect to composability, we argued the applicability of both the core-local analyses and the augmentation by the interference delay, showing that our assumptions are motivated by the problem domain and thereby not more restrictive than those of related approaches. We also discussed that missing composability does not violate the validity of the approach, rather than the applicability of an architecture. In the target domain, also the criticality of the application needs to be considered, as it influences the composability requirements.
Future work will address the problem of overestimated timing bounds due to variations in resource access delays, targeting increased average system utilisation. Furthermore, we will investigate how to control the effects of DMA-capable I/O devices, extending the presented approach. Based on the approach, scheduling solutions will be developed to select applications that should be scheduled in parallel to optimise system resource utilisation.
IX. ACKNOWLEDGEMENT
The ARAMiS project funds this work (German Federal Ministry for Education and Research, funding ID 01IS11035).
