The pressing market demand for competitive performance/cost ratios compels Critical Real-Time Embedded Systems industry to employ feature-rich hardware. The ensuing rise in hardware complexity however makes worst-case execution time (WCET) analysis of software programs -which is often required, especially for programs at the highest levels of integrity -an even harder challenge. State-of-the-art WCET analysis techniques are hampered by the soaring cost and complexity of obtaining accurate knowledge of the internal operation of advanced processors and the difficulty of relating data obtained from measurement observations with reliable worst-case behaviour. This frustrating conundrum calls for novel solutions, with low intrusiveness on development practice.
Introduction
The market for Critical Real-Time Embedded Systems (CRTES), which includes the automotive and avionics sectors, is experiencing an unprecedented growth [1] . While crucial to keeping competitive advantage, the inclusion of increasingly sophisticated value-added functions, such as for example Advanced
5
Driver Assistance Systems, causes CRTES makers to continually seek higher guaranteed computational performance while striving to contain cost and power budget. This goal can only realistically be achieved by adding complex and powerful hardware accelerator features such as caches or multicore designs 1 .
techniques to handle complex hardware, but hybrid approaches are subject to similar limitations.
25
The availability of more powerful hardware and the quest for more functional value per unit of product also prompt CRTES industry to consider adopting mixed-criticality design solutions for their systems. From the timing perspective, which is the focus of this paper, the challenge with mixed-criticality systems lays in the need for solutions to ensure strict temporal isolation between 30 programs assigned to different criticality levels, so that their behaviour can be deemed composable in the time dimension 3 . In the absence of effective means to abate the pessimism of WCET analysis, however, mixed-criticality solutions that achieve time isolation by fencing budget allowances, risks incurring massive over-provisioning, which defeats the purpose of combining systems together.
35
Probabilistic techniques may greatly aid on all of those fronts. In particular, with Measurement-Based Probabilistic Timing Analysis (MBPTA) methods [3, 4, 5, 6] , the execution time of the application can be accurately modelled -at some level of execution granularity -by a probability distribution.
MBPTA seeks to determine WCET estimates for arbitrarily low probabilities of 40 exceedance, termed probabilistic WCET or pWCET. As a consequence, there is some residual risk (in the form of an exceedance probability) beyond which it cannot be proven that a pWCET bound cannot be exceeded. However, this residual risk is upper bounded with a given probability, which can be determined at a level low enough to suit the needs of system design in the application do- 45 main. For example, the residual risk can stay in the region of 10 −9 per hour of operation, largely below the acceptable probability of failure in certified systems.
Under MBPTA, at a given granularity of execution, the response time of every individual execution component at that level (e.g., an instruction) is assigned a distinct probability of occurrence. This trait -which shall not be 50 3 Time composability is had when the timing behaviour of an individual software component does not change in the face of composition when the system is integrated, and so, the timing analysis performed in isolation remains valid at system integration.
confused with the probability of that component being executed in a run of the program -is described by a probabilistic Execution Time Profile (ETP), expressed by the pair: <timing vector; probability vector>. The timing vector in the ETP enumerates all its possible response times. For each response time in the timing vector, the probability vector lists the probability of occurrence of 55 that response time in an instance of execution. Hence, for execution component 
60
The processor architecture is instrumental in ensuring that individual instructions have an associated ETP. As this guarantee in turn is a crucial enabler to a sound and effective application of MBPTA, the processor architecture is the level of execution granularity on which we focus in this work.
Contribution. Within the context of the FP7 PROXIMA project [7] we de-scribe the architecture features that a processor should possess to be amenable by construction to the use of MBPTA. We term this quality MBPTA compliance. In presenting our case, we offer insight on the costs that may be incurred in actual implementation of a MBPTA-compliant processor. To that end, we categorise processor resources according to their timing behaviour and detail 70 how they should be designed for use in a MBPTA-compliant processor. Without loss of generality, we consider the inner operation of the processor to employ a number of passive resources (e.g., caches, buffers, buses, etc.). We assume each processor instruction to use some of those resources in a given order, whether in sequence or in parallel. We design processor resources so that each of them 75 can be assigned a given ETP. To achieve this for all resources, we use time randomisation in some, actually very few, of them. Resources that are not time randomised must be assigned a local upper bound to their response time that can be safely composed. We assume a time anomaly free baseline architecture.
The remainder of this paper is organised as follows. Section 2 introduces 80 PROXIMA and contextualises this work. Section 3 presents the requirements 4 that MBPTA places on processor hardware. Section 4 classifies hardware resources in a taxonomy specifically related with MBPTA. Section 5 presents software-only solutions that could be applied to make commercial-off-the-shelf processor hardware fit for MBPTA. Section 6 presents a demonstrative imple-85 mentation of a processor architecture, purposely designed for compliance with MBPTA. Section 7 surveys related work. Section 8 draws some conclusions and outlines the future of this line of work.
Context within PROXIMA
This work has been performed within the scope of PROXIMA [7] , an In-90 tegrated Project (IP) of the Seventh framework programme for research and technological development (FP7). PROXIMA objectives include providing a complete toolchain enabling low-cost timing verification for systems based on multicore and manycore processors implementing critical real-time functionalities. In particular, PROXIMA toolchain includes the following main elements:
95
• Hardware and software platforms amenable for MBPTA. One of the key elements of the toolchain is a hardware platform providing the timing properties required by MBPTA to facilitate obtaining reliable and tight pWCET estimates. This hardware platform has been implemented in a FPGA prototype used in the Space domain. Alternative software-only 100 solutions have been developed to enable MBPTA on top of commercial offthe-shelf (COTS) processors that include a non-MBPTA-compliant version of the Space prototype, an Infineon AURIX T277 and a Freescale P4080 processors. MBPTA compliance in future manycore processors has also been investigated by means of architectural simulators.
105
• MBPTA-compliant real-time operating systems (RTOS). The RTOS needs to be enhanced with features so that its contribution to the execution time of the tasks analysed is made constant, and hence, time-composable, and its impact on the hardware and software state is neutral w.r.t. the properties needed to attain MBPTA compliance, thus being transparent 110 for the timing analysis process. RTOS features have been implemented as part of PikeOS, RTEMS-SMP, ERIKA and some research-oriented RTOS.
• Timing analysis tools. Appropriate methods for the estimation of pWCET are required to account for the timing behaviour of the underlying hardware/software platform, being compatible with the tracing methods in This paper reviews MBPTA-compliant hardware behaviour to deliver the timing properties needed to estimate reliable and tight pWCET. We also show how some of these goals can be achieved in the absence of MBPTA-compliant hardware. 
Taxonomy of Timing Analysis Techniques
We differentiate three main timing analysis types, each of which has a deterministic and probabilistic variant.
• Measurement-based deterministic timing analysis (MBDTA) techniques techniques have been used in industry for many years. They are usually coupled with detailed analysis of the software structure that provides confidence in exercising those worst-case paths or scenarios at the application 140 level that can arise during system operation. To make safety allowances for the unknown (which the cognizant associates with the difficulty of determining the hardware worst case), an engineering margin is often added to the computed bound. The intent of the margin is striking some sound balance between pessimistic overkill and risk of underestimation. Deter-145 mining a reliable and tight engineering margin is extremely difficult -if at all possible -especially when the system may exhibit discontinuous changes in timing due to unanticipated timing behaviour. The confidence had on the WCET estimate determined with MBDTA is, therefore, fully-dependent on the ability of the end user to identify what behaviour needs to be triggered in the hardware and software to observe the WCET that can occur during operation (or execution times close to it) and to produce program inputs that trigger that behaviour. The increasing complexity of the hardware (i.e. the use of cache hierarchies and multicores) is also a threat for the scalability of this approach [8] .
155
• Static deterministic timing analysis (SDTA) techniques rely on the construction of a cycle-accurate model of the processor and an abstract representation of the application code. SDTA searches the resulting state space for the worst case, with constraint-based integer linear programming. Obviously, such an analysis cannot carry forward all the possible states of 160 execution. Hence, conservative choices are made during the process, thus trading a reduction in the state space for increased pessimism [9, 10, 11] .
SDTA has abundant need for information about the timing specification of the processor hardware and flow facts for the application. As the prediction must necessarily err on the side of pessimism, any lack of information 165 about the timing behaviour of the object of analysis (e.g., the address of a memory access needed to determine if execution hits or misses in cache)
or about processor timing behaviour degrades the tightness of the WCET estimate. Further, the result of the analysis is as reliable as the input provided to it [8] . The rise in complexity of next-generation CRTES greatly 170 exacerbates this problem: the volume of detailed knowledge needed to construct a sufficiently accurate execution model as well as the time, effort, cost and complexity entailed in acquiring that information, challenge the adoption of SDTA for CRTES applications.
• as cache interactions among programs and inter-task interference in the use of hardware shared resources in multicores [8] .
At the present state of the art, probabilistic timing analysis (PTA) can be applied in either a static (SPTA) [5] or measurement-based (MBPTA) [4] fashion:
we refer the interested reader to those works for details on PTA fundamentals.
190
In this work we focus on MBPTA only since it is more mature for industrial use than SPTA [8] .
MBPTA generates a probability distribution that describes the maximum probability with which an instance of the program can exceed its assigned bud- 
Requirements
MBPTA considers events resulting from the observation of end-to-end measurement runs of the program, thus at coarser granularity than processor in-210 structions. MBPTA builds upon EVT [13, 4 ] to estimate pWCET. Yet MBPTA and EVT are not the same thing. We clarify this by differentiating the requirements that MBPTA imposes due its use of EVT and other MBPTA requirements to satisfy representativeness requirements.
• Extreme Value Theory: The use of EVT requires that its input, i.e. the 215 observed execution times in our case, to be described with independent and identically distributed (i.i.d.) random variables. Two random variables are said to be independent if they describe two events such that the occurrence of one event does not have any impact on the occurrence of the other event.
Two random variables are said to be identically distributed if they have 220 the same probability distribution. Specific statistical tests can be used to check these properties on a set of execution times, see Section 6.
It is worth noting that some authors have shown that independence across observations is not strictly needed as long as maxima are independent or the dependence across maxima is weak [14, 15] . However, in the rest of 225 this paper we build upon independent data since it is a by-product of MBPTA-compliant platforms presented in this work.
• program paths, and synthetically construct the worst-case path from them. The details of the latter method are presented in [19] .
Probabilistically Modelling the Timing Behaviour of Processor Resources
When the latencies with which each resource responds should have an at-280 tached probability of occurrence, the execution time of the instructions using those resources can then also be captured probabilistically. In this respect, the probabilistic execution time of an instruction is a function of the ETP of the resources it uses and how they are arranged, in series or in parallel. Ultimately, this enables capturing the execution of the whole program, which is comprised 285 of instructions, in a probabilistic manner.
For a processor to be MBPTA compliant, the pWCET estimates obtained for the programs that run on it must hold valid for the whole operational life of the system. Hence, they must hold valid for every run of the programs of interest under all (or a desired subset of those that can arise during operation) 290 execution conditions. To understand how the timing behaviour of processor resources needs to be modelled for those guarantees to be obtained, we first
show how the MBPTA process works.
Probabilistic Timing analysis process
Systems amenable to MBPTA have two distinct modes of use: one for anal-295 ysis, and another for operation.
• The analysis mode is used to obtain pWCET estimates that hold valid during system operation. To this end, the timing behaviour of the system in that mode must upper bound that of the system after deployment, as used in real scenarios. This guarantees that circumstances that can occur 300 during the lifetime of the system cannot alter its timing behaviour in a way that has not already been upper bounded at analysis time.
• The operation mode is used during actual operation. In this mode, timing conditions are unrestricted (or restricted to a specific subset) and can thus lead to lower execution times than those experienced in the analysis mode.
305
By intent, the analysis mode requires that the timing behaviour of the system as a whole and of its individual components in isolation (seen at the granularity of execution of interest) either upper bounds or matches that which will occur in operation mode. For MBPTA-compliant processor architectures, this condition can be achieved in either a deterministic or a probabilistic manner. Accordingly, represents execution time, and the y-axis the probability for any particular latency to occur (this is obviously 1 in the case of deterministic resources). In 
Taxonomy of hardware resources for canonical MBPTA compliance
We term jitterless resources the processor resources that have a fixed latency, independent of the input request and of the past history of service. Several hardware resources in current processor architectures are jitterless such as, for in- Conversely, the ETP of a time-randomised jittery resource r j is: frequency for each possible latency of that resource, but not necessarily a true 50% probability.
For the purposes of MBPTA, the timing behaviour of jitterless and jittery (either upper-bounded or time-randomised) resources can all be described probabilistically by ETP. 
ETP of several execution components
A composite ETP can easily be determined for every individual program component (ET P pc ), e.g. a dynamic instruction, that uses processor resources, 425 which has an associated ETP describing their latency. That is ET P pc = f (ET P 1 , ET P 2 , ..., ET P n ), where ET P i is the probabilistic execution time of resource r i .
• Sequential composition: the ETP, f s (ET P 1 , ET P 2 , ..., ET P n ), resulting from sequential composition is one where latencies and probabilities are Let us assume two ETPs, ET P 1 =< (1, 2), (0.5, 0.5) > and ET P 2 =< (5, 10), (0.5, 0.5) >. Further assume that whenever ET P 1 takes latency 1, then ET P 2 =< (5, 10), (0.8, 0.2) > and whenever ET P 1 takes latency 440 2, then the second ETP is ET P 2 =< (5, 10), (0.2, 0.8) >. In this case, ET P 1+2 = f s (ET P 1 , ET P 2 ), leading to ET P 1+2 =< (6, 7, 11, 12), (0.4, 0.1, 0.1, 0.4) >. Still, ET P 2 takes, for instance, latency 5 with probability 0.5 because P (ET P 1 = 1) × P (ET P 2 = 5) + P (ET P 1 = 2) × P (ET P 2 = 5) is 0.5 × 0.8 + 0.5 × 0.2 = 0.5.
445
The key trait here is that the dependence that ET P 2 has on ET P 1 can be modelled probabilistically. As a result, the executions carried out during analysis, capture the behaviour of this dependence and hence, cause it to be covered by the pWCET estimate derived to bound the execution time during operation.
450
This is the typical case for the ETP of cache accesses since the ETP of a given cache access depends on what the previous accesses did. For instance, if a first access hits, it does not evict any data and the second access may have a given hit probability. However, if the first access misses, it will evict some data likely decreasing the hit probability of the second 455 access. Still, the second access has an ETP since the dependence between the first and the second access is probabilistic given that the first access will hit or miss with a true probability when using time-randomised caches.
• Parallel composition: processor resources may also be arranged in paral- We call causal dependence any dependence among two instructions in a given precedence order such that the execution of the earlier one affects the timing 485 behaviour of the later one. Obviously, the execution time of the earlier one determines when the later one can start executing, but our notion of causal dependence actually means that the latency a given instruction not only affects the time at which the later one starts but also its duration.
We differentiate two types of causal dependences among a source (preced- • Probabilistic dependence: The execution of the source instruction has a probabilistic effect on the ETP of the target instruction. This is the case of memory accesses to a time randomised cache. A probabilistic causal dependence causes that dynamic instruction to suffer a transformation in 525 its ETP. However, given that the causal effect in the target instruction is probabilistic, this is equivalent to applying a transfer function transf () that takes as an input an ETP and provides as an outcome another ETP tranf (ET P isol target ) = ET P bb target . Again, the key trait is that the target (dynamic) instruction is always subject to the same ET P bb target thus en-530 abling MBPTA to properly capture its timing effects at analysis time analogously as they will occur during operation.
Overall, on a PTA-compliant platform, any hardware and software state with bearing on the execution time after of any dynamic instruction of the program is reached with a given probability. Therefore, one can build the ETP of every 535 single program path that can be traversed by an observable execution by collect-ing the execution time of each final state of that system and its corresponding probability of occurrence. Therefore, the execution time of the program as a whole (seen as the traversal of a given path) has an ETP and is, hence, a random variable with i.i.d. properties. 
More complex single-core processor architectures
We have shown that jittery deterministic resources need to be redesigned to make their timing behaviour amenable to MBPTA by construction. This can be done by either randomising their timing behaviour or enforcing them to their worst-case latency. Resources with probabilistic latency perfectly fit the 545 MBPTA principles. However, jittery processor resources exist that do not easily fit in the taxonomy we used in Section 4.2. This is the case of resource buffers, also known as first-in first-out (FIFO) queues or simply buffers.
A buffer resource may stall if it gets full, which increases the latency of the requests that use it. Stalls across pipeline stages may for example occur owing 550 to contention for buffer space; those stalls would be real enough to fear, but difficult to predict causally.
The main characteristic of buffer resources, however, is that they are not sources of jitter but rather jitter propagators [21] . The intuition here is that if all jitter that occurs in a processor is probabilistic, that is, it is solely due to time-555 randomised resources, any combination of random events has a given probability of occurrence. Now, as every single combination of events causes the program to incur a distinct execution time, each execution time has a distinct probability of occurrence. For each combination of random events, resource buffers may get full and consequently increase the execution time of the program. However, In general, all hardware resources can be made MBPTA-compliant as long 565 as they either do not introduce jitter on their own (hence they are fixed-latency 22 or else just jitter propagators), their jitter can be upper-bounded or else it can be randomised.
Multicore processor architectures
In single-core architectures, the execution time of a software program is 570 influenced by (1) the initial processor state when the program starts executing -which in turn is affected by previous execution, (2) the RTOS interferences that it may suffer during execution, (3) the input data that influence control flow or data-dependent jitter in jittery processor resources, and (4) the randomisation occurring in processor resources.
575
The effect of initial conditions, (1) above, can be taken into account by flushing the state of all stateful resources (e.g., caches) prior to the execution of the program. For the RTOS, state-of-the-art solutions exist to make its interference amenable to probabilistic analysis [11] .
The effect of input data on the control flow of the program is controlled by 580 state-of-the-art techniques that work in unison with MBPTA [19] . For instance, authors in [19] show how to pad execution time measurements at basic block granularity to discount the benefit obtained by executing specific paths when that benefit would not be obtained through other paths. The effect of input data on the latency of processor instructions using resources with data-dependent 585 jitter as well as the jitter introduced by the randomised hardware resources are controlled with standard PTA techniques [5] .
In multicore architectures, in addition to all the sources of execution time variability that appear in a single-core architecture, a further one arises: intertask interference 5 .
590
In general in single-core architectures, given two instructions i x and i y of the same program, where the subscripts determine the order in which each instruction is executed into the processor, i y may have a potential impact on the execution time of i x only if y < x, meaning that i y executes prior to i x .
In a multicore, when several programs run in parallel, the execution time of Interestingly, the MBPTA-compliant design principles already outlined for single-core processors extend quite well to the design of multicore architectures.
610
The resources for which this approach is most advantageous are those that are shared upward the processor hardware architecture off the core, where they may cause massive inter-task interference. Next we review them in detail.
Shared bus. The authors of [23] show that the arbitration latency of a shared bus can either be upper bounded at analysis time or randomised so that 615 the timing behaviour observed at analysis matches or upper-bounds that which may emerge during operation. In fact, upper bounding the bus arbitration latency has been shown to be viable also for time-deterministic systems [24] . This approach ensures that the latencies and probabilities of the ETP derived for this resource already account for worst-case interaction in this shared resource. For instance, if latency is upper-bounded, the ETP accounting for arbitration delay will have the form ET P bus =< (latbus max ), (1.0) >, where latbus max stands for the maximum bus arbitration latency. Alternatively, if random (lottery) or random permutations arbitration is used, ETP can also be derived as already proven in [23] .
625
Shared memory controller. The same approach used for buses can be applied to the arbitration in the memory controller. Thus, the latency of a shared memory controller can be upper bounded, which is fine for MBPTA compliance. Again, that measure is in line with findings for time-deterministic systems [25] . Thus, if latency is upper-bounded, the ETP for the memory 630 controller will have the form ET P memctrl =< (latmemctrl max ), (1.0) >, where latmemctrl max stands for the maximum memory controller arbitration latency.
Note that random (lottery) or random permutations arbitration can also be alternatively used since ETPs exist for both policies [23] . However, memory latency can also vary based on the last operation performed due to the fact 635 that the latency of a read (or write) operation varies depending on whether the last operation was a read or write operation. Authors in [25] describe how to upper-bound memory access latency, so an ETP can also be derived for this component with the form ET P DRAM =< (latDRAM max ), (1.0) >, where latDRAM max stands for the maximum memory access latency. Note that in 640 this case, latency cannot be randomised since it depends on non-probabilistic events such as the particular memory accesses performed by tasks running in other cores, which are unlikely to be known at analysis time.
Shared cache. Cache partitioning has been proved to be a practical way to attenuate the interference effects from cache sharing. This solution was first 645 shown for time-deterministic systems [24] . However, since it eliminates all cache conflicts among tasks running on different cores, it cancels out the multicore side of the cache problem, and allows using, for each multicore, the solutions devised for single-core processors.
An alternative approach has been put forward in [26] , where a hardware 650 feature is proposed to limit the eviction frequency caused by individual tasks on a shared time-randomised cache. That mechanism allows controlling inter-task interference without resorting to cache partitioning, which reduces the pWCET against the partitioned case, as long as inter-task interference distributes ran-domly across sets. The rationale behind that mechanism is as follows: during the analysis phase the program under analysis is exposed to a given eviction rate in the shared cache. Then, during operation such eviction rate is not allowed to be exceeded by tasks in other cores. Hence, the ETP experienced at analysis time upper-bounds operation conditions. In other words, the miss rate during operation in the shared cache can only be lower than the one during the analysis 660 phase. Therefore, the multicore case does not differ from the single-core case for the purposes of MBPTA.
Software-only Alternatives
Recently, for some (COTS) time-deterministic hardware resources (e.g., caches) software-only solutions have been shown to achieve the effects of the hardware 665 design proposals presented above. So far, the design of those solutions has focused on cache memories [27, 28] , seeking the same type of MBPTA-related benefits as warranted by hardware-implemented random placement. The essence of those solutions is to place the data and the code of the application at random locations in memory so that their placement in time-deterministic caches 670 that implement modulo placement becomes also random and thus, MBPTA requirements for caches are met. Obviously, this random placement is entirely transparent to the application and has no functional effect on it. Next we review those solutions and compare their properties against their hardware-only correspondents. 
Software-only Random Placement
Software-only random placement aims at causing cache conflicts in sets to occur randomly by placing objects at random memory locations. For instance, if an object is placed in a random memory location Loc, given a cache with S cache sets, the particular set where the object will be placed in cache, Loc 680 mod S, is also random.
At the present state of the art, software-only random placement operates on individual software functions (i.e., syntactically defined program fragments), static variables, and stack frames. As some padding is required for those entities to be moved in isolation, the memory footprint of the program grows as a result 685 of the application of this technique. Current experience shows [27, 28] that the resulting bloat may be contained within acceptable limits.
Software vs Hardware Solutions
Hardware solutions place each cache line in an independent and random location in cache. Therefore, one can build an ETP for cache accesses of the ability of conflicting in cache, whereas cache lines inside a given object have a fully deterministic behaviour among them. Still, this does not break MBPTA requirements since those deterministic behaviours observed at analysis time stay exactly the same during operation as the memory location of a given object is randomised but the lines that form the object retain their position relative to 700 one another. Hence, there is a probability [29] that two lines from different objects are placed in the same cache set and thus, are able to evict each other.
However, if those two lines belong to the same object, the probability of being in the same set is either 0 or 1 depending on whether their relative alignment is different or matches the size of one cache way respectively.
705
Still, probabilities can be attached to all events and thus, one can also build an ETP of the form ET P SW cache =< (l
> for cache accesses under software-only random placement. While latencies will be the same for ET P HW cache and ET P SW cache , probabilities will not, given that the probabilities of the different latency outcomes differ across hardware and software-only It is important to appreciate however, that the actual values of probabilities need not be known in order for MBPTA to be applied. What is needed is that MBPTA requirements are satisfied, which is indeed the case for both hardware and software-only solutions. We can therefore contend that software-only 715 solutions for cache placement can also be regarded as MBPTA compliant. • Fetch stage. The IL1 is accessed (and the instruction TLB, ITLB, on a IL1 miss) to obtain the next instruction to be executed. Branches are predicted to be taken always.
• Decode stage. Instructions are decoded. This stage is, in essence, an extra delay in the pipeline.
• Register access. Instructions read their input registers with fixed latency.
• Execute stage. Non-memory instructions are executed with a fixed latency that depends solely on the type of operation. Although originally floating-point division (FDIV) and floating-point square root (FSQRT) instructions had input data dependent latencies, they have been modified 735 as described later. Memory operations compute their addresses.
• Memory stage. Load instructions access the DL1 (and data TLB, DTLB, on a DL1 miss). Indeed, they also access the write buffer. Store operations are placed in the write buffer for their offline processing. If the write buffer is full the pipeline will be blocked.
740
• Exception stage. Exceptions are managed here.
• Write-back stage. Results (if any) are sent to the register file.
The IL1 and DL1 are 16KB in size, 4-way set-associative, with 16B/line IL1 and 32B/line DL1. All caches implement random placement and replacement policies presented in [31] . The DL1 is write-through and no-write-allocate, so For its evaluation we use the EEMBC Automotive Benchmarks [32] , which is a well-known benchmark suite representative of some existing real-time automotive functionalities. The description of each benchmark is conveniently provided in Table 1 for the sake of completeness. 
Hardware Modifications
In the quest for MBPTA-compliance, we have modified cache placement and replacement policies, as well as selected floating-point (FP) operations with a comparatively high jitter dependent on the input parameters. In the original processor design, all caches (DL1, IL1, DTLB, ITLB) implemented modulo 760 placement and least recently used (LRU) replacement, whose sensitivity to history of execution makes them unable to meet the MBPTA prerequisites [31] unless appropriate software support is provided to the application [27] .
Random placement and replacement have been implemented as described in [31] . In particular, random placement implements the latest design as described 765 in [33] . Random replacement relies on the use of a pseudo-random number generator. While the one described in [33] has been shown to be convenient, the one described in [34] has appeared to generate random numbers with similar quality while being amenable to a much easier implementation on a FPGA.
For the FP unit we concentrated on the FDIV and FSQRT operations, whose 770 latency jitter is highly dependent on the input parameters. The FDIV latency Since, from the processor design perspective, the actual latency of those 775 operations does not occur with a given probability, and all that one can infer from the application program is the frequency of their execution, which is of no use for MBPTA, the solution described in Figure 3 (b) needs to be applied. The implementation of FDIV and FSQRT has therefore been modified so that they always operate in 18 and 26 cycles respectively in the analysis mode. As we 780 noted earlier, modifications of this kind cause the pWCET estimates to incur some (though limited) pessimism, but they make the corresponding hardware resources MBPTA compliant, which is what we are after here.
Deriving ETP
In view of the hardware modifications discussed above, the processor ar- We differentiate between two types of instructions: those that operate on the core (e.g. add, div, mult); and those that operate on memory (e.g. load, 790 store). Core operations take a variable latency depending on whether they hit in the instruction cache and instruction TLB, whose ETP (ET P IL1 and ET P IT LB respectively) are composed in parallel, and memory latency, which is accessed in case of a miss and whose ETP (ET P DRAM ) is composed sequentially with the composition of the instruction cache and the instruction TLB.
795
This leads to what we term the ETP of the front-end (fend): ET P f end = f s (f p (ET P IL1 , ET P IT LB ) , ET P DRAM ). Then, the resulting ETP, ET P f end needs to be composed with the ETP of the buffer between the front-end and the back-end (ET P buf 1 ), the ETP of the decode stage (ET P dec ), the buffer after decode (ET P buf 2 ), the register access stage (ET P ra ), the buffer after register 800 access (ET P buf 3 ), the core operations (ET P exec ), the buffer after execution (ET P buf 4 ), the memory operations stage (ET P mem ), the buffer after memory operations (ET P buf 5 ), the exceptions stage (ET P excep , the buffer after exceptions (ET P buf 6 ) and the write-back stage (ET P wb ).
While ET P dec , ET P ra , ET P exec , ET P mem , ET P excep and ET P wb have the states can be found in [21] . If all actions occurred sequentially (thus omitting interactions in the buffer to memory), the ETP for core operations would be as follows:
ET Pcore = fs (ET P f end , ET P buf 1 , ET P dec , ET P buf 2 , ET Pra, ET P buf 3 , ET Pexec, ET P buf 4 , ET Pmem, ET P buf 5 , ET Pexcep, ET P buf 6 , ET P wb )
Memory operations have the same ETP as core operations for the different stages and buffers except for the memory stage (ET P mem ). The memory 815 latency, instead of depending on ET P mem , depends on the time of the data memory path (dmpath) composed by the data cache and the data TLB, which are accessed in parallel, and memory latency, which is accessed sequentially:
ET P dmpath = f s (f p (ET P DL1 , ET P DT LB ) , ET P DRAM ). Therefore, the ETP for memory operations (still omitting interactions in the buffer to memory) is 820 as follows:
32
ET Pmem = fs (ET P f end , ET P buf 1 , ET P dec , ET P buf 2 , ET Pra, ET P buf 3 , ET Pexec, ET P buf 4 , ET P dmpath , ET P buf 5 , ET Pexcep, ET P buf 6 , ET P wb )
Finally, we must consider that the misses occurring in the DL1/DTLB and in the IL1/ITLB are serialised in the buffer that connects the core to the memory controller. Again, this buffer has an ETP of the same form as any other buffer (ET P buf DRAM ). Unlike previous buffers, where an instruc-825 tion could only be delayed due to activities of older instructions, here data requests from some instructions may get delayed by instruction requests of younger instructions. Still, the buffer can only have a finite number of states and each state will have a probability that, hypothetically could be derived by expanding the probability tree from the beginning of the execution of the 830 program. Thus, ET P buf DRAM should be composed serially with the ETP of the memory accesses, so ET P f end and ET P dmpath should be ET P f end = f s (f p (ET P IL1 , ET P IT LB ) , ET P buf DRAM , ET P DRAM ) and ET P dmpath = f s (f p (ET P DL1 , , ET P DT LB ) , ET P buf DRAM , ET P DRAM ) for a correct calculation of the ETP of core (Equation 1) and memory operations (Equation 2). To assert independence we use the Ljung-Box test [35] (LB). The Ljung-Box test is a powerful method that tests autocorrelation for different lags simultaneously, so for each datum with the next one (lag 1), the one after (lag 2), and so on and so forth. In particular we test all lags up to 20 as shown appropriate 
pWCET
In this section we show the type of probabilistic WCET estimates that can run (e.g., the number of cache accesses), the less likely that abrupt performance variations occur other than (if at all) at extreme exceedance thresholds. Thus, execution time variation is moderate and the pWCET curve is steep.
As the example processor architecture demonstrably meets the requirements needed for MBPTA, it can be argued that MBPTA can be applied to performance-885 aggressive hardware features. Interestingly, the MBPTA process stays unchanged in procedure and effort, while the pWCET estimates become consider- ably smaller (up to 9% in the specific experiment) than the engineering margin often applied in measurement-based deterministic timing analysis by industry (20%) in the case of [38] .
890
To the best of our knowledge, complex architectures including caches, TLB, and staged pipelines with buffers, have not been unrestrictedly used with static timing analysis, unless with cautionary restrictions that mitigate the rapid degradation in the tightness of the WCET estimates that arise from resources being used whose state cannot be determined exactly. MBDTA also is at a loss 
Related Work
There is an increasingly rich literature on the problem of WCET analysis.
One substantial part of the state of the art, with more history and tradition, and additional flow-facts describing the operation of the software such as value ranges and memory addresses.
Conversely, MBPTA, the focus of this paper, requires amounts of information comparable to those obtained by end users in the context of MBDTA, but it scales to arbitrarily complex software running on top of high-performance 975 hardware easing the collection of evidence usable for certification purposes [43] .
MBPTA has been used in the context of time-randomised architectures for single-path programs [4, 44] and multi-path programs [6, 19] .
At hardware level, random placement was proposed in [31] to enable the use of set-associative caches for MBPTA. [45] and [16] discuss the reliability of 980 pWCET estimates obtained with MBPTA on top of random placement caches.
In particular authors discuss representativeness related to the fact that some random events may have a low probability to be captured in the measurement
runs, yet have a high impact on execution time. The latter work [16] and other recent works [17] conduct thorough analysis of those scenarios in the context of
985
MBPTA and propose ways to address them.
39
EVT has been applied to time deterministic architectures to derive execution time bounds [46] . While randomisation -and creating deterministic bounds to jitter resources -is not needed for the application of EVT, deterministic architectures seriously difficult deriving a representativeness argument.
990
That is, with EVT-only approaches, building a representativeness argument that analysis-time execution conditions capture those that can arise during operation is completely left to the user. Instead, with MBPTA-compliance -through randomisation and deterministic upper bounding -the space of potential execution conditions is automatically, transparently and randomly sampled as the user 995 makes more runs. Hence, representativeness just requires the user to perform enough runs to probabilistically capture the impact of the different sources of jitter, rather than the user designing specific experiments to reach that goal [47] .
Conclusions and Future Work
In this paper we have shown that in order for MBPTA to be usable econom-1000 ically and assuredly, the target processors should be designed such that every program instruction have a distinct probabilistic ETP. We have shown that this ETP can be built incrementally from the timing behaviour of the processor resources used by that instruction.
Using MBPTA on MBPTA-friendly processor architectures, the timing in-1005 terference between competing applications, which is one of the key problems in mixed-criticality systems, can be studied from the angle of exceedance probability: the probability that the execution time of a program exceeds a given threshold. We have shown that this threshold is tight, owing to the natural attenuation of multiple worst-case events generated as i.i.d. random variables. We 1010 have shown that the probabilistic worst-case execution time bounds obtained with the proposed technique are only marginally greater (around 12% in our case study) than the average-case performance of time-deterministic processor architectures. This allows achieving higher guaranteed (feasible) utilisation for mixed-criticality systems, because little would be lost, if at all, in raw proces-sor performance, and a great reduction would be had in the pessimistic overprovisioning incurred with traditional techniques. The use of Extreme Value
Theory allows setting bounds for execution-time budgets at levels of exceedance probability that satisfy the system assurance requirements. Normal mitigation measures (i.e. adding some form of redundancy, setting up a safe state, etc.) 1020 can be taken if protection guarantees had to be provided for higher-criticality applications at conditions past the given exceedance threshold.
