INTRODUCTION
Information about the upper bound of task execution time is a key factor when designing real-time embedded systems. This "worst case execution time" (WCET) is defined as the longest time taken by the processor to execute a task in the absence of pre-emption [1] . WCET estimates can be used as the basis of many key design operations, such as scheduling and schedulability analysis, determining whether performance goals are met for periodic tasks, checking that interrupts have sufficiently short reaction times, and finding performance bottlenecks [2] .
Determining WCET values is becoming more challenging as embedded designs become more complex and make use of faster and smaller processors and "system on chip" architectures [3] , [4] .
These modern processors often incorporate features (such as pipelines, caches, and branch predictors) that help to increase the performance. Unfortunately, these features also make it difficult to determine the internal state of the system [5] .
As a consequence, analysing the timing of the whole system and estimating the WCET requires a significant effort even for systems with a simple software architecture, such as timetriggered systems which use a table-driven task schedule [6] . In addition, use of fault-tolerance polices based on knowledge of WCET becomes more challenging with modern processors [7] .
To cope with these difficulties some researchers base their scheduler on expected execution time rather than WCET [8] , [9] . This represents -at best -a partial solution where precise timing behaviour is required.
A number of previous studies have been conducted to address the problem of accurately estimating WCET: this work has involved analysis and / or measurements (e.g. [2] , [5] , [10] , [11] ).
For example, Engblom and Jonsson [2] examined the timing of instructions from the perspective of static WCET using a mathematical model of instruction execution on inorder single-issue pipelined processors: the particular concern in this study was with timing effects between non-adjacent instructions. Engblom and Jonsson showed that there are negative long timing effects (LTEs) which can be safely ignored, and positive LTEs which have to be accounted for in a WCET analysis.
In another study, Rochange and Sainrat [10] showed that pure static analysis might not allow safe WCET computation for modern processors with speculative execution. In particular, they noted that when a branch instruction is predicted, the instructions belonging to the predicted path can be executed before that the branch is resolved. They discuss the possible effects of executing the wrong path whenever a branch is mispredicted.
Deverge and Puaut [5] explore the issues to be addressed for designing a measurement-based method for WCET estimation. They propose generation of test data for program segments, using program clustering, to combine execution time of program segments and to obtain the WCET of the whole program.
A different approach has been proposed by Puschner and Burns [11] : this is called "the single-path programming paradigm". This approach was based on the idea of writing the program in manner which ensures that there is only one execution path. They showed that this helps to produce a constant execution time.
In this paper, we highlight two issues with the techniques described by Puschner and Burns: (i) they are applicable only to hardware which supports "conditional move" or similar instructions; (ii) their balancing approach can increase power consumption. In the present paper, we address both of these problems with a modified set of single-path programming techniques. The effectiveness of these new techniques is demonstrated by means of an empirical study.
The remainder of this paper is organised as follows. Section 2 gives an overview of the "single path programming paradigm" as introduced by Puschner and Burns [11] . In Section 3 we introduce some code-balancing techniques which address two issues with this paradigm. In Section 4 we assess the proposed code-balancing techniques. In Section 5 we present our conclusions.
THE SINGLE PATH PROGRAMMING PARADIGM
This section gives an overview of the single-path programming paradigm as described previously ( [11] - [14] ).
Programming code that complies with the single-path programming paradigm has only one execution path. This can be achieved by replacing input-data dependencies in the control flow by predicated (instead of branched) code. In predicated execution, instructions are associated with predicates: if the predicate evaluates to true the instruction executed; otherwise the microprocessor internally replaces the instruction by a no-operation (NOP) instruction. It is assumed that a simple predicated execution model is used (such as the conditional move instruction in M-Core processor, in which conditional instructions have a constant, data-independent execution time).
As an example, Figure 1 shows some pseudo code that indicates how a code branch using if-then-else structure can be translated to the single path form. In this example the conditional move instruction "movt" copies the value of "temp1" to "result" if the result of the "test" instruction is true; otherwise the processor performs a NOP instruction. The same can be said for the "movf" instruction; it will copy the value of "temp2" to "result" if the result of the "test" instruction is false; otherwise the processor performs a NOP instruction. This code can be easily modified to be used with a nested if statements [15] .
Figure 1. Converting if-then-else conversion to
single path (adapted from [14] ).
In a similar manner, a loop of variable length can be translated into a loop of constant length (provided we know the maximum size of the loop). Please note that less structured "goto" and "exit" statements are not considered in this approach.
It has been demonstrated ( [11] - [14] ) that using this method helps to produce a constant execution time. However, this method has some drawbacks: i) Its usage is limited to hardware which supports "conditional move" or similar instructions
ii) It is likely to increase power consumption because the CPU will always execute the single-path code for a fixed (maximum) period. During this time, the processor will be in "full power" mode.
PROPOSED CB1 TECHNIQUES
In this section we begin to address the drawbacks mentioned above by using a set of novel code-balancing techniques. For ease of reference, we refer to the approach described here as the "CB1 techniques" in the remainder of this paper.
Overview
The main idea behind the CB1 approach can be explained by considering an example. This example is intended to stabilise the time taken to complete a number of iterations in a given loop.
Assume that the time spent in performing "x" iterations of the loop is equal to Time(x), where:
and MAX is the maximum number of iterations. The microcontroller is set to enter a power-saving mode for the period of time required to perform (MAX -x) iterations. This time can be approximated by Eq. (1).
Hardware timers can be used to measure Time(x); time spent in performing x iterations, by starting the timer directly before the start of the loop and stop it directly after the last Please note that it is assumed that the loop will be executed at least once for Eq. (1) to give real results. Also note that there is an approximation in calculating Time (MAX -x) given by Eq. (1) as it does not take into account the effects of the performance improvements features mentioned earlier.
The reason for this approximation is to simplify the calculations and the implementations process so as to be suitable for any platform while avoiding complex analysis of each specific feature.
Based on this form of "sandwich delay", a set of balanced code can be used to reduce the variations in executing for and while loops. The approach can also be used to balance ifthen-else structures, as explained in the following subsections.
Balanced for loop
Listing 1 shows pseudo code that can be used to stabilise the WCET of a for-loop for any number of iterations in the range of [1, MAX] , where MAX is the maximum number of iterations.
Please note that a small "safety margin" was added to the time calculated in Eq. (1) to assure that there is time to enter sleep mode before the interrupt occurs even at the maximum loop length. 
Balanced while loop used for waiting for input
Listing 2 shows a pseudo code that can be used to stabilise the WCET of executing a while-loop which is usually used to wait, for a predefined maximum time (TMAX), for an input to be ready.
Please note that a small "safety margin" was (again) added to the time TMAX after the end of the while loop to ensure that there is time to enter sleep mode before the interrupt occurs, even in case where the input becomes ready at time TMAX. The safety margin will typically be 1% of the value of TMAX. (TMAX + "safety margin" -"timer value"); Send the microcontroller to power saving mode; Listing 2. Pseudo code of a balanced while-loop used for waiting for input.
Balanced if-then-else structure
Listing 3 shows a pseudo code that can be used to stabilise the WCET of a general if-then-else structure.
Please note that the number of the assignment instructions in the if-part must be equal to those in the else-part ("NOP" padding or similar approaches must be used, if necessary). 
PERFORMANCE OF THE CB1 TECHNIQUES
An empirical test was carried out to explore the effectiveness of the CB1 techniques. The procedure and the results obtained are detailed in this section.
Initial test
In this test an example which was used to assess the effectiveness of the single path programming paradigm (in [11] ) is used here. The original example explores different implementations of a "bubble sort" for arrays of 10 elements. The original example was used to sort all the elements of the array. The version used here used here sorts the first x elements of the array (where x is <= SIZE, the total number of the array elements). This modification was made in order to explore the impact of different implementations on the execution time, jitter, and power consumption.
Our tests employed using a time triggered co-operative (TTC) scheduler [16] . The tick interval was set to 10 ms.
The main (sorting) task was run every two ticks.
Three additional tasks were also scheduled: i) A "jitter-test" task. This low-priority task was scheduled to execute in the same tick as the sorting task. It was used to measure the effect of the variations in the execution time of the sorting task on the jitter in the start times of other tasks in the system ii) A "sort length" task used to increment the value of x, from 2 to 10 and then back to 2.
iii) A "sort complexity" task used to initialise the array in "completely sorted" or "completely unsorted" forms, in order to vary the time taken to carry out the sort process.
The test was carried out on an NXP (formerly Philips) LPC2106 microcontroller running on a small evaluation board. The LPC2106 is based on an ARM7TDMI core and is typical of modern (low cost) embedded processors. Because this microcontroller does not support the conditional move instruction the single path code is modified, while keeping the main structure described by [11] , to cope with this limitation. Listing 4, Listing 5, and Listing 6 show different implementations of the bubble sort using the traditional, single path, and the proposed CB1 code respectively. Table 1 and Table 2 show the measured minimum and maximum execution time, maximum jitter, and average power consumption resulted from each implementation. From these tables it can be noticed that:
· Both the single-path code and CB1 code demonstrated a reduction in both the variation of the execution time and in jitter levels. These improvements were at the expense of an increase in the average power consumption and execution time.
· The jitter and the variations in execution time obtained by using the CB1 code was less than that of the traditional code and higher than that of the single-path code.
· The average power consumption obtained by using the CB1 code was less than that of the single path code and higher than that of the traditional code. 
Extended test
The experiment described in the previous section was repeated using two additional benchmark test cases which have been used in previous WCET studies [17] , [18] :
i) The first test case implements a single nested loop used to calculate the Fibonacci series for up to 30 elements.
ii) The second test case implements a triple-nested loop used to calculate Matrix multiplication of two 2-D arrays up to 20x20 elements in size.
The length of the tick interval of the TTC scheduler was set to 10 ms for the first test and 100 ms for the second test. Table 3 through Table 6 show the measured minimum and maximum execution time, maximum jitter, and average power consumption resulted from each case. These results were in line with the results obtained from the test described in the previous section. 
CONCLUDING REMARKS
The CB1 techniques introduced in this paper involved two stages:
i) using an interrupt-based sandwich delay to keep the execution time of tasks fixed without requiring significant increases in system power consumption.
ii) calculating the max execution time (and required timer settings) for each form of branch / loop structure.
It has been demonstrated (using empirical studies) that: · The variation in task execution time obtained by using the CB1 code was less than that of the traditional code and higher than that of the single-path code.
· As a consequence of the above, the task jitter levels obtained by using the CB1 code were also less than that of the traditional code and higher than that of the single-path code.
· The average power consumption obtained by using the CB1 code was less than that of the single-path code and higher than that of the traditional code.
· These single-path and CB1 results were achieved at the expense of an increase in the maximum task execution time.
Further work will be carried out to extend the CB1 techniques.
ACKNOWLEDGMENTS
The authors would like to thank Kam L. Chan for his help in making the power measurements.
