Abstract-This paper presents automatic generation of fast and accurate timed models of streaming embedded applications, before the complete software-hardware platform is available. First, a measurement model is generated and executed, on the target processor, to predict the computation delays in the application. Next, the stochastic delays are annotated to the application code to generate a host-compiled model of the application. Our experiments show that such models can be generated and simulated in seconds to accurately predict the computation load offered by the application. Our results with large streaming media applications, such as music and voice codecs, show that the estimation errors are less than 3.3%, while providing very high simulation speed. Therefore, using our models, embedded system designers can perform early optimizations to the system architecture with high confidence.
INTRODUCTION
One of the primary goals of embedded system design is to co-optimize the software-hardware architecture for a given application. Designers typically tune various system parameters such as task priorities, scheduling policies, interrupt rates, type and number of hardware accelerators, and so on, in order to optimize system performance for a given application. Model-based design advocates early system models that are executable and semantically represent the software-hardware architecture. The architectural parameters are modified in each iteration and a new model is generated to measure the impact of design changes. A major modeling challenge is the need for early and accurate performance estimation for a given software-hardware architecture. Designers are particularly concerned with determining the computation load offered by an application, executing on the target processor.
Various methods, such as processor instruction set simulation, source-level analysis and worst-case execution time (WCET), are used for performance estimation. Typically, instruction-set simulation is too slow to be used for extensive design space exploration, since each design choice may need to be evaluated by several hours of simulation. Source-level analysis can be fast and accurate, but relies on availability of source code of the entire application, including libraries. WCET analysis is excellent for hard real time systems, but provides pessimistic timing estimates, which may not be relevant for most embedded applications.
In this paper, we propose a measurement-based stochastic technique for determining the load offered by the application to the processor. We are particularly interested in streaming applications, such as voice codecs and media codecs, that tend to be the most compute intensive embedded applications. Our technique capitalizes on the following two observations:
(1) Typically, the processor core is available on an evaluation board or as a virtual prototype. The design challenge is to create the optimal platform by integrating on-chip buffers, hardware accelerators, and various interfaces with the processor, and porting the system software to the platform.
(2) Most streaming embedded applications are designed as process networks, with little control flow inside the processes themselves. As such, the amount of computation inside a process depends largely on the amount of data being processed, as opposed to the data values themselves.
Before we delve into model generation, we present a motivating example of streaming application optimization. Fig.  1(a) shows the message sequence chart of a typical multitasking streaming application executing on an embedded platform. The application is mapped to a target embedded processor, running a Real-time Operating System (RTOS). The application is designed as a set of user tasks (T 1 , T 2 , …, T n ) that communicate amongst themselves using local buffers (buffer 1 , buffer 2 ,…buffer n ). The input data stream is received from file, network or other I/O, such as microphone or camera, and stored in an on-chip hardware buffer (IN buffer). The output decoded/encoded data is stored in another on-chip hardware buffer (OUT buffer). The OUT buffer data is consumed by the network interface or other I/O, such as speakers or display. An underflow interrupt from the OUT buffer is used to wake up the application tasks. The tasks, then, execute their respective encoding/decoding functions, write into their respective output buffers, and seek more data in their input buffers if needed.
An execution instance of the application in Fig. 1(a) would proceed as follows. The underflow interrupt from the OUT buffer triggers the interrupt service routine (ISR), which copies any existing processed data from buffer 1 into OUT buffer. If there is insufficient data in buffer 1 to fill OUT buffer, the ISR sends a message to T 1 to write more data into buffer 1 . Task T 1 is therefore activated and reads from its input buffer, buffer 2 , processes it and writes the output to buffer 1 . If there is insufficient data in buffer 2 , T 1 sends a message to activate T 2 in order for T 2 to write more data into buffer 2 and so on. Finally, T n is activated to read more raw data from IN buffer.
The sizing of the hardware buffers can have significant consequences on the performance of the system. For instance, if the OUT buffer is very small, it will take a short time to fill it, assuming a constant stream of incoming raw data in the IN buffer. As such, the decoding/encoding delays will be small, leading to good quality of service. However, the OUT buffer data will also be consumed in a short time, leading to frequent underflow interrupts. Every underflow interrupt causes a context switch by the RTOS, and the cache needs time to be warmed up with tasks T 1 to T n . Once the OUT buffer is filled, the RTOS may switch to other user applications or kernel tasks, thereby potentially evicting T 1 to T n from the cache. When the next underflow interrupt arrives, the cache must again be warmed up with tasks T 1 to T n .
On the other hand, if the OUT buffer is very large, it will take longer time to fill it, assuming a constant stream of raw data in the IN buffer. As such, the decoding/encoding delays will be larger, leading to poorer quality of service. However, there will less frequent underflow interrupts, more data processing done by the user tasks per underflow interrupt, and, consequently, fewer context switches. The cache will still need to be warmed up with tasks T 1 to T n after every underflow interrupt, but since there the processor will be executing more iterations of the tasks per interrupt, the average cache behavior of the application will improve.
The consequences of buffer sizing are illustrated in Fig.  1(b) . We define the output buffer size in terms of the time it takes to play (sink) the data in the buffer. For instance, a buffer size of 20 ms corresponds to the amount of decoded data needed to playback 20 ms of music or voice. In the first trace, the OUT buffer size is set to 20 ms, leading to an average of t ms of task execution per underflow interrupt. Therefore, the computation load offered by the application is t/20. If the OUT buffer size is increased to 40 ms, the computation time per interrupt will increase to t' ms, since the amount of total computation per underflow interrupt will double. However, the improved overall cache performance, would imply that t' <2t. As such the overall computation load offered by the application will be less than t/20, resulting in lower power consumption.
The computation delay of the tasks, per underflow interrupt, may vary greatly with the state of the cache. As such, it is impractical to statically determine the computation delays. Therefore an executable simulation model is needed to predict these delays, and ascertain the average load offered by the application. The system designer can use the predicted load to optimize the buffer size and other parameters for desired performance and quality of service.
In this paper, we present a modeling methodology, and corresponding model generation tools, to aid the designer in making early design decisions, specifically for streaming applications. The contributions of our work are (i) a method for stochastic modeling of task-level timing in streaming applications, and (ii) a method for automatic generation of a host-compiled, and timed, system-level model before the availabilities of the RTOS and the complete hardware platform.
II. RELATED WORK
There are four fundamental technologies that can be used to estimate resource consumption by embedded software before hardware availability: (i) traffic generators, (ii) instructionlevel simulation models, (iii) host-compiled timed transactionlevel models (TLMs), and (iv) RTOS models in system-level design languages.
Traffic generators are used very early in the design flow, even before the application source code is available. The goal is to exercise the underlying hardware, or its model, with stochastic execution scenarios that can be expected from the application. Naturally, traffic generators are not an accurate representation of embedded software execution and are of limited use [1] .
Instruction-level simulation models are often used as virtual hardware platforms for software development, before the hardware is available. The accuracy of an instruction level model depends on the abstraction level at which the processor has been modeled. Virtual platforms are often used to develop the system software, such as RTOS and drivers, for a new hardware platform [2] [3] . In order to use the instruction-level virtual platform for estimating embedded software performance, the RTOS must already be available on the virtual platform.
TLMs are typically developed in system-level design languages, such as SystemC, and can be executed on a host machine [4] . There have been several approaches to automatically generate TLMs from a high level description of the target hardware [5] [6] [7] . A model of the RTOS scheduler in SystemC can also be developed [8] [9] [10] [11] . Some RTOS models also incorporate timing delays of kernel calls [12] [13] .
In host-compiled TLMs, timing is added to the application source code, and the annotated application is linked to an RTOS model for simulation [14] [15] [16] [17] . Typically, the timing is annotated at the function or basic-block level [18] . Sourcelevel simulation techniques can be used for accurate instruction and data cache simulation [19] [20] . However timing annotation in TLMs requires a data model of the processor, which is not always available due to intellectual property concerns. Moreover, the entire source of the application, including libraries, must be available. As such, source-level timing annotation, based on static code analysis, is not always practical. The work presented in this paper builds upon previous work on RTOS modeling in SystemC. We use an executable measurement model to determine timing of application code. Therefore, our model can use the RTOS targeted application code as is without requiring the library sources.
III. METHODOLOGY Fig. 2 illustrates our modeling methodology. Our methodology required two models, namely the measurement model and the simulation model. Consequently, we have two model generators, as shown. The common inputs to both generators are: (i) the system configuration, which is an abstract representation of the target hardware platform in xml, (ii) the application source code, and (iii) a SystemC model of the target RTOS. The RTOS model implements the scheduling policy of the target RTOS and the inter-task communication primitives in SystemC. We use the RTOS model as described in [21] .
The application software is targeted to the RTOS. Our goal is to annotate the application source code with timing delays. The granularity of annotation is chosen to maximize simulation speed without losing accuracy. Following our second observation in section 1, we can use a coarse-grained annotation for streaming applications.
Consider the illustration of the application code in Fig. 2 . The application consists of tasks, where each task performs some computation and calls the RTOS kernel methods (K 1 , K 2 , K 3 , K 4 ) for inter task communication. During execution, the application task may execute along one of many source paths (P 12 , P 13 , P 14 ) between the kernel calls. In most streaming applications, the delays of primitive computations are dependent, to a much larger degree, on the data size rather than on the data value. Moreover, the primitives usually operate on data frames of fixed sizes. Since the computation along a source path consists of such primitives, the path delay is data independent. Therefore, we treat the inter-kernel-call source path as an atomic computation block for timing annotation.
The Measurement Model Generator produces a SystemC model of the application, with annotated hooks to measure and log the execution times of each block in the application. The application model is linked with the RTOS model to produce a binary that can be executed on a base operating system (OS), such as Linux, running on the target processor. After execution, a log of block delays, over several iterations, is obtained. Due to a variety of environmental factors such as processor state, cache state or scheduling in the base OS, the computation delay for a single block may vary across iterations. Hence we generated a PMF of delays for each block, to stochastically account for such variation.
In order to generate the simulation model, the stochastic delays are annotated to the application code for each block, and the model is linked with the SystemC model of the target RTOS. The simulation model is then executed on a host machine to obtain estimated CPU load offered by the application.
IV. MEASUREMENT MODEL GENERATION
The execution of the inter-kernel computation forms the bulk of the overall computation load offered by the application. For modeling purposes, we treat the inter-kernel computation as an atomic block. Therefore, we are interested in determining the total block delay, which will be used to model CPU resource consumption by the block, in the simulation model. Hence, in the measurement model, we are interested in executing the inter-kernel block without any interruptions, and measuring its delay. SystemC uses a non-preemptive simulation kernel for scheduling its tasks. As such, the application tasks, modeled as SystemC threads will execute without preemption until they explicitly call a SystemC wait statement. We use this property of SystemC to determine the RecvPulse() (7) (8) (9) block delays. This section describes the semantics and structure of the measurement model. Fig. 3 shows an example of the execution of a measurement model. The application consists of two tasks, t 1 and t 2 , implemented as SystemC threads. We assume that t 2 has a higher priority than t 1 . The application is mapped to a processor (CPU), which is implemented as a SystemC module containing the threads. A Buffer module models an on-chip hardware buffer that generates underflow interrupts to the processor. A SystemC method, sensitive to the interrupt signal (int), is used to model the interrupt service routine (ISR). The tasks communicate amongst themselves, and with the ISR, using the communication primitives of the RTOS model.
A. Annotation of Measurement Code
There are two notions of time in the measurement model: the real time, which corresponds to the wall clock time, and is maintained by a free running hardware counter on the target processor; and the logical time of the SystemC kernel. The SystemC time is advanced only by wait statements in the SystemC model of the hardware, that explicitly models the delays between hardware interrupts. Fig. 3 illustrates the progress of both real and SystemC time during the execution of the measurement model.
The order of execution is as follows. Both real and SystemC times are assumed to be 0 at the reference starting point. We also assume that a task blocks on receiving a pulse or message.
(1) t 2 calls RecvPulse(), which consumes Δ 0 units of real time and is suspended, waiting for pulse from the ISR.
(2) t 1 executes block A, which consumes 10 units of real time, and calls Recv(t 2 ), thereby suspending on message from t 2 .
(3) Concurrently, the Buffer models the consumption of buffer data by calling wait for 5 time units. Therefore, the SystemC time is advanced by 5 units.
(4) The subsequent underflow interrupt is modeled by setting the interrupt signal, which activates the ISR task.
(5) The SystemC kernel switches context to the ISR method.
(6) ISR sends a pulse to t 2 .
(7) t 2 is unblocked, and executes block B, which consumes 5 units of real time.
(8) t 2 sends a message to t 1 and exits.
(9) t 1 resumes execution.
The correct simulation of the above scenario would model consumption of 10 units of SystemC time by t 1 for block A and 5 units of SystemC time by t 2 for block B. In order to compute the above delays, we must annotate code around blocks A and B to check the timer and log the delays. To measure block delays, we have automated the annotation to the application code to identify inter-kernel blocks. Fig. 4 illustrates the transformation applied to a sample source code to identify and measure its blocks. The if condition in Fig.  4 (a) may result in execution of block (A, B) between kernel calls K 1 and K 2 , or the block (A, C) between kernel calls K 1 and K 3 . The annotated code in Fig. 4(b) is used determine the executed block as well as the delay associated with the block.
The measurement model generator parses the application code and assigns a unique identifier to RTOS kernel calls. As shown in Fig. 4(b) , the generator introduces a variable begin, and assigns the kernel call identifier to it, after the call. In this example, the kernel call identifier would be K 1 . It is important (a) (b) Fig. 4 : Annotation of measurement code
Measurement
Model Generation to note that K 1 , K 2 , and K 3 are unique kernel-call identifiers; they may or may not be the same kernel function. The model generator adds code to start the time measurement of the block by starting the timer.
The generator also adds code before each kernel call to stop the timer and log the measured time corresponding the executed block. The block is easily identified, since the begin variable holds the starting kernel call identifier of the block. The block delay is returned by the function timer_val(). The logged delays for each block are used to compute the probability mass functions to be used by the Simulation Model Generator.
B. Probability Mass Function Generation
Once the measurements of the blocks are logged, the data are filtered and sorted. The execution times of the logged data for each block can vary due to environmental factors such as the behaviors of the cache, the scheduling, and the DRAM refresh rates. Execution of a block over several iterations produces very large logs of block delays. Fig. 5 shows the distribution of delays for an inter-kernel block in one of our example applications. In order to meaningfully use the delay information, we generate a Probability Mass Function (PMF) of delays for each block.
Not all logged delays are of interest. Recall that the measurement model is executing on a base OS, such as Linux. The base OS may preempt a task in the measurement model to its own perform kernel tasks or run other application. While these scenarios are relatively rare, they produce extremely large block delays, since they include the time for the base OS to complete the preempting task and resume the task being measured. The resulting large delays can skew the median and should be treated as outliers.
Since it is impractical to determine exactly which delays resulted from an interruption by the base OS, we use a simple heuristic to filter the delays. We determine the minimum and median delay and consider only those delays that are separated at most (median -minimum) from the median. In other words, we consider only the delays that are less than (2 * medianminimum) because the performance of the application is only of interest at a steady state with a stable cache. The rationale behind the heuristic is that the best case and worst case cache behaviors for a given block execution are likely to be equidistant from the median. It is expected that the minimum value corresponds to the best case cache behavior, which the median value corresponds to the average cache behavior. The next step is to sort the filtered data into different sets. These sets are split evenly among the range of filtered block delays. The number of measured data points within each set's range divided by the total number of filtered data points determines the probability of the set. Each set's median delay is assigned as the representative value for the set. Fig. 5 shows how the measured data can be filtered and split among 10 sets. The probability of each set corresponds to a relative median value within its set (m 0 , m 1 , m 2 , …), as shown in Fig. 6 . The PMF Generator calculates the PMF of every block in the application.
V. SIMULATION MODEL GENERATOR
The Simulation Model Generator takes the generated block PMFs, along with the system configuration, application code and the RTOS model as inputs. It generates a SystemC model that can be executed on a host machine to estimate the CPU load offered by the application. The application software is reannotated for inter-kernel identification and for applying a SystemC time consumption function, provided by the RTOS model, to the identified blocks. The blocks' PMF delays are used by this consume function to model the CPU time consumption. Figure 7 illustrates the annotation of the application code during simulation model generation. The simulation model generator parses the application code to identify kernel calls. It introduces a variable begin and assigns the identifier of the kernel call to it. The simulation model generator also adds code before each kernel call to consume the time for the source path leading to that kernel call. For instance, if the then path is taken, the delay for the block (A, B) must be consumed. Conversely, if the else path is taken, the delay for the block (A, C) must be consumed. Furthermore, the delays are consumed stochastically, based on the PMF of the respective block.
In order to model the appropriate delay consumption, the simulation model generator introduces a probability variable p, as shown in Figure 7 . Before, each kernel call, p is assigned a random real value between 0 and 1, by calling the rand function. Now, p represents the probability with which we will consume a given delay from the block's PMF. To obtain the actual delay, the range [0, 1] is divided into multiple bins. The number of bins is the same as the number of sets into which the raw measured delays for the block are divided (see Fig. 5 ). The size of a bin corresponds to the probability of the delay in the PMF. For instance, for the given PMF in Fig. 6 , we have a total of 10 bins. The size of the bin for m 3 is 0.262, while that for m 5 is 0.1. Over multiple iterations of the block, the value of p is expected to be uniformly distributed across the range [0, 1]. As such, over multiple iterations, the delay m 3 will be returned by the getDelay function with a probability of 0.262, for the example in Fig. 6 . Finally, the simulation generator adds code to apply the obtained time delay calling the RTOS model's consume function.
From the example in Fig. 3 in Section IV.A, recall that blocks A and B were measured to take 10 and 5 units of physical time respectively. In the generated simulation model, the blocks consume the measured timetime, and hence advance the logical time by calling SystemC wait statements.
The execution of the simulated model is demonstrated in Figure 8 . At SystemC time 0, task t 2 is suspended on waiting for a pulse from the ISR. Therefore, the RTOS model schedules t 1 , which starts executing block A. However, at time 5, the interrupt signal is set by the buffer module, thereby triggering the ISR, which calls SendPulse method, of the RTOS model to activate t 2 . As a result, the RTOS model changes the state of t 2 to ready, and reschedules the tasks. Since t 2 has a higher priority than t 1 , t 1 is preempted after executing only 5 units of time of A (represented by sub-block α 0 ). Task t 2 executes block B, consumes 5 units of time to model its delay, and terminates after sending a message to t 1 . At time 10, t 1 resumes and consumes the remaining 5 units of time of A (represented by sub-block α 1 ).
The simulation model uses the TotalBusyTime variable in the RTOS model to estimate the total time during which the CPU is busy. The counter simply aggregates all the consumed times for all the tasks during simulation. The busy time excludes any time during which all the tasks are suspended, waiting for external hardware interrupts. The total simulated (SystemC) time at the end of simulation model execution is the sum of the estimated total busy time and total idle time. Hence, the overall computation load offered by the application to the CPU is simply the total busy time divided by the total simulated time.
VI. EXPERIMENTAL RESULTS
To demonstrate our model generation methodology, we consider a Smartphone application of MP3 and Voice encoding/decoding. The target platform is QNX RTOS [22] , running on a 600 MHz Geode LX embedded processor [23] . The application software, the RTOS model, and the system configuration file are used to generate the measurement and simulation models. We use a SystemC model of the QNX RTOS [21] . In order to quantify the quality of our methodology, we measured the model generation times and the model accuracy for the Smartphone application.
A. MP3 and Vocoder Case Study
Our Smartphone case study runs playback concurrent to a voice call. The caller wants to play an MP3 clip for the callee, while hearing it on his/her own handset. The audio from the MP3 file must be decoded and mixed with the audio from the phone call at both ends, so that they can sing along or make comments to each other while the music is playing. The application is designed using multiple tasks, which similar architecture as shown in Fig. 1(a) .
For the first experiment, referred to as MP3, we disable the Vocoder to simulate only MP3 playback on the phone. As shown in TABLE I, MP3 has four tasks. The MP3 data is fetched from a file, and the decoded data is written to an onchip serial buffer. The buffered data is played on the handset speaker. In the second experiment, MP3 + Vocoder, we enable concurrent MP3 playback and voice encoding /decoding. The encoded voice data of the called is fetched from the network buffers and decoded. The decoded voice is mixed with the decoded MP3 and written into the on-chip serial buffer. TABLE I shows the generation time for the annotation of the models. The number of tasks and inter-kernel blocks identified indicates the complexity of the application. The generated lines of code include the annotated code, for both measurement and simulation, as well as the SystemC model of the platform. The models were generated and executed under one second on an Intel i3 host machine running at 3.20GHz. Another important quality metric for our methodology is the execution speed of the generated models. Clearly, the model execution speed depends on the complexity of the application, the target and host platforms, and the amount of time we want to measure or simulate for. The generated measurement models are executed on the target processor (Geode LX). The generated simulation models are executed on the host machine (Intel i3). TABLE II shows the execution times for the MP3 measurement and simulation models over different simulated times. The serial buffer size is set to 32480 bytes for the MP3 design, corresponds to 184.127 milliseconds of decoded stereo audio data at 44.1 KHz. As expected, the measurement model runs slower than the simulation model since the target platform has a less powerful processor than the host machine.
B. Model Generation & Execution Speed
TABLE III shows the execution times for MP3 + Vocoder models, with serial buffer sizes of 20, 40, 60, 80, and 100 milliseconds, over 3 minutes of simulation time. As shown, the run-time of the measurement model decreases as the buffer size increases. As in the MP3 model, the execution time of the application tasks dominates the overall run-time of the measurement model. The trend seen here is indicative of the fact that a larger buffer size would result in increased performance due to better cache behavior. There are more cache hits because whenever ISR asks for more data for the larger serial buffer sizes, the processing loops in the task are repeated more often to fill the buffer. As a result, the instructions of the tasks are found in the cache more often, due to temporal locality, which reduces the time needed to fetch them.
It must be noted, however, that the measurement model's run time consists of several other overheads, such as the SystemC library calls, the kernel activities of the base OS and so on. The actual block delays are recorded in the logs and the corresponding PMFs. The simulation model speed is also very high as seen. Overall, we are able to generate models and predict CPU load for a given buffer size, and given simulated time, in the order of minutes.
C. Accuracy
The most important quality metric for our methodology is the accuracy of the predicted CPU load. TABLE IV shows the comparison of the predicted and actual CPU loads for the MP3 and MP3 + Vocoder applications. The estimated CPU load is obtained from the simulation model by dividing the total busy time by the simulated time. The actual CPU load is obtained from the time kernel call in QNX, which gives the busy time for the application during execution on target.
Our measurements are prone to a few errors. For instance, the cache behavior for the measurement model and the reference design might have some inconssitencies, since the measurement model runs in the context of the SystemC kernel and the base OS. The reference design runs the application on the target OS. Moreover, the state of the processor may be different while executing the same block in the measurement model and in the reference design. The block delay modeling itself may introduce errors, due to its stochastic nature. However, these errors do not have a significant impact as shown in TABLE IV. The MP3 model has an error of only 0.8%. On the other hand, the MP3 + Vocoder models have a maximum error of only 3.33% across various buffer sizes. Therefore, our model generators can be used for accurate performance predictions.
Since the MP3 + Vocoder offers a very high CPU load (up to 50%), it is a critical application to be optimized. As we can see, increasing the buffer size significantly reduces the load up to a certain point, beyond which there are diminishing returns on efficiency. Increasing the buffer size for this case study also increases the delay in the callee's speech to be decoded and played on the caller's speaker. Therefore, the larger the buffer size the poorer the quality of service.
Based on the high accuracy of our simulation models, embedded system designers can investigate trade-offs between quality of service and CPU load for different buffer sizes. For instance, a buffer size of 60ms provides an acceptable quality of service for tolerable CPU load in the actual design, and confirmed by the model. If however, Quality of service is of primary concern, a buffer size of 20ms can be chosen. In that case, it may be advisable to move some of the compute intensive functions of the Vocoder to dedicated hardware accelerators. The 50.78% CPU load predicted by the model can be used to guide the exploration of hardware accelerated platforms. As seen from the above, our modeling methodology supports high speed and reliable design space exploration before the completed software-hardware platform is available.
VII. CONCLUSION
In this paper, we described a methodology and tools to generate fast and accurate simulation models for streaming applications before the hardware-software architecture is finalized. Our results show that the model generation is very fast and the estimated performances are accurate compared to the target platform. The accuracy of our models indicates that early analysis and optimization, with high confidence, of embedded system can be done before the target hardware and the system software is available. In the future, we plan to apply our modeling techniques to multi-core platforms, as well as validate our methodology with more diverse RTOSes. We also plan to implement measurement at basic-block granularity to apply our methodology to control dominated applications as well.
