Abstract-This paper describes a computer-aided software engineering (CASE) tool that helps designers analyze and fine-tune the timing properties of their embedded real-time software. Existing CASE tools focus on the software specification and design of embedded systems. However, they provide little, if any, support after the software has been implemented. Even if the developer used a CASE tool to design the system, their system most likely does not meet the specifications on the first try. This paper includes guidelines for implementing analyzable code, profiling a real-time system, filtering and extracting measured data, analyzing the data, and interactively predicting the effect of changes to the real-time system. The tool is a necessary first step towards automating the debugging and fine tuning of an embedded system's temporal properties.
INTRODUCTION
A N extremely time-consuming task of producing an embedded real-time system is analyzing and finetuning the system's timing. Existing computer-aided software engineering (CASE) tools focus on the software specification and design of embedded systems. They provide little, if any, support after the software has been implemented. Even if the developer uses a CASE tool to design their system, it most likely does not meet the timing specifications on the first try. This happens because the CASE tool's software design and real-time analysis is based only on estimated data and idealized models. The tools do not take into account practical concerns such as operating system overhead, interrupt handling, limitations of the programming language or processor, inaccuracies in estimating worst-case execution time of each process, and software errors introduced at the implementation phase by the programmers.
While symbolic debuggers provide help in identifying functional problems, they do not provide any support for identifying timing problems in a system. Furthermore, they only help an engineer track down a problem. The tools do not provide any advice on how to fix a problem once it has been detected.
Performance monitoring tools, also called profiling tools, are the counterparts to symbolic debuggers for the time domain. They allow developers to obtain raw data from the underlying embedded system in real-time. Like the symbolic debuggers, these tools provide most, if not all, of the data needed to pinpoint the problem. Such data, however, is not provided in a symbolic fashion and, thus, could be very difficult to understand. The monitors only show what happened during runtime, without correlating those results to the original specifications. Performance monitors also do not perform any analysis on the data that is collected. As a result, there is no means to easily differentiate between parts of the execution that are "normal" versus those parts that have difficult-to-detect timing errors. Only an expert's eye can spot the differences after substantial effort.
We are investigating and developing tools that can be used to help embedded system designers analyze, debug, and fine tune the timing characteristics of their applications. Such a tool can have a major impact on embedded system development by allowing designers whose expertise is in an area other than real-time system analysis (such as communications, controls, or hardware design) to obtain valuable information on how to fix their code that is not performing according to specifications.
In this paper, we describe a prototype tool and provide results of experiments that show the potential accuracy of such a tool. The key contributions of this work are the following:
. Combining the real-time scheduling theory of various research groups, and extending the theory to use measured data and analytical models that more closely represent real implementations of embedded systems. . Implementing a first prototype of an analysis and interactive predictor tool that assists in fine tuning embedded real-time systems after implementation. Results from the initial experimental validation of the prototype are very promising. They show that there is a strong correlation between results from the analysis unit and results of executing a synthetic workload. We also found that, by using both the average and worst-case measured execution times as input to the analysis unit, we obtain a lower bound and upper bound on the amount allowable variance of the CPU utilization. The research reported in this paper is the beginning of a new class of debugging and trade off analysis tools for embedded designers.
RELATED WORK
Katcher et al. were among the first to introduce a methodology for incorporating the costs of scheduler implementations for fixed priority scheduling algorithms in a uniprocessor environment [7] , [8] . Jeffay and Stone also provided extensions for scheduling equations using dynamic priority scheduling, specifically for deadline driven environments [5] , [6] . Their extensions build upon some of their earlier works on nonpreemptive scheduling of periodic and sporadic tasks, as well as some of the work done by Liu and Layland on dynamic priority scheduling [14] .
Kettler et al. later created a framework to account for implementation costs in real-time system theory [10] , but this is more of a general modeling methodology that would allow embedded system developers to accurately model and evaluate their real-time operating system. Since general terms were created in their model, the authors admit that a high degree of expertise is required to create a valid model for any given real-time operating system; therefore, they suggested two additional steps to overcome the expertise barrier. These steps were the development of basic implementation cost components based on the operating system structural properties that affect cost, and the development of a database of real-time model properties. The tool described in this paper addresses some of these issues.
There already exists CASE tools that incorporate the analysis. The first few research tools included PERTS [15] , CAISARTS [4] , and Scheduler 1-2-3 [28] . Unlike these research tools, the work described in this paper targets analysis of real-time properties after implementation using measured data, not analysis during design using estimated data. There are now several commercial tools that perform actual measurements of timing data, including WindView by WindRiver Systems [29] , and TimeTrace by TimeSys Corporation [27] , and CodeTEST by Applied Microsystems Corporation [1] . These tools are often categorized as performance monitoring tools. They can either be hardware or software based. In some cases, a hybrid of the approaches can be utilized.
These performance monitoring tools are the most likely candidates for incorporating research described in this paper. Currently, however, they show exactly what happened during runtime, but they do not correlate those results to the original specifications. The unique goals that distinguish our research from other related work and today's commercial products are the following:
1. Our work combines two types of real-time tools -performance monitors and system analyzers-into a single integrated tool. 2. We integrate novel prediction features that forecast the effect of small fine-tuning adjustments to the code. This helps designers quickly identify the best techniques for fixing the timing problems.
3. The analytical models are revised to better reflect realities of the target hardware platform by incorporating measurable operating system overhead, modeling interrupt handlers, and by giving a more realistic model of the execution time breakdown of aperiodic servers. 4. The user interface is targeted towards system developers with little real-time scheduling theory knowledge. In contrast, the displays from current profiling tools provide output that mimics a logic analyzer and, thus, targets expert users who are familiar with the fine-grain timing details. The ultimate objective is to automate the data collection and analysis, so that the tool presents both the problems and the solutions to fix the problems. At best, today's tools can only provide a view of a system's timing, and leave the identification and solving of the problems to the designers.
OVERVIEW OF AFTER
As a first step towards meeting the goal to fully automate the analysis, debugging, and fine tuning of embedded systems, a prototype of a tool that we call AFTER (Assist in Fine Tuning Embedded Real-time systems) was developed. It incorporates both static and dynamic scheduling models, and support for periodic threads, aperiodic servers, and interrupts. These represent a sufficient set of features to demonstrate the concepts of automated analysis and prediction.
A diagram of AFTER is shown in Fig. 1 . Numbers in this text (e.g., <1>) correspond to the numbers in the diagram and are used to help explain the process.
First, software is implemented using the developer's favorite environment (<1>). Automated analysis requires that software be analyzable. We define analyzable software as code that has definitive starting and stopping points, with timing specifications associated with the starting points (e.g., periods), and constraints associated with the stopping points (e.g., deadlines). These points serve as instrumentation points for time measurements. This restriction should not be limiting, as it translates into defining tasks using good structured design, rather than having arbitrary entry and exit points. A discussion of analyzable code is given in Section 4.1. Next, the code is executed (<2>), and a profiling mechanism captures the real-time behavior of the system. Ideally, the real-time operating system (RTOS) provides the necessary instrumentation for collecting this timing data. Examples of such profiling mechanisms include WindView [29] and Chimera's ATP [22] . Alternately, if such a feature is not available, hardware such as a logic analyzer or in-circuit emulator can be used. The data is uploaded from the profiling tool or equipment to the workstation that executes the AFTER tool. Profiling techniques that enable accurate analysis are discussed in Section 4.2.
The filter unit (<3>) preprocesses the raw data needed to represent the temporal state of the target system. The filter is specific to the profiling method such that changing monitoring techniques results in only needing to switch the filter module. The filter converts the data whose format is dependent on the profiling tool to a hardware-independent intermediate format that consists of a detailed list of events that describe the real-time behavior of the system. An example of a filter unit used in our experimental setup is given in Section 4.3.
The extraction unit reads the output of the filter and correlates this measured data with the real-time specifications of the system (<4>). For example, it can determine the minimum, average, and worst-case execution time of each task, measured frequency of tasks, minimum interarrival time of interrupts, or compute the number of missed deadlines for any particular task. The extractor unit may consist of multiple modules, each of which can produce a different kind of information to the analysis unit, depending on the type of analysis to be performed. For example, one extractor module may only extract worst-case execution times, while a second module may extract the entire timing history of a specific task, and a third module produces the scheduling time-line immediately after a user-specified trigger operation. Since the function of these modules for the extractor is similar to WindView, we do not discuss the details of each module. Rather, in Section 4.4, we only discuss the issues of the extractor that relate specifically to providing accurate input data for the analysis and predictor units.
The results of the extractor are then displayed to the user. The display is designed to be intuitive by presenting results in a table that is similar to the original real-time specifications, except that all data included has been collected from the real system. The developer can easily understand the data and compare it to the original system specifications. The results are also used as input for the analysis unit.
The analysis unit (<5>) is the core of AFTER, and operates in two modes: analysis mode and predictor mode. The input to the unit is the data produced by the extractor modules (<4>) and premeasured overhead parameters that are specific to the target platform and RTOS (<6>). In the analysis mode, the unit only uses real data collected from the embedded system and performs a schedulability analysis using a mathematical model consistent with the scheduling algorithm and RTOS used in the embedded system. The results of the analysis as well as any actual problems, such as tasks missing deadlines and processes using more CPU time than their allocated worst-case execution time, are reported to the user. The analysis unit is a continually evolving mathematical model of the realtime system, as described in Section 4.5.
If the analysis detects problems (<7>), then the analysis unit enters the predictor mode (<8>). Prediction queries are specified through a graphical interface, as shown in Fig. 2 . For example, if the developer wants to predict the effect of switching from rate monotonic to earliest-deadline-first scheduling, the "Preemptive EDF" button can be selected. 1 As another example, the designer can determine the effect of optimizing one or more of the threads, by reducing the execution time used by the thread.
The analysis unit then recomputes schedulability, but instead of using only measured data, it either uses a different analytical model, or merges the measured data with the user's proposed modified data (<10>). It then predicts what timing errors, if any, may occur if the change was actually implemented. If the analysis unit indicates potential improvement (<7>), then the designer can modify the code, and repeat the <1> through <5>. If, however, the proposed change would have a negative or null effect on the final system, the designer does not need to spend any time implementing the change, and instead can try alternate adjustments by repeating the prediction procedure (<8>). Thus, the developer only performs a full compile/download/execute iteration for changes that are likely to improve the system. Predictor capabilities are described in Section 4.6. If the analysis does not find any problems (<7>), then, from a temporal perspective, the system meets its specifications. If the results of the analysis used only raw data collected from the application (<9>) and did not involve any prediction queries (<8>), then no further work needs to be done and, from a system timing perspective, the product can be shipped. Even if no problems are detected, the predictor mode of the tool can be used during the maintenance phase of the software life cycle, to predict the effects of modifying the application, such as adding a new task or adding code, thus increasing the CPU utilization.
As elaborated upon in Section 7.1, the analysis and prediction capabilities are an essential first step towards automating the fine tuning and debugging of temporal properties of a real-time system.
DESIGN OF A TOOL FOR ANALYSIS AND PREDICTION
The design of a tool that enables automated analysis of real measured data and interactive prediction to determine the effect of fine-tuning operations requires the steps that were shown in Fig. 1 . Each step has been designed and implemented as a separate unit. The functions, issues we encountered, and our solutions for each unit are described in the remainder of this section.
Implementation
It would be nearly impossible to analyze the real-time properties of any arbitrarily designed application. From experience, we have discovered that some applications are simply not analyzable because the code is not written as a distinct set of tasks with clearly identified starting and ending points of each cycle. We do not consider such designs. Rather, code must be structured with a single entry and single exit point for each iteration of a periodic task or aperiodic server. This restriction should not be limiting, as it translates into defining tasks using good structured design, rather than having arbitrary entry and exit points. One exception, though is with exiting due to error conditions. Error and exception handling is a topic beyond the scope of this paper, although the topic has been partially addressed in [11] , [12] . In this research, we propose that exception handling be treated as a separate mode, in which case realtime analysis would be performed specifically for that error mode of operation. If an error occurs, a task ends its normal execution and reconfiguration into an error mode occurs, so that each task still maintains the single entry/exit point criteria. In our experiments, the port-based object model [19] , [20] is used to build analyzable tasks. We assume that an application is written as a set of concurrent tasks. The implementation should also have a clear separation between the mechanisms that provide the multitasking and the application that consists of the set of tasks. It does not matter if the multitasking support is nonpreemptive using a real-time multirate executive, or preemptive using a commercial RTOS. What is important is that the mechanism is distinct from the application tasks. Typically, this is not a problem for applications that use a commercial RTOS or executive. The problem does surface with implementations that do not use an RTOS, when the timing, scheduling, and application code is often intermixed. An argument against mixing the mechanism and application code is that automated analysis is difficult or not possible. We only consider the analysis of code that has a clear separation between the multitasking mechanisms and the application tasks.
Tasks can be either periodic or aperiodic. An aperiodic task is typically referred to as an aperiodic server [17] . The basic model of both periodic tasks and aperiodic servers is shown in Fig. 3 . The task executes one iteration of the loop per period (if periodic) or event (if aperiodic). At the beginning of the loop, any input needed is read, either as interprocess communication from another task or as I/O from a device. Computation is performed, then a new output is produced. In order to obtain accurate scheduling and no clock skew, the task should then block at the end of its cycle using a DELAY UNTIL specified time mechanism. Such a mechanism was incorporated into the Chimera II RTOS as the pause() routine [24] . Unfortunately, many commercial RTOS do not provide such a function and instead only provide a DELAY FOR specified amount of time mechanism, such as the sleep() routine. A discussion of the clock skew problems associated with using the delay for instead of delay until mechanism can be found in [30] .
An aperiodic server differs from a periodic task only in the way it blocks while awaiting its next cycle, as shown by the righthand side in Fig. 3 . While the periodic task awaits a time signal, the aperiodic server awaits a signal from an external source, such as a semaphore wakeup operation or a message arrival signal. An efficient implementation of an aperiodic server by using timing error detection and handling mechanisms is detailed in [21] .
An application may have any number of interrupt handlers. It is extremely important to include all interrupt handling in the analysis of a system. This includes the interrupt handler associated with updating the system clock, which almost every RTOS has. Although the amount of time spent in a handler is usually small, the execution is at highest priority and has significant effect on the schedulability of all the other tasks. For example, a 1 msec system clock interrupt that incurs 20 "sec per interrupt uses 20 msec of high priority execution time per second. As analysis that incorporates interrupt handling cost shows, this time is nonnegligible [6] .
An analyzable application consists of tasks that are designed according to the models given above. For our experiments, we use the port-based object model of a process, as detailed in [20] . This model leads to software that is analyzable, as per our definition above. Furthermore, the model uses a state-variable real-time interprocess communication (RTIPC) mechanism that does not result in blocking; this has allowed us to analyze applications without the need for considering complex RTIPC such as the priority ceiling protocol. Other RTIPC mechanisms will be considered in later versions of the tool, as we elaborate on in Section 7. The importance of following the model is to have a single entry and exit point for each task during each period, so that instrumenting the application for purposes of profiling is straightforward, as we describe next.
Profiling Unit
Profiling, also called monitoring, involves the collection of timing and performance data while the system is running. Hardware profiling is achieved by attaching probes of a logic analyzer to the processor, system bus, or to input/output ports in order to observe the activity and collect the measurement data without disturbing the system. Software profiling is accomplished by adding measurement routines that record important events which can later be evaluated by the software application on the target system. Hybrid profiling is a combined hardware/software solution in which a logic analyzer with a computer interface is used. One of the keys to automating analysis is to use an appropriate profiling method to quickly and easily obtain accurate measurements of the real-time behavior.
Generally, the software-based methods are easier to use, but provide lower resolution, greater overhead, or are usable only on faster processors. On the other hand, hardware-based methods are very accurate and nonintrusive, but very difficult to setup and use. In this section, various profiling methods are reviewed, and we make an argument for using software methods to obtain coarse-grain information about the system and hybrid methods for performing the detailed analysis during the fine-tuning phases of a project.
A number of methods exist for measuring execution time, as outlined in our paper "Measuring Execution Time and Real-Time Performance" [25] . Alternately, an Automated Task Profiling (ATP) mechanism can be built to allow the operating system to collect the data. ATP is a software profiling technique that we designed and incorporated into the Chimera RTOS [24] , as detailed in [26] . It always collects runtime data transparently and provides the information to the user as part of the process status. An example of the output from ATP is shown in Fig. 4 .
ATP is useful for obtaining coarse-grain information about execution time and missed deadlines, but it does not provide sufficient accuracy for detailed analysis. We do recommend, however, that RTOS's provide a mechanism similar to ATP to obtain at least coarse-grain estimates of worst-case execution time during the early stages of implementation. The more detailed profiling using a logic analyzer, as described next is recommended for later stages of testing and fine tuning.
Instrumented Kernel with Logic Analyzer
Combining a software profiling mechanism like Chimera ATP or WindView, with measurement instrumentation such as a logic analyzer or bus analyzer, presents a hybrid approach that is suitable for profiling an application on any embedded processor. This section describes data collection using a logic analyzer, as this method provides the most accurate data suited for automatically analyzing and fine tuning real-time properties.
The start and end of any event to be profiled is marked by a write operation of the task ID and a code for the specific event to either a reserved memory location or a general purpose digital output port. To use reserved memory, the address and data bus must be available for connecting the probes of the logic analyzer. If the address and data bus are not available (or if they are multiplexed which adds complexity), then an 8-bit or more digital output port is used. In this case, the task ID and a code can either be merged into a single write operation (if enough bits are available), or output back-to-back. The logic or bus analyzer is setup in state mode to trigger only on the address of that memory location and stores the data on each trigger.
The logic analyzer automatically time stamps every event. One advantage to time stamping by the analyzer is that the time is independent of the clock used by the RTOS. Many RTOS may cheat and not provide the exact timing base requested by the designer. For example, the definition of "one second" in an RTOS might be approximate. Since most timers are based on the CPU's clock frequency and, prescalar factors are generally a multiple of 2 n , it is not always possible to setup a system clock to exactly the desired rate. For example, rather than 1 msec, an RTOS may program the timer with a prescalar value that gives you close to 1 msec, but it might be 998 "sec. This means that, when the designer believes one second has passed, it is really 998 milliseconds. Although this may not seem like much, it amounts to 172 seconds per day, or almost three minutes. Suppose you are relying on the system clock to show the time-of-day, customers will get upset because their clock is gaining three minutes per day. Using the logic analyzer to time stamp events instead of relying on an RTOS's timer or clock will expose such problems and allow the designers to adjust their design as necessary if accurate real time is needed. If the RTOS's timer is used, then it is possible that the time stamps in the event log are also skewed by this amount.
The amount of data that can be collected while profiling is dependent on the depth of the logic analyzer and the characteristics of the application. It is desirable to have as long of a profile as possible as occasional time glitches might not show up in a short profile. On the other hand, deep buffer logic analyzers needed to store long profiles can be very expensive and, thus, a compromise is needed. The length of a profile can be estimated as follows:
where E is the number of events to log per second, T i is the period of periodic task i or the minimum interarrival time of an interrupt or aperiodic server, N i is the number of interrupt handlers, and N t is the number of tasks. A b is the average number of times that each task calls a function that provides event log information. In our instrumentized kernels, functions that provide event log information include semaphores, message passing, or other calls that may cause a task to block. Since A b is likely not known precisely, E, the length of the profile in seconds, is at best an approximation. For example, an application with a 1 msec system clock interrupt handler, four tasks with periods of 5 msec, 10 msec, 30 msec, and 100 msec, and on average two calls to functions that generate event logs would require approximately 4K of trace buffer on the logic analyzer for each second of profiling that is desired. With a 1 MByte trace buffer, 256 seconds (or about 4.5 minutes) worth of profiling can be performed. With more tasks or higher frequency tasks, the length of the profile would decrease accordingly. An example of the data in an event log collected using state mode on a logic analyzer is shown in Fig. 5 . In this case, the thread ID and event action have been integrated into a single 8-bit value, where the first four bits (or most significant hex digit) the action and the lowest four bits represent the thread ID. For simplicity in showing the log, this example only shows two action codes: 0 for start of task and 4 for end of task. It only shows three tasks, with IDs of 1, 2, and 3 and periods of 10, 25, and 40 msec. Extracting data from an event log to produce accurate results is not meant to be done manually. Rather, the event log is uploaded from the logic analyzer to a workstation. The data is first filtered to put it in a format that is independent of the specific logic analyzer used to collect the data, as described more in Section 4.3. The data needed for the analysis is then extracted, as detailed in Section 4.4.
The hybrid solution of instrumenting the kernel is a compromise between the software-only methods that use too much overhead or are not sufficiently accurate, and the hardware-only methods that involve expensive hardware and complex manual extraction of the desired timing parameters. This solution does require an instrumented kernel, but the overhead of instrumenting is minimal. Consider the example above, where 4K traces are needed per second. If each trace takes 1 "sec to write, then a total of 4 msec of execution time per second is used. In contrast, the Chimera mechanism required 6 "sec per trace and, although we do not have exact numbers for WindView, we estimate that on the same processor, the overhead of logging each trace would be similar to Chimera.
If an RTOS is not instrumented, it is still possible for the application designer to instrument the code. In this case, however, the event logging must be included manually at the beginning and end of every task, interrupt handler, and function that may cause a change in scheduling (such as semaphores or message passing). The measured data might also not be as accurate, as it will be more difficult to pinpoint the time of the various events that need to be logged.
Filter Unit
The goal of the filter unit is to convert data that is in a format dependent on the profiling mechanism, to a hardware-independent format that can be read by the extractor unit. While the logic analyzer may include a variety of data in each trace, most important for the filter unit is to obtain the task or interrupt ID, the event action, and the time stamp for each trace. Any other information stored by the analyzer can be discarded. The filtered information is then stored sequentially in an array, for easy processing by the extractor unit.
It should be possible to create a filter module for each RTOS/logic-analyzer pair. That is, if we assume that each RTOS instrumented the kernel in their own way, and that each logic-analyzer manufacturer stores the trace data differently, then each combination of an RTOS and logicanalyzer would need its own filter. The advantage of this approach is that, if an organization always uses the same one or two RTOS and always the same logic analyzers, then one or two filter modules need to be built initially, then they can be reused for any application. If the hardwareindependent output of the filter module is standardized, it would also be possible for RTOS vendors to provide the filter modules for the most popular logic analyzers, thus providing off-the-shelf solutions for the profiling and filtering phases.
Extractor Unit
The extractor unit parses the list of events in the filtered event log file, and creates one or more views of the temporal behavior of the system. While significant information about the temporal behavior of the system can be observed through these visual tools, existing tools do not extract the data needed to correlate the data to the specifications and perform schedulability analysis and predictions.
Although the extraction of the execution time for each task from an event log such as the one shown in Fig. 5 may seem obvious, there are several issues to address if we want sufficiently accurate data to use as the basis for analysis and prediction.
First, the execution time of task i, C i , is calculated as t endÀ t startÀ t preempt , where t end is the time the trace ended, t start is the time the trace began, and t preempt is the amount of time that another task with higher priority executed.
These measurements, however, already include the operating system overhead, but not in a consistent manner. Whenever a high-priority task preempts a low-priority task, the preemption overhead (Á thr ) is added to the execution time of the lower priority task, assuming that the kernel was instrumented to output to the logic analyzer immediately when it begins executing a task's period and again when the period ends. This is contrary to what is desired, where the overhead should be associated with the higher priority tasks. If the overhead is associated with the higher priority task, then it can be modeled as 2Á thr for each task [8] , [9] .
We can best demonstrate this issue by example. Consider the schedule shown in Fig. 6 . We desire to accurately measure C 1 , C 2 , and C 3 , knowing that there is overhead during each context switch. The event log includes the start and stop timestamps at the instants indicated by the triangles. From this diagram, we see that the execution times should be the following if we assume the overhead is zero (where C x;y means task x, cycle y; C idle is the amount of time the processor is idle):
However, these are not the results that would be extracted from the event log. Rather, the results would include the overhead Á thr and result in measurements that produce the following times: Note the inconsistency of the overhead becoming part of the measurements. The highest priority task never has any overhead associated with it, while the lower priority tasks accumulate 2Á thr each time there is a preemption. Suppose there is a task with low priority that was preempted 40 times. That would mean its measured execution time is C k þ 80Á thr , which is much higher than its true execution time. On the other hand, the same task if preempted only three times has an execution time of C k þ 6Á thr . Furthermore, notice how some of the overhead is calculated as part of any other task, thus accumulates as part of the overhead for the idle task (C idle ). This cumulative effect of the overhead can lead to significant miscalculations of the measured execution time, especially for lower-priority tasks and for the idle task.
On the other hand, suppose that the start task event was logged before the overhead and the end task event was logged after the overhead. Then, we obtain the following measurements: The overhead for each task is now constant, such that the measurements conform to the overhead model presented in [8] , [9] .
Unfortunately, profiling data with these event log points is not practical to implement. Specifically, to output to the event log we would need to know which task is being swapped in; however, the selection of the task to swap in is what constitutes the bulk of the overhead. On the minor side, when a period ends, it is feasible to postpone outputting the event log until after the overhead, but this adds a small amount of overhead on each event log, since we would no longer output the task ID of the currently running task, but would need to keep track of the previously running task.
Assuming that the overhead is constant, the extractor unit can adjust the execution times of each task by keeping track of the depth of preemption. The extractor adjusts the execution time such that 2Á thr is spread uniformly across every cycle for every task, rather than nonuniformly as described above. This translates to adding 2Á thr to the measured execution time for any task that is never preempted and subtracting 2ðn pÀ 1ÞÁ thr for any task that is preempted, where n p is the number of preemptions that occur between the start and stop event tags. With this adjustment, schedulability analysis using the measured data will be more accurate. On the other hand, there still remains an approximation, as there is no guarantee that the operating system's overhead is constant. For example, many real-time schedulers maintain linked lists of ready tasks sorted by priority. The more tasks in the queue, the longer it takes to insert a new task into the ready queue and, thus, the more overhead. Nevertheless, as we discuss more in Section 5, assuming a constant overhead Á thr produces results that are sufficiently accurate for analysis and prediction.
A second issue to be dealt with by the extractor is identification of missed deadlines. The event log captures the start and end times of each cycle of a periodic task. However, the event log has no knowledge of the deadlines for each task; hence, it cannot identify missed deadlines. The best solution in this case is to have the RTOS detect the timing errors. We provided the details for efficiently building such a mechanism in [21] and demonstrated the use in the Chimera RTOS. In this case, it is easy to log an event whenever a missed deadline is detected. This is the method we used.
Unfortunately, most commercial RTOS do not provide missed deadline detection. In such a case, it is possible to deduce the deadlines for most (and sometimes all) the tasks from the event log, assuming that the deadline is always the start of the next period. To find all the deadlines for Task k, we search the event log for an instance where the start of a cycle for Task k preempts a lower priority task. This indicates a real start of a period, rather than a delayed start due to a high priority task using the CPU. Using the time of this event as a baseline, we can continually add (or subtract) the task's period to this time, to obtain every deadline for the task. This method fails only if a task never preempts a lower priority task, assuming that the idle task is the lowest priority task.
A third issue is determining the measured period. The event log does not directly store information about a task's period. While the task period is usually available in a specification file, it is still desirable to measure it, as the RTOS may have approximated it. For example, if an RTOS has a 1 msec system clock, it may round every period request to the nearest msec. So, a task with a desired frequency of 450Hz may in fact run with a period of 2 msec, rather than 2.2 msec.
The period of a task can be deduced in the same way as a deadline. Two real starting points for the task must be found in the event log. The difference between these two points, divided by the number of iterations of the task between those two points, will yield an accurate result for the period. From this information, every starting point of the task can then be identified and the amount of jitter of the starting times can then also be computed.
The goal of the extractor is to provide an accurate summary of the periods and execution time for each interrupt and task in the system. An example of such a summary in our AFTER tool is shown in Fig. 7 . Given this data, the analysis unit can then use real measured data, rather than specified or estimated data which might not be as accurate as the application developer expects.
Analysis Unit
The analysis unit is the core of our tool. It performs schedulability analysis on the data provided by the extractor unit based on a set of analytical equations. There are two operational modes for the unit: 1) the analysis mode that uses only real data collected from the embedded application to present the current temporal state of the system to the developer; 2) the predictor mode, in which the unit uses a combination of real data and data modified by the developer, to predict how a particular fine-tuning operation might change the timing characteristics of the system. The predictor mode is discussed in Section 4.6.
The following conventions are used throughout this section and the next:
. Interrupts are numbered as the highest priority processes, as ( 1 ; ( 2 . . . ; ( nintr . Threads (both periodic and aperiodic) are numbered afterwards, as ( nintrþ1 ; ( nintrþ2 . . . ; ( nintrþnthr . . For periodic threads, C i and T i are execution time and period, respectively; for aperiodic servers, C i is the server capacity and T i is the minimum interarrival time; for interrupt handlers, C i is the execution time and T i is the minimum interarrival time. . We assume that the deadline D i of any thread is the end of the period T i . We have not yet considered scheduling algorithms for cases where the deadline is earlier than the end of the period. . Each interrupt or thread ( i has a fixed priority, such that the priority of an interrupt is always greater than the priority of any thread. In the dynamic scheduling model, the fixed priority is ignored for threads. . The threads and interrupts are numbered in decreasing order of fixed priority. . Á thr is the context switch overhead for a thread and includes the time to execute the scheduler and to select the next thread. Á intr is the operating system overhead for servicing an interrupt. Operating system overhead Á thr and Á intr are incorporated into our equations in a similar manner as Katcher [8] incorporates overhead, although multiple terms have been combined such that the values are "measurable" given one of the profiling mechanism described in Section 4.2.
Compared to Katcher's notation, we use Á x to represent operating system overhead, while C x always represents execution time of a user's thread or interrupt; Katcher uses C x for both user execution time and overhead.
To measure the overhead Á thr and Á intr for a particular operating system and platform, each thread constantly toggles a different bit in a digital I/O port. When a context switch or interrupt occurs, the specific bit being toggled will change. The output of the I/O port is captured by a logic analyzer and the "dead time" when no bits are being toggled represent the corresponding operating system overhead.
Following are three analytical models that have been implemented in the AFTER prototype. They form a sufficient basis for us to investigate the analysis and prediction capabilities described in this paper. Additional models can be added to AFTER in order to support more analysis and prediction as described in Section 5. One contribution of our work worth noting is how the tool brings together technical results from several different research groups into a single comprehensive package.
Fixed-Priority Scheduling Model
This section presents the schedulability model for a thread and interrupt set executing on a fixed priority platform. A task set is schedulable in the worst case using fixed priority scheduling if and only if:
This model is a result of combining results of several different researchers. Liu and Layland [14] first presented the scheduling theory for fixed-priority systems and established the worst-case schedulable bound. Lehoczky et al. [13] proved that a more optimistic analysis of the task set can be performed. That is, a task set consisting of n periodic threads is schedulable if the following equation holds:
We evaluate each thread ( i over its period, but only up to its deadline. The summation is evaluated at every scheduling point. If the minimum value of the workload is normalized by time and is less than unity, then the thread is schedulable, meaning the specific thread will meet all of its deadlines.
Katcher et al. [9] added overhead terms to (3), which we use to derive the following condition for checking the schedulability of n periodic tasks in the absence of interrupts: 8i; 1 i n; min
To incorporate interrupts, we model an interrupt service routine (ISR) as a sporadic server with its capacity being the maximum execution time of the ISR and its period equal to the minimum interarrival time of the interrupt. Sprunt et al. [17] proved that, for analysis purposes, a sporadic server can be treated as a standard periodic thread with the same period and execution time as the server. This allows us to add the effect of interrupts to fixed priority scheduling analysis. Using this result, we obtain an equation similar to (4), but for n interrupts:
Combining (4) and (5) lead us to the analytical model that was shown in (2).
Dynamic Priority Scheduling Model
The dynamic priority scheduling model is used if threads in the underlying system are scheduled using the earliestdeadline-first (EDF) algorithm. EDF was chosen for this model for three reasons. First, EDF has been proven as an optimal real-time scheduling algorithm. Second, there exists comprehensive analytical models that are already understood by many readers, and that could be modified easily to take into account the operating system overheads. Third, EDF is a good algorithm for the fine-tuning phase because of its higher schedulable bound (100 percent). Embedded systems are often overloaded or near capacity, yet they use fixed priority scheduling which provides a lower schedulable bound.
The model can also be used in the predictor mode if the target system is using a static scheduling algorithm, to determine the likely effect if the scheduling algorithm is changed.
As a basis, we use the equation derived from Liu and Layland's analysis of mixed-priority scheduling [14] . Threads have deadlines and are scheduled using EDF, but there is a set of interrupts with fixed higher priority that will preempt the threads. Katcher [8] extended the mixed priority schedulability analysis to incorporate operating system overhead: A task set consisting of k interrupt handlers and m user tasks is schedulable if and only if
where t À f k ðtÞ is the availability function of the processor. The availability function of a processor for a set of tasks is defined as the accumulated processor time from 0 to t that is available to this set of tasks. The function f k ðtÞ is the processor time consumed in an interval of length t by the k fixed priority interrupt handlers. It can be defined as follows:
Combining and rearranging (6) and (7) yields the following:
where Á 0 thr is the operating system overhead from scheduling threads using EDF. Typically, Á 0 thr is greater than the corresponding overhead in a fixed priority scheduler Á thr because of the need to dynamically recompute priorities.
While (8) gives necessary and sufficient conditions for the feasibility of the task set in a dynamic priority system with interrupts, it needs to be evaluated for all t ! 0, implying that this equation is not a feasible test.
Jeffay and Stone [6] have proven that, if we restrict ourselves to task systems for which processor utilization is less than 100 percent, we get a closed form schedulability analysis equation for (8) by restricting t, such that, t 2 P P ; P P ¼ ðk; lÞjðk ! 1Þ^ð1 l n intr þ n thr Þ f kT l < maxðT l Þj n thr þn intr l¼1
This represents the set of nonnegative multiples of a thread's period, less than the length of the thread with the longest period. The model gives a finite number of evaluations needed to test the schedulability of a task set with high priority interrupts and EDF scheduling, thus making it usable in our analysis tool.
Static Scheduling with Aperiodic Servers
Aperiodic servers are threads that execute irregularly. Their execution generally provides a small amount of computation with fast response for an externally generated event, such as an I/O device driver requesting service. Sprunt presented a closed form schedulability analysis for the deferrable and sporadic servers in [17] .
Aperiodic servers are preferable over interrupts because they enable better use of the CPU's bandwidth, and they reduce the priority inversion problem by allowing such events to be scheduled, rather than always having the highest priority. On the other hand, interrupt handlers require less operating system overhead since only a partial context save and restore is needed for the interrupt handler, but a full context save and restore is needed for the aperiodic server. Furthermore, Sprunt's scheduling model is idealized. In real systems, an aperiodic server still needs to be signalled, usually in the form of an interrupt. Thus, it is not an issue of replacing an interrupt handler with an aperiodic server and vice versa. Rather, an interrupt handler that uses significant CPU time can be replaced with a much shorter interrupt handler, with the bulk of execution being moved to the aperiodic server.
This leads to a question as to whether the additional overhead of an aperiodic server justifies eliminating the potential priority inversion that may occur when interrupt handlers execute for extended periods. Our model for aperiodic servers is designed especially to help a designer analyze those specific trade offs.
In our model, an aperiodic event is composed of two separate elements: a minimal interrupt handler that signals the arrival of the event and an aperiodic server thread that performs the bulk of the computation in response to the event. All aperiodic server threads by definition have a higher priority than periodic threads, but lower priority than the fixed priority interrupts. The priority of an aperiodic server, however, can be lowered by the scheduler if it uses its entire capacity in a given cycle [17] , using the mechanism described in [21] .
Using our results in (2) and incorporating the sporadic server and the operating system overhead, we obtain a schedulability test for a system consisting of periodic threads and all interrupts converted to aperiodic servers:
where Á sig is the execution time of code within the interrupt handler needed to signal an aperiodic server. Using a modified version of the above equation, it is also possible to analyze the effect of only converting some, but not all, interrupts to aperiodic servers.
Equation (10) applies to fixed priority scheduling only. In our prior work on developing the maximum-urgency-first scheduling algorithm [23] , we showed that it is also possible to compute the schedulability for aperiodic servers in a dynamic scheduling environment. In this work, however, we showed the schedulability for a dynamic system with only a single aperiodic server. We are currently investigating a more generalized form of that solution for an arbitrary number of aperiodic servers.
Predictor Unit
The analysis unit presents a temporal image of the system to the user, using an analysis that is based on a correlation of the system specifications with the measured timing data. An interactive interface is then displayed, as was shown in Fig. 2 , and operation of the tool switches to the predictor mode.
The predictor unit provides two classes of prediction:
Effect of changing the system configuration. For example, changing the scheduling algorithm from static to dynamic or vice versa. To make such a prediction, we use the same measured raw timing data, but use a different model to analyze the data.
Effect of modifying the application's configurable design parameters. For example, modifying parameters such as execution time and period. For predicting the effect of such changes, we use the same model and most of the measured data as in the analysis mode, but replace some of the measured data with the new design parameters.
We do not expect to get perfect predictions. In particular, features such as variability of RTOS overhead, caches, pipelines, branching, randomly generated external events, and other known phenomenon make measuring execution time in an embedded real-time system difficult. As a result, it might be impossible to create a perfect modeling of the system. However, perfection is not needed for the tool to be valuable. For example, in meteorology, predictions are used to inform the population of the upcoming weather. We all know that the predictions are not completely accurate. Nevertheless, they serve a very important function, especially for warning the public of strong storms and dangerous conditions in the area. In our tool, predictions about timing errors that are "close enough" can provide valuable information to the designer for rapidly tracking down and fixing the problems.
The accuracy of meteorological predictions largely depends on both the accuracy of the analytical models used and how precisely current atmospheric data can be collected. Similarly, the accuracy of our predictions will depend on how closely we can model the target execution environment and how precisely we can collect data from that environment. Thus, we expect the accuracy of our predictions to be a function of the accuracy of the theoretical models described in Section 4.5, and the profiling and extraction methods described in Section 4.2 and Section 4.4, respectively. Validating the accuracy of the tool must then be a continuous process, as discussed in Section 5.
Currently, we have incorporated models for both static and dynamic scheduling in the presence of interrupts, and extended the analysis of aperiodic servers to better reflect real embedded systems. Based on these analytical models, we can perform the following predictions.
Fixed vs. Dynamic Scheduling: AFTER gives the developer the ability to evaluate scheduling algorithms specifically for their application. The major advantage of using a dynamic scheduling algorithm is an increase in schedulable bound to 100 percent for any task set as compared to a fixed-priority algorithm. However, if a system has interrupts or possible transient overloads, there is no guarantee that using dynamic priorities will improve the system. Generally, there is also an increase in operating system overhead for a dynamic scheduler. Before changing scheduling algorithm, the developer can use AFTER to observe timing characteristics and only switch scheduling strategies if there are benefits. If AFTER suggests that such a change is not worthwhile, the developer can save a complete iteration of modifying the system unnecessarily.
Adding or Removing Threads: It is often desirable to know in advance the effect on the timing of the system if a new thread is to be added to the system. Conversely, the designer may want to know how removing an optional thread from the system may change the timing. Threads can easily be added or removed through the prototype graphical interface. The analysis then simply skips threads that have been removed and uses the estimated data for a thread that the designer considers adding. In the interface that was shown in Fig. 2, thread 08 is disabled for the prediction, and thread 12 is currently blank so that the designer can enter a new thread.
Interrupts vs. Aperiodic Servers: There is a trade off between interrupt handlers and aperiodic servers. Interrupt handlers are typically nonpreemptive and execute with the highest priority, which can decrease the predictability of the system. An aperiodic server, on the other hand, can be used to improve predictability, but at the cost of higher operating system overhead. Using the real data collected from the system, AFTER can predict the effect of converting one or more interrupt handlers to aperiodic servers. Conversely, given data from a system already using aperiodic servers, it can determine the schedulability of the system if some of the servers are converted to high priority interrupt handlers.
Frequency and Period: AFTER allows a developer to request a prediction on whether or not a system will be schedulable if the period or frequency of one or more tasks is modified. In some cases, changing the frequency of tasks can improve the system performance, while in other cases (such as when using the rate monotonic algorithm), reducing the frequency of some tasks could in fact lower the schedulable bound [13] , thus worsening system performance instead of improving it.
The period or frequency of one or more threads is modified in order to predict the likely effect in the real system if such a change was made. This prediction uses the same model as the original analysis, except that the period T j is replaced by T j þ t j . The designer can then select the amount by which to modify the period of any specific thread ( j by assigning a value to t j . For other threads, t j is set to 0. Adjusting t j for an interrupt handler represents an adjustment of the minimum interarrival time.
Code Optimization: AFTER can be used by a software developer to estimate how much code optimization is needed. There are situations when after trying all possible combinations of parameters to make the system schedulable, the developer finds that the CPU is still overloaded. The only remaining option is to reduce the execution time of one or more tasks, either through optimization, by making some tasks soft real-time, or by removing nonessential functionality. In any case, a major development effort is usually required to perform the modifications. Unfortunately, even after significant effort, there is no guarantee that the system will meet timing requirements. The problem is that the designer does not have specific goals as to which modules need to be optimized-and more importantly by how much-in order to make a difference in the overall schedulability of the system. AFTER can help the developer determine that "if you reduce the execution of thread A by 1 msec and thread B by 0.5 msec, then the system is likely to work; optimizing thread C does not make any difference."
The worst-case execution time of a thread can be varied to obtain predictions. To do so, C j is replaced by C j þ c j . If c j is negative, the designer is presented with predictions about what happens if the corresponding thread is optimized. If c j is positive, then the designer can determine how much code can be added to a particular thread without adversely affecting the timing in the system.
EXPERIMENTAL VALIDATION
An automated analysis and predictor tool will only benefit the designer if there is high confidence that the analysis results and solutions for fixing problems are accurate. Since the analysis and prediction can only be as accurate as the models and profiling methods used, it is necessary to continually validate models and profiling methods, always seeking to better models that can improve the accuracy. Our approach to experimentally validating the tool, and a sample of our results in doing so are described in this section.
The goal of the experimental validation is to identify the accuracy of the analysis as compared to actual execution. In the simplest terms, if we run the application on the hardware and find that no task ever missed a deadline, we want the analysis to report that the task set is schedulable. Conversely, if we run the application on the hardware and at least one task misses a deadline, then the analysis should reflect that the task set was not schedulable. For our experiments, the Chimera timing error detection and handling mechanism [21] was used to identify whether or not an application had missed any deadlines.
Randomly generating a hundred (or thousand) task sets, then both executing and analyzing them, is not sufficient to determine the accuracy for use in predictions. Such an approach would yield answers such as "98 percent of the time the tool gives the right answer," without providing any insight on why the tool is wrong for 2 percent of the time. Furthermore, the percentage of times the tool is correct would be a function of the range of utilization for the task sets. For example, if we select task sets with utilization requests in the range of 10 percent to 200 percent, then it would be very easy to get a high percentage of right answers. On the other hand, if we select task sets with utilization requests in the 80 percent to 120 percent range, then the number of times the tool is wrong might be much greater, as we would expect the actual execution and analysis of the execution to differ in results at the boundaries between a schedulable and nonschedulable task set.
To determine the accuracy of the analysis models and predictions, we instead focus on a single task set, but observe it with excruciating detail to analyze its behavior near the threshold of being schedulable or nonschedulable. The threshold is examined both analytically using the equations in Section 4.5 and, experimentally, by executing a synthetic workload and monitoring for missed deadlines.
Our first objective is to begin with a baseline task set that is schedulable, but on the threshold of being nonschedulable. To do so, we began with the task set shown in Table 1 . The interrupt handler (task 0) is in fact the system clock; we did not have the ability to modify it in any way, but it is included because the high overhead has an impact on the schedulability analysis. Every other task executed a synthetic workload that used up approximately the amount of CPU time under the C ref column. The workload was implemented using the Chimera delay() function, which simply busy-waited. Due to natural variations in execution time when using Chimera's delay() function, we expect to get measurements near C ref , but not necessarily exactly C ref . All execution times that we report in this section includes the 2Á thr overhead, as discussed in Section 4.4. We then adjusted the execution time of each task uniformly until we had a schedulable task set that was on the threshold of being unschedulable. We define the threshold where we can increase the execution time of the workload by 1 percent, and one of the tasks would miss a deadline. Profiling and extraction were then used to obtain real measurements of C avg and C max for this baseline task set. These values are shown in the baseline columns of Table 1 .
The baseline task set is on the threshold experimentally. Next, we use the measured values to determine whether or not the analysis unit believes the task set is schedulable. By applying (2) using C max of the baseline task (i.e., the worstcase execution time), the analysis unit provided a result stating that the task set was notschedulable. This differs from the experimental results, but is in fact to be expected. Consider the case where we use C avg instead of C max in the analysis unit; the task set is then reported as schedulable. The analysis unit is thus telling us that there is the potential for missed deadlines, but we did not notice any during experimentation because we were lucky that not all tasks experienced their worst-case utilization at the same time.
To further investigate the difference in results of the analysis unit and the execution of the synthetic workload, we created three new task sets by modifying the execution time of the synthetic workload proportionally to the baseline task set. For convenience, we call these new task sets k50, k90, and k110. The name "k50" means that we multiplied the execution time of each task by 50 percent. For k50 and k90, this resulted in a task set that is schedulable, while for k110, it resulted in a task set that is not schedulable. Both the execution on real hardware of the synthetic workload and the analysis unit agreed.
This test at least validated that when we are not close to the threshold (where not close means at least 10 percent utilization away from the threshold), both the real execution and the synthetic workload provide consistent answers. At this point, the question to ask is as we slowly increase or decrease execution time to approach the baseline task set, at what point does the analysis unit's answer differ from what we observe when executing the synthetic workload? The answer to this question is especially important, as it also provides insight on how accurate our predictions will be since predictions are performed by looking at the results of the analysis unit.
To answer this question, we repeated the following experiment for the k90 workload.
. Modify the synthetic workload so that the execution time requested for each task is 90 percent of the workload used in generating the baseline task set. Execute this workload to confirm no missed deadlines. Use the measured data as input to the analysis unit, to confirm the analysis unit accurately concludes that it is schedulable. The analysis unit recomputed schedulability twice, once using the measured C max , the second time using measured C avg . These values are shown in Table 2 . Note that the values are not necessarily exactly 90 percent of the baseline: The requested time was 90 percent of baseline; actual measured time varies slightly from this amount, as shown in Columns 4 and 5. Both showed the task set was schedulable. . Using the analysis unit, increase C avg of task 1 only, until we reach the threshold point above which the task set becomes nonschedulable. This value is noted in column 6 of Table 2 for TaskID=1. This was repeated using the analysis unit, but starting with the C max baseline and is shown in column 7. . Using the synthetic workload, increase C avg of task 1 only, until at least one missed deadline is detected by the RTOS. When a first deadline was missed, the measured C avg of task 1 was noted and is shown in column 8. C max for this same task set was measured and is shown in Column 9. The above steps were repeated, for each of tasks 2 through 8, each time starting with the schedulable k90 workload and increasing the execution time of only one task. The results are shown in Table 2 . Since we could not adjust execution time of the interrupt handler, columns 6 to 9 of task 0 are blank.
These results are also shown graphically in Fig. 8 . The x dimension is the Task ID. For each task ID, six bar graphs are shown, representing columns 4 through 9. The y dimension is execution time in msec. For each task, the first two bars are showing the 90 percent of baseline execution time. The middle two bars are showing the threshold execution times computed by the analysis unit; these represent the endpoints for the predictions of changing execution time; the measured threshold is shown in the right two bars; in all cases, the right two bars are both greater than the 3rd bar, and less than the fourth bar.
Most interesting to note, when using measured C max as the baseline, the analysis unit always provided an estimate of execution time increase that is lower than what we obtained when measuring the synthetic workload. When using measured C avg as the baseline, the analysis unit always provided an estimate of execution increase that is higher than the measured value in our synthetic workload. This is an important result, as it tells us that for purposes of prediction, the analysis unit can use both C avg and C max measured execution times, to establish the range for which any one task can increase while still maintaining a schedulable task set.
The above experiment was repeated with the k50 task set. Since this task set was much farther than the threshold, there is much more potential for variability between the results of the analysis unit and the results from executing the synthetic workload. We obtained similar results as with the k90 task set. We were able to use the analysis unit to obtain a prediction of how much execution time can be added to any one task and still meet all deadlines.
Using the k110 workload, an opposite approach is used for verification. The initial task set overloads the processor and produces tasks that continually miss deadlines. We ask the question, by how much can we reduce the execution time of any one task to make the task set schedulable? We obtained answers from the analysis unit and obtained the answer experimentally by modifying the synthetic workload. In this case, we found that, for tasks 4 through 8, removing the entire task would not make the task set schedulable. For task 3, the analysis unit indicated that we cannot reduce the execution time to obtain a schedulable task set, but, experimentally, we were able to. The reason is the same as above, in terms of the analysis unit providing conservative estimates when measured C max times are used as input.
SUMMARY
To summarize our initial experimental validation, the results are very promising. They show that there is strong correlation between results from the analysis unit and results of executing a synthetic workload. Experiments on accuracy of predictions are conducted only for the predictions in response to modifying CPU time. We found that by using both the C avg and C max measured times as input to the analysis unit, we obtain a lower bound and upper bound on the amount we can adjust CPU time. The prediction experiments also only considered modifying execution time one task at a time.
As noted in Section 4.6, experimental validation is not a one-shot deal, but rather a continual process that is used to evaluate the current state of the tool, and provide feedback for the parts of the system that can better be modeled to provide more accurate analysis and prediction. Experiments must also be performed for other types of prediction and for analysis using other models.
FUTURE WORK
The first implementation of our tool intentionally addresses only a small number of real-time issues. It allows us to keep the problem size reasonable while we validate the concept of automated analysis and prediction. In particular, we looked at a single RTOS/architecture pair, considered Table 2 .
preemptive periodic tasks, sporadic servers, and interrupt handlers and only considered rate monotonic and EDF scheduling algorithms. Such a tool, however, will only be practical if all resources used within an embedded system are properly modeled.
The experimental validation is a continual, lengthy process. Initial experiments are very promising. First, it shows that analysis of measured data correlates with the measured data. Second, it shows that using the analysis as a basis for predictions can provide a range within which the right answer is expected to be found. More accurate modeling will reduce this range and, thus, result in more accurate predictions.
Part of our future work is to expand on the models, by integrating research results from many groups and refining them when necessary to take into account practical concerns. After each new model is implemented, experimental validation of that model will be performed. The modeling we expect to perform over the next few years includes the following. For each one, we include a sample query that the model is designed to help answer.
. Microcontroller architectures. How will switching from an 8051 to a MC68HC912 affect our application? . Processor speeds. How much additional CPU bandwidth will we gain by increasing our x486 processor speed from 33 MHz to 66 MHz? . Different RTOS. If we switch from using VRTX to VxWorks, will our system still work? . Synchronization primitives. What are the trade offs of using priority ceiling protocol (with more RTOS overhead) in our system vs. priority inheritance protocol (with more context switching and possible deadlock)? . Scheduling algorithms. Can we use a nonpreemptive EDF algorithm instead of a preemptive algorithm to reduce operating system overhead? . Jitter. If we make the deadline 2 msec before the end of the period to reduce jitter, will we miss any deadlines? . Hard vs. soft real-time threads. If we allow thread B to miss a deadline, will the system work better? [23] . Multiprocessor support. What happens if we move thread A to CPU 2? . Communication. Will speeding up the serial communication bit rate from 33K bps to 56K bps improve the application? Or will the additional frequency of the serial port interrupt handler cause missed deadlines? . I/O devices. What will be the effect of using the lower cost analog-to-digital converter which is 10 "sec slower for each conversion? In addition, we will continually enhance the existing analytical models to better reflect the real characteristics of the target environment in order to improve the accuracy of analysis and prediction.
Automated Debugging and Fine Tuning
The ultimate objective of our research is to fully automate the debugging and fine tuning of embedded systems, such that the tool not only identifies problems, but can also suggest possible solutions to the designer. If we are successful in creating an accurate analysis and predictor tool as described above, then it follows logically that the procedure can be automated. For example, suppose a system is not meeting all of its deadlines, and a designer uses the prediction capabilities to find out which thread to optimize. He can make the query, "what if we reduce thread A's execution time by 5 percent, will the system then meet its deadlines?" He can repeat the query, asking about reducing execution time by 10 percent, 15 percent, etc. He can then do the same for thread B, then thread C.
The procedure can be automated, so that the tool automatically searches through all possible queries. In the worst case, an exhaustive search is used and a list of all possible solutions are provided. However, this may take days or weeks to execute and may suggest many solutions that are impractical. Thus, a key research challenge in automating the debugging and fine tuning is to reduce the search space to a reasonable size and to sort results by the practicality of implementing them. The designer can then be provided with a list of the top five solutions that can improve the system. The designer then picks the one that they feel most comfortable implementing based on their own experience and their knowledge of "hidden specifications" that the analysis and predictor tools might have overlooked. If the designer is not very experienced, they can sequentially try the proposed solutions and select the first one that works.
David B. Stewart received the BEng degree with great distinction, in computer engineering from Concordia University, Montreal, Canada. He received the MS and PhD degrees in electrical and computer engineering from Carnegie Mellon University, Pittsburgh, Penn., where he was also a member of the Robotics Institute. He is Chief Technology Officer and Executive Vice President at Embedded Research Solutions (ERS), a small consulting and contracting company that focuses on building small, reliable, precision real-time embedded systems. He is also an adjunct associate professor in the Department of Electrical and Computer Engineering at University of Maryland, College Park. Before joining ERS, he was the director of the Software Engineering for Real-Time Systems (SERTS) Laboratory and a full-time faculty member at UMD. His research at the university focused on combining software engineering and real-time system advances. He is best known for his pioneering work on developing dynamically reconfigurable software specifically for real-time systems. In 1991, he was a visiting researcher at the Jet Propulsion Laboratory, California Institute of Technology. As a result of this work, he received a NASA Class I Tech Brief Award. In the past, he was also recipient of many awards, including the Natural Sciences and Engineering Research Council of Canada (NSERC) 1967 Science and Engineering Scholarship, the Canadian Foundation for the International Space University Scholarship, also sponsored by NSERC, the Chait Medal and Computer Engineering Medal from Concordia University, and the Prize of Excellence from the Quebec Order of Engineer's. He is a member of the IEEE Computer Society, the ACM, and SAE, and has been active in meetings sponsored by these associations.
Gaurav Arora received the Bachelor of Engineering degree in electronics and communications engineering from the Birla Institute of Technology, Mesra, India, in 1994, and the MS degree in electrical engineering from the University of Maryland, College Park, in 1997. Currently, he is a senior member of the technical staff at Hughes Network Systems, Germantown, Maryland, where he is involved with the design and development of terrestrial and satellite digital television receivers. His research interests include performance monitoring and analysis of real-time embedded software and multimedia system design.
. For more information on this or any computing topic, please visit our Digital Library at http://computer.org/publications/dlib.
