Abstract-Energy consumption has become a major focus in the design of embedded systems (e.g., mobile computing and wireless communication devices). In particular, a shift of emphasis from hardware-oriented low-energy design techniques to energy-efficient embedded software design has occurred progressively in the past few years. To that end, various techniques have been developed for the design of energy-efficient embedded software. In operating system (OS)-driven embedded systems, the OS has a significant impact on the system's energy consumption directly (energy consumption associated with the execution of the OS functions and services), as well as indirectly (interaction of the OS with the application software).
[20] C. Lee 
I. INTRODUCTION
The complexity of embedded system software and the underlying hardware, tight performance and power budgets, and aggressive time-to-market schedules, usually necessitate the use of sophisticated runtime software support. In embedded systems where a single processor executes many different system tasks (each system task may be further divided into communicating processes), the use of an embedded operating system (OS) for runtime execution support is quite common. The use of an embedded OS significantly impacts both performance and energy consumption. Representative investigations of performance issues related to the use of an OS can be found in [1] - [4] . However, in modern microprocessors, where idle or sleep Manuscript received March 6, 2002 ; revised October 17, 2002 . This work was supported by DARPA under Contract DAAB07-00-C-L516. This paper was recommended by Associate Editor R. Gupta.
T. K. Tan and N. K. Jha are with the Department of Electrical Engineering, Princeton University, NJ 08544 USA (e-mail: tktan@ee.princeton.edu; jha@ee.princeton.edu).
A. Raghunathan is with NEC Laboratories America, Princeton, NJ 08540 USA (e-mail: anand@nec-labs.com).
Digital Object Identifier 10.1109/TCAD. 2003.816207 modes are usually exploited by software, performance optimization does not necessarily lead to energy optimization. Not much literature deals with energy consumption issues related to an embedded OS. The effect of a real-time OS on the energy consumption of embedded software was illustrated in [5] using C/OS and SPARClite processor,
where it was shown that the OS could consume a significant portion of the total energy consumption. However, this was just a first step. Much remains to be done to fully appreciate and model the impact of an embedded OS on software energy consumption.
In this paper, we present an energy simulation framework that enables simulation of an embedded Linux OS on a typical hardware platform based on the Intel StrongARM processor. We have also made this energy simulation framework available for public use (please visit http://www.ee.princeton.edu/~tktan/emsim to download our simulator, EMSIM). As described in Section III, such a simulation imposes special requirements in terms of modeling the functionalities of the hardware platform. Function and process-based energy accounting has been implemented to provide energy analysis capability, enabling users to identify specific energy hot spots in embedded software programs, and better guide code optimizations.
The remainder of the paper is organized as follows. In the next section, we present some previous work and highlight our contributions. Section III presents the design and validation of our simulator. Section IV describes a case study that demonstrates the usefulness of our energy simulation framework. Section V discusses several limitations and design issues pertaining to our simulator. Section VI presents the conclusion and future work.
II. RELATED WORK AND OUR CONTRIBUTIONS
In this section, we discuss related work and highlight our contributions.
A. Related Work
Many instruction-level energy simulators based on the instruction-level power modeling approach have emerged since its introduction in [6] . Applications of the instruction-level modeling technique include [5] , [7] - [15] , and enhancement of the basic ideas leading to either more accurate or efficient modeling techniques are presented in [7] , [16] - [18] . Mehta et al. [10] described an energy simulator based on the DLX architecture which they used to evaluate several compiler optimization techniques for low-energy software. Li et al. [8] presented an energy simulation framework for an embedded hardware-software system based on Fujitsu's SPARClite processor. Simunic et al. [7] enhanced cycle-accurate models of an embedded system based on the StrongARM SA-1100 processor to estimate energy consumption and battery life. Such embedded system simulators could be used to perform design space exploration for low power. JouleTrack [11] , a web-based energy simulation tool, performs energy estimation using models that separate the energy consumption into a first-order component that depends only on the operating frequency and voltage, and a second-order component that considers instruction statistics. They also presented techniques to separate out the energy consumption due to leakage from the energy consumption due to switching.
A complementary approach to instruction-level power modeling is through the use of structure-based power modeling. In this approach, the power consumption of the microprocessor in each execution cycle is estimated by summing up the power consumption of all the active functional blocks in that cycle. The total energy consumption of a program is obtained by accumulation of processor power consumption over the 0278-0070/03$17.00 © 2003 IEEE execution time. Some representative work in this area includes [13] , [14] , [19] - [21] .
In order to estimate and minimize the energy consumption of embedded systems, it is necessary to combine the energy models for the processor with energy models for other (heterogeneous) subsystems and components (e.g., memory subsystem, system bus, display and man-machine interface, peripherals, network subsystem, power supply subsystem, etc.). Efforts in this direction were presented in [5] , [7] , [8] , [22] , [23] .
As pointed out earlier, the use of embedded operating systems has gained popularity in the implementation of modern embedded systems. Performance and energy analyses that consider the embedded OS are very useful in driving embedded software design. Rosenblum et al. [24] presented SimOS, a machine simulator that simulates the hardware of MIPS-based multiprocessors in enough detail to boot and run an essentially unmodified version of Silicon Graphics' IRIX OS. Though not directly applicable to embedded systems, SimOS was a major attempt to conduct a detailed performance analysis and characterization of an OS-included system. Energy analysis of a real-time OS was first performed by Dick et al. [5] , who developed an energy simulation and analysis framework for C/OS-SPARClite based embedded systems.
A similar work by Cignetti et al. [22] modeled the power consumption of the PalmOS family of devices based on discrete device states and provided an energy simulator based on this power model. Flinn et al. [25] introduced PowerScope as a hardware instrumentation tool to map energy consumption to program structures such as processes and procedures. Their method is intrusive in the sense that a software daemon called System Monitor is built into the processor under test to record the execution state upon trigger by a current measuring multimeter. A recent work by Acquaviva et al. [26] also employed a hardware instrumentation method to perform energy characterization related to the OS.
B. Paper Contributions
A first step toward designing energy-efficient OS-driven embedded systems is to develop methodologies to analyze and characterize the energy-consumption characteristics of the embedded OS. Software energy simulation tools can provide invaluable feedback to designers regarding the energy consumption of their software. While several energy-simulation tools have been proposed for a wide range of microprocessors, many of them are targeted at providing feedback from the hardware architecture standpoint. In other words, they are meant to support the evaluation of different processor or memory subsystem architectures. A few have been adapted to evaluate the software energy consumption. Even fewer can evaluate the energy consumption of embedded systems considering the effects of an embedded OS. An energy simulation framework can also provide the means to formulate a set of systematic guidelines to design embedded software for low energy consumption.
Our work makes the following contributions.
1)
We provide an efficient and accurate energy-simulation framework that can be used to study the energy consumption of system software in relation to the application software, identify the energy hot spots, and perform software architecture optimizations that exploit a knowledge of the OS energy-consumption characteristics and system/application software interactions. 2) In order to provide intuitive feedback to embedded software designers, we have developed an energy accounting methodology that (nonintrusively) tracks the execution flow of the software being simulated, and associates energy consumption in each cycle to the appropriate function within an application process, or to the relevant OS function.
3) As Linux gains popularity as a general-purpose OS, its use in embedded systems is also increasing. We have modeled the StrongARM microprocessor and the required peripherals in sufficient detail so as to run Linux OS starting from system initialization and boot to the spawning and execution of multiple application processes. We have also included energy models for various components in the modeled system so as to enable energy simulation of the embedded software (system and application). As far as we know, this is the first energy-simulation framework for an embedded system featuring a StrongARM microprocessor and Linux OS. Our energy modeling and accounting methodology can be easily applied to other embedded system platforms. 4) We also present a detailed validation of our simulator, the corresponding error analysis, and a case study to demonstrate its usefulness.
III. DESIGN OF THE SIMULATOR
To enable an OS such as Linux to run in the simulator, we need to model the system hardware in sufficient detail, while maintaining high efficiency. Besides modeling the core components such as instruction decoder, pipeline, caches, and memory management unit (MMU), other hardware components essential for the functionality of an OS should also be modeled correctly. For example, we need to have timers to set the pace for task scheduling. We need to model universal asynchronous receiver/transmitters (UARTs), since Linux can use one of the serial ports as the console output. Most importantly, we also need to have an interrupt controller to service the interrupt requests from the timers, UARTs and the internal software interrupts.
In the following sections, we discuss various aspects of the design in detail.
A. Basic Features
The simulation framework is shown in Fig. 1 . Shown on the right half of Fig. 1 are the simulation models for all the subsystems featured in the simulation framework, whereas the sequence of steps involved in using the simulation framework is shown on the left. The simulator includes the following components: 1) a model for the StrongARM SA-1100 core, consisting of an instruction set simulator (ISS), simulation models for the D-cache and I-cache and a memory management unit (MMU); 2) a simulation model for 32 MB of system memory; 3) a simulation model for an interrupt controller; 4) simulation models for two timers; 5) simulation models for two UARTs conforming to the Intel 8250 series. By default, we direct the transmitter of UART0 to the host terminal. We also connect the transmitter/receiver of UART1 to a TCP socket on the simulation host, so that two instances of the simulators can communicate with one another through their UART1s, respectively. This feature allows simulations that involve multiple target embedded systems communicating with one another through their UARTs.
We have modeled each component in sufficient detail to enable a version of Linux OS to execute on it. The Linux OS that we have adapted for the purpose of simulation has the following features.
1) It is arm-linux version 2.2.2, configured for the EBSA-110 platform. 2) We have configured the arm-linux to mount an initial RAM disk (initrd) as the root file system. The user application code is preinstalled into the initrd. At the end of the kernel initialization steps, the user application program is loaded from the RAM disk. 3) We have also configured the arm-linux to use UART0 as the console. Output from the kernel as well as from the user application can, thus, be displayed on the host terminal since the transmitter of UART0 is redirected to the host terminal.
B. Usage of the Simulator
The flow diagram depicting the usage of the simulator is shown in the left part of Fig. 1 . The boxes with dark shading represent the auxiliary tools employed in the flow, whereas the cylinders with light shading represent the objects that are being manipulated at various stages. Given the arm-linux source code, we generate the arm-linux object code using a crosscompiler. In parallel, we also generate the user application object code from the application source code using the cross-compiler. As mentioned in the previous section, we need to preinstall the user application code into the initrd. To do that, we use an initrd generator. The output of the initrd generator is a compressed image of the initrd. After that, a linker is used to link the arm-linux object code and the compressed initrd image into a final ROM image.
At the start, the simulator loads the ROM image into the memory. Note that the application object code is also loaded by the simulator, as depicted in the flow diagram. This is necessary even though the application is already preinstalled in the initrd because we need the function symbol information directly from the application code to enable process-level and function-level energy profiling.
C. Energy Modeling in the Simulator
We now describe the energy models used in the various components of the simulator. Most of the concepts presented in this section are adapted from previous work on software and system energy estimation. The description is included here for the sake of completeness. Since the main purpose of building the simulator is to analyze the energy consumption from the software perspective, we do not resort to analytical energy models as described in [8] . In fact, Simunic et al. [7] have demonstrated that reasonably accurate estimation (within 5% error) can be obtained even if the power models of the constituent components are inferred directly from the data-sheet information for the system components. In our work, we follow this philosophy for most components of the simulator except for the processor, for which we have obtained the StrongARM SA-1100 instruction-level energy models from [11] .
The accumulated energy consumption of the embedded system for the duration of a program's execution is calculated as follows: number of cycles required to execute instruction i The instruction-level energy models for the StrongARM SA-1100 presented in [11] already include the energy contributions of all parts of the processor core, including cache, MMU, clock generator, timer, interrupt controller, etc. Following the original methodology described in [6] , the instruction-level energy models are applicable when cache hits occur. The data provided in [11] are in the form of instruction current consumption I instr . To calculate the corresponding energy consumption per processor core cycle when a particular instruction executes, we use the following conversion formula:
Our simulator also supports StrongARM's idle mode. In the idle mode, the clock to the processor core is stopped, but all other on-chip resources, such as clock generator, timers, interrupt controller, UARTs, power manager, etc., are still active. When an interrupt occurs, the core is reactivated.
When the processor is running the Linux kernel without a workload, the idle task in the kernel kicks the processor into idle mode. From [27] , we obtain the idle mode power consumption of the processor chip to be 65 mW for a clock frequency of 206 MHz. Since the clock generator is still running, we can calculate the total idle energy based on the clock tick E T idle = P idle 3 Tcyc 3 N idle cyc (4) where idle power P idle = 65 mW, T cyc is the core clock period, and N idle cyc is the total number of core clock cycles during which the processor stays idle. When cache misses occur, additional energy is consumed as a result of external bus and memory access. We address this next.
To calculate E T mem , we assume that our memory part is the same as that used in the Compaq Western Research Lab's Itsy pocket computer v1.5 [27] , [28] . The energy consumption of the memory part in each bus clock cycle can be extracted from the data given in [27] . Note that clock switching [29] is normally enabled by the arm-linux kernel, hence, the bus clock frequency is half of the core clock frequency. From [27] , we estimate the memory energy consumption for each bus clock cycle to be 4.70 nJ (denoted E mem ). Given that, the accumulated memory energy consumption is calculated as E T mem = Emem 3 Nmem cyc (5) where N mem cyc is the number of memory cycles in which the memory part is active.
E T uart can be calculated using data from [27] , where it was reported that the additional power consumption incurred when UARTs are enabled is fairly constant at approximately 44 mW. At a core clock frequency of 206 MHz, we estimate the energy consumption of the UART module in each core cycle to be 0.21 nJ (denoted E uart ). Given E uart , we calculate E T uart as E T uart = Euart 3 Nuart cyc (6) where N T uart cyc is the total number of core cycles in which the UART is active. Note that the UART can be active independent of whether the processor core is in the idle mode.
Energy models for other peripherals can be added in the same manner as for the UART. That is, the accumulated peripheral energy consumption E T peri can be calculated as
where E peri is the per cycle energy consumption of the peripheral, and Nperi cyc is the total number of core cycles in which the peripheral is active.
D. Energy Accounting
Correct accounting for task and function energy in the application and system software is one of the challenging parts of the design of our simulation framework. The energy accounting mechanism of our simulator is task-based. In other words, we keep a separate energy balance sheet (called the task energy balance sheet, or TEBS) for each task running in the simulator. This is illustrated in Fig. 2 . At the beginning of kernel initialization, there is only one task, the idle-task. The name idle-task might be misleading since this task does a lot of the hardware initialization before spawning another task called init-task. The init-task continues with kernel software initialization, mounts a root file system near the end, and executes the first user-mode program, which can then spawn other application tasks. This sequence of events during the initialization is illustrated in Fig. 3 . The energy balance sheets in the simulator have a one-to-one correspondence with the tasks running in the simulator. At the beginning, there is only idle-task, so there is only one energy balance sheet. When the init-task is created, a new energy balance sheet is created and the energy value for the new balance sheet is initialized to zero. Whenever a context switch occurs, the simulator is able to detect the context switch nonintrusively and attribute the energy consumption of the embedded system to the running task accordingly. Therefore, at any point in time, the energy value in each energy balance sheet indicates the energy consumption of the embedded system due to the corresponding task. The sum of energy values from all the energy balance sheets equals the total energy consumption of the embedded system. Energy profiling for individual software functions is more involved. Basically, we maintain a function energy stack (FES) for each task. An FES mirrors the function call stack for the corresponding task. The only additional information stored in each slot of the FES is the energy Fig. 4 . Two-task scenario used to illustrate the energy accounting mechanism.
value of the task at the time of entry into a function. When the function exits, the current value of task energy minus the energy value at the entry is the energy consumption of that particular function instance. Upon function exit, the corresponding FES slot is closed after storing the energy consumption value in a database.
We further illustrate the energy accounting mechanism with an example. Consider the scenario depicted in Fig. 4 . In this scenario, there are two tasks in the system, namely tasks 1 and 2. The functions contained in each task are also shown in Fig. 4 . Each box represents a function. The tree structure connecting the boxes represents the function call hierarchy. For example, function f1() starts out by executing a sequence of instructions local to itself. It then calls function f1a(). When function f1a() returns, it calls function f1b(). When function f1b() returns, function f1() continues to execute more instructions local to itself, before coming to its end and returning to its parent function. Fig. 5 shows the execution trace of this system starting at some point in time t0, where the program counter is at the beginning of function f1(). The first two rows show the functions that tasks 1 or 2, respectively, are in at any point in time. Accordingly, associated with each task is a TEBS. Hence, the next two rows show the plots (TEBS1 and TEBS2) for their respective TEBSs. During the time when task 1 is executing, only the value of TEBS1 is increasing, whereas the value of TEBS2 stays unchanged. When task 2 is executing (between t7 and t8), the value for TEBS1 stays unchanged, whereas the value of TEBS2 is increasing. Note that, when context switch occurs at time t 7 , task 2 is entered at function f2aa(), because this is where it was switched out before time t 0 .
The mechanism used for function-level energy accounting is illustrated in the last two rows of Fig. 5 . The FESs of tasks 1 and 2 are selectively shown for some points in time (FESs for some points in time are not explicitly shown for ease of illustration). Some of the changes to the FESs are explained as follows. 1) At time t0, function f1() is entered. The TEBS1 value e0 at this point in time is pushed into FES1. 2) At time t 1 , function f1() calls function f1a(). The TEBS1 value e1 at this point in time is pushed into FES1. 3) At time t 2 , function f1a() calls function f1aa(). The TEBS1 value e 2 at this point in time is pushed into FES1. FES1 now contains three entries. 4) At time t 3 , function f1aa() calls function f1aaa(). The TEBS1 value e3 at this point in time is pushed into FES1. FES1 now contains four entries. 5) At time t 4 , function f1aaa() returns. At this point in time, the TEBS1 value is e4. The fourth entry in FES1 is closed, and the energy difference e 4 0 e 3 is stored in the energy database as the energy consumption of this instance of function f1aaa(). 6) At time t5, function f1aa() returns. At this point in time, the TEBS1 value is e 5 . The third entry in FES1 is closed, and the energy difference e 5 0 e 2 is stored in the energy database as the energy consumption of this instance of function f1aa(). 7) At time t 6 , function f1a() returns. At this point in time, the TEBS1 value is e 6 . The second entry in FES1 is closed, and the energy difference e6 0 e1 is stored in the energy database as the energy consumption of this instance of function f1a(). 8) Between t 6 and t 7 , similar energy accounting operations are performed. Details are omitted in this discussion. 9) When the context is switched to task 2 at time t 7 , all of the subsequent FES operations are applied to FES2 instead. Since task creation in the Linux OS is always accomplished by task cloning, FES is also cloned when a task is created. However, the entry energy values for all of the slots in the new function energy stack are reset to zero, as is the energy value of the new energy balance sheet. This is necessary to avoid double counting.
The simulated program triggers the end of simulation by setting the program counter to 0200000000. Once this happens, the simulator starts consolidating the FESs for all the tasks in the simulator. Basically, all of the slots in the FESs are closed and the partial energy consumptions of the functions are stored in the energy database. After that, the energy database is stored into a file for postprocessing.
E. Simulator Validation
Validation of our simulator consists of two parts. The first part validates the functionality of the simulator, while the second part validates the energy modeling accuracy.
1) Functional Validation:
For functional validation, we ran a few commonly known multitask programs to make sure that the outputs of the simulations were the same as those obtained from host executions, as well as execution on targets such as the Itsy handheld [28] . The multitask programs we tried include:
Dining philosopher: This program has five tasks to represent five philosophers and another five tasks to represent five forks available. Acquisition of a fork by a philosopher is accomplished through message passing from the philosopher task to the fork task. Priority inversion: This program demonstrates the priority inversion problem inherent in the Linux OS. It consists of three tasks. It makes calls to the sleep() library function and also tests the idle mode of the simulator.
Time-of-flight:
The purpose of this program is to measure the time it takes for a message to be transferred across a message queue. The total time includes the time to make the call to msgsnd(), the time for the system to transfer the message, the time for context switch, and finally, the time for the other end to call msgrcv().
Two robots: This test uses two copies of the simulators linked together by a null-modem through their UARTs. Programs running in the two simulators emulate two robots exchanging data.
2) Energy Modeling Accuracy:
We also performed experiments targeted at validating the energy modeling accuracy of the simulator. We ran a few example programs both in the simulator and on an Itsy evaluation board [27] , [28] . The Itsy board has more components than what we have modeled in the simulator, hence, the total power is not comparable. However, since it also provides a built-in measurement circuit to allow convenient measurement of the current drawn from the 1.5 V supply by the StrongARM processor chip alone, we only make comparisons of the processor energy consumption.
It should be noted that validation of power consumption estimates is different from the validation of the energy consumption estimates. Achieving good power consumption estimates by structure-based power modeling of the microprocessor is not trivial because all of the modules in the microprocessor need to be properly accounted for in order to achieve good accuracy, and the internal hardware structure of commercial processors is not easily available. On the other hand, achieving good power consumption estimates by instruction-based power modeling is relatively easy, because the power numbers for the whole microprocessor are originally obtained from measurements. In fact, as long as the power numbers do not vary too much from instruction to instruction, which is true for StrongARM, the average power consumption estimate will typically be close to the measured power consumption during the validation. Since our simulator's power modeling is instruction-based, we do not present any power consumption validation, since such validations have been presented in previous work [11] . Energy consumption validation, however, is not trivial even for instruction-based power modeling. Besides accurate power modeling, which is the norm for instruction-based power modeling, accurate execution cycle accounting is also required to achieve good accuracy in the overall energy consumption estimates.
For our experiments, we built an energy measurement setup as illustrated in Fig. 6 . The measurement setup consists of a host computer running National Instrument's data acquisition software called Labview [30] , a data acquisition card (PCI-6035E DAQ card), and a wire connector box (SCB-68). What we measure directly is the voltage across a sense resistor Rs, which is connected in series with the power supply.
The Itsy evaluation board contains a sense resistor built in, whose value is sufficiently small that it can be considered nonintrusive. From the measured voltage Vs, the power supplied to the processor chip is calculated as
where V dd is its supply voltage. At the beginning of each program, we insert code to set a general-purpose input-output (GPIO) pin to HIGH. At the end of the run of the program, we have code to reset the GPIO pin to LOW. We use the GPIO signal S(t) as a gating window for the integration of power P s (t). The energy consumption of the program is calculated by
The gating signal S(t) ensures that the integration accumulates the power consumption only when the specific programs of interest are running.
With this measurement setup, we conducted experiments for a set of eight example programs. Short descriptions of these programs are provided in Table I . Note that these programs are of different length and Fig. 7 shows the simulated versus measured energy consumption for these programs. In order to evaluate the accuracy quantitatively, we perform the following analysis. First, we describe the relationship between the simulated energy and the measured energy by an equation of the form E s = ( + )E m (10) where is a fixed coefficient representing the systematic correlation between the simulated energy Es and the measured energy Em, and represents the statistical fluctuation. 1 0 is then the systematic linear error, or the absolute error. is calculated by the following equation: (11) where N is the number of example programs used in the validation. The statistical fluctuation can be characterized by its standard deviation , which is given by Calculation based on the data shown in Fig. 7 reveals that = 0:81 and = 0:04 for our simulator. Though the simulated energy does not match the measured energy exactly, with a systematic scaling factor of 19%, a good correlation between the two is observed since the relative error is only 4%. In other words, prediction of our simulator is quite stable over different program execution lengths and instruction mixes. We believe that this level of relative accuracy is sufficient to enable software architectural exploration.
IV. CASE STUDY
In this section, we describe a case study to demonstrate the utility of the energy simulation framework. This case study is taken from an example in [31] .
A. Background
Situational awareness subsystems are software subsystems used to monitor the biometric parameters of the subjects they are attached to. They could be installed on animals to collect biometric data related to the subjects' behavior. They could also be installed on firemen, soldiers, etc. Since these devices are usually mobile and battery-powered, they should be designed with low energy consumption in mind.
Development of software for an embedded system like the one above usually starts with a specification in the form of a data-flow diagram. A data-flow diagram models the functionality of the situational awareness subsystem. Implementation of the data-flow diagram as an actual embedded software program first requires refining this specification to a realizable software design, and subsequently implementing the design by performing the actual coding [32] . In the following, we present two designs of this subsystem and compare their energy consumption.
B. Alternative Software Designs
Consider an initial software design of the embedded situational awareness subsystem. The design is represented as a process configuration graph (PCG), as shown in Fig. 8 . This design is a result of a one-to-one mapping from the data-flow diagram to the PCG. In the PCG, vertices represent software or hardware entities. Vertices can be of these two types.
• Software processes are represented by an ellipse with an arrow on the perimeter (e.g., process read_heart_rate in Fig. 8 ).
• Hardware devices are represented by a square box, which are optionally checked to indicate active devices (e.g., the UART peripheral in Fig. 8 ). An active device initiates data transfer spontaneously. A timer is a special active device that spontaneously initiates data (signal) transfer at regular intervals. A passive device, on the other hand, waits for the processor to initiate a data transfer. Arrowed edges between any two vertices represent the communication of data or control messages. A small solid square at the termination of an arrowed edge indicates a blocking communication. Otherwise, the communication is assumed to be nonblocking. Communication between processes read_heart_rate, read_resp_rate, and take_picture, and their respective devices are blocking because, the cardiac monitor, respiration sensor, and camera interface are slow, passive devices. That is, after getting a read request from the processor, they require a significant amount of time (> 10 ms) to respond with the corresponding data.
The design in Fig. 8 is just a straightforward mapping from the data-flow diagram to the PCG. Knowing that the battery gauge and the thermometer are fast passive devices, we know that processes read_battery_gauge and read_thermometer can be merged into one without having any negative impact on performance. An improved design that reflects this optimization is shown in Fig. 9 .
Both designs are implemented in C to run under the Linux OS. Our energy simulation framework is used to evaluate the energy consumption of both designs to see how much we gain from the optimization.
C. Software Energy Data
Our energy simulation framework simulates both designs and reports their energy consumption characteristics. Besides total energy, the energy breakdowns among the software processes are also reported. These energy data are shown in Table II. From Table II , we can compute the energy reduction to be 1.634 mJ. From the energy breakdown among all processes, we can also deduce that the major energy consumer is the send_status process. The next step may involve a careful look at the design of send_status to explore further improvement.
V. LIMITATIONS
We believe the energy simulation framework presented in this paper is powerful enough to be useful in many areas involving energy analysis of OS-driven embedded software. However, some of its limitations and design issues are still worth mentioning. 
A. Context Switch Detection
The energy accounting mechanism employed by the simulator is general enough to be applicable across different OSs. That is, an instruction-level energy simulator supporting another OS, instead of Linux, can employ the same energy accounting mechanism. However, since the actual details of detection of the context switch relies on the execution of a special function in Linux, implementation of this aspect of energy accounting is not totally OS independent. A future version of the arm-linux may employ a different mechanism to execute context switch. In that case, the simulator needs to be slightly modified to cope with the changes. However, we do not anticipate this to require significant effort.
B. Asynchronous Operation
Our energy accounting mechanism only recognizes the notion of tasks and functions. Both of these are general conceptual entities present in many OSs. Therefore, the energy accounting mechanism is applicable across many different OSs. However, an important implication of this mechanism is its inability to trace asynchronous work flow taking place in the OS. Consider the following scenario.
A task requests a device to transmit some data and the OS system function handling this scenario buffers the request for later processing, which can happen when another task is actually running. Our task-andfunction-based energy accounting mechanism will not be able to recognize that the actual work done should be attributed to the task requesting it.
Fixing this problem requires the simulator to trace the flow of asynchronous operation at the instruction execution level. With the current complexity of state-of-the-art processors, this is difficult. Since the simulator is an instruction-level simulator, it only sees instructions. To trace the flow of the operation, it needs to be able to recognize high-level data structures in Linux. A simulator can do this only if it itself is an emulator for Linux. That is, it does not run Linux instruction by instruction, but emulates the operation of Linux. Therefore, although this issue is indeed a limitation of our simulator, it is also a limitation of all instruction-level energy simulation frameworks.
C. Energy Estimation Error
While obtaining the energy estimation accuracy of our simulator with hardware measurements (presented in Section III-E2), we reported an absolute error of 19% and relative error of 4%. The absolute error may be large for some applications. However, since the main purpose of constructing the simulator is for OS-driven software energy analysis, where relative accuracy is more important, we believe such a limitation is acceptable.
The accuracy of energy modeling rests heavily on a proper estimation of cycle counts. Unlike structure-based simulators such as Wattch [20] , which simulates the actual details of a processor microarchitecture, our simulator is instruction-based. In other words, we only estimate cycle counts based on the timing information provided in the StrongARM SA-1100 reference manual [29] . This may lead to some cycle-count error, and hence, energy estimation error.
Fixing the above problem requires high-fidelity modeling of the StrongARM's microarchitecture. However, this is not trivial. Simplescalar-ARM [33] does not truly emulate StrongARM SA-1100's five-stage pipeline exactly. It uses parametric adjustments to get the cycle-count estimate close to actual measurements. Given available information from the StrongARM SA-1100 reference manual [29] , we have tried to emulate the many aspects (e.g., virtual memory system, caches) of the StrongARM microarchitecture as accurately as possible. However, our simulator is not a high-fidelity model of the StrongARM SA-1100 pipeline. Although we could perform parametric adjustments to reduce the absolute error (as is always possible for all systematic 
VI. CONCLUSION AND FUTURE WORK
In this paper, we presented an energy simulation framework for an embedded system consisting of a StrongARM microprocessor, memory, and the required peripherals. The framework is detailed enough so that we can run the Linux OS starting from initialization to the spawning of multiple application tasks. The energy simulation framework has been validated for both its functionality and its modeling accuracy. The simulator paves the way for future work in software architectural exploration for low energy consumption.
