Hsin-Ming Chen

Andes Technology
Ing-Jer Huang
National Sun Yat-Sen University &AN IN-CIRCUIT EMULATOR (ICE) is part of the development environment for a microprocessor-or microcontroller-based system-called a target system. (We use the terms microprocessor and microcontroller interchangeably unless we need to differentiate them.) While retaining the same functionality and physical features as the original microprocessor, the ICE provides extra debug and test mechanisms to support designers in the test, development, debug, and maintenance of target systems' hardware and software. These mechanisms include single stepping, breakpoint setting and detection, tracing, internal resource monitoring, and modification.
Traditionally, designers have used an ICE mainly when debugging a microprocessor-based system design at the PCB (printed circuit board) level, as Figure 1 shows. (Although test, development, debug, and maintenance are different activities, they involve similar operations. We use the term debug to include all these activities unless there is a need to differentiate them.) To debug the system, designers pull the target microprocessor chip out of its slot on the board and insert the ICE into the slot to act as the target microprocessor. 1 The host computer's software controls the ICE's operation via a communication channel. After debug is complete, the designer disconnects the ICE and places the original target microprocessor chip back in its slot. In this scenario, the ICE functions only during debug and doesn't exist in the final product. The ICE's cost and performance affect the development system but not the final product.
In the SoC era, however, the ICE no longer plays a negligible role. Responding to the needs of higher performance, more functionality, and higher integration levels, manufacturers are permanently embedding an ICE with the microprocessor core in the final product. For example, in microprocessors developed by ARM 2 and IBM, 3 there is no way to remove the ICE from the chip. ICE performance, cost, power consumption, test and debug support, and hardwaresoftware interfacing have become important considerations in microprocessor-based platform design.
Thus, it has become necessary to comprehensively investigate the effects of embedding an ICE in a microprocessor core. Unfortunately, the design of ICEs has mainly followed an ad hoc approach. Architecture platforms, hardware-software interfaces, and operating
462
In-circuit emulators have become part of the permanent structure of microprocessor cores to support on-chip test and debug activities in highly integrated environments such as SoCs. However, ICE design styles and operation principles are quite diverse. This article presents a taxonomy based on the notions of foreground and background operations and hardwaresoftware implementation alternatives to organize existing in-circuit emulation approaches.
methodologies vary widely among ICEs for different microprocessors. Furthermore, many ICEs are proprietary commercial products for which in-depth design information is unavailable. Most available information is in the form of user manuals or application notes, which provide very limited design information. Therefore, it is difficult to perform a fair comparison of on-chip debug approaches and select appropriate approaches for future designs under various application requirements.
In this article, our goal is to demystify ICE designs and their impact on the SoC environment. We classify existing ICE approaches, identify a basic design for each major category, and show how to instantiate it with an ARM7-based microprocessor. Finally, we conduct experiments to quantitatively analyze the hardware, software, and operational features of these on-chip debug approaches and draw conclusions about their applications in embedded-system design.
Classification of in-circuit emulation approaches
We divide in-circuit emulation operations into two modes: background debug mode (BDM) and foreground debug mode (FDM). In BDM, the user program executes normally, except that the ICE is active at the same time to monitor system status for trigger conditions such as timer timeout, breakpoint and watchpoint matching, single stepping, and trace buffer full. (Although breakpoints, watchpoints, single stepping, and traces are different activities, they can be implemented with similar basic operations. To simply our discussion, we focus on the breakpoint activity.) Once the trigger condition exists, the operation mode switches into FDM, in which the ICE, rather than the user program, takes control of the system.
In FDM, while the user program is halted, the ICE can observe or configure the microprocessor's internal system status, including memory, registers, and other control or I/O signals. Alternatively, the ICE can communicate with the host computer to receive debug commands from the host and execute them or send back the internal system status to the host though a communication channel. Finally, the ICE can switch the operation mode back to BDM to resume user program execution.
We can refine these two modes into more sophisticated debug modes. For example, in one variation of FDM, the user program can continue execution within a limited and safe context instead of halting completely, while the ICE communicates with the host. (Because of space limitations, we don't go into such details here.)
Both modes can be implemented with either software or hardware. Therefore, we can place all possible in-circuit emulation approaches into the four classes listed in Table 1 . The software emulation class uses the all-software approach for both FDM and BDM, and hardware emulation uses the all-hardware approach. Hybrid emulation 1 uses software for FDM and hardware for BDM; and hybrid emulation 2 uses hardware for FDM and software for BDM. The table uses the notations F and B for foreground and background, and S and H for software and hardware. Thus, FSBH indicates a software foreground and a hardware background implementation and represents hybrid emulation 1.
BDM operations
BDM includes two major tasks: detecting trigger conditions while the user program is executing, and suspending the user program and switching to FDM when the conditions are met.
Software BDM approaches
The two basic approaches to BDM software implementation are instrumenting and single stepping. Instrumenting creates a specialized version of the user program. Programmers construct this version by patching special instructions into the target locations of the original user program. Executing these instructions raises a software interrupt that transfers control to an exception handler (also called an interrupt service routine), causing the processor to enter FDM immediately. Alternatively, the instrumented program can perform simple condition checking at the cost of performance overhead. If the conditions are met, the processor enters FDM; otherwise, it resumes user program execution. This pure software approach can be implemented in almost all microprocessors. Its disadvantage is that the instrumented program is different from the original user program; a bug-free instrumented user program doesn't guarantee a bugfree original user program.
The single-stepping approach can be used in microprocessors, such as Intel's x86 microprocessors, that support the single-stepping exception. 7 This singlestepping mechanism, once enabled, causes an exception to be raised after the execution of each instruction in the user program, and the corresponding exception handler takes control. The exception handler then activates the software monitor to check trigger conditions and decide whether to execute the next instruction in the user program or to switch to FDM. This approach's advantage is that it achieves debug without instrumenting the user program. On the other hand, debugging with this approach is very slow, so it In-Circuit Emulation is infeasible for even a medium-size user program. To avoid this problem, the user program must be instrumented to turn on the single-stepping mechanism only within a small range in the program.
In summary, the advantages of the software BDM approaches are that they are applicable to most microprocessors, flexible in design, easy to modify, and require little hardware support. They support a flexible number of breakpoints. In addition, they allow adjustment of the priority among the breakpoint exception and other hardware and software exceptions to protect critical tasks from being disrupted by the debug activity.
On the other hand, software BDM takes up exception (interrupt) vector space and precious memory space, which might be limited in the SoC environment. Second, identifying a possible trigger condition takes a significant number of instruction cycles, making software BDM inappropriate for realtime debug. Third, software overhead makes it infeasible to detect trigger conditions with complications such as masking, data dependency, and range checking. Finally, software BDM can detect only software logic bugs; it has difficulty detecting hardware and timing-related bugs.
Hardware BDM approaches
The basic hardware support approach provides a mechanism to control the target microprocessor's program execution flow. 16 Implementing BDM in hardware usually involves a hardware comparator to monitor address and data buses, control signals, internal states, and I/O signals. The comparator contains a set of registers that can be programmed for several trigger conditions. The trigger conditions can be more sophisticated than those of the software approach, including masking and data dependency (equal, not equal, greater than, less than, range, and so forth), because these are easy to implement in hardware. Once the trigger conditions are met, the comparator stops the core clock or raises an exception to halt the user program and enter FDM.
Implementing BDM in hardware is a simple concept but requires careful design. An important issue is proper timing in halting the microprocessor after the trigger conditions are met to keep it in a stable, precise state. Another issue is handling instruction parallelism such as pipelining and superscalar execution so as to retain the logical sequence and eliminate false conditions caused by parallel execution.
The advantages of the hardware approach are that it allows trigger condition checking in real time, and the trigger conditions can be sophisticated because these extra functions take only a few extra gates. In addition, system status that is not directly accessible by software can be handled by hardware. Therefore, the hardware approach can detect hardware, software, and timing bugs. The disadvantages are hardware overhead, longer design and verification time, and inflexibility in modifying the ICE (such as increasing the number of supported breakpoints) after integration in the SoC.
FDM operations
FDM consists of three major tasks: accessing and modifying internal system status and configuring BDM, interacting with the host computer, and switching back to BDM and resuming the user program.
Software FDM approaches
The software FDM implementation has the form of a system service routine or software monitor that usually resides in the system memory area. 16 The software monitor consists of a command loop that interacts with the host computer to receive commands from the user and feed information back to the user through a communication channel. Upon receiving the user's command, the software monitor decodes the command and calls the corresponding service subroutine, such as setting breakpoints, accessing memory, accessing registers, resuming the user program, single stepping, or tracing.
Although a procedure call can invoke the software monitor, it is more efficient to invoke it through an interrupt (or exception) such as a software interrupt. The exception handler backs up the user program's system status, checks the exception's source, performs system mode switching (if necessary), and finally calls the software monitor. Once the service of the software monitor is complete, the system leaves FDM and returns to BDM by simply resuming the execution of the user program.
The main advantages and disadvantages of software FDM are similar to those of software BDM. An additional advantage is the smooth transition software provides between BDM and FDM: Entering (or leaving) the software monitor automatically suspends (or restarts) the user program. There is no need to release (or hold) the system clock to activate (or deactivate) the user program, as in the case of hardware FDM.
Hardware FDM approaches
Implementing FDM in hardware usually requires an I/O port specifically dedicated to debug, a set of registers for storing related information, and a debug controller for handling communication with the external world and executing FDM operations. Hardware FDM is independent of the microprocessor core and is thus driven by the test clock, which is different from the core clock. While the system is in FDM, the core clock halts and the hardware FDM is under the test clock's control. To switch back to BDM, the test clock halts and the core clock resumes.
Although there are many possible hardware FDM implementations, designs based on standard test mechanisms make core integration and software development easier. Such mechanisms include the IEEE 1149.1 JTAG architecture, 17 which provides serial test access to the chip, and the newer IEEE 1500 architecture, which provides both serial and parallel access to the chip. 18 Hardware FDM has two main advantages. First, the debug circuit is independent of the microprocessor core and thus takes no programming resources from the user program. No exception or service routine is necessary. Second, the test clock can run faster than the core clock to speed up debug operations, because the debug circuit is far simpler than the microprocessor core, which is not active during FDM. The main disadvantages are similar to those of hardware BDM.
FDM communication channels
An important distinction between software and hardware FDM is their communication channels. Figure 2 shows a generic block diagram of an SoC with an ICE. The SoC has two communication channels: the external I/O bus connected to the microprocessor's system (memory) bus, and the external test bus connected to the test access mechanism.
In software, the debug channel can be regarded as a regular I/O port, accessible through memorymapped I/O addresses or distinct I/O ports, as defined by the instruction set architecture. Therefore, software FDM communicates with the external world through the external I/O bus. The advantage of this approach is its simplicity. The disadvantage is that other SoC components might be blocked from accessing the system bus or might have to share bus use with software FDM. Thus, the approach can slow down both system and debug performance.
Hardware FDM communicates with the external world through the external test bus. This bus is visible only to the test access mechanism, not the microprocessor software. The advantage is that debug access doesn't interfere with other activities on the system bus. The disadvantage is that additional I/O pins are necessary.
Classification examples
The debugger program for the Motorola 68HC11 evaluation board is an example of software emulation. 19 For FDM, it uses a software monitor called Buffalo, which resides at the top of the memory. The user can input a set of commands from the keyboard. The program performs a BDM breakpoint through a software interrupt (SWI) instruction 466 patched into the target address. It performs single stepping through a counter (OC5timer) that generates an interrupt to halt the user program. The value set in the counter is the exact time required to run through the monitor and execute the next user instruction.
Another example of software emulation is the ARM Angel debug monitor. 8 Angel is a program that lets developers debug applications running on ARM-based hardware. Angel requires ROM or flash memory to store the debug monitor code, and RAM to store data. A typical Angel system's two main components are a host debugger and a debug monitor, which communicate through a physical connection such as a serial cable. The host debugger, acting as the FDM, runs on the host computer. The Angel debug monitor, acting as the BDM, runs on the target system. Angel uses its Angel Debug Protocol to communicate between the host and the target systems.
Intel's x86 microprocessors, such as the IA-32/64, support both software and hybrid emulation 1. ARM's ICE hardware supports only two breakpoints (which we call BP0 and BP1). To overcome this limitation, ARM's RealView Debugger, running on the host, uses an interesting technique to combine the hardware and software emulations. 11 RealView provides one hardware breakpoint and an unlimited number of software breakpoints. The hardware breakpoint refers to BP0 in hardware. The so-called software breakpoints in RealView are actually accomplished by BP1 in hardware, as opposed to software interrupts in the previously described software emulation method. When a programmer places software breakpoints in the program under debug, RealView replaces the instructions in the corresponding locations with the same specific binary pattern (for example, 0xFFFF FFFF). In addition, RealView configures BP1 in the ICE hardware as a watchpoint, with the binary pattern as the target value under watch. When program execution reaches such locations, the binary pattern is fetched as an instruction from program memory and appears on the data bus. The binary pattern appearing on the data bus triggers the watchpoint and causes the processor to halt accordingly. The host debugger software can then read back the program counter through the JTAG port to determine the halted location. With this technique, classified as hybrid emulation 2, a single breakpoint circuit in hardware can support an unlimited number of software breakpoints.
The National Sun Yat-Sen University's retargetable embedded ICE module is another example of hardware emulation based on the JTAG architecture. 20 To make the ICE module retargetable to a wide range of microprocessor architectures, the developers decided that its operations should be controlled only through test access port (TAP) instructions, not through instruction set architecture features such as instructions, system flags, or proprietary configuration registers. Therefore, they defined a TAP instruction set extension and additional hardware for the module's JTAG architecture.
The Nexus 5001 Forum defined the IEEE-ISTO 5001-2003 debug interface specification to standardize the processor debug interface in embedded systems. 12 The standard adopts the hardware emulation approach. It uses the JTAG port to access the internal debug circuit and allows optional extra pins, defined by the designer, for higher debug throughput or more complex control. At least two hardware breakpoints are required to meet the standard. Vendors such as IPextreme provide Nexus 5001-compliant debug modules for microprocessors such as the ARM7 and ARM9, and on-chip bus interfaces such as the Advanced HighPerformance Bus (AHB). 13 Finally, any microprocessor core with a basic JTAG port and boundary scan cells and appropriate software interrupt capability is a typical example of hybrid emulation 2. The JTAG-related circuits serve as the hardware FDM, and the software interrupt instruction can be patched into the user program to serve as the software BDM.
Representative ICE designs
We have presented a classification scheme of in-circuit emulation approaches from the hardware and software perspective. However, quantitatively analyzing and comparing such approaches is still difficult because existing designs are implemented on significantly different platforms and for different purposes. Here, we identify a typical design for each class of ICE and show how to implement it on the same ARM7 microprocessor platform, thus allowing fair analysis and comparison. Figure 3a shows a block diagram of the software emulation scheme for the ARM7 microprocessor. At the right is the ARM7 microprocessor core. External memory is connected to the address and data buses of the microprocessor core. The external memory is conceptually partitioned into three portions: system memory, user memory, and the communication device. Software FDM and software BDM are located in system memory and user memory, respectively. Software FDM is implemented with a software program segment called SoftFDM, activated by the SWI exception handler, which also resides in system memory. Software BDM is the instrumented user program under debug. The communication devices are memory-mapped I/O devices. Figure 3b shows the memory layout in more detail. The upper part (system memory) contains the table that stores breakpoint information, the pool that preserves the register contents of the user program upon entering FDM, the I/O buffers that hold information while 468 
Software emulation (FSBS)
In-Circuit Emulation
IEEE Design & Test of Computers
SoftFDM communicates with the host computer, and SoftFDM, which is part of the SWI handler. The instrumented user program resides in the user program memory. A target location in the user program, where the user intends to set a breakpoint, is replaced with the special SWI instruction, which serves as the FDM trigger. Figure 4 presents the basic structure of the SWI exception handler. Before entering SoftFDM, the SWI exception handler must back up the user register file and read the SWI instruction's data field to determine the exception service vector. After entering SoftFDM, the program's first task is to restore the registers polluted by the SWI exception handler. These housekeeping activities constitute software emulation's major performance overhead. SoftFDM is a command loop that receives commands from the host computer, decodes them, and performs corresponding operations. Figure 5 shows a block diagram of the hardware emulation scheme for the ARM7 microprocessor core. The hardware monitor is the hardware BDM. The JTAG controller and its related components, such as the five I/O pins and the boundary scan chains, serve as the hardware FDM. The hardware monitor is connected to the microprocessor core's address and data buses. The hardware monitor checks the trigger conditions on the buses.
Hardware emulation (FHBH)
The hardware monitor's major component is a comparator. Figure 6 shows the circuit diagram of the comparator, which supports two breakpoints. The figure shows the details of one breakpoint. Three kinds of information are necessary to configure a breakpoint: the control, data, and address signals. Each signal is further controlled by two parameters: the mask and the target value. It takes a total of six configuration registers to control a breakpoint setting. These configuration registers allow breakpoint checking to be data dependent and bitwise maskable. The hardware monitor is controlled by the debug-enable I/O pin and the hardware FDM. When a breakpoint is triggered, output signal breakpt is asserted. This disables the microprocessor core clock at the proper cycle to halt user program execution and switch the system into FDM, in which the system is under test clock control.
Hardware FDM is implemented with the IEEE 1149.1 JTAG architecture. The serial access imposed by the JTAG standard could cause a performance bottleneck during debug. To improve debug performance, designers can use newer architectures with parallel test access, such as IEEE 1500, for the FDM implementation, at the cost of higher hardware overhead.
Hybrid emulation 1
Hybrid emulation 1 (FSBH) uses the software FDM from software emulation and the hardware BDM from hardware emulation. Figure 7 shows the block diagram and the memory layout for hybrid emulation 1. These are similar to those of software emulation because the FDM is implemented with software. However, a few modifications are worth noting. First, an additional hardware module, the hardware monitor, connects to the memory data and address buses. The hardware monitor implements the hardware BDM. Second, there is no instrumented code in the user program, because the hardware monitor performs breakpoint checking in the background. Third, the hardware monitor's behavior is similar to memory controllers such as memory management units. Thus, instead of holding the core clock for the microprocessor core as in hardware emulation, the hardware monitor halts the microprocessor core and enters the FDM by generating a data abort signal (using its breakpt output signal). Fourth, the software FDM is in the data abort exception handler, instead of the software interrupt handler as in software emulation. Fifth, an additional field called the hardware monitor registers is allocated in the system memory for configuration of the hardware monitor.
Hybrid emulation 2
Hybrid emulation 2 (FHBS) uses the hardware FDM from hardware emulation and the software BDM from software emulation. Figure 8 shows the block diagram and the memory layout for hybrid emulation 2. These illustrations are similar to those of hardware emulation because the FDM is implemented with hardware. However, again, we note a few modifications. First, the hardware In-Circuit Emulation module connected to the memory data and address buses in the hardware emulation scheme is not necessary here, because the software BDM checks trigger conditions. Second, although hardware performs the major FDM task, this design still needs a small software interface to manage the FDM controller. This is the FDM control routine in the SWI exception handler in Figure 8b . Third, because there is no hardware monitor to hold the core clock, the FDM control routine must hold the core clock by writing to a memory-mapped I/O address-0x0000001 C in Figure 8b . While the clock is held, system control can be safely transferred to the FDM controller. The user can reactivate the core clock by properly configuring the related I/O circuit through the FDM controller. Table 2 summarizes the implementation features of the four emulation approaches for the ARM7 microprocessor core.
Quantitative comparisons
We constructed an FPGAbased prototyping system to verify and demonstrate various in-circuit emulation approaches. We built the ICE designs just described with an academic synthesizable microprocessor core that implements the ARM7 instruction set. We downloaded the ICEs to the prototyping system for experiments, synthesized them to standard cells, and analyzed their gate-level features.
Hardware analysis
We synthesized the ICEs with TSMC's 0.35-micron standard cell library. The ARM7 core requires 46,167 gates. ICE hardware for each of the four emulation approaches. Compared with the ARM7 core, the gate count overheads of the FSBS, FHBH, FSBH, and FHBS approaches are 0%, 15%, 11%, and 4%. The major gate count contributor is the hardware comparator. Therefore, the designer must be careful in deciding how many breakpoints and conditions are supported by hardware. Regarding the core clock speed, FHBH and FHBS have the same speed because they have the same critical path. FSBS and FSBH also have the same speed. In addition, FHBH and FHBS have a minor overhead of 0.4% of the overheads of FSBS and FSBH. The overhead is due to the scan cells on the critical path. The experiment shows that most of the ICE hardware components are not on the critical path and don't affect system performance.
FHBH and FHBS have another clock, the test clock, which drives the hardware while the core clock is halted. The experiment shows that the test clock is about 20% faster than the core clock, because most of the complex system hardware modules are not used during test. This indicates that the hardware debug mechanism can operate at a faster speed than normal system speed.
Software aspects
The related software modules are written in the ARM7 assembly language, and assembled and linked with the ARM STD v2.5 development tool. The machine code is downloaded into the embedded memory in the chip. Table 4 presents our quantitative analysis of the ICE software for the four emulation approaches. Of all the approaches, FHBH needs no software code or resources. FSBH needs the largest software code and consumes one exception vector resource, but it can also debug the original user program. On the other hand, FHBS has the same software debug mechanism as FSBS but requires far less code in the SWI exception handler, thus saving precious system memory space. The analysis shows that when choosing the appropriate in-circuit emulation approach, the designer must consider the flexibility requirements for FDM and BDM in a specific SoC development environment.
ICE features
The last five ICE features in Table 5 are the latencies for various ICE operations. We show the latency with the physical time instead of the cycle count, because the operations in FHBH and FHBS are the collaboration of the core clock and the test clock running at their own speeds. In addition, because some operations involve interactions between the SoC and the external world, the bandwidth of the communication channel also affects the latencies.
Therefore, Table 5 shows two versions of latencies, whenever appropriate, designated by S and P to indicate serial and parallel access. For the hardware FDM, serial access refers to the IEEE 1149.1 JTAG architecture, and parallel access refers to the IEEE 1500 architecture. For the software FDM, serial access and parallel access refer to an external I/O bus with 1-bit and 32-bit bandwidth, respectively. Furthermore, some ICE operations can be broken down to three steps of operations, which are listed in parentheses in the table:
& set up the debug command, & execute the command, and & send feedback to the user.
The analysis of the ICE operation latencies shows that FHBH has the shortest latencies, especially for detecting the breakpoints in which the latency incurs only the time spent waiting for the current instruction to complete its execution before a break can be taken. This feature makes FHBH the best candidate for realtime debug. The next-best candidate is FHBS. The worst one is FSBH, because once a breakpoint is detected by the hardware monitor with the data abort exception, it must await the current instruction for completion, preserve the system status, and then transfer control to the software FDM.
Moreover, the latency breakdown analysis indicates that the major contributor of the latency is the time spent receiving commands from and sending feedback to the user, rather than the time to execute the debug command. This observation suggests that the ICE performance for an SoC can be greatly improved by employing the following strategies: First, develop a communication channel with a high bandwidth and an efficient protocol. Second, store macros of ICE operations on chip, similar to the concept of microprogramming, to avoid communicating a tremendous amount of primitive operations through the channel. Table 6 gives the suitable SoC application domains for the four emulation approaches.
Application domains
THE QUANTITATIVE ANALYSES show that the FHBH hardware emulation is suitable for SoC designs where the extensive hardware cost is affordable and real-time hardware-software debug is a strict requirement. The FSBS software emulation is suitable for SoC designs with a rich memory resource and a simple I/O structure, in which functional software debug, and not timing behavior, is the primary concern. It can be also used as a supplement to the FHBH hardware to provide extra capacity that is not provided by the hardware, such as more breakpoints. The FSBH approach is suitable for SoC designs requiring low- Table 6 . Suitable SoC application domains of the four emulation approaches for the ARM7 microprocessor. 
