This article presents an in-circuit emulation (ICE) 
Introduction
An in-circuit emulator (ICE) is part of the development environment for microprocessor (or microcontroller 1 1 Without loss of generality, we will use the terms "microprocessor" and "microcontroller" interchangeably through out the article, depending on the context. Basically both refer to the same thing: an instruction set processor (ISP). Traditionally, "microprocessor" tends to denote a high performance general purpose ISP, whereas "microcontroller" tends to denote a smaller (such as 8-bit or 16-bit) and perhaps lower performance ISP with sufficient I/O mechanisms suitable for industrial control.
EMDT/AM29xxx microprocessor [12] and the RISCWatch debugger for IBM's PowerPC microprocessors [19] .
In either the board or chip level, the design of ICE has been mostly an ad hoc approach. The architecture and the operating method are significantly different between different microprocessors. It is very difficult to reuse the ICE module and related software module among different microprocessor cores in order to save precious human resources and shorten development time in the SoC era. It is therefore the objective of this research to develop an ICE module that can be parameterized and retargeted to a range of microprocessors. Based on this motivation, we have defined the architecture of the proposed ICE module, based on the IEEE 1149.1 JTAG architecture, and implemented it as a soft silicon intellectual property (soft IP, a Verilog RTL description with corresponding synthesis and simulation scripts and test patterns). Its functionality and retargetability have been successfully demonstrated by integrating it with two microprocessors with significantly different architectures: one 8-bit industrial microcontroller HT48x00 [1] and one 32-bit ARM7-like embedded microprocessor. Both FPGA prototypes and chip implementation have been accomplished.
The rest of the paper is organized as follows. Section 2 reviews related work. Section 3 describes the ICE architecture. Section 4 discusses the integration issues. Section 5 demonstrates the technique with two industrial microprocessors. Section 6 provides the conclusion for this work.
Related Work
General discussions on the debugging techniques are provided by [3] , [15] and [18] . The approaches to the design of ICE's are primarily ad hoc. Most of the ICE's are board level designs. They are built around extended versions of the microprocessor chips, such as [1] , [9] and [10] . The extended chip makes available to the outside world its internal information such as registers, the data bus, the program counter, and certain control signals through many extra I/O pins that do not exist in regular chips. The other components on the circuit board of the ICE are responsible for the debugging support: such as interfacing between the host and the microprocessor chip, monitoring and recording the chip status, and controlling the behavior of the chip, etc. The major problems are two-fold. First, the extra I/O pins increase tremendously the cost of the extended microprocessor chips and thus making the cost of the ICE much higher. Second, there is no standard regarding what kind of information should be make available to the outside world and thus it is very difficult to reuse the board level design for different microprocessors.
When extra I/O pins for observation of internal status are not available, the software approach with support of some board level circuitry may be adopted. In this approach, at least two modes of software execution are required: supervisor mode (or called monitor mode in [3] ) and user mode [13] . The user program is executed in the user mode. When a breakpoint is detected by the board level circuitry, an interrupt is raised to halt the microprocessor and then the operation mode is changed to the supervisor mode by a procedure call to a pre-defined debugging routine (or called monitor) that resides in the supervisor region of the memory. The debug routine is responsible for the debugging operations. There are three major problems with the software approach. First, precious system resources, such as interrupt vectors and memory space are taken up by the debugging routine such that some portions of the interrupt and memory spaces that are normally available to the user are no longer available. Second, the software approach may terribly slow down the debugging process. Third, again, there is no standard method that can be followed in the software approach.
To meet the challenge of higher degree of integration and the demand for full speed debugging, some modern microprocessors begin to embed the functionality of ICE into the microprocessor chip. Depending on the complexity of the added circuitry, there are two approaches to accomplish the goal. The first approach is software oriented, in which additional instructions are provided to access the basic debugging circuitry. Most of the debugging activities are operated through a pre-defined debugging subroutine and proper embedding of the debugging instructions into the user programs. An example is the Intel386DX embedded microprocessor in which debugging is accomplished by embedding into the user program the "breakpoint" instruction and some regular data access instructions that set the debug registers (debug control register and breakpoint registers) and the single-step flag in the EFLAG register [14] . The advantage is that debugging can be considered as simply a software activity; debugging is to execute an instrumented user program, just as executing a normal user program. The disadvantage is that the instrumented user program and the normal user program are actually two versions of software. They are not the same thing and may not be co-existent, such as during the field maintenance of an embedded system in a communication device in a remote desert area.
The second approach to embed ICE into the microprocessor chip is hardware oriented. Most of the debugging functions are implemented in hardware, which are accessed through a separate I/O port, such as the IEEE 1149.1 JTAG test access port [5] , used in the ARM7TDMI embedded microprocessor [7] , the EMDT/AM29XXX microprocessor [12] and the R ISCWatch debugger for IBM's PowerPC microprocessors [19] . The advantages are two-fold: first, software is consistent: the same version of the user program is used for both normal execution and debugging. Second, non-intrusive debugging can be achieved since the debugging operation uses hardware resources different than the ones used by the normal microprocessor execution and consequently does not conflict with the user program. The disadvantage is that an additional I/O port and the corresponding access method become necessary.
Our proposed approach follows the last approach, with an attempt to define a retargetable embedded ICE architecture that facilitates the reuse of hardware and software modules and experiences, in order to save precious human power and shorten the development time in the SoC era.
In addition to the above debugging implementations, there have been researchers and organizations looking into defining new standards for on-chip debugging, in order to meet the SoC challenges. For example, Alves and Ferreira analyze the limitations of the IEEE 1149.1 standard and propose extensions to the standard to facilitate on-chip debugging, such as modified boundary scan cells and requiring that a "hold" signal be integrated into the logic of the circuit under test-and-debug [17] . Several companies such as Motorola, Hitachi et al. form the Nexus Global Embedded Processor Debug Interface Standard Consortium to define a debug interface standard [16] . Instead of striving to define an extensive standard for a very wide range of applications as these approaches, our focus is on the development of a feasible low-cost and retargetable embedded ICE module, with as few modifications to the standard as possible, which can be easily integrated with microprocessor cores at the RTL level (called soft cores or soft IP's ) without modification to the original microprocessor cores (except the insertion of the optional internal scan chains).
Architecture of the Embedded ICE
Our embedded ICE is based on the IEEE 1149.1 JTAG standard, similar to ARM7TDMI's approach. Figure 1 shows the architecture of the embedded ICE. There are two gray areas in the figure. The top gray area is the microprocessor core. The bottom gray area is the ICE module. The signals in between are the interface signals between the microprocessor core and the ICE module. There are two types of components in the ICE module: the JTAG components (the white boxes), which are presented in Section 3.1 and 3.2, and the debugging related components (the black boxes), which are presented in Section 3.3 (the breakpoint detection unit, BDU) and in Section 3.4 (the interface between the ICE module and the microprocessor core). The trace unit is still under construction and is not covered in this article. The supported ICE operations are discussed in Section 3.5.
Fundamental IEEE 1149.1 JTAG Components
The IEEE1149.1 JTAG architecture is a framework for standardized design-for-testability of integrated circuits for module-level testing [5] . It allows the inputs/outputs and internal signals (if desired) of the digital logic of the integrated circuit to be accessed from outside modules. The advantage is that the controllability and observability of a module containing many components is vastly improved while the input/output overhead of the module is minimized. Our ICE is based upon this standard with some extensions to support the ICE related operations. 
Extension to the JTAG components
To support the debugging functionality and make the ICE module retargetable, several extensions are necessary. In the part of scan chains, three additional scan chains are defined: the breakpoint scan register, the internal scan chain, and the scan chain configuration register. The breakpoint register is the input buffer of the breakpoint detection unit. The configuration parameters for the breakpoint detection unit are first shifted into the breakpoint register from the TDI pin and then are sent to the unit. The internal scan chain is for the observation and control of the internal status of the microprocessor core. This scan chain is optional and is up to the designer in determining which signals that they want to access, such as the program counter, the status register, the pipeline register, etc. To speed up the access into selected group of scan cells in the scan chain while leaving the rest untouched, the scan cells can be optionally multiplexed as shown in the upper right corner of Figure 2 .
A scan chain configuration register is provided to manipulate the scan chains, including the selection of the active scan chain to be connected between the TDI and TDO pins, and the configuration of multiplexed internal scan chain (if any), as shown in the lower left corner of Figure 2 . The M field of the register controls the selection of the scan chain while the N field controls the multiplexors in the internal scan chain.
Additional TAP instructions are also defined to operate the ICE components. The instructions RESTARTM, RESTARTS, and RESTARTF configure the microprocessor core and the ICE to run at the monitoring mode, single-stepping mode and free running mode, respectively (these modes are described in detail in Section 3.5). The CONFIG instruction configures the scan chain configuration register. The TRACE_ON and TRACE_OFF instructions turn on and off the trace unit (not covered in this article) on and off, respectively.
The above TAP instructions greatly contribute to the retargetability of the embedded ICE module since all the debugging operations are encapsulated into TAP instructions. As long as a given microprocessor core is integrated with our embedded ICE according to the proper configuration of design parameters and procedure, the entire ICE functionality can be easily retargeted to such microprocessor. On the contrary, some commercial microprocessors rely on mechanisms that are not easily retargetable to other microprocessors. For example, the debugging operations of the Intel386DX embedded microprocessor are activated by the "breakpoint" instruction and the manipulation of the EFLAG flag through some data access instructions [14] . These instructions are a part of the x86 instruction set and cannot be easily "cut and past" to retarget to other microprocessors. Similarly, the debugging operations of ARM7TDMI are activated by the execution of certain TAP instructions and setting of some proprietary on-chip configuration registers that make them harder to be retargeted.
Finally, two additional external pins are required. The input pin EXT_DEBUG is added to allow the activation of the ICE operations from the outside world. It is assumed that this signal is synchronous with the chip clock. If not, then a synchronization circuit is necessary to construct a synchronous version of the signal. The output pin ICE_status is added to indicate the status of the ICE to the outside world (shown in Figure 5 ). 
Breakpoint Detection Unit (BDU)
The breakpoint detection unit (BDU) monitors the value(s) on the address bus and/or the data bus of the microprocessor. Once the value matches one of the target values, BDU stops the normal operation of the microprocessor, and the control is taken over by the TAP controller. The target values are stored in the internal registers (called breakpoint registers) of the BDU through the TAP controller by the user before debugging is activated. A target value is called a breakpoint if it is for an instruction fetch, whereas a target value is called a watchpoint if it is for a memory data access. To simplify the discussion, we will use the term breakpoint for both meanings.
The organization of our BDU is similar to that of the ARM7TDMI, as shown in Figure 3 . In the center is a group of breakpoint registers whose values are compared with the address and/or data buses of the microprocessor core during the appropriate time indicated by the bus control signals. A breakpoint-matched signal is raised when at least one breakpoint is matched. The breakpoint registers are configured through the breakpoint configuration register that is a scan chain connecting between the TDI and TDO pins. The configuration parameters are serially sifted into this scan chain and applied to the breakpoint registers under the control of TAP instructions.
Two types of matches are supported: data independent and data dependent. The data independent match happens when the value of the address bus matches at least one of the target values in the breakpoint registers. The data dependent match happens only when both the values of the address bus and the data buses match the target values.
To provide further flexibility in debugging, the values can be masked before they are compared with the target values. The masking mechanism makes it possible to stop the microprocessor under some sophisticated conditions such as stopping when the address is word aligned, or when the data is even, or only when both conditions are met.
To achieve the masked data dependent matching, a breakpoint register consists of four fields: target address, target address mask, target data, and target data mask. To save the size of the BDU, the designer can decide whether to remove the masking and/or the dependency mechanism by eliminating the corresponding mask and/or target data fields and their corresponding circuits. The BDU has to be adjusted for the memory architecture of the microprocessor core. If the program memory and data memory are separated (Harvard architecture), and the user wants to detect both instruction and data breakpoints, the BDU must have breakpoint registers for both instruction and data. If the program and data memo ry share the same data and address bus (von Neumann architecture), then the BDU must know whether the microprocessor core is in the fetch stage or execute stage. Therefore, each breakpoint register in the BDU requires an extra 1-bit field to record whether the target is for instruction or data.
Interface between the microprocessor core and the embedded ICE module
To connect the embedded ICE module to a given microprocessor core, a minimal set of interfacing signals are defined in Figure 5 . These signals are defined for retargetability: as long as a given microprocessor core can receive and generate such signals, the embedded ICE can be integrated with the microprocessor core without difficulty. The signals from the microprocessor core to the ICE module are the following. The address and data buses for the memory access of the microprocessor core are sent to the ICE for monitoring. The signals mem_control are the control signals of the memory access that indicating the types of the memory access such as read/write for instruction/data access. They are used by the ICE to interpret the activities on the address and data buses. The signals tdo_out1 and tdo_out2 are the outputs of the boundary scan chain and internal scan chain of the microprocessor core respectively. The signal flush indicates that the microprocessor core is flushing its prefetched instructions due to situations such as a taken branch, etc., in which a breakpoint trigged by such prefetched instructions on the address bus becomes false and should be discard by the BDU. If the microprocessor core does not have the prefetch mechanism, such as the case in a non-pipelined implementation, then this signal is not required. Finally, the eoi signal indicates the end of execution of the current instruction. This signal is used by the ICE to wait until that the microprocessor core has reached a safe state before the microprocessor core can be halted for further ICE operations.
The signals from the ICE module to the microprocessor core are the following. The core clock is the system clock for the microprocessor core. During the normal execution, the chip clock, which is supplied from the external input pin, is passed directly to the core clock to drive the microprocessor core. During the debugging operations, the ICE may halt the microprocessor core by holding the core clock. The scan mode signal is used to activate the scan chains in the microprocessor core, including the boundary and internal scan chains. These two signals are generated by the M/I interface in Figure 1 , which will be discussed in more details shortly in Figure 6 . The signals shiftDR_en, clockDR_en, tdi_in, shiftDR, clockDR and updateDR are the related signals to control the scan cells [5] . The optional signal current instr, which is generated by the TAP controller, is used to direct the microprocessor core to finish the execution of the current instruction and flush the prefetched instructions, if any. If the microprocessor core does not accept such signal, the same behavior can be accomplished by directly manipulate the pipeline registers of the microprocessor core through the JTAG port.
Compared with other ICE components, the M/I interface requires most of the attention during the integration of the ICE with a given microprocessor core since the behavior of the interfacing signals (as -10-defined in Figure 5 ) may differ from one microprocessor core to another core. A basic template of the interface is depicted in Figure 6 . The input signals to this interface are listed at the left side. The text in the parenthesis indicates the source of the signal: ICE indicating a signal coming from other ICE components, EXT indicating a signal from an external input pin, and MP indicating a signal from the microprocessor core. There are three major D flip-flops: DFF1, DFF2 and DFF3. DFF1 latches the pipeline flush signal from the microprocessor core. DFF2 latches the breakpoint signal from the BDU. DFF3 latches the eoi signal from the microprocessor core. The timing for latching or resetting the D flip-flops is controlled by proper combination of the related signals, as shown by the glue logics connecting the enable, clock and reset ports of the D flip-flops. DFF1 is enabled when the signal restartm or restarts is high, which is set when the TAP instruction RESTARTM or RESTARTS (entering the monitoring mode or the single-stepping mode respectively, as defined in Section 3.5) is executed. DFF1 is reset when there is no breakpoint activity or the external test reset signal trst is asserted. DFF2 is enabled when the current instruction finishes its execution, and is reset when the ICE is not in the monitoring mode or the microprocessor core asserts the flush signal, or the external test reset signal trst is asserted. DFF3 is enabled when the TAP instruction RESTARTS (entering the single stepping mode, as defined in Section 3.5) is executed, and is reset when the microprocessor core flushes its pipeline or the test reset signal is asserted.
The DDF2 and DDF3 signals are combined with the external_debug request signal and the chip clock to generate the core clock for the microprocessor core. The core clock is a gated clock of the external chip clock signal. The core clock has to be halted at the high (low) level if the storage elements in the microprocessor core are trigged at the rising (falling) edge of the clock. The M/I interface generates both versions of the core clock. One of them, selected by the configuration signal MP Core FF Trigger Mode, is sent out to the microprocessor core. Finally, the generation of the scan mode signal is similar to that of the core clock. The circuit in Figure 6 serves as a basic template for the understanding of the interfacing behavior. Based on this template, several variations can be easily derived to accommodate different styles of microprocessor cores. For example, for some microprocessor cores it is more feasible to halt the core by disabling the writing to important registers such as the register file, program counter, pipeline registers, status register, etc. while leaving the core clock running, instead of stopping the core clock completely. In this case, the designer must be able to modify the control logic of the microprocessor core by properly combining the scan_mode signal (or its variant) with the enabling signals for register writing [2] .
Another example is that in certain multi -cycled implementations of some microprocessors the signals flush, eoi and breakpoint might be active during different cycles, and therefore, proper latching of these signals becomes necessary before they can serve as the input signals to the M/I interface.
A third example is that a false breakpoint might have been detected during the prefetching of the subsequent instructions of a branching instruction that is still computing the branching condition and later decides to take the branch and abandon the instructions that have been fetched already. In such case the flush signal might not be able to cancel the breakpoint signal in time. A solution to this problem is to preprocess the breakpoint signal by latching it for a certain cycles before it is sent to DFF1 for processing.
A fourth example is about the timing for halting after a breakpoint has been detected. For the ease of retargetability, the default timing of our ICE for halting is to wait for an asserted eoi signal. However, in a pipelined microprocessor, there may still be some pending instructions before the breakpointed instruction remain in the pipeline even though there has been one instruction completing its execution and issuing an eoi signal. In this case, the user can apply the single stepping TAP instructions to flush these pending instructions and move the breakpointed instruction to a desired stage. On the other hand, a hardware-oriented solution is to latch the breakpoint signal in the pipeline registers until it reaches a stage where the halting (a break) is expected to happen. For example, in ARM7TDMI a breakpoint identified in the fetch stage does not halt the microprocessor until it reaches the execution stage.
ICE operation modes
Four modes of ICE operations are supported with the above hardware mechanisms.
l Free running mode. In the free running mode, the microprocessor core is active and ignores the breakpoint signal, acting as the bare microprocessor without the ICE circuitry. The microprocessor's core clock is the same as the chip clock, and therefore the system runs at the normal speed. The processor will run continuously until the external debug request occurs, in which case the ICE enters the TAP mode and waits for the TAP instructions for further operations. l TAP mode. In the TAP mode, the microprocessor core is halted; that means the core's state remains stable. During this mode, the user can observe and control the boundary scan register and the internal scan registers of the microprocessor core through executing TAP instructions in the TAP controller. In this mode, the system is controlled by the test clock TCK. Monitoring mode. In the monitoring mode, the microprocessor core is active as in the normal case. In addition, the BDU keeps monitoring the microprocessor's status, and stops the microprocessor core when the breakpoint is reached. The TAP controller is in the idle state. The microprocessor's core clock is the same as the chip clock, and therefore the system runs at the normal speed. When the breakpoint is reached, the ICE switches from the monitoring mode to the TAP mode for further operations.
l
Step mode. The step mode is similar to the monitoring mode except that the microprocessor core is halted after the execution of the current instruction is completed, instead of a reached breakpoint. Once the microprocessor core is halted, the ICE switches to the TAP mode for further operations. Figure 7 shows the relationship among these ICE modes. Switching between modes can be accomplished by an external event (the asserted external_debug signal), internal events such as breakpoints, or the execution of related TAP instructions. The TAP mode serves as the central role for various ICE operations. Once in the TAP mode, all other modes can be entered by proper TAP instructions. More sophisticated switching between these modes are possible by defining additional TAP instructions or further internal/external signal relationships.
ICE integration and prototyping

Integration procedure of the embedded ICE
To integrate the retargetable embedded ICE to a given microprocessor core, the designer first has to determine what signals within the core that the designer wants to scan (observe and/or control). If the core has had scan chains for these signals, then they can be directly connected with the ICE. If not, the designer needs to replace them with appropriate scan cells, such as in the case of a synthesizable (soft) core. If the core is a hard core that does not provides accessibility into the internal wires or storage elements, then only the external I/O pins can be made scanable by allocating boundary scan cells for the I/O pins. If reading the internal status is still necessary, the user can scan a memory store instruction into the boundary scan cells of the data bus and then activates the fetch operation of the microprocessor to fetch the scanned instruction and then executes such instruction to read the required signals (such as the register file) and put them on the data bus which can then be captured by the their boundary scan cells. To write into the internal status, a memory load instruction can be executed with the same approach. This approach can be found in ARM7TDMI core.
Second, the designer has to instantiate the parameters for the BDU, which are (1) Once the scan chains have been allocated and the BDU has been instantiated, the ICE can then be integrated with the microprocessor core by constructing the M/I interface. If the given microprocessor core does not directly provide the signals described in Figure 5 , some glue logics may be necessary to combine existing related signals to generate such signals. In addition, special attentions must be paid to special processor architectures such as those with irregular pipeline behavior or different clocking requirements, as discussed in Section 3.4.
Cost and performance consideration of the embedded ICE
The cost (gate count) overhead of the embedded ICE includes the scan chains in the microprocessor core and the ICE itself. 208  878  988  8  634  225  193  396  1052  1255  16  634  377  384  770  1393  1779  32  634  1264  760  1521  2055  2816 Table 2 . The gate counts of a simple ICE with one breakpoint for microprocessor cores with various address/data bit widths
The embedded ICE has possible impact on the performance of the microprocessor core in two aspects. First, a typical scan cell has a multiplexor in front of its D-flip-flop. The multiplexor may introduce additional delay when writing the D-flip-flop. Second, the ICE is connected to the memory address and buses and thus increases the loading on the buses, which may increase the memory access time. Therefore, if the scan chain or the memory address/data bus is on the critical path, then the delay of the critical path may be increased. However, it is possible to compensate for the delay degradation by increasing the driving capability of related circuits. Our case studies shows that no performance degradation is experienced if the designer is able to use circuit cells with higher driving capability, as the cases in Section 5.1, Section 5.2 and Section 5.3, and that the maximal performance degradation is kept with 19% if the designer connects the ICE directly with the microprocessor core without increasing the driving capability, as the case in [2] . The analyses indicate that the embedded ICE approach makes it possible to perform real-time (on-line) debugging at full speed.
An FPGA prototyping system for the retargetable embedded ICE
An FPGA prototyping system has been developed to demonstrate the retargetability of the embedded ICE, as shown in Figure 8 . The system consists of three boards. At the right side is a board with Xilinx's XC4010XL (about 40,000 gates in capacity) FPGA chip, which houses the ICE module. At the left side is a board with Xilinx's XVC300 FPGA chip (about 300,000 gates in capacity) to accommodate the target mi croprocessor core or its bus model. The bus model of the microprocessor core is the cycle-based test vectors of the interfacing signals defined in Section 3.4, which can be used to mimic the behavior of the microprocessor core when the microprocessor core is not available such as during the early development phase of the microprocessor in which a functioning core has not been ready yet. This FPGA chip has been recently upgraded to a XVC800 chip (about 800,000 gates in capacity) to a ccommodate larger microprocessors. The board in the middle is for the I/O control and display of signals such as the TAP state, breakpoint values, the breakpoint signal, the system and test clocks, program counter of the microprocessor, the internal and boundary scan chains of the microprocessor, etc. Although a single FPGA chip with large capacity such as Xilinx's XVC300 would suffice to accommodate both the microprocessor core and the ICE module, we choose to use two separate FPGA chips instead in order to investigate the interfacing mechanism between the microprocessor core and the ICE and the retargetability of the embedded ICE.
An experiment has been conducted to successfully demonstrate the retargetability of the embedded ICE. We first downloaded a generic ICE into the small FPGA chip and an 8-bit industrial microcontroller HT48x00, as described in Section 5.1, into the large FPGA chip and then performed several operations including regular microcontroller operations and all modes of ICE operations. The HT48x00+ICE integration was successfully; both the microcontroller and the ICE operated as desired. In the second step, with the o riginal ICE in the small FPGA chip unmodified, we downloaded a 32-bit ARM7-like microprocessor core, as described in Section 5.2, into the large FPGA chip and then repeated the same operations in the HT48x00 case. The ARM7+ICE integration was also successful; both the microprocessor and the ICE worked as desired. Of course, some small glue logics are necessarily added to the ICE, such as dealing with address buses with different bit widths, in order to make such experiment possible.
Case studies
In this section we demonstrate the effectiveness of the retargetable embedded ICE with three case studies. Section 5.1 presents the integration with the 8-bit industrial microcontroller HT48x00 (a synthesizable RTL Verilog code), called HT48x00+ICE. Section 5.2 presents the integration with a 32-bit ARM7-like microprocessor core (a synthesizable RTL Verilog code), called N_ARM7+ICE. The ICE operations of N_ARM7+ICE are not the same as those of ARM7TDMI core [8] . Section 5.3 demonstrates that our ICE can be made to conform to the ICE specification of ARM7TDMI with minor modifications.
Embedded ICE with an 8-bit industrial microcontroller HT48x00: HT48x00+ICE
The HT48x00 microcontroller is a RISC-like microcontroller specifically designed for I/O control applications, such as remote controller, fan/light/washing machine controllers, scales, and various subsystem controls. There are 64 instructions, which are executed in 1 or 2 instruction cycles. The microprocessor adopts a two-staged pipeline structure. Each stage takes four clock cycles (with the single-clock scheme) or four non-overlapping phases (with multiple-clock scheme) The program memory ROM is 14-bit wide and the data memory RAM is 8-bit wide. There are 18 bi-directional I/O lines to communicate with the outside world. HT48x00 supports one external interrupt and one 8 -bit programmable timer/event counter with overflow interrupting. A watchdog timer is incorporated to prevent the software malfunction or sequence jumping to an unknown location with unpredictable results.
The parameters of the ICE embedded with HT48x00 are: boundary scan cells for all I/O pins, internal scan cells for the instruction register, program counter, stack registers, data/address registers for ROM and RAM respectively, and one breakpoint with the masking mechanism in the BDU. Figure 9 shows the simulation waveform of a debugging activity (setting breakpoint at the address 013) of HT48x00+ICE. In the HT48x00 architecture, each pipeline stage takes four cycles for execution and the memory address is set at the first cycle. The simulation waveform shows that the breakpoint is raised as soon as the me mory address is matched. However, the core clock hasn't been halted immediately until three cycles later for the pipeline to complete its current execution.
Both FPGA (on the FPGA prototype system in Section4.3) and chip (with Holtek's 0.5µm standard cell library) implementations for HT48x00 (without ICE) and HT48x00+ICE have been accomplished. Figure  10 shows the chip layouts for HT48x00 and HT48x00+ICE and the package for the HT48x00+ICE chip. The ICE overhead is about 2000 gates in the selected configuration. However, the chip area of HT48x00+ICE is slightly smaller than that of HT48x00. The reason is that based upon the experience of laying out the HT48x00 chip, the designer learned that the power requirement is not as strong as what was originally planned. Therefore, although the HT48x00+ICE requires five extra pins for ICE operations, its layout is slightly smaller than HT48x00 by 3.5% because of a few power pins being saved. (Due to the lack of time, the designer did not have the chance to re-layout the HT48x00 chip.) As for the performance, both chips are capable of running at the same frequency: 25MHz. The ICE does not introduce performance overhead in this case.
This case study shows that for a small microprocessor such as an 8-bit microcontroller, although the gate count overhead is not neglectable, it does not necessarily contribute to the final chip area since a core of this size is usually bounded by the I/O pins, instead of the core logic. In addition, with a careful synthesis skill, the possible performance overhead due to the ICE circuitry can be avoided, making real-time (on-line) debugging at full speed achievable. 
Embedded ICE with a 32-bit microprocessor (N_ARM7): N_ARM7+ICE
N_ARM7 is an academic RTL synthesizable 32-bit microprocessor core that is capable of executing the ARM7 instruction set. The ARM7 architecture adopts a RISC (reduced instruction set computer) architecture with special mechanisms to support certain CISC (complex instruction set computer) features such that performance, cost and software code density are well tuned towards embedded applications such as cellular phones, PDA's, networking, consumer entertainments, etc. The major CISC features include instruction semantics overloading, conditional execution and multi-cycled execution. The ARM7 architecture has a typical three-stage pipeline: instruction fetch, instruction decode, and instruction execution. The instruction execution time ranges from one clock cycle to seventeen clock cycles.
The parameters of the ICE that is embedded with N_ARM7 are: boundary scan cells for all I/O pins and one breakpoint with the masking mechanism in the BDU. N_ARM7+ICE has been successfully synthesized and downloaded into our FPGA prototype system as described at the end of Section 4.3. Since the ICE architecture is the same as in Section 5.1, the simulation waveform is similar to Figure 9 and is therefore not shown here. On the other hand, we also synthesized N_ARM7+ICE with TSMC's 0.35µm standard cell library. The retargetability of our ICE module greatly reduces the time to integrate N_ARM7 with the ICE. With the prior experience of HT48x00+ICE integration, it takes a master student only one month (in a regular semester while other courses are taken simultaneously) to integrate and verify N_ARM7 +ICE, of which the required works include understanding of the N_ARM7 RTL behavior, configuring the ICE, interfacing between them, developing test benches, simulating and prototyping.
Modification of the retargetable ICE for ARM7TDMI's ICE specification
As shown in the previous section, our retargetable ICE approach provides a fast and economic way to incorporate debugging and testing mechanisms to a microprocessor core. If necessary, the designer can then modify and fine tune the integrated core to make it conform to certain industrial debugging specifications that are more extensive and sophisticated. In this section we investigate the feasibility of such approach by modifying the N_ARM7+ICE in the previous section to conform to the ICE specification of ARM7TDMI [8] . Let's call this version N_ARM7+ICE ARM . The major differences between our retargetable ICE and ARM7TDMI's ICE are summarized in Table 4 . The main reason for such differences is that, for the benefit of retargetability, our ICE encapsulates most of the ICE control operations into the TAP instructions while ARM7TDMI relies on a combination of microprocessor's instruction set, proprietary on-chip configuration registers/flags, TAP instructions, I/O pins, etc., which make it harder to retarget its ICE module to other architectures. Table 5 shows the logic synthesis result of such modification using the same standard cell library as in the previous section. It takes only additional 1789 gates to support the more complicated behaviors of ARM7TDMI's ICE. The modification does not cause any performance degradation. Figure 11 shows the simulation waveform of a debugging activity of N_ARM7+ICE ARM . Sub figure (a) shows that the microprocessor is halted by an external debug request signal, and then a breakpoint is setup via the JTAG port. Sub figure (b) shows that the microprocessor resumes it execution and then a breakpoint is matched that halts the microprocessor again. Sub figure (c) shows that some internal status is shifted out through the JTAG port for external observation. Note that reading out the internal status is achieved by executing the memory store instruction in the microprocessor that put the status on the memory data bus, which has boundary scan cells. The status can then be sifted out through the JTAG port.
It took a master student two months during a regular semester to accomplish the task of modification, including studying of the ARM7TDMI specification, performing necessary modification, developing test benches and verification. The breakpoint signal is synchronized with the first coming eoi signal.
The breakpoint signal is synchronized with the second coming eoi signal. Access into internal status (1) . With internal scan cells; (2) . Activate the microprocessor to execute a memory store (load) instruction to write (read) internal status that is instruction accessible to (from) the data bus, which has boundary scan cells.
Activate the microprocessor to execute a memory store (load) instruction to write (read) internal status that is instruction accessible to (from) the data bus, which has boundary scan cells.
External debug pin behavior (EXT_DEBUG)
Needs to remain high for the entire debugging activity.
Need not to remain high for the entire debugging activity since some of the control responsibility is taken over by the proprietary on-chip configuration registers.
Interfacing signals
Minimal set as defined in Section 3. -20-
Conclusions
We have presented an in-circuit emulation (ICE) module that can be embedded with a microprocessor core. The ICE module, based on the IEEE 1149.1 JTAG architecture, supports typical debugging and testing mechanisms, including boundary scan paths, partial scan paths, single stepping, internal resource monitoring and modification, breakpoint detection, and mode switching between debugging and normal modes. The architecture of the ICE module is parameterized and retargetable to different microprocessors. An FPGA prototype system has been constructed for the experiments of integrating the retargetable ICE Wait for the breakpointed instruction to reach the execute stage before the µP is halted. µP resumes execution.
Breakpoint is matched.
µP executes the "store" instruction (scanned in).
The execution of the "store" instruction outputs the content of one register on the data bus Shift out content of the data bus (through the boundary scan cells)
The content of the boundary scan cells of the data bus is available on the output pin.
