Abstract: A comprehensive system debug methodology is presented, which combines the state-of-the-art support for software, functional hardware and process technology debug. The application of this methodology to the 65-nm CMOS En-II SoC is described, containing among others a high-performance ARM CPU and a TriMedia VLIW DSP. The debug requirements and implementation choices made are explained in detail.
Introduction
The continuous decrease in complementary metal-oxide semicondutors (CMOS) feature sizes allows design teams to integrate multiple processors with dedicated, hard-wired accelerators on a single die. This trend, combined with the continued growth in the amount of software that runs on top of these processors, makes it increasingly difficult to guarantee that all software and hardware design errors are detected and corrected before silicon is manufactured. Even though hardware/software co-verification is successful in screening out the majority of errors in software drivers using simulation models of the system chip, most application errors tend to show up only when the application software is run on the actual silicon. The noise margins in digital circuits are also becoming smaller because of the use of lower power supply voltages. Local voltage spikes and dips, temperature variations, crosstalk and substrate noise, which typically depend on unforeseen operational conditions, can cause chips to fail even though they passed all standard test procedures. However because of the growing system complexity and the dramatic reduction in the internal observability, the time required for system validation, that is from first silicon to volume production, currently consumes more than 50% of the total project time. A supportive debug infrastructure is needed to quickly and accurately help pin-point the rootcause of system failures. If a system failure occurs, a consistent and complete system debug solution is essential to be able to address the resulting time-to-market and Quality concerns. In this paper, we show how a comprehensive system debug methodology, which combines support for software, functional hardware and process technology debug, has been applied to the En-II SoC design.
This remainder of this paper is organised as follows: Section 2 describes the En-II design. Section 3 lists the requirements for facilitating software, functional hardware and process technology debug. Section 4 describes the extensions of the IEEE 1149.1 test access port (TAP) controller to control the internal debug functionality. Debug execution control is described in Section 5, and details on the En-II's real-time trace capabilities are given in Section 6. The application of the process technology monitors is presented in Section 7. Section 8 reports on the area cost of this debug approach, and demonstrates an example debug scenario. We conclude this paper in Section 9.
En-II design
The high-level block diagram of the En-II design is shown in Fig. 1 . The En-II is the successor to the En-I SoC [1] and includes both a high-performance ARM1176 processor core and a TriMedia VLIW processor on a single die. Both processors have access to a Static RAM controller, and a DDR controller, through a shared, system L2 cache interconnect. In addition to their connections to system memory, both processors can access a set of dedicated hard-wired peripherals for communication with the outside world via standardised interfaces, such as UART and I2C. Two expansion ports allow the chip to be used in a wide variety of applications.
The En-II chip has several voltage domains (ARM processor subsystem, TriMedia processor subsystem, RAM voltage domain, debug voltage domain, always on domain, and SoC infrastructure voltage domain) that can be individually switched off. The En-II design also allows dynamical voltage and frequency scaling to achieve the lowest power consumption while the application performance requirements are still met. Individual clock domains exist for all controllable voltage domains.
When power is applied to the system, the embedded power-on-reset circuit within the power and reset controller generates a system reset event and pre-programmed reset sequences are applied to the SoC components. This controller is also responsible for reseting components that are returning to normal operation from a mode in which they were powered down.
-Cross breakpoint support between the processors, the on-chip trace buffer, the trace port interface, the on-chip clock generation unit and external pins.
-Clock control to stop all functional clocks on a breakpoint, and to allow any of these clocks to be switched to the TAP's TCK clock.
-Access to the on-chip scan chains via the TAP for in-system state dumping. † Real-time trace of key internal execution data. -Real time observability of the software execution on the ARM and TriMedia processors. -Capture of real-time information in an on-chip trace buffer, and/or by external data acquisition equipment. -Access to system busses for debug control and observability of the hard-wired peripherals. † Sensors to monitor the maximum process frequency, voltage drop, and local temperatures.
A good overview of debug solutions is given in [2] . However each solution discussed addresses only part of our set of debug requirements for the En-II. The decision was made to base the larger part of the En-II debug infrastructure on the ARM CoreSight architecture [3] , because of its native support for the ARM CPU that we use. For the En-II SoC, this base architecture was complemented with several in-house solutions to fulfill all debug requirements. In the following sections, each debug requirement described above is addressed in great detail.
4
Debug control port
When software and hardware debug has to be carried out in a system environment, for example, to analyse system test fails or to analyse customer returns, the standard IEEE 1149.1 chip-level TAP and associated controller (CL TAPC) often provide the only available access mechanism not already used by the system itself. Fortunately, the CL TAPC can be easily re-used and extended to allow external control over on-chip debug functionality. For the En-II design, we implemented the Multi-TAP Controller architecture [4] as the primary debug access interface (see Fig. 2 ). This architecture allows us to access a CL TAPC and a CoreSight Debug Access Port (DAP) [3] , while maintaining full IEEE 1149.1 compliance for board-level manufacturing test. The Multi-TAP controller architecture is part of the IEEE P1149.7 draft standard [5] . The CL TAPC is connected in series with the debug access port (DAP) to provide not only production test access to the IC, but also access and control for hardware and software debug. In addition to implementing standard IEEE 1149.1 boundary scan instructions, the CL TAPC has been extended with the following debug instructions: To enable the repetition of a silicon debug session for automated analysis, it is necessary that the debugger software can execute a functional reset of the chip. As the debugger software can only communicate through the TAP with the on-chip debug hardware, the CL TAP controller was extended with an DBG_RESET instruction to give the debugger software control over the functional reset of the chip. When the TAP controller is not used and held in reset, this silicon debug reset signal is kept inactive, connecting the internal input of the power and reset controller directly to the device's reset pin.
At the heart of the CoreSight debug and trace architecture is the DAP, which interfaces among the off-chip debug tool, system bus connections and debug configuration/control connections. The TMS, TCK and TDI signals generated by a JTAG ICE are passed through the Chip Level TAP controller to the DAP. A debug tool (like the ARM software debugger) can issue high-level commands to the processor's debug unit, such as 'set a breakpoint at location 0Â1234', or 'dump the contents from memory locations 0Â0 to 0Â100'.
The TDO_BYPASS_MUX multiplexer in Fig. 2 allows the DAP to be excluded from the serial chain between TDI and TDO during data register scans. This selection is controlled from the CL TAPC, and ensures IEEE 1149.1 compliant behaviour on the TAP during board-level manufacturing test operations. The other multiplexer in Fig. 2 allows direct access to the DAP, completely bypassing the CL TAPC. This is used for debug tools that cannot correctly handle a daisy-chain of on-chip TAP controllers. The JSEL input pin determines which operating mode is in use. † When the JSEL input pin is equal to '1', -Both the CL TAPC and the DAP are part of the daisychain in the Shift-IR state.
-All standard IEEE1149.1 instructions, when applied to the CL TAPC, result in the selection of the CL TAPC's TDO in the shift-DR state, and therefore in IEEE 1149.1-compliant behaviour. -Only the 'DBG_BYPASS1' instruction in the chip level TAP controller enables access to the DAP, by overruling the JSEL input. † When the JSEL input pin is equal to '0', -the TAP device pins are directly connected to the DAP, whereas the CL TAPC is held in a stable state.
5
Run-stop control
Debug run-stop control of the software executing on the two processors and of individual hardware components has been implemented to help pin-point potential root-causes of a system failure. Fig. 3 shows an overview of the breakpoint generation and distribution architecture used in the En-II. Two example breakpoint scenarios are shown in Fig. 3  Arrow 1 shows a scenario where a breakpoint in the ARM CPU causes the TriMedia DSP to stop. This scenario is useful for debugging the software interaction between the two cores. Arrow 2 shows a scenario where a breakpoint in the TriMedia DSP causes all functional clocks to be stopped using the clock generation unit (CGU). This scenario is used to create state dumps, as described in Section 5.4. The result of a system breakpoint is that either each processor enters its debug state, or that all functional clocks are stopped. This is programmable via the TAP and will typically depend on the amount of hardware detail required for a particular debug analysis.
Processor breakpoints
The ARM processor supports up to six thread-aware breakpoints and two watch points, by monitoring the ARM's address and data buses. These breakpoints and watchpoints are programmed either by the processor itself or through the DAP (see Section 4). Once a breakpoint condition occurs, the core asserts a debug acknowledge output to indicate that the core has entered the debug state. This signal is connected to the cross trigger interface #3 (see Fig. 3 and Section 5.3). CTI#3 also connects to the core's debug request signal to allow forcing the core into the debug state.
The TriMedia processor supports instruction, data address and data value breakpoints. Issued instruction addresses are compared against breakpoint low-and highaddress values. A control register bit determines whether the instruction address has to be inside or outside of the specified range for a successful match. The TriMedia asserts one or more output signals when matches occur. These signals are routed to CTI#2 to signal a TriMedia breakpoint to, among others, the ARM core, so that both processors can be halted. One of the external interrupt inputs of the TriMedia is used to allow forcing a breakpoint in the TriMedia core itself.
Bus monitor breakpoints
System bus monitors have been added to detect preprogrammed address and data values and generate breakpoint events. Processor addresses are compared to high and low values, whereas data values can be masked. Each match increases an internal counter, and when a pre-programmed counter value is reached, a signal is sent to CTI#0 to inform the other system components. In addition to address and data matches, these monitors can check the identification code of transactions. This identifies the bus master that is sending or requesting the specific data, and implements an additional filter on the transaction data that is monitored.
The monitors can be programmed to calculate additionally a checksum of the address and data values corresponding to matches. This checksum serves as a compact representation of the read and write sequences observed so far, and therefore serves as an easy means to validate long read and write transaction sequences. As bus values are observed locally, the checksum value can also assist in diagnosing crosstalk or other signal integrity issues on the system busses. The checksum serves as higher level reference for the state dumping functionality that, in some cases, is more meaningful than the clock cycle counters found in other systems.
Breakpoint distribution
Debug events between cores are communicated via the embedded cross trigger (ECT) architecture. The En-II ECT consists of the cross trigger matrix (CTM), three cross trigger interfaces (CTIs) and a port for future extensions. The CTM routes trigger inputs and outputs to and from the connected cores through the CTIs. On the En-II, the ECT provides the standard mechanism for synchronised debug and trace, by routing triggers from a breakpoint source to a programmable set of trigger destinations.
A common four-phase handshake protocol is used to ensure that trigger events are properly detected, even when multiple clock domains need to be crossed to reach the trigger's destination. Fig. 4 shows the operation of this handshake and its four phases.
This handshake uses a debug trigger request and acknowledges signal pair with return-to-zero signalling. When a trigger occurs, the request signal R S is asserted and observed, as R D in the receiving clock domain. Upon detection of this assertion, the receiving clock domain assert the acknowledge signal A D . The request signal R S remains asserted until the assertion of the acknowledge is detected on A S , after which it is de-asserted. The de-assertion of the request signal R S is followed by the de-assertion of the acknowledge signal A D by the receiving clock domain.
The components that act as trigger destinations stop a few clock cycles later than the trigger source because of the time it takes to communicate the trigger event across clock domains. This is a recognised inconvenience [6, 7] that cannot be avoided in multiple clock domain SoCs with truly independent clocks. The consequences of this have to be taken into account while debugging.
State dumping
The En-II state dumping functionality is controlled from the CL TAPC, and is based on the architecture presented in [8] . It allows access to the complete state of the En-II that is reachable through scan. The breakpoint generation and distribution architecture can be programmed to stop a selection of clocks on the detection of a system event. Once the functional clocks have stopped, the source of the functional clocks is switched to a gated TCK clock. In addition, all scan chains are concatenated into one long, so-called scan probe. When the DBG_SCAN instruction is loaded in the CL TAP controller, this scan probe is connected to the CL TAPC's TDI and TDO ports. The gated TCK clock is subsequently enabled when the CL TAPC's state machine is in the shift-DR state. While remaining in this state, the complete state of the chip is observed at the TDO pin on consecutive TCK clock cycles. The resulting state dump can be compared to, for example, state dumps obtained from a simulation, or from an emulation setup. This comparison allows for quicker and more accurate detection of deviations from the chip's intended behaviour.
6
Real-time trace 
Trace data sources
Both the ARM and the TriMedia subsystem can act as a trace source. The ARM processor contains a trace macro cell (ETM11CS) that provides instruction and data traces for the processor. Programming is done through the DAP interface. While the core is running at full speed, the trace macro cell continuously monitors the core's buses. This allows information on the processor's activity to be captured both before and after a specific event.
Software-controllable filters and trigger logic allow the software programmer to select which instructions and data are captured by the ETM11CS before the information is compressed. The trace output conforms to the ARM AMBA trace bus interface specification. The TT3271 is the trace module for the TriMedia processor core. It produces a compressed bit stream that can contain information on the software executing on the TriMedia core. The TT3271 can be programmed to output the following types of trace information:
1. Program counter. 2. Event information on among others, data cache write and read misses, software and hardware prefetch requests, and uncached reads and writes. 3. Number of stall cycles and the source of each stall. 4. Guarding information indicating which operations in an instruction are not executed because of guarding. 5. Re-synchronisation information for program and cycle counters to recover from trace data loss.
Experiments show that the average amount of trace bits for the compressed TriMedia trace output is well below 2.5 bits per cycle. A protocol adapter translates the TT3271 trace output so that it conforms to the ARM AMBA trace bus interface specification, allowing the data to be combined with the trace data from the ARM core.
Trace data transport
The compressed data streams from the two trace sources are combined by a Trace Funnel before being passed to the Trace Replicator. The Funnel behaves like an arbiter and multiplexer. The software programmer selects an arbitration scheme to select which input trace stream to pass for each bus cycle. A priority register determines the priority for each source. The Trace Replicator enables the trace data stream to be sent to either the ETB (embedded trace buffer), the TPIU (trace port interface unit), or both.
The En-II has multiple voltage and frequency domains. Asynchronous bridges are used whenever the trace architecture crosses domains. Because the read and write rates vary depending on the voltage and frequency conditions within a domain, these bridges use buffering to ensure correct trace communication. Level shifters and clamps are used within the receiving domain to bridge any voltage differences and ensure that no undefined control values are ever generated.
Trace data sinks
The ETB stores trace data on-chip at high rates using 32-bit data elements. These data can be retrieved at a later stage and at a lower rate, either by external equipment through the DAP interface or by the ARM processor. This provides access to real-time trace information whenever the required number of trace pins and/or frequency cannot be met by device pins and constraints. For the En-II chip, an ETB size of 8 kB has been chosen. This size is a trade-off between the number of multi-cycle events that can be captured from multiple trace sources and the amount of memory dedicated to debug functionality. Source IDs are included in the data streamed to the trace buffer to distinguish individual trace sources. Owing to the high compression achieved, a single ETB can be shared among various trace sources.
The TPIU formats and transmits the trace data off-chip. The trace data format also contains a source ID to enable multiple trace sources to be transmitted through a single trace port. The trace port is programmable by software with respect to the trace frequency (from 1 kHz to the maximum supported by the external trace equipment). The stream is output at dual data rate on 16 dedicated device pins at 125 MHz, yielding a total available trace bandwidth of 4 Gbit/s. The sampling points are indicated to the capturing device by the rising and falling edges of the trace clock. At device level, these correspond to the centres of stable data on the trace control and data signals.
Process monitors
On-chip variations may severely affect the chip performance for deep sub-micron processes. To accurately measure these effects, the En-II design contains special process sensors [9] to allow real-time monitoring of process technology parameters that affect chip performance. These monitors are accessed via a separate TAP interface while the system is running. This separate interface is used to ensure that under all circumstances, that is, during manufacturing test and functional execution, the sensors can be accessed and the on-chip parameters measured, without impacting the chip's behaviour. Fig. 6 shows the TAP-based access architecture for the En-II process monitors.
In total, 15 units with process sensors have been implemented and spread across the design layout. Unit 0 contains only ring oscillators, whereas units 1-14 each contain a ring oscillator, a voltage drop sensor and a temperature sensor. The process sensors are controlled from the TAP data register, containing reference values and control and status signals (see Fig. 7) .
A high-speed comparator inside the monitor compares the reference value with the locally measured quantity. The comparator output flag can be captured in the TAP data register and observed by external debugger software. By increasing and decreasing the reference value, the software can track the value of the measured quantity. The ring oscillators are daisy-chained and output on a single clock pin. Via the TAP controller, the ring oscillator whose clock signal needs to be made visible on this pin can be selected.
The measurements obtained from these sensors allow us both to pinpoint potential causes of system failures and to provide feedback on the predictive quality of the models and tools used during pre-silicon verification. These use cases allow us, on the one hand, to quickly contain a certain problem and find a work-around, on the other hand, to prevent similar problems on future system chips.
8
Experimental results
Area cost
The cost of implementing the debug architecture described in this paper is summarised in Table 1 . The area cost is given as the percentage of the total number of logic gates. The trace sources, which take 'raw' data from the processor cores and subsequently use compression to significantly reduce the amount of trace bandwidth required, constitute the largest part of the debug infrastructure. The process sensors take up some 0.9% of the total area. This includes the area for the analog sensors as well as the IEEE 1149.1 program register. A sizeable amount of area is also consumed by the asynchronous bridges. Next to the logic gates, the ETB consists of 8 kB of SRAM, which constitutes 0.75% of the total embedded SRAM on the En-II. Fig. 8 shows an example state dump scenario that utilises the functionality described in this paper.
Example state dump scenario
Starting on the left-hand side of Fig. 8 , the active-low power-on-reset is asserted first. After the power-on-reset is de-asserted, the TAP controller and the on-chip test and debug logic are reset. Next, the clock generation unit is programmed to stop the functional clocks on a breakpoint event. This is done using the PROGRAM_CGU_TPR instruction. Afterwards, the DAP is used to programme the CTIs and CTM, and set up a breakpoint event. Some time after the breakpoint was enabled, the breakpoint event is generated (dbg_stop_req) and propagated to the CGU. This indicates that the on-chip functional clocks need to be gated. As a result, all functional clocks are stopped. Fig. 8 shows three functional clock signals (arm_clk, tm_clk and ddr_ctrl_clk), which are gated after the dbg_stop_req signal is asserted. This condition is subsequently detected by the debugger software through the use of the QUERY_CGU_TPR instruction, which allows the debugger software to check the status of the dbg_stop_req signal. Once the debugger software has detected that the chip has stopped, the debugger software proceeds with reprogramming the CGU (using the PROGRAM_CGU_TPR) to allow all functional clocks to be reactivated via the CL TAPC and derived from the TCK clock. Afterwards, the operating mode of the circuit is changed by reprogramming the Test Control Block (using the PROGRAM_TCB instruction). Finally, the DBG_SCAN instruction is selected in the CL TAPC, after which the TCK clock is passed to all selected functional clocks, and the state of the chip is shifted out on the TDO pin of the device.
Note that even though the clock signals look the same in Fig. 8 , the clock frequencies of the functional clocks before and after the breakpoint hit are different. Before the breakpoint, these clock signals have their appropriate functional application frequencies, whereas after the breakpoint has hit, and the circuit has been switched to debug scan mode, these clock signals are derived from the TCK signal and have the TCK frequency. 
Conclusion
We have presented a comprehensive system debug methodology, which supports the debugging of software, hardware and process technology. This methodology has been implemented for the En-II SoC design. The En-II chip is a 65-nm CMOS chip and currently in its final design stages. En-II silicon is expected in Q3 2007.
