Semiconductor Systems, Newport Beach MS~CT Due totheexponentird growth of both design complexity and the number of gates per pin, functional debugging has emerged as a critical step in the development of a system-on-chip. We introduce a novel debugging approach for programmable systems-on-chip. The new method leverages the advantagw of the two complementary functional execution approaches, emdation and simdation. We have developed a set of tools, transparent to both the design and debug~ng process, which enabl= the user to run long test sequences in emdation, and upon error detection, roll-back to an arbitrary instance in execution time, and switch over to simtiationbased debugging for fu~design visibility md contro~abifity. The efficacy of the approach is dependent on the method for transferring the computation from one execution domain to another. To enable effective transfer of the computation state, we have identifid a set of optimization tasks, established their computation complexity, and developed an efficient suite of optimization dgoritbms.
university of cdifO~a, hS hgeleS $Rockwell
Semiconductor Systems, Newport Beach MS~CT Due totheexponentird growth of both design complexity and the number of gates per pin, functional debugging has emerged as a critical step in the development of a system-on-chip. We introduce a novel debugging approach for programmable systems-on-chip. The new method leverages the advantagw of the two complementary functional execution approaches, emdation and simdation.
We have developed a set of tools, transparent to both the design and debug~ng process, which enabl= the user to run long test sequences in emdation, and upon error detection, roll-back to an arbitrary instance in execution time, and switch over to simtiationbased debugging for fu~design visibility md contro~abifity. The efficacy of the approach is dependent on the method for transferring the computation from one execution domain to another. To enable effective transfer of the computation state, we have identifid a set of optimization tasks, established their computation complexity, and developed an efficient suite of optimization dgoritbms.
1.~ODU~ON
With the increasing complexity of modem desi~s, functiomd verification emerges as a time and cost dominant step in the development process. For example, verification of the~traSPARC-I took twice as long as its design~an95]. Traditionrd approaches, such as system emdation and simtiation, are becoming incrertsin~y inefficient to address debugging n~. Emdation is fast, but provides tirnited design contro~abifi~and observablfity. Simtiation has the required controllability and observability, but is six to ten orders of magnitude slower than emdation mrtn97]. For simulation, state-of-tbe-~RT-level simulators are capable of performing error trace and timing analysis @terra's Picasso mt98]) and backtracking (Synopsys' Cyclone [Syn98]). For pro-able proctisor simdation, instruction-set simdators provide~system visibl~ty at various degrees of accuracy. The debugging circuitry in the emtiator, usudy implemented using a~AG boundary scan methodology mau86], enables contro~abitity and observability of particular intemrd states. The emtiation testbd have evolved into logic (functionA) porting of the processor model into arrays of rapid prototyping moddes (e.g. arrays of gates, WGAS [Apt98, Qui98, fko98]). Such emtiation engin= aim to provide both high execution speed and relatively high observab;fity and controllability of dl registers mar98]. These systems suffer from high cost antior redud contro~abifity md observablfity oess than 1s ignfls). Recently, a technique which leverages the advantages of the w'o complemen~functioned execution approaches, emtiation and simulation, has been presentd~r97]. However, this techPeh.sion to make &@tat or hd copies of SUor part of W~vorkfor pe~nas or &ssrmm E E @ted~titiout fee pmiided that copi~are not made or &Wb utti for profit or commerd advantage and tit copies bear Wnotice ond the frdS dtation on tStefit page. To copy othm@e, to repubhh, to~t on =wa or to r~s~%ute to hts, r@= prior s@c @sion and/or a fee. ICC~8, Saa Jew, @ USA O 1W8A~l l-58113~&Z98JMl l.S5.W nique targets ordy statically schedded singe-core ASIC designs. Trends in the semiconductor indus~show that programmable systems-on-chip are becoming a dominant design paradigm. We have developed a generrdized methodology for coordinate simtiation and emulation of mtiti-core programmable systems-onchip. The developed approach enables the user to migrate the functional execution of the design back and forth between the simdator and emdator. bng test sequences are run in emdation. Upon error detection, the computation is migrated to the simtiation tool for Ml design visibility and controllability. To explain how execution is transferred from one domain to another, we use the notion of a complete cut. A complete cut is a set of variables which Mly determines the design state at an arbitrary time instance Kr98]. The running d=ign (simtiation or emtiation) periodictiy outputs its cuts. The cuts are saved by a monitoring workstation. When a transition to the dtemate domain is desird, any one of the previously saved cuts can be used to initkdize, and then continue execution with preserved functiomd and timing accuracy.
The debugging paradigm introduc~a number of optimization problems and and for efficient implementation mechanisms. We propose a suite of algorithms which effectively identifies the minimrd computation state and post-process the core dwign and system integration to enable UO of variables of the identified computation state. We have conducted a set of experiments on standard mtiticore benchmarks to quantifi the overhead induced to enable the developed debugging-methodology.
The targeted system core architature, embedded software, and core integration.
2.~WAW AND COMP~~ON MODEL
The architecture template used to evaluate the developed debugging method is depicted in Figure 1 . The architecwe is typical for most modem consumer electronics devices. It contains a set of application-specific (ASIC) and slave programmable cores (SPC) connected to a shined bus. The system is controlled by a singe master programmable core~C). Each ASIC contains a datapath andor memory hierarchy.
We target the fo~owing heterogeneous model of computation. me backbone of the model is the semi-ifirrite strm (S1S) random access machine m model. me standard W model [AboS3] is relaxed by removing a requirement for dgontbm termination. me SISW model provides high flexibility with wet ested and widely usti semantics and syntax (C and Java). me second component of the heterogeneous model is synchronous data flow (SD~@87] which is used for specifying a potentirdly empty set of staticdy scheduled islands of computations.~s model facilitates optimization-intensive implementation on both ASIC and programmable platforms.
GLOBAL D~IGN-FOR-DEBUGG~G

FLOW
During the design of an application-specific core, debug tictiontity is addd as a post-processing step.~s functiontity includm a set of register-to-output interconnections and a f=ture which enables the system integrator to select a specific cut. Since the system architecture is in generrd not known at the time the appficationspecific core is designed, tis subset of interconnats shotid enable VO of variables for a large number of candidate cuts. When numerous options exist for selecting a cut, the system integrator will more effectively coordinate the cut VO.
me ASIC developer provides the system integrator with information about the set of cuts. For =ch cut, the variables and control steps at which =ch variable carsbe dispensti through the virtual pins of the ASIC is given. me system integrator fac~tbrd esign problems. Firstiy, for each ASIC a singe cut has to be selected. Second, the selected cuts, jointiy with the primary input and outpu~have to be schedsded for VO over the available set of UO pins. We integrated these two phases into a tight optimization loop which swches for a feasible solution. h the third step, if no fwible schdtie is found, the system-on-chip cut VO is spread in time by schedtiing partictiar ASIC cuts squentiWy.
Each PC has, in generfl, two components in its cuh instmctionaccessible states (e.g. generrd-purpose registers), and states nonaccessible to machine code (e.g. branch prdction hardware). fie part of the cut accessible to instructions is UO using code instrumentation. me portion of the core's state that is not accessible by instructions must be either reset (e.g. cache flush) or VO. me debug instructions are instrumented into the objwt code in a way similar to~rify [Has92] . An example of instrumented code is given in Figure 1 . ktructions O~@~);
and if Debug D: =~are sufficient to enable ti controflabifi~and visibility.
During the system software developmen~four subtasks are undertaken. In the first phase the minimrd-size cut for tich static~y schedtied "computational island' is identified. k the second step, the programs are instrumented for cut VO. h the last two phases, we identi~cut variables outside the SDF islands and instrument the non-SDF code running on the~C and SPCS with instructions which control the system cut UO. Figure 2 i~ustrates the technical details of the process of cut UO. me code instrumentation which runs on the~C starts the cut VO. As shown in Figure 2 , the WC first sends a si~d (start ASIC) to the ASICS which start their cut VO.~s process is staticdy scheduled. Cuts of cores ASICI (Cutl) and ASIC3 (Cut2) are interl~ved and the cut of core ASIC4 (CUS) is UO squenti~y. Due to static schedting, the~C knows when the ASIC cut UO is complete. If buffering is used to resolve the problem of unschedded cut variables, the MC is responsible for expficit VO of these variables.~e~C, using around-robin poticy, initiates (stit SPC(i)) the cut UO of =ch SPC. Upon receipt of this sigmd the virturd tristate gate that controls the actual UO of variabl= onto the shard bus is enabled and the cut VO starts. me instmmented code running on the SPC has to be able to assure that i exactiy one cut UO from each SPC (Cut SPC) is completed. Once its cut is dispensd, the SPC sends a signal back to the WC which acknowledges one SUCCCSSM cut VO. FinWy, the WC initiates its own cut VO, which represents the end of th~system cut VO. This problem is~-hard since there is one-to-one mapping between its special case, when dl operations in the computation are executed exacfly the same number of times, and the FEEDBACK MC S~problem [Gar79] . For this problem, we have developed a heuristic summarized in Figure 3 . InitiMy, the computation CDFG is partitioned into a set of stron#y connected components (SCCS)using a breadth-search rdgorithm [Cor90]. Ml trivial SCCS, which contain exacfly one vertex, are deletd from therestiting set because they do not form cycles. Then, for =ch SCC severrd processing steps are performed. Firstiy, to reduce the solution sach space, a graph compaction step is performed. Each path P : A w B which contains ordy vertices V G P, V # A with exactiy one input variable is replaced with a new edge EA,B which connects the source A and destination B. Secondy, for each node V in the graph an objmtive is estabfishd. The objective evaluates the cardinrdities of the newly created SCCS in the remaining graph when V is deleted. The node that has the smtiest objective is deleted from the graph and added to the restiting CULThe described process is repeated until the set of nontrivird SCCS in the graph is empty.
ASIC Cut Selection
The debug~ng strategy comprises two moddar phases, conducted by hvo parties in the system development process: the core provider and system integrator. The core developer selects a rninimfl number of re~ster-to-output interconnects such that large number of complete cuts are available to the system integrator. An additiond constraint is set on the timingoccurrence of these cuts. Since the core developer does not bow in advance the mdti-core system configuration, its search for a set of register-to-output interconnects is targeted for large number of non-overlapping small cuts. Such subset of registers enables the system integrator to have flexibility in finding a solution to the cut schedting problem. The definition of a debugging register subset forces selection of registers which define a large cardin~l~set of cuts with small cardintity, long fife-times of containing cut variables, and nonoverlapping life-times of variables in the set of cuts. The core developer faces an optimization problem to find a subset of registers with the smrdlest possible constants kfax and K. A special case of this problem, with no register sharing among CDFG computation variables and no additiond heuristic requirements, is NP-hard since it is equivalent to the FEEDBACK MC SET problem [Gar79] . We have developed a heuristic to swch for a debug@rrg register subset in a schedticd and assignti CDFG. The algorithm is forrnMy explairrcd using the pseudo-code in Figure 4 . me algorithm first partitions the CDFG into a set S of SCCS. Then, for each register~, an objective function evaluatesthe set of SCCS Snewj G Snew which are rcsdt of deletion of dl variables held by register~. The objective function usd to quantify the register selection is:
, where Scar returns ---the sum of squares of cardirrdities of the set SneWi of SCCs creatd when dl variables held by R are deleted from the CDFG. LiveVars returns for control step C, the sum of squares of number of variables five at C and held by the currentiy selected subset of registers SR. The register with the highest objective function is selected, added to the currentiy selected subset of registers SR, and W its variables are deleted from the original CDFG. The process of register selation is recursively repeated while the set of nontrivial SCCs is not empty. 
Multi-ASIC Cut Selection and ScheduEng
We introduce an algorithmic solution which enables VO of cut variables from mdtiple staticrdly schedtied ASICS. We use the common multiple (CM) of dl as the system debugging period. Within Wlsperiod the dgonthm tries to find a feasible schcrhde of variables of rdl ASIC cuts such that the range of control steps is minimal between the moments when the first and last variable in the ASIC subsystem cut is output. The NP-complete problem of schedding a subset of variables in a CDFG [Kir97] is a special case of this problem. We developed a most-constrained Imt-constrairdng heuristic described using the pseudo-code in Figure 5 . Mtidly, for each ASIC, the available cuts are sorted in decrwing order with respect to the average fifetime of contained variables. The selection and schdting search loop selects one cut for each ASIC from its fist of available cuts. Cuts that contain vtiables with longer average fife-times are given priority. Next, within CM consecutive control steps, the subset of M consecutive control steps in which the variables of the selected cuts can be schedded. The search is initiated by determining the lower bound on the range of control steps M = Mini. for which W cuts can be dispensed. This bound is cqud to the sum of the cardin~hies of dl ASIC cuts. Then, within CM consecutive control steps, a set T is found where each element TP c T represents a subset of NP consecutive control steps which contains at Iwt Afro;. ide control steps, and for tich variable of W cuts there must beat Iwt one ide step in which it is rdive. For each combination of cuts witilrt TP, a scheduling heuristic is performed. The scheduling heuristic iteratively constructs the solution by selecting N most-constrained cut variabla and schdtding them exactly at the N l~t-constrahting control steps. If feasible schedfllng is found, the range of the solution NP is compared to the bat current solution. Otherwise, the control step range M is increased and the swch procedure is repeated.
- 
EPEm~~~TS
We have conducted a set of experiments to evaluate the effectiveness of our system debugging paradigm. Table 1 shows experiment results for the ASIC design-for-debugging technique. All designs were synthesized using the~ER system~ab91]. Columns 2-6 of Table 1 praent the iteration period in control steps, the number of variables in the computation, and toti area. Column eight presents the number of variables in the smflest cut, broken down into VO and non-VO cut variables. FinWy, the last column shows the area overh~d (0~. Table 2 presents the experiment results which demonstrate the efficiency of our mtiti-ASIC cut-selection and schduling technique. Column 2-4 present the resdting system period, the number of variables in the system cut, and the range of control steps in which the cut transfer is accomptishd. By comparing these two columns, it is CIW that the available system VO-ide control steps are efficiently utifized to VO interl=ved cuts. Table 3 : Cut selection for programmable machines.
6. CONCLUSION We have presentd the first debugging approach for programmable systems-on-chip that coordinates emtdation and simtiation. We have developed design-for-debugging algorithms for code instrumenting with cut UO instructions and an optimization methodology for efficient cut schedding of a set of ASICS on a shared bus. The effectiveness of the approach is demonstrated on a set of programmable and ASIC mtiti-core designs where ml system observability and contro~abifity have bmn enabled with low hardware and performance overhmd.
