We have fabricated a Chip Multiprocessor prototype code-named Merlot to proof our novel speculative multithreading architecture.
INTRODUCTION
The rapid progress of LSI technology has realized system on chip (SoC) with embedded processors, devices, and memories. SoC enables us small, low cost, and low power consumer equipments. LSI designers, however, now encounter design and verification problems because of the complexity with the system level integration and shorter design time requirement.
We have been researching a chip-multiprocessor architecture targeting at higher performance with lower power consumption. We have proposed FOPE (Fork Once Parallel Execution) architecture, and fabricated a prototype chip code-named Merlot.
In section 2, the target of Merlot and its architecture is briefly described. In section 3, design issues of Merlot and our solutions are referred. Finally, in section 4, we focus on the coverage improvement of functional verification since we think it is crucial for complex system design. 
MERLOT ARCHITECTURE
SMP (Symmetrical Multiprocessor) has been widely used in high-end computer systems. In SMP systems, processon are connected to a shared memory through a coherent cache system. Thread libraries for application programming on a monolithic SMP operating system relieve burden of parallel programming. Even with.them, however, multithread programming is not easy due to racing. condition in shared Parallelization scheme is depicted in Figure 1 . 1. .' Shtic thread scheduling (ordered threads) controlled by several additional instructions (fork, term, thcomt, thahort, etc).
2. Speculative execution of threads beyond control flow determination and data disambiguation.
A statically defined thread order eliminates thread-scheduling hardware. With the constraint that a thread generates another thread only once throughout the thread's life, unique thread order is defined among all threads. Under this constraint, data communication and synchronization are unidirectional, and Architectd keys are: Speculative execution enables us to exploit larger parallelism though we have to pay the cost for tentative storages of register and memory value. The storages also resolve hazards (ReadAfter-Write, WribAfler-Read, Write-After-Write) which are derived from parallel execution of threads.
Chip specifications are shown in Table I . Die plot and block diagram of Merlot are shown in Figure 2 and Figure 3 respectively.
Figure 4 Performance Estimation of Merlot
On Merlot, each processing element shares an instruction fetch unit, a register file, and a data cache with store reservation buffers (SRBs) for faster thread creation and data communication. We have designed the whole logic from the scratch including the processing element because of those sharing resources.
Performance estimation is shown in Figure 4. 
DESIGN AND VERIFICATION OF MERLOT
Design difficulty of Merlot is shown in Table 2 . With much effort spent on functional verification, we could successfully m mpeg2 decoder with simple OS on the first silicon with several software workaround and a hug fix with FIB. However, we still think the verification is the biggest issue in complex microprocessor design. Before discussing this issue in section 4, we would like to summarize the design and verification of Merlot focusing mainly on verification.
Criteria for Design Environment
For the design environment, we considered the following aspects: I ) Repeatability or Automation in order to remove human errors, 2) Controllability and Convergence to design target, 3) Expandability of the tool, and 4) Information for management aids,
Design Environment
For logic design and verification of Merlot, we have provided 1. Online Documentations: used Web for ease of update and accessibility.
Strict design rules:
Signal and module naming rules predefined for a globally unique and standardized identifier. Maximum current, Antenna-Effects.
generator reduced the source to 10% in words in 53 files
The tool also works for the fiont-end for RTL timing tools.
4. System Level Modeling and reference for simulation. 
System Level Modeling
We combined Merlot RTL to a commercial PCI modeler and a memory modeler. The PCI modeler originally required handwritten test scenarios for master operation. We modified the modeler to be triggered by program counter in RTL, so that PCI stimulus can be specified in assembly code of the tests. To match the RTL and ILsim (Instruction Level simulator), we extracted the positions of events, which are required to be clock accurate, from RTL simulation first, and inserted them into ILsim. However, it was difficult to remove false mismatches because of simplified modeling in ILsim (no cache, sequential emulation of thread execution, etc) or visibility of cycles in system registers. To rnn applications on RTL or ILsim, we provided OS stubs for the delegation of system calls to the host OS. For early debugging of a bus interface unit P I U ) including SDRAM and PCI interfaces, we isolated BIIJ and tested.
Command and stahs code are defined for the communication between BlU and core, so we manually provided the command sequences, and verified the log files.
Test Pattern Generation
We had a review of test pattern design. According to the review, we provided a mixture of directed (manually written) tests and random tests in assemble instruction sequence, as seen in [8]. For random testing, 11 generation algorithms have been finally implemented. We provided parallel OS prototypes for regression tests, where we verified OS boot through character VO operation by application programs. Real applications with reduced data set are also executed on RTL. Four handled test sets are executed as regression tests every night on Mentor Modelsim on 6 sun workstations (Table 3) .
Design Managements
Processor design is achieved as collaboration in a big team. We think success lies in I ) timely and proper recognition of problems, and 2) the information sharing in the team. In another word, when these two things are achieved, problems are almost resolved. We have found visualization is quite useful to accomplish the recognition and sharing, and provided visualization tools.
Followings are some examples: Figure 5 is a graph automatically updated by the result of nightly regression tests. Figure 6 shows the timing slack in each synthesis block. RTL Figure 6 Timing Improvement History of Merlot, Generated from RTL Timing Tools. 
EXPECTATION TO VERIFICATION
As seen in the bug curve (Figure 7) , it took long time for functional verification, even though we ran random tests every night. Furthermore, bug detection leaps are observed in the bug curve. The leap occurred afler we introduced new random generation algorithms. This shows that bug detection ability of random tests rapidly saturates because we limits the randomness for both generating meaningful instruction sequence and aiming at a specific comer case defined in the test plan. If we can improve bug detection, chip design time would be dramatically improved.
With this motivation, we would like to discuss the coverage improvement of functional verification in this section.
Microprocessor Design Flow
Before discussing functional verification, we would like to review the design phases of microprocessor. This is helpful to understand reference model and target model of design:
1.
Instruction set Level (architectural or behavior) design.
2.
Pipeline Level (micro-architectural) design.
3.
Register Transfer Level (RTL) design 4.
Gate Level design
Pipeline design adds the perception of clock or pipeline. In RTL design, logic implementation is considered, and behavior is modified for simplification of control logic or data paths. Gate level design is manually generated or synthesized from RTL, where actual gate mapping and fan-out trimming are performed. Logical behavior of gate level design slightly differs from that of RTL in the implementation of redundant states and initial states.
Functional Verification Methodologies
Considering design flow, functional verification methodologies are categorized as follows:
1.
Full Chip Verification i. Running Self-checking instructions (Vector Based).
ii. Comparison between Instruction Simulator and RTL simulator. (Vector Based) iii. Dynamic Verification [9] [IO] (Vector Based)
2. Sub-hlock Verification
Property Checking [111 [12] (Formal) ii.
111.

Formal Equivalence-checking (Formal)
Since the definition of instruction is fundamental and instruction hounday is well preserved through all the design phases, method (14, I-ii) is commonly used. In I-ii discrepancies are observed for some instructions, where clock cycle is observable such as clocWperformance counter, intempt, etc. The discrepancy makes comparison in actual system configuration, which sometimes includes OS and asynchronous U 0 events, very difficult. The discrepancy in Merlot design is observable in Figure   5 . In dynamic verification (1-iii), simple RTL models of each pipeline stage are used for reference. Since input of reference model can he acquired from target, reference model can he simple and fast enough. In the approach, when the discrepancy is observed, reference machine rewinds the state of target machine, and corrects the target machine. As the result, we can run a test vector beyond detected hugs for detection of other bugs. Propew checking is effective in both coverage and simulation time. In (2-ii), pattem generation libraries or device modeling is useful for quick generation of test bed, however, it is not easy to provide reference vector. For the reference, preserved logs after checking the behavior by hand is recommended as golden vector[l3]. Formal equivalent checker (2-iii) is useful to verify gate level design against RTL when ECO (engineering change order) is manually applied to gate level design.
Test Bed Generation Libraries [I31 (Vector Based) ... The feature B) and C) are useful for design quality management.
Current Vector Generation Methods
Considering easy preparation of reference vector in MPU verification, we focus on the method using assembly instruction sequence as test vector. Table 4 Vector Generation Methods
A) B)? C) Tools
Run on real condition.
Quite low B), C) A)?
I I Table 4 summarizes current vector generation methods and their pros and cons of referring item number of good vector in the previous section. For the design where multiple autonomous entities interact, unexpected racing conditions are created with multiple asynchronous events. So it is quite difficult to write test vectors with good coverage. Random vector generation is commonly used for detecting such cases. However, it is still difficult to achieve good coverage. Furthermore, an error detected by random vector is hard to debug due to (C), and hard to manage due to (B). In the random test generation of MPU, consmints or heuristics have been considered in the scripts for random pattem generation to improve effectiveness of hug detection (C) or adding selfcbecking feature. Though the method was mostly adhoc, recently a verification suites, Specman[l4], has been applied for some SoC designs. Specman reduces the verification cost with a test bench description language and verification IPS including constraint driven random generation, functional coverage analysis, and protocol checking. For MPU testing, integrated approach is also seen in a commercial tool Raven [lS] .
With the emulator on FPGAs, simulation speed bas become fast enough to run real programs with WO devices and real OS. The approach still encounters the coverage problem if we cannot get realistic test sequences. In addition, visibility of logic status for the error is a key for debugging. In newer approaches, we finalize design at hebanor level (C/C+t level) and automatically convert the design to real chip with high-level synthesis tool and IP lihraries [l6] . At behavioral simulation, we can expect much hener performance for system level simulation by s o h a r e .
Architecture Level Design Checking
In multiprocessor design, racing condition of mntual interactions of processors sometimes create bugs, It is a good idea to rcmove such a racing condition at instruction (architecture) level definition. We think it is uscful if wc can apply formal method at this level.
In IP b a d design, spccificalion is given at architecture Icvcl, and the architccture level racing check is useful for such an IP bascd SoC design.
RTL Level Coverage Improvement
It is not diffrcult to write property for basic logic components such as sequenccr, FIFO, arbitcr, etc. Propctty checking is useM to test these basic components. Itowever, evcn when we assemble verified components to a system, the system will not be bug frcc, and such bugs arc difficult to dctcct in simulation.
We cxpcct a tool that analyzes both specification of instruction and RTL description, and gcnerdtes instruction scqucnces that may create cffcctive condition for bug detection. It is a big challenge because a lot of logic interacts for executing single instruction isolated fiom other instructions. In rcal multiproccssor systcm, a lot of cvcnts interact in the sequence of instructions on multiplc processors.
CONCLUSIONS
In this paper, we presented our design experience of a spcculative multithreading processor code-named Mcrlot. We have rccognixed thc difficulty of functional verification considcring irsynchronous interaction of proccssing elements and cmbedded dcvices, so wc tried to cstablish strategic verification cnvironmcnt. However, wc still spcnt too much time and manpowcr on functional verification sincc om Verification approach was still brulc force. In this paper, we tried to examine the functional verification of multiprocessor systcm, and cast expectations to theoretical approach in the tesL vector generation.
