Tomorrow's silicon chips will hold more transistors than most embedded system designers could possibly use under the prevalent "describe-and-synthesize" design paradigm. Many have thus re-proposed the once popular "capture-and-simulate" paradigm, wherein pre-designed Intellectual Property software and hardware components are connected and co-simulated, to reduce this gap. However, major hurdles limit this paradigm to only very high-cost embedded systems. In this paper, we describe those hurdles and present a case for a new "configure-andexecute" paradigm for mainstream embedded systems, based on the idea of deconstructing rather than constructing systems, which takes advantage of the surplus transistors in a way that can overcome the hurdles and significantly reduce time-to-market.
INTRODUCTION
Trends in silicon chip capacity indicate that next decade's chips will have massive transistor counts compared to the previous decade, implementing entire systems-on-a-chip (SOC). Unfortunately, design methods have not kept pace, resulting in a "productivity gap," resulting in underutilized transistors.
Synthesis is a commonly proposed solution. In the past, the "capture-and-simulate" paradigm [4] was prevalent, wherein one captures gates or register-transfer components in a structural schematic, and then simulates to verify correctness. Recently, the "describe-and-synthesize" design paradigm has taken hold, wherein one describes desired functionality, and then automatically synthesizes a structural schematic.
The productivity gap continues to increase, though, so many are proposing a paradigm based on design reuse. It is essentially a return to a capture-and-simulate paradigm in which the components, rather than being gates, are standard and custom processors, called Intellectual Property (IP) or more specifically "cores." However, capture-and-simulate has major problems for SOCs that didn't exist previously, which will result in high SOC design costs and thus limit SOC design to high-end niches, leaving most designers still under-utilizing potential transistors.
We can, however, still make good use of all those transistors by adopting a new paradigm, which we call "configure-andexecute." An embedded system designer acquires a silicon reference design for a particular application class (e.g., still video processing), and then develops hisher specification application Permission to iiiahc digita! or hard copies ofaii or pan ofrhis work for perrmal or ciassrooni iisc i s granted wiihnut k c pi-oridrd ihat copies BTC no1 made 01 disliibiacd for piofil or coinnieicin! advanage and that copics hcm this iiotiile and the iull citation on the first p q c . To mpy otherwise. to republish, to post on sesveis or to redistribute to lists. requires pnm sprcilic permission a n d h i I Tee. CODES '99 Rome Italy Copyright ACM 1999 1-581 13-132-1199105 ... $5.00 (e.g., a digital camera) by repeatedly configuring the design and executing it in its real enviromnent. The silicon reference design is an over-designed general-purpose working silicon SOC intended for the application class, with the SOC design wsts amortized over numerous applications. When done developing, the embedded system designer can optionally spin specialized silicon, which is virtually guaranteed to be correct the first time because of the extensive in-circuit verification already done. This paper summarizes current trends, describes problems with applying a capture-and-simulate paradigm for SOC's, introduces the configure-and-execute paradigm and argues for its use in mainstream embedded system design, and provides preliminary data for a digital camera example, developed as part of the Dalton project at UC Riverside, supporting those arguments.
TRENDS
The driving trend is the existence of a huge number of transistors on chips in the near future. By 2006, an inexpensive ZOOnnn* die would have 75M logic transistors, or 375111 SRAM transistors [8] ; a highend 1000mmz die would have 400M logic transistors [18] , To illustrate the luxury of this silicon real estate, consider the following transistor counts for various cores [SI:
486DX4 core: 0.7 million Pentium/MMX 2.8 million MPEG-2 encoder/deccder: 1.5/0.5 million 8051 microcontroller: 0.05 million In other words, we could fit up to 7500 8051 microcontrollers on a single chip. Assuming a typical core has 0.5 million transistors, we could fit between 150 and 750 cores on a single chip.
Designers simply cannot build such complex systems under the cost and time constraints of most applications. The Semiconductor Industry Association (SIA) states that while transistors per chip have increased 58% per year, a designer's ability to use transistors has improved only 21% per year, leading to a widening productivity gap. This inability to use transistors has led to most systems not utilizing potential transistors: data from a major silicon vendor indicates that the actual number of transistors per design start has increased only 25% per year [131. 
CURRENT PROBLEMS
The industry has generally recognized that transistors are underutilized and many seem to agree that synthesis and E' reuse will help solve the problem [18] [20]. However, we argue that even these techniques will not enable designers to really utilize the large number available transistors, because cost and time requirements would still be prohibitive for most applications.
Synthesis follows the "describe-and-synthesize" paradigm. One first describes desired functionality in a program-like language, and then synthesizes a structural implementation. A problem is that describing functionality is error-prone -some data suggests about 1 bug per 100 lines of code, independent of code size [I] .
A similar problem has faced the software community for decades.
Therefore, synthesis will almost certainly need to be supplemented with IP reuse. However, the commonly-proposed paradigm of connecting Ip together, essentially representing a return to a capture-and-simulate paradigm at a higher level of abstraction than before, bas major problems of its own.
Verification --the main problem
The main problem in building SOCs by connecting IP relates to verification of the correctness and completeness of a design's functionality. Some state that verification accounts for 113 to 112 of the system design process [IX], others say it accounts for 213 [31[14] . Thus we see that verification is a bottleneck in system design, and the idea of building systems by connecting IP does little to address this problem.
Simulation
Simulation is probably the most common verification technique.
It is straightforward for systems with thousands of transistors, but very difficult to perform well for million transistor systems.
The main problem with simulation is that simulating a reasonable amount of real time requires an absurdly long amount of simulation time. For example, 100 seconds of real-time for a million-transistor design requires IO+ years of RTL simulation [9] [13] (cycle-based simulators help only incrementally). Such long simulation times result in typically less than one second of real time being simulated by designers. Cosimulation techniques gain perhaps an order of magnitude, which is not enough.
A second problem with simulation is that developing the system's environment for simulation (i.e., the testbench) can take enormous effort, since multi-million transistor systems often have very complex environments, unlike simpler multi-thousand transistor systems. Some claim that testbench development is the real bottleneck in verification [3] . Since most of this time is spent by the verification team trying to understand the environment and the system, there is not much potential for time reduction, and verification productivity tools like random test generators provide only incremental improvements.
A third problem is that environments often have undocumented features, which obviously cant be captured in a testbench no matter how much effort is expended.
We can conclude that solving these three problems requires atspeed (or nearly at-speed), in-circuit verijication techniques.
Emulation
Emulation is commonly used to provide nearly at-speed verification. It usually consists of the use of a general-purpose FPGA (Field Programmable Gate Array) platform along with microprocessor and other components configured to act like the system being designed. However, emulation bas problems ton.
The first problem i s that emulation requires many weeks, often over a month, to set up [31.
The second problem is that wmpile times are long for large designs, often lasting almost full day, and thus preventing frequent iteration among design and verification.
A third problem is that general-purpose emulators are expensive, ranging from $100,000 to $l,OoO,WO, thus limiting their use to high-end applications, and sometimes creating a new bottleneck of different design teams having to share a single emulator.
A fourth problem is that emulators may still run IO to 100 times slower than the eventual system [13], which may prevent proper functioning in the system's environment.
Silicon spins
To really begin verifying a system in its environment, we often require first silicon (i.e., a semi-custom chip) to be generated. Because of the above problems with simulation and emulation, first silicon usually still contains bugs, even after extensive simulation and emulation [13] [161. Each silicon spin can take months, and a recent study showed that the average number of spins required to verify a design was 3.5 [16] , resulting in over 50% of development time occurring after first silicon
Other problems
Test checks manufactured chips for defects. Testing costs are projected to increase [2] : the cost per automated test equipment divided by the number of chips it tests per hour rose from $600 in 1985 to $6000 in 1998, and is estimated to rise to $100,000 in 2005. Furthermore, the test cost per transistor is also increasing (while design costs per transistor are decreasing).
Integrating cores can be a very difficult task. Even when cores have been designed to interface to the same bus, detailed timing problems can often arise, as well as load problems. Furthermore, integrating cores often requires a good understanding of the core's functionality, which can take much time. Finally, there is a problem of "compounded risks" [13], such as if there is a 98% probability of successfully integrating a core into an SOC, then the probability of successfully integrating 100 cores is only 13%.
Physical design has become "really, really, really hard [I71 in deep submicron technology. Chip design must be integrated with physical design, and designs must be created to ensure high yields. Custom techniques are becoming more necessary.
Time-to-market constraints continue to shrink, with average time from product conception, to delivery reduced to 8 months [Ill, with further reduction likely. Such crushing constraints greatly increase the need for correct silicon on the first spin.
Summary of problems
Building SOCs, even with extensive Ip reuse, is time-consuming and costly, because simulation covers a small fraction of realtime and can't address undocumented environment features, emulation requires much time and is expensive, and silicon spins require months. Furthermore, testing casts, integration and physical design problems, and tight time-tomarket constraints add to the difficulties in building SOCs. Large SOCs will therefore be limited to a very small percentage of designs [SI. Mainstream embedded system designers thus dl1 build systems that greatly underutilize potential transistors. Even today, potential transistors are underutilized by a factor of ten [13] . This underutilization will likely get worse as chip capacity grows and the Productivity gap widens; if the current gap continues its rate of growth, designers in 2006 will underutilize potential transistors by a factor of 68, and in 2012 by a factor of 287. 
CONFIGURE-AND-EXECUTE
Shce most designers can't build systems that take advantage of potential transistors, we can instead use a different paradigm that makes use of those transistors to greatly reduce time-to-market. The paradigm is based on two points. First, designs for different products within the same application class have similar hardware architectures. Second, deconsmcting (configuring and adding to or deleting from) an existing design that subsumes one's desired design is much easier than constructing a design from scratch (as evidenced by many designers starting from past designs in practice). These points lead us to a "configure-and-execute" paradigm, involving providers and users of reference designs.
A reference design provider is a company with extensive SOC expertise that builds a silicon chip, implementing a reference design, for a given application class. A reference design is an over-designed working system for an application class, containing mme cores than would likely ever he needed by any particular application, and built to he easily configured and modified to implement most instances of applications of its class. The discussion of the earlier sections demonstrated that there is plenty of room on the chip for such extra cores. The high cost of building a reference design could be amortized over a large number of applications in the class. Reference design builders would likely make extensive use of advanced synthesis, verification, testing, and physical-design techniques.
A reference design user is a typical embedded system designer, who acquires a reference design and then configures it for a specific application. The user thus bas working silicon from the very start of the design process, running at real speeds and executing in a real environment. If the user decides to generate specialized silicon for a given configuration in order to reduce chip costs, power, size, etc., then first-time correct silicon is quite likely since extensive verification in the system's real environment has already been performed. Figure 1 illustrates the time-to-market advantage gained by the user. Whereas a simulation-hased paradigm typically requires respins, the configure-and-execute paradigm is carried out on real silicon, and thus respins are very unlikely.
Reference designs-'Lfig chips"
As described above, a key aspect of configure-and-execute is the existence of configurahle reference design chips, or what we refer to as " fig chips." A fig chip i s a fabricated In contrast, product-oriented fig chips are intended for use in final products. Though still more general than a particular application instance, they would he small, cheap, and consume a small enough amount of power to be used in final praducts. In fact, because of their production in large quantities, and potential for optimized configuration, their power consumption could be even lower than obtainable by a typical custom SOC designer.
An example of a fig chip for control systems (e.g., closed-loop automatic control) would include a microprocessor, digital signal prccessor, cache, memory, direct-memory access controller, and a two-level bus (processor-lacal and peripheral). The peripheral bus would link a large number of cores specific to control systems, such as microcontrollers, numerous analogdigital converters, pulse-width modulators, counters, timers, serial communication devices (UARTs), and numerous blocks of fieldprogrammable logic. In this case, the ideal core would he specifically optimized for the programmable logic, typically resulting in only a factor of 10 slowdown or less [21] compared to a core implemented as an ASIC (application-specific integrated circuit).
Otherwise, the core could he synthesized and mapped to the logic. The second purpose is to implement custom logic, which i s estimated by Dataquest to occupy only about 10% of an SOC, with the remaining 90% of the SOC being m' &e up of cores [61.
A key aspect of a fig chip is that it is designed for development and debugging, implying that its internal registers should be controllable and observable, and step-by-step execution should he supported. Fortunately, scan technology can he used to provide such features [13] . Fig chips would thus come with debug environments providing software control of the execution of the system from a development workstation or PC.
It is important to point ont that a fig chip would he a complete working system. This means that any operating systems would he pre-installed, all drivers for controlling peripherals would he included, and template software would be running on the microprocessor exercising these items. 
Execute
The designer bas working silicon from the start of the design process. The silicon may run at speed, or perhaps nearly at-speed if numerous FPGA m e s were used and result in a system slowdown compared to an ASIC design. The designer makes numerous iterations among configuring the design and executing it, thus supporting the spiral model of development.
Specialize
when using a prototype-oriented fig chip, the designer will want to eventually generate a specialized chip for use in the final product. This specialized chip is not custom, hut rather a snhset of the original fig chip. Selected parameters during configuration become "hardcoded" in the specialized chip. For example, the specialized chip may have a smaller cache with a particular associativity, a smaller bus with bus-invert, omitted or added cores, and programmable logicpossibly replaced by custom logic. Specialization will differ depending on the relative importance of optimizing power, performance, size, cost, etc.
Because the system is extensively verified in its real environment during development, first-pass correct silicon is very likely.
RELATEDWORK
Many of the concepts in this paper are based on the "Rapid Silicon Prototyping concepts described by Payne [131 of VLSI Technology Inc., which has already built reference design chips for several application classes [19] , and some tools for specialization, such as a cache configuration tool. Reference designs for various applications classes are heginning to appear, such as customizable networking chips. Given this reference design, we then set out to build a digital camera system. The digital camera requires a CCD preprocessor core, not already part of the reference design, for image-capture, thus requiring use of the FPGA. We could capture the remaining camera functtonality using the existing cores, with the 8051 and DMA cores unused, as shown in Figure 2 . Thus, design consisted of modifymg the existing reference design software to implement the digital camera application, and of describing and synthesizing the CCD core onto the P G A and integrating it with the other cores. Table 2 shows that the CCD core was designed in a couple weeks and required another couple weeks to integrate, The complete example is available at [ZZ]. 
CONCLUSIONS
The current situation and trends indicate that designers will not he able to utilize potential transistors, and even an P-based reuse paradigm is not likely to bridge the gap because of major problems related to simulation time, emulation time and expense, repeated silicon spins, and other problems. Rather than trying to enable designers to use all those transistors for their system's functionality, we propose instead to use them to reduce time-to-market, by using a "configure-and-execute" paradigm. Extensive research into the design, parameterization, configuration, and debugging of reference designs is thus needed, as well as techniques for optimization of configuration parameters for specialized silicon generation.
