T o remain competitive, system-on-chip (SoC) designers must keep pace with silicon technology's rapid evolution. New communication, consumer, and computer product designs must exhibit rapid increases in functionality, reliability, and bandwidth-and rapid declines in cost and power consumption.
All these improvements dictate increasing use of high-integration silicon, in which designers traditionally use register-transfer-level (RTL) hardware to realize data-intensive capabilities. Three forcesthe design productivity gap, the growing cost of nanometer-level semiconductor manufacturing, and the global time-to-market imperative-put intense pressure on chip designers to develop more complex systems ever more quickly and cheaply.
One approach to speeding development of megagate SoCs uses multiple microprocessor cores to perform much of the processing currently relegated to RTL techniques. Although general-purpose embedded processors can handle many tasks, they often lack the bandwidth needed to perform particularly complex jobs such as audio and video processing. Hence the historic rise of RTL use in SoC design.
Developers can configure a new class of processor-automatically generated extensible microprocessor cores such as Tensilica's Xtensa, or usermodifiable cores such as MIPS Technologies' M4K-to bring the required amount and type of processing bandwidth to bear on many embedded tasks. Because these configurable processors employ firmware instead of RTL-defined hardware for their control algorithm, designers can develop and verify processor-based task engines for many embedded SoC tasks more quickly and easily than they could develop and verify RTL-based hardware blocks that perform the same tasks.
SOC DESIGN CHALLENGES
A few characteristics of typical deep-submicron integrated circuit (IC) design illustrate the challenge facing SoC design teams:
• In a generic 0.13µ standard-cell foundry process, silicon density routinely exceeds 100,000 usable gates per square millimeter. Consequently, a low-cost chip-one with a core area of 50 square millimeters-can carry 5 million logic gates. Simply because it's possible, a system designer somewhere will find a way to exploit this immense computational potential in any given market. • In the past, silicon capacity and design-automation tools limited the practical size of an RTL block to fewer than 100,000 gates. Improved synthesis, place-and-route, and verification
Chris Rowen
Tensilica By using extensible processors, designers can develop and verify task engines for many embedded systemon-chip tasks more quickly than by using the traditional RTL-defined hardware design approach.
tools have raised that ceiling. Blocks of 500,000 gates are now within the capacity of these tools, but existing design and verification methods are not keeping pace with silicon fabrication capacity, which can now put millions of gates on an SoC. • The design complexity of a typical logic block grows more rapidly than its gate count, and system complexity increases more rapidly than the number of constituent blocks. Verification complexity has also increased disproportionately with gate count. Consequently, many teams that have recently developed real-world designs report that they now spend as much as 90 percent of their development effort on block-or system-level verification.
• The cost of a design bug is increasing. Industry analysts make much of the rising cost of deepsubmicron IC masks: The cost of a full mask set approaches $1 million. However, mask charges represent just the tip of the iceberg with respect to design-bug costs. The risk of bugs compounds the costs. The combination of larger teams required to create complex SoC designs, higher staff costs, bigger nonrecurring engineering fees, and lost profitability and market share makes show-stopper design bugs intolerable. SoC design bugs can literally kill a company. As a result, design methods that reduce the occurrence of such showstoppers, or permit painless workarounds for them, pay for themselves rapidly.
• All embedded systems now contain significant amounts of software. Software integration is typically the last step in the system-development process, and this step is routinely blamed for overall program delays. Analysts widely view earlier and faster hardware and software validation as a critical risk-reducer for new product development projects.
• Standard communication protocols are rapidly increasing in complexity. The need to conserve scarce communications spectrum, plus the inventiveness of modern protocol designers, has resulted in the creation of complex new standards such as IPv6 for packet forwarding, G.729 voice coding, JPEG2000 image compression, MPEG-4 video, and Rijndael AES encryption. These new protocols coupled with rising communication bit rates demand much greater computational throughput than their predecessors.
Competitive pressures have pushed development of the next-generation SoC, characterized by dozens of functions working together. Such designs illustrate the trend toward using many RTL-based logic blocks and mixing control and digital signal processors together on the same chip.
This ceaseless growth in integrated circuit complexity poses a central dilemma for SoC design. If developers could implement all these logic functions with multiple cheap, fast, and efficient heterogeneous processor blocks, a processor-based design approach would be ideal because using pre- designed and preverified processor cores for a SoC's individual functional blocks moves the design effort largely to the coding of several relatively small software blocks. This approach to SoC design permits bug fixes in minutes instead of months because changing and verifying software is much easier than altering RTL hardware, especially if the code is stored in on-chip RAM. Unfortunately, for the most computationally demanding problems, general-purpose processor cores fall far short with respect to application throughput, cost, and power efficiency.
At the same time, designing the custom RTL logic for complex functions and emerging standards takes too long, and, once designed, the logic is too rigid to change easily. A closer look at the makeup of the typical RTL block, shown in Figure 1a , gives insight into this paradox, while Figure 1b shows a more flexible alternative.
In most RTL designs, the data path consumes the vast majority of the logic block's gates. A typical data path can be as narrow as 16 or 32 bits or it can be hundreds of bits wide. The data path's width is generally sized to the task at hand. A data path typically contains many data registers, representing intermediate computational states, and often has significant blocks of RAM, or interfaces to RAM, that it shares with other RTL blocks. These basic data path structures reflect the data's nature and are largely independent of the finer details of the specific algorithm that operates on that data.
By contrast, the RTL logic block's finite state machine contains nothing but control details. This RTL block subsystem captures all the nuances of sequencing data through the data path, all exception and error conditions, and all handshakes with other blocks. The state machine may consume only a few percent of the block's gate count, but it embodies most of the design and verification risk due to its complexity. If developers make a late design change in an RTL block, the change is more likely to affect the state machine than the data path's structure, heightening the design risk.
Configurable, extensible processors-a fundamentally new form of microprocessor-provide a way of reducing the risk of state-machine design by replacing state machine logic blocks that are hard to design and verify with predesigned, preverified processor cores and application firmware.
CONFIGURABLE PROCESSORS
Rapidly increasing logic complexity and technology scaling, as characterized by Moore's law, make multimillion-gate designs feasible. Fierce product competition in system features and capabilities generates demand for these advanced silicon designs. A well-recognized SoC design gap-which lies between the growth in chip complexity and productivity growth in logic design tools-widens every year, as Figure 2 shows. Moreover, market trends favoring high-performance, low-power systems-such as long-batterylife cell phones, 4-megapixel digital cameras, fast and inexpensive color printers, high-definition digital televisions, and 3D video games-also increase the number of SoC designs. Unless something closes this design gap, it will soon become impossible to bring enhanced versions of these system designs to market.
As Figure 3 shows, the conventional SoC design model closely follows that of its predecessor-the board-level combination of a standard microprocessor, memory, and logic built as application-specific integrated circuits. Board-level, chip-to-chip interconnect is expensive and slow, so board-level designs typically use shared buses and narrow data paths-often only 32 bits wide. Designers frequently carry these relatively limited buses over to SoC designs because this approach provides the easiest solution to crafting an SoC architecture: Just reuse what's been done before.
Combining all these system components on a single piece of silicon increases maximum achievable clock frequency and decreases power dissipation relative to the equivalent boardlevel design. System reliability and cost often improve as well. These benefits alone can justify investment in SoC design. However, the shift to SoC integration does not automatically change a design's organization or architecture. Thus, the architecture of these chips typically inherits the assumptions, limitations, and tradeoffs of board-level design.
SOC INTEGRATION
The origins and evolution of microprocessors further constrain their use in traditional SoC design. Most popular embedded microprocessors, especially the 32-bit architectures, descend directly from 1980s desktop computer architectures such as ARM, MIPS, 68000/ColdFire, PowerPC, and x86. Designed to serve general-purpose applications, these processors typically support only the most generic data types, such as 8-, 16-, and 32-bit integers. Likewise, they support only the most common operations, such as integer load, store, add, shift, compare, and bitwise logical operations.
Their general-purpose nature makes these processors well suited to the diverse mix of applications run on computer systems. Their architectures perform equally well when running databases, spreadsheets, PC games, and desktop publishing. However, all these processors suffer from a common bottleneck: Their need for complete generality dictates their ability to execute an arbitrary sequence of primitive instructions on an unknown range of data types. Put another way, general-purpose processors are not optimized to deal with the specific data types of any given embedded task, which results in inefficiencies.
Compared to general-purpose computer systems, embedded systems comprise a more diverse group and individually show more specialization. A digital camera must perform a variety of complex image processing tasks, but it never executes SQL database queries. A network switch must handle complex communications protocols at optical interconnect speeds, but it doesn't need to process 3D graphics.
The specialized nature of individual embedded applications creates two issues for general-purpose processors in data-intensive embedded applications. First, the critical functions of many embedded applications and a processor's basic integer instruction set and register file are a poor match. Because of this mismatch, critical embedded applications require more computation cycles when they run on general-purpose processors.
Second, more focused embedded devices cannot take full advantage of a general-purpose processor's broad capabilities. Expensive silicon resources built into the processor go to waste because the specific embedded task that's assigned to the processor doesn't need them.
Many embedded systems interact closely with the real world or communicate complex data at high rates. A hypothetical general-purpose microprocessor running at tremendous speed could perform these data-intensive tasks. This is the basic assumption behind the use of multi-GHz processors in today's PCs: Throw a fast enough processor at a problem-regardless of the cost in dollars or power dissipation-and you can solve any problem. For many embedded tasks, however, no such processor exists today as a practical alternative because the fastest available processors typically cost orders of magnitude too much and dissipate orders of magnitude too much power to meet embedded-system design goals. Instead, embeddedsystem hardware designers have traditionally turned to hardwired circuits to perform these dataintensive functions.
In the past 10 years, the wide availability of logic synthesis and ASIC design tools has made RTL design the standard for hardware developers. Reasonably efficient compared to custom transistor-level circuit design, RTL-based design can effectively exploit the intrinsic parallelism of many data-intensive problems. RTL design methods can often achieve tens or hundreds of times the performance a general-purpose processor achieves.
EXTENSIBLE PROCESSORS
Like RTL-based design using logic synthesis, extensible-processor technology enables the design of high-speed logic blocks tailored to a specific task. The two technologies differ in that RTL designers realize both specialized data paths and the control state machines in hardware, but when building logic blocks with extensible processors, designers can create optimized data paths in hardware while implementing the control functions entirely in firmware.
A fully featured, configurable, and extensible processor consists of a processor design and the design tool environment for configuring that procesCompared to general-purpose computer systems, embedded systems comprise a more diverse group and individually show more specialization.
sor. This environment permits significant adaptation of the base processor design by letting a system designer change major processor functions, thus tuning the processor to specific application requirements. Typical configurability forms include additions, deletions, and modifications to memories, to external bus widths and handshake protocols, and to commonly used processor peripherals. An important superset of configurable processors, the extensible processor, lets the application developer extend the processor's instruction set and include features that the processor's original designers never considered or imagined.
Extensible processors as RTL alternatives
Hardwired RTL design has many attractive characteristics including small area, low power, and high throughput. However, RTL technology's liabilities-difficult design, slow verification, and poor scalability to complex problems-have begun to overshadow its benefits now that designers work with millions of gates. A design methodology that retains most RTL efficiency benefits but reduces design time and risk has a natural appeal. Replacing complex RTL designs with application-specific processors can achieve this goal.
An application-specific processor can implement data path operations that closely match those of RTL functions. A chip architect can implement the equivalent of RTL data paths using the base processor's integer pipeline, plus additional execution units, registers, and other functions tailored to a specific application.
In one example of defining such a processor using TIE-the Tensilica Instruction Extension language, a variant of Verilog-designers optimize a processor for high-level specification of data path functions in the form of instruction semantics and encoding. More concise than an RTL description, a TIE description omits all sequential logic, including state-machine descriptions, pipeline registers, and initialization sequences. The firmware programmer has access to the new processor instructions and registers described in TIE via the same compiler and assembler that employ the processor's base instructions and register set. Firmware uses the processor's normal instruction fetch, decode, and execution mechanisms to control all operation sequencing within the processor's data paths. Developers use a high-level language such as C or C++ to write this firmware.
Extended processors used as RTL-block replacements routinely have the same structures as traditional data-path-intensive RTL blocks: deep pipelines, parallel execution units, problem-specific state registers, and wide data paths to local and global memories. These extended processors can sustain the same high computation throughput and support the same low-level data interfaces as typical RTL designs. The control of extended-processor data paths works very differently, however. Instead of hardwired state machines, processor-based task engines use firmware for data path control.
Explicit control scheme
With firmware-controlled state transitions, designers do not fix the cycle-by-cycle control of the processor's data paths. Instead, they make the sequence of operations explicit in the firmware executed by the processor, as Figure 1b shows. The processor makes control-flow decisions explicitly in branches, makes memory references explicit in load and store operations, and makes sequences of computations explicit in sequences of general-purpose and application-specific computational operations.
This design migration from hardwired state machine to firmware program control has the following implications:
• Flexibility. Despite their attractions, application-specific processors may not be the best choice for all block designs. Consider three exceptions:
• Small, fixed-state machines. Some logic tasks are too trivial to warrant a processor. Bit-serial engines such as simple universal asynchronous receiver-transmitters fall into this category.
• Simple data buffering. Similarly, some logic tasks amount to no more than storage control. Memory operations within a processor can emulate a first-in, first-out controller built with random-access memory and some wrapper logic, but a basic FIFO is faster and simpler.
• Very deep pipelines. Some computation problems have so much regularity and so little statemachine control that a single very deep pipeline provides the ideal implementation.
The common examples-3D graphics and magnetic-disk read-channel chips-sometimes have pipelines hundreds of clock stages deep. An application-specific processor could be used to control such deep pipelines, but the benefits of instruction-by-instruction control would be of less help in these applications.
Aside from these few caveats, firmware program control's advantages make it a wise design choice.
T he migration of functions from software to hardwired logic, over time, presents a wellknown phenomenon. During early design exploration of prerelease protocol standards, processor-based implementations are common even for simple standards that clearly allow efficient logic-only implementations. Some common standards that have followed this path include popular video codecs such as MPEG-2, 3G wireless protocols such as W-CDMA, and encryption and security algorithms such as SSL and triple-DES.
However, the large gap in performance and design ease between software-based and RTLbased development has limited this migration. The emergence of configurable and extensible application-specific processors creates a new design path that's quick and easy enough for the development and refinement of new protocols and standards, yet efficient enough in silicon area and power to permit very-high-volume deployment. I 
Chris

