Introduction to the Systems Approach

System Architecture: an Overview
The past 40 years have seen amazing advances in silicon technology and resulting increases in transistor density and performance. In 1966 Fairchild [1] introduced a quad two input NAND gate with about 10 transistors on a die. The latest Intel Itanium processor has almost 100 million times that many transistors. The aim of this book is to present an approach for computer system design that exploits this enormous transistor density. In part this is a direct extension of studies in computer architecture and design. However it is also a study of system architecture and design. About 50 years ago, a seminal text on Systems Engineering [34] appeared. As the authors, H. Goode and R. Machol, pointed out, the system's view of engineering was created by a need to deal with complexity. As then our ability to deal with complex design problems is greatly enhanced by computer-based tools.
A system-on-chip architecture is an ensemble of processors, memories, switches and buses tailored to an application domain. An example of such an architecture is the Emotion Engine [37, 39, 40] for the Sony PlayStation 2 ( Figure 1.3 ). This system contains three essential components: a main processor of the Reduced Instruction Set Computer (RISC) style [35] , and two vector processing units, VPU0 and VPU1, each of which contains four parallel processors of the Single-Instruction-Multiple-Datastream (SIMD) style [32] . We provide a brief overview of these components and our overall approach in the next few sections. [37, 39] .
While the focus of the book is on the system, in order to understand the system one must first understand the components. So before returning to the issue of system architecture later in this chapter, we review the components that make up the system.
Components of the System: Processors, Memories and Interconnections
The term architecture represents the operational structure and the user's view of the system. Over time, it has evolved to include both the functional specification and the hardware implementation. At the system level, architecture defines the processor-level building blocks, such as processors and memories, and the interconnection among the building blocks. At a lower level, architecture determines the processor's programming model and its detailed implementation. The implementation of a processor is also known as microarchitecture ( Figure 1 .4).
The system designer has a programmer's or user's view of the system components, the system view of memory, the variety of specialized processors and the interconnection bus. In the next sections we look at basic components: the processor architecture, the memory and the bus or interconnect architecture. Figure 1 .5 illustrates some of the basic elements of a SOC system. These include a number of heterogeneous processors interconnected to one or more memory elements with possibly an array of reconfigurable logic. Frequently, the SOC also has some analog circuitry for managing sensor data and A/D conversion or to support wireless data transmission. If all elements can't be contained on a single chip the implementation is probably best referred to as a system on a board but often is still called SOC. What distinguishes system on a board (or SOC) from the conventional general purpose computer(s) plus memory on a board is the specific nature of the design target. The application(s) is assumed to be known and specified, so that the elements of the system can be selected, sized and evaluated during the design process. The emphasis on selecting, parameterizing and configuring system components tailored to a target application distinguishes a system architect from a computer architect.
In this chapter we primarily look at the higher level definition of the processor -the programmer's view or the ISA (instruction set architecture), basics of the processor microarchitecture, memory hierarchies, and interconnection structure. In later chapters, we shall study in more detail the implementation issues for these elements.
Hardware/Software Tradeoffs
A fundamental decision in SOC design is to choose which components in the system are to be implemented in hardware, and which in software. We first look at the features of conventional hardware and software implementations, and then describe two technological developments -custom instruction processors and reconfigurable devicesthat blur the distinctions between hardware and software.
The main strengths and drawbacks of hardware and software implementations are summarized in Table 1 .1.
The notable feature of a software implementation is that it is executed on a processor, which interprets instructions at run time. This architecture offers flexibility and adaptability, and provides a way of sharing resources among different applications; however, the hardware implementation of the instruction set architecture is generally slower and more power hungry than implementing the corresponding function directly in hardware, without the overhead of fetching and decoding instructions. With the advent of program development environments -optimizing compilers, performance profilers and such like -for high-level languages, software developers are usually more productive than hardware developers.
In contrast, the direct implementation of applications in hardware often provides high performance at the expense of flexibility and productivity.
Given that hardware and software have complementary features, many SOC designs aim to combine the individual strengths of the two. The obvious method is to implement the performance-critical parts of the application in hardware, and the rest in software. For instance, if 90% of the software execution time of an application is spent on 10% of the source code, up to a 10 fold speed-up is achievable if that 10% of the code is efficiently implemented in hardware.
While hardware and software approaches may seem independent, two particular advances in technology are narrowing the gap between the two. The first advance involves processors with instruction sets customized for a specific application or domain. Custom instructions efficiently implemented in hardware are integrated into a base processor with a basic instruction set. This capability often improves upon the conventional approach of using standard instruction-sets to fulfill the same task while preserving its flexibility. Chapter 7 explores further some of the issues involving custom instructions.
The second advance in technology involves the use of reconfigurable devices as application specific hardware accelerators. The known reconfigurable devices are fieldprogrammable gate arrays (FPGAs). Because of the growing demand for reducing the time-to-market and the increasing cost of chip fabrication, user-reconfigurable devices are becoming more popular for implementing digital designs. Such devices typically contain computation units, memories and their interconnections, and all three are usually programmable at run time. Reconfigurable devices often offer a good compromise for both worlds: they are faster than software, while being more flexible than conventional hardware implementations.
Processor Architectures
Typically processors are characterized either by their application or by their architecture (or structure), as shown in Table 1.2 and Table 1 .3. The requirements space of an application is often large, and there is a range of implementation options. Thus, it is usually difficult to associate a particular architecture with a particular application. In addition, some architectures combine different implementation approaches as seen in the PlayStation example of Section 1.1. There the graphics processor consists of a four element SIMD array of vector processing functional units. Other SOC implementations will consist of multiprocessors using very long instruction word (VLIW) and/or superscalar processors.
From the programmer's point of view, sequential processors execute one instruction at a time. However, many processors have the capability to execute several instructions concurrently in a manner that is transparent to the programmer. Pipelining is a powerful technique that is used in almost all current processor implementations. Techniques to extract and exploit the inherent parallelism in the code at compile-time or run-time are also widely used.
Exploiting program parallelism is one of the most important goals in computer architecture.
Instruction-level parallelism (ILP) means that multiple operations can be executed in parallel within a program. ILP may be achieved with hardware, compiler or operating system techniques. At the loop level, consecutive loop iterations are ideal candidates for parallel execution, provided that there is no data dependency between subsequent loop iterations. Next, there is parallelism available at the procedure level, which depends largely on the algorithms used in the program. Finally, multiple independent programs can execute in parallel.
Different computer architectures have been built to exploit this inherent parallelism. In general, a computer architecture consists of one or more interconnected processor elements that operate concurrently, solving a single overall problem. [7] General proprietary ISA Array processor of 96 Processing Elements PlayStation 2 [40, 37, 39] Gaming MIPS Pipelined with 2 vector coprocessors ARM VFP11 [13] General ARM configurable vector coprocessor 
Processor -A Functional View
Processor -An Architectural View
The architectural view of the system describes the actual implementation at least in broad brush way. For sophisticated architectural approaches, more detail is required to understand the complete implementation.
Simple Sequential Processor
Sequential processors directly implement the sequential execution model. These processors process instructions sequentially from the instruction stream. The next instruction is not processed until all execution for the current instruction is complete, and its results have been committed.
The semantics of the instruction determines that a sequence of actions must be performed to produce the specified result (Figure 1.6 ). These actions can be overlapped but the result must appear in the specified serial order. These actions include:
1. fetching the instruction into the instruction register (IF), 2. decoding the op code of the instruction (ID), 3. generating the address in memory of any data item residing there (AG), 4 . fetching data operand(s) into executable registers (DF), 5. executing the specified operation (EX), and 6. writing back the result to the register file (WB).
During execution, a sequential processor executes one or more operations per clock cycle from the instruction stream. An instruction is a container that represents the smallest execution packet managed explicitly by the processor. One or more operations are contained within an instruction. The distinction between instructions and operations is crucial to distinguish between processor behaviors. Scalar and superscalar processors consume one or more instructions per cycle, where each instruction contains a single operation. Although conceptually simple, executing each instruction sequentially has significant performance drawbacks: a considerable amount of time is spent on overhead and not on actual execution. Thus the simplicity of directly implementing the sequential execution model has significant performance costs.
Pipelined Processor
Pipelining is a straightforward approach to exploiting parallelism that is based on concurrently performing different phases (instruction fetch, decode, execution, etc.) of processing an instruction. Pipelining assumes that these phases are independent between different operations and can be overlapped -when this condition does not hold, the processor stalls the downstream phases to enforce the dependency. Thus multiple operations can be processed simultaneously with each operation at a different phase of its processing. Figure 1 .8 illustrates the instruction timing in a pipelined processor, assuming that the instructions are independent.
For a simple pipelined machine, there is only one operation in each phase at any given time, thus one operation is being fetched (IF), one operation is being decoded (ID), one operation is generating an address (AG), one operation is accessing operands (DF), one operation is in execution (EX), and one operation is storing results (WB). The most rigid form of a pipeline, sometimes called the static pipeline, requires the processor to go through all stages or phases of the pipeline whether required by a particular instruction or not. Dynamic pipeline allows the bypassing of one or more stages in the pipeline, depending on the requirements of the instruction. The more complex dynamic pipelines allow instructions to complete out-of-(sequential) order, or even to initiate out-of-order. The out-of-order processors must ensure that the sequential consistency of the program is preserved. This sequential execution behavior describes the sequential execution model that requires each instruction executed to completion in sequence.
execution of a non-pipelined sequential processor.
Instruction-Level Parallelism
While pipelining does not necessarily lead to executing multiple instructions at exactly the same time, there are other techniques that do. These techniques may use some combination of static scheduling and dynamic analysis to perform concurrently the actual evaluation phase of several different operations, potentially yielding an execution rate of greater than one operation every cycle. This kind of parallelism exploits concurrency at the computation level. Since historically most instructions consist of only a single operation, this kind of parallelism has been named instruction-level parallelism (ILP).
Two architectures that exploit ILP are superscalar and VLIW (very long instruction word). They use different techniques to achieve execution rates greater than one operation per cycle. A superscalar processor dynamically examines the instruction stream
In s t r u c t i o n #4
In s t r u c t i o n #1 to determine which operations are independent and can be executed. A VLIW processor relies on the compiler to analyze the available operations (OP) and to schedule independent operations into wide instruction words, which then executes these operations in parallel with no further analysis. Figure 1 .10 shows the instruction timing of a pipelined superscalar or VLIW processor executing two instructions per cycle. In this case, all the instructions are independent so that they can be executed in parallel.
Superscalar Processor
Dynamic pipelined processors remain limited to executing a single operation per cycle by virtue of their scalar nature. This limitation can be avoided with the addition of multiple functional units and a dynamic scheduler to process more than one instruction per cycle. These superscalar processors [36] can achieve execution rates of several instructions per cycle (usually limited to 2 but more is possible depending on the application). The most significant advantage of a superscalar processor is that processing multiple instructions per cycle is done transparently to the user, and that it can provide binary code compatibility while achieving better performance.
Compared to a dynamic pipelined processor, a superscalar processor adds a scheduling instruction window that analyzes multiple instructions from the instruction stream each cycle. Although processed in parallel, these instructions are treated in the same manner as in a pipelined processor. Before an instruction is issued for execution, dependencies between the instruction and its prior instructions must be checked by hardware.
Because of the complexity of the dynamic scheduling logic, high-performance superscalar processors are limited to processing four to six instructions per cycle. Although superscalar processors can exploit instruction-level parallelism from the dynamic instruction stream, exploiting higher degrees of parallelism requires other approaches. 
VLIW Processor
In contrast to dynamic analyses in hardware to determine which operations can be executed in parallel, VLIW processors rely on static analyses in the compiler. VLIW processors are thus less complex than superscalar processors, and have the potential for higher performance. A VLIW processor executes operations from statically scheduled instructions that contain multiple independent operations. Because the control complexity of a VLIW processor is not significantly greater than that of a scalar processor, the improved performance comes without the complexity penalties.
VLIW processors rely on the static analyses performed by the compiler, and are unable to take advantage of any dynamic execution characteristics. For applications that can be scheduled statically to use the processor resources effectively, a simple VLIW implementation results in high performance. Unfortunately, not all applications can be effectively scheduled statically. In many applications, execution does not proceed exactly along the path defined by the code scheduler in the compiler. Two classes of execution variations can arise and affect the scheduled execution behavior:
1. Delayed results from operations whose latency differs from the assumed latency scheduled by the compiler. 2. Interruptions from exceptions or interrupts, which change the execution path to a completely different and unanticipated code schedule.
Although stalling the processor can control delayed result, this solution can result in significant performance penalties. The most common execution delay is a data cache miss. Many VLIW processors avoid all situations that can result in a delay by avoiding data caches and by assuming worst-case latencies for operations. However, when there is insufficient parallelism to hide the exposed worst-case operation latency, the instruction schedule has many incompletely filled or empty instructions, resulting in poor performance. 
SIMD Processors, Array and Vector Processors
The SIMD (Single Instruction, Multiple Data Stream) class of processor architecture includes both array and vector processors. The SIMD processor is a natural response to the use of certain regular data structures, such as vectors and matrices. From the view of an assembly-level programmer, programming SIMD architecture appears to be very similar to programming a simple processor except that some operations perform computations on aggregate data. Since these regular structures are widely used in scientific programming, the SIMD processor has been very successful in these environments.
The two popular types of SIMD processor are the array processor and the vector processor. They differ both in their implementations and in their data organizations. An array processor consists of many interconnected processor elements, each having their own local memory space. A vector processor consists of a single processor that references a single global memory space, and has special function units that operate specifically on vectors.
The notion of an array processor or a vector processor is frequently more than a standalone processor. For either type of processor, it can be represented by an instruction set extension to an otherwise conventional machine. The extended instructions enable control over special resources in the processor, or in some sort of co-processor. The purpose of such extensions is to enable increased performance on special applications.
Array Processors
The array processor is a set of parallel processor elements connected via one or more networks, possibly including local and global inter-element communications and control communications. Processor elements operate in lockstep in response to a single broadcast instruction from a control processor. Each processor element has its own private memory, and data are distributed across the elements in a regular fashion that is dependent on both the actual structure of the data and also the computations to be performed on the data. Direct access to global memory or another processor element's local memory is expensive, so intermediate values are propagated through the array through local interprocessor connections. This requires that the data be distributed carefully so that the routing required to propagate these values is simple and regular. It is sometimes easier to duplicate data values and computations than it is to support a complex or irregular routing of data between processor elements.
Since instructions are broadcast, there is no means local to a processor element of altering the flow of the instruction stream; however, individual processor elements can conditionally disable instructions based on local status information -these processor elements are idle when this condition occurs. The actual instruction stream consists of more than a fixed stream of operations: an array processor is typically coupled to a general-purpose control processor that provides both scalar operations as well as array operations that are broadcast to all processor elements in the array. The control processor performs the scalar sections of the application, interfaces with the outside world, and controls the flow of execution; the array processor performs the array sections of the application as directed by the control processor.
A suitable application for use on an array processor has several key characteristics: a significant amount of data which have a regular structure; computations on the data which are uniformly applied to many or all elements of the data set; simple and regular patterns relating the computations and the data. An example of an application that has these characteristics is the solution of the Navier-Stokes equations, although any application that has significant matrix computations is likely to benefit from the concurrent capabilities of an array processor. an example of the type of array processor chip that is directed at signal processing applications.
Vector Processors
A vector processor is a single processor that resembles a traditional single stream processor, except that some of the function units (and registers) operate on vectors -sequences of data values that are seemingly operated on as a single entity. These function units are deeply pipelined and have high clock rates. While the vector pipelines often have higher latencies compared to scalar function units, the rapid delivery of the input vector data elements together with the high clock rates result in a significant throughput.
Modern vector processors require that vectors be explicitly loaded into special vector registers and stored back into memory -the same course that modern scalar processors use for similar reasons. Vector processors have several features that enable them to achieve high performance. One feature is the ability to concurrently load and store values between the vector register file and main memory, while performing computations on values in the vector register file. This is an important feature, since the limited length of vector registers requires that vectors longer than the register length would be processed in segments -a technique called strip-mining. Not being able to overlap memory accesses and computations would pose a significant performance bottleneck.
Most vector processors support a form of result bypassing -in this case called chaining -that allows a follow-on computation to commence as soon as the first value is avail- Figure 1 .14 Vector processor model. able from the preceding computation. Thus, instead of waiting for the entire vector to be processed, the follow-on computation can be significantly overlapped with the preceding computation that it is dependent on. Sequential computations can be efficiently compounded to behave as if they were a single operation, with a total latency equal to the latency of the first operation with the pipeline, and chaining latencies of the remaining operations but none of the startup overhead that would be incurred without chaining. For example, division could be synthesized by chaining a reciprocal with a multiply operation. Chaining typically works for the results of load operations as well as normal computations.
A typical vector processor configuration consists of a vector register file, one vector addition unit, one vector multiplication unit, and one vector reciprocal unit (used in conjunction with the vector multiplication unit to perform division); the vector register file contains multiple vector registers (elements). 
Multiprocessors
Multiple processors can cooperatively execute to solve a single problem, by using some form of interconnection for sharing results. In this configuration, each processor executes completely independently, although most applications require some form of synchronization during execution to pass information and data between processors. Most configurations are homogeneous with all processor elements being identical, although this is not a requirement. Table 1 .10 SOC multi-processors and multi-threaded processors.
ements, and synchronizes the independent execution streams between processor elements. When the memory of the processor is distributed across all processors and only the local processor element has access to it, all data sharing is performed explicitly using messages and all synchronization is handled within the message system. When the memory of the processor is shared across all processor elements, synchronization is more of a problem -certainly messages can be used through the memory system to pass data and information between processor elements, but this is not necessarily the most effective use of the system.
When communications between processor elements are performed through a shared memory address space -either global or distributed between processor elements (called distributed shared memory to distinguish it from distributed memory) -there are two significant problems that arise. The first is maintaining memory consistency: the programmer visible ordering effects on memory references, both within a processor element and between different processor elements. This problem is usually solved through a combination of hardware and software techniques. The second is cache coherency -the programmer invisible mechanism to ensure that all processor elements see the same value for a given memory location. This problem is usually solved exclusively through hardware techniques.
The primary characteristic of a multiprocessor system is the nature of the memory address space. If each processor element has its own address space (distributed memory), the only means of communication between processor elements is through message passing, If the address space is shared (shared memory), communication is through the memory system.
The implementation of a distributed memory machine is far easier than the implementation of a shared memory machine, when memory consistency and cache coherency are taken into account. However, programming a distributed memory processor can be much more difficult, since the applications must be written to exploit and not to be limited by the use of message passing as the only form of communication between processor elements. On the other hand, despite the problems associated with maintaining consistency and coherency, programming a shared memory processor can take advantage of whatever communications paradigm is appropriate for a given communications requirement, and can be much easier to program.
Memory and Addressing
SOC applications vary significantly in memory requirements. In one case the memory size and structure can be so well defined that the program can be contained in an onchip read only memory and the data can be contained in one or more on-chip RAM arrays. In another case the memory system must support an elaborate operating system requiring a large off chip memory (system on a board).
Why not simply include memory with the processor on the die? This has many attractions:
1. it improves the accessibility of memory, improving both memory access time and bandwidth; 2. it reduces the need for large cache; 3. it improves performance for memory intensive applications.
But there are problems. The first problem is that the DRAM memory process technology differs from standard microprocessor process technology, and would cause some sacrifice in achievable bit density. The second problem is more serious: if memory were restricted to the processor die, its size would be correspondingly limited. Applications that require very large real memory space would be crippled. Thus the conventional processor die model has evolved (Figure 1 .15) to implement multiple robust homogeneous processors sharing the higher levels of a two-or three-level cache structure with main memory off die, on its own multi-die module.
From a design complexity point of view, this has the advantage of being a "universal" solution: one implementation fits all applications, although not necessarily equally well. So while a great deal of design effort is required for such an implementation, the production quantities can be large enough to justify the costs. An alternative to this approach is clear. For specific applications, whose memory size can be bounded, we can implement an integrated memory SOC. This concept is illustrated in Figure 1 .16 (also recall Figure 1. 3).
A related but separate question is: does the application require virtual memory (mapping disk space onto memory) or is all real memory suitable? We look at the requirement for virtual memory addressing in the next section.
Finally the memory can be centralized or distributed. And even here the memory can appear to the programmer as a single (centralized) shared memory even though it is implemented in several distributed modules. Several memory considerations are listed in Table 1 .11.
The memory system comprises the physical storage elements in the memory hierarchy. These elements include those specified by the instruction set (registers, main memory and disk sectors) as well as those elements that are largely transparent to the user's program (buffer registers, cache and page mapped virtual memory). Table 1 . 13 Example SOC embedded memory Macro cell -see chapter 4 for discussion of cell types. T refers to the number of transistors in a one-bit cell. 
SOC Memory Examples
Addressing: the Architecture of Memory
The user's view of memory primarily consists of the addressing facilities available to the programmer. Some of these facilities are available to the application programmer, and some to the operating system programmer. When the facilities are properly implemented and programmed, memory can be efficiently and securely accessed.
Conceptually, the physical memory address is determined by a sequence of (at least) three steps:
1. The application produces a process address. This, together with the process id, defines the virtual address: virtual address = offset + base + index, where the offset is specified in the instruction and the base and index values are in specified registers.
2. Since multiple processes must cooperate in the same memory space, the process addresses must be coordinated and relocated. This is typically done by a segment table. Upper bits of the virtual address are used to address a segment table which has a (predetermined) base and bound values for the process, resulting in a system address: system address = virtual address + (process) base, where the system address must be less than the bound. memory space is implemented on disk and only the recently used regions (pages) are brought into memory. The available pages are located by a page table. The upper bits of the system address access a page table. If the data for this page have been loaded from the disk, the location in memory will be provided as the upper address bits of the "real" or physical memory address. The lower bits of the real address are the same as the corresponding lower bits of the virtual address.
Usually, the tables (segment and page) performing address translation are in memory, and a mechanism for the translation called the translation lookaside buffer (TLB) must be used to speed up this translation. The TLB is a simple register system usually consisting of between 64 and 256 entries that save recent address translations for reuse. A small number of (hashed) virtual address bits address the TLB. The TLB entry has both the real address and the complete virtual address (and id). If the virtual address matches the real address from the TLB can be used. Otherwise a not-in-TLB event occurs and a complete translation must occur (Figure 1.17 ).
Memory for SOC Operating System
One of the most critical decisions (or requirements) concerning an SOC design is the selecting of the operating system and its memory management functionality. Of primary interest to the designer is the requirement for virtual memory. If the system can OS Vendor Memory model uClinux NEC Virtual VxWorks (RTOS) [27] Wind River Real Windows CE Microsoft Virtual Nucleus (RTOS) [25] Mentor Graphics Real MQX (RTOS) [22] ARC Real Table 1 .14 Operating Systems for SOC designs.
be restricted to a real memory (physically, not virtually addressed) and the size of the memory can be contained to the order of 10s of megabytes, the system can be implemented as a true system on a chip (all memory on die). The alternative is slower and significantly more expensive virtual memory. Table 1 .14 illustrates some current SOC designs and their operating systems.
Of course fast and concise real memory designs come at the price of functionally. The user has limited ways of creating new processes and expanding the application base of the systems.
System level interconnection
System-on-Chip (SOC) technology relies on the interconnection of predesigned circuit modules (known as Intellectual Property or IP blocks) to form a complete system which can be integrated onto a single chip. In this way the design task is raised from a circuit level to a system level. Central to the system level performance and the reliability of the finished product is the method of interconnection used. A well designed interconnection scheme should have vigorous and efficient communication protocols, unambiguously defined as a published standard. This facilitates interoperability between IP blocks designed by different people from different organizations, and encourages design reuse. It should provide efficient communication between different modules maximizing the degree of parallelism achieved.
SOC interconnect methods can be classified into three main approaches based on buses, crossbar, and network-on-chip, as depicted respectively in Figure 1 .18, 1.19, 1.20.
Bus based approach
With the bus based approach, IP blocks are designed to conform to published bus standards (such as ARM's AMBA [12] or IBM's CoreConnect [18]). Communication between modules is achieved through the sharing of the physical connections of address, data and control bus signals. This is the most common method used for SOC system level interconnect. Usually two or more buses are employed in a system. These are organized in a hierarchical fashion. To optimize system level performance and cost, the bus closest to the CPU has the highest bandwidth and the bus furthest from the CPU has the lowest bandwidth. Figure 1. 18 SOC system level interconnection: Bus based approach [45] . 
Crossbar switch based approach
The second approach employs a crossbar switch to provide asynchronous communication channels between a variety of IP cores and I/O interfaces. Unlike the bus based approach where all modules attached to a given bus operate synchronously to a bus clock signal, the crossbar approach uses asynchronous channels to connect synchronous modules that can operate at different clock frequencies. Systems using this method of communication is also known as Globally Asynchronous, Locally Synchronous (GALS) systems. It has the advantages of potentially higher throughput than a bus based system, while making integration of a multiple clock domain system much easier. Figure 1 .20 SOC system level interconnection: Network-on-chip approach [47] .
SOC Application Interconnect type
ClearSpeed CSX600 [7] HPC ClearConnect Bus NetSilicon NET+40 [2] Networking Custom bus Sharp LH7A404 [10] Networking AMBA Bus Intel PXA27x [9] Mobile / Wireless PXBus Matsushita i-Platform [20] Media Internal connect bus Emulex InSpeed SOC320 [19] Switching Crossbar switch MultiNOC [38] Multiprocessing system Network-on-Chip 
Network-on-chip approach
A Network-on-chip system consists of a regular array of switches, each containing client logic forming the core module, and its own routing logic. The interconnect scheme is based on a two dimensional mesh topology. All communications between switches are conducted through data packets, routed through the router interface circuit within each tile. Since the interconnections between switches have a fixed distance, interconnect related problems such as wire delay and crosstalk noise are much reduced.
An Approach for SOC Design
Two important ideas in a design process are figuring out the requirements and specifications, and iterating through different stages of design towards completion.
Requirements and specifications
Requirements and specifications are fundamental concepts in any system design situation. There must be a thorough understanding of both before a design can begin. They are useful at the beginning and at the end of the design process: at the beginning to clarify what needs to be achieved, and at the end as a reference against which the completed design can be evaluated.
The system requirements are the largely externally generated criteria for the system. They may come from competition, from sales insights, from customer requests, from product profitability analysis, or from a combination. Requirements are rarely succinct or definitive of anything about the system. Indeed, requirements can frequently be unrealistic: "I want it fast, I want it cheap, and I want it now!"
It is important for the designer to analyse carefully the requirements expressions, and spend sufficient time in understanding the market situation to determine all the factors expressed in the requirements and the priorities those factors imply. Some of the factors the designer considers in determining requirements include:
• Customer requests / complaints
• Sales reports
• Cost analysis
• Competitive equipment analysis
• Trouble reports (reliability) of previous products and competitive products
The designer can also introduce new requirements based on new technology, new ideas or new materials which have not been used in a similar systems environment.
The system specifications are the quantified and prioritized criteria for the target system design. The designer takes the requirements and must produce a succinct and definitive set of statements about the eventual system. The designer may have no idea of what the eventual system will look like, but usually there is some "straw man" design in mind which seems to provide a feasibility framework to the specification. In any good design process, it would be surprising if the final design resembled the "straw man" design.
The specification does not complete any part of the design process; it initializes the process. Now the design can begin with the selection of components and approaches, and the study of alternatives and the optimization of the parts of the system.
Design iteraion
Design is always an iterative process. So the obvious question is how to get the very first, initial design. This is the design that we can then iterate through and optimize according to the design criteria. For our purposes we define several types of designs based on the stage of design effort:
Trial design: This is the first design created to simply meet the functional specifications. This design is usually done without great regard to the many performance and cost criteria. However if there is a strong real-time constraint, this design should plausibly satisfy that constraint. The processor or memory or I/O should be sized so that it appears to meet the real-time constraint.
Initial design: After the trial design is complete, the components must be further allocated, defined and parameterized. This process results in component specification and a corresponding understanding of its expected idealized performance and cost. Idealized in not ideal, but rather a simplified model of the expected area occupied and computational or data bandwidth capability. It's usually a simple linear model of performance (e.g. the expected MIPS (million instructions per second) rate of a processor).
Optimized design: Once the base performance (or area) requirements are met and the base functionality is insured then the goal is to minimize the cost (area) and/ or the power or the design effort required the complete the design. This is the iterative step of the process. The first steps of this process use higher fidelity tools (simulations, trial layouts, etc.) to insure that the initial design actually does satisfy the design specifications and requirements. The later steps refine, complete and improve the design according to the design criteria. System performance is limited by the component with the least capability. The other components can usually be modeled as simply presenting a delay to the critical component. In a good design the most expensive component is the one that limits the performance of the system. The system's ability to process transactions should closely follow that of the limiting component. Typically, this is the processor or memory complex.
Usually designs are driven by either (1) a specific real-time requirement, after which functionality and cost become important, or (2) functionality and / or throughput under cost-performance constraints. In case 1 the real-time constraint is provided by I/O consideration, which the processor-memory-interconnect system must meet. The I/O system then determines the performance and any excess capability of the remainder of the system is usually used to add functionality to the system. In case 2 designs the object is to improve task throughput while minimizing the cost. Throughput is limited by the most constrained component, so the designer must fully understand the tradeoffs at that point. There is more flexibility in these designs and correspondingly more options in determining the final design.
The purpose of this book is to provide an approach for determining the trial design and then, by inspecting each system component, taking that design to an initial design. We do this in the next several chapters on a component by component basis. Following that the designer must optimize each component (processor, memory etc). This optimization process requires extensive simulation. We provide access to basic simulation tools through our associated web site.
System Architecture and Complexity
The basic difference between processor architecture and system architecture is that the system adds another layer of complexity and the complexity of these systems limits the cost savings. Historically the notion of a computer was a single processor plus a memory. As long as this notion is fixed (within broad tolerances), implementing that processor on one or more silicon die does not change the design complexity. Once die densities enable a scalar processor to fit on a chip, the complexity issue changes.
Suppose it takes about 100,000 transistors to implement a 32-bit pipelined processor with a small first-level cache. Let this be a processor unit of design complexity.
As long as we need to implement the 100,000 transistor processor, additional transistor density on the die does not much affect design complexity. More transistors per die while increasing die complexity simplifies the problem of interconnecting multiple chips that make up the processor. Once the unit processor is implemented on a single die, the design complexity issue changes. As transistor densities significantly improve after this point, there are obvious processor extension strategies to improve performance.
1. Additional cache. Here we add cache storage and, as large caches have slower access times, a second level cache.
2. A more advanced processor. We implement a superscalar or a VLIW processor that executes more than one instruction each cycle. Additionally, we speed up the execution units that affect the critical path delay, especially the floating point execution times.
3. Multiple processors. Now we implement multiple (superscalar) processors and their associated multi-level caches. This leaves us limited only by the memory access times and bandwidth.
The result of the above is a significantly greater design complexity, see Figure 1 .23. Instead of the 100,000 transistor processor, our advanced processor has millions of transistors, the multi-level caches are also complex, as is the need to coordinate (synchronize) the multiple processors since they require a consistent image of the contents of memory.
The obvious way to manage this complexity is to reuse designs. So reusing several simpler processor designs implemented on a die is preferable to a single new more advanced processor. This is especially true if we can select specific processor designs suited to particular parts of an application. For this to work we also need a robust interconnection mechanism to access the various processors and memory.
So when an application is well-specified, the system-on-a-chip approach includes:
1. Multiple (usually) heterogeneous processors, each specialized for specific parts of the application.
2. Main memory with (often) read-only memory for partial program storage.
3. Relatively simple, small (single level) cache structure or buffering schemes associated with each processor. Even when the system-on-chip approach is technically attractive, it has economic limitations and implications. Given the processor and interconnect complexity, if we limit the usefulness of an implementation to a particular application, we have to either (1) ensure that there is a large market for the product, or (2) find methods for reducing the design cost through design reuse or similar techniques.
Product Economics and Implications for SOC
Factors affecting Product Costs
The basic cost and profitability of a product depend on many factors: its technical appeal, its cost, the market size and the effect the product has on future products. The issue of cost goes well beyond the product's manufacturing cost.
There are fixed and variable costs, as shown in Figure 1 .24. Indeed, the engineering costs, frequently the largest of the fixed costs, are expended before any revenue can be realized from sales (Figure 1.25) .
Depending on the complexity, designing a new chip requires a development effort of anywhere between 12 and 30 months before the first manufactured unit can be shipped. Even a moderately sized project may require up to 30 or 40 hardware and software engineers, CAD design, and support personnel. For instance, the paper describing the Sony Emotion Engine has 22 authors [37, 39] ! However, their salary and indirect costs might represent only a fraction of the total development cost.
Non-engineering fixed costs include manufacturing startup costs, inventory costs, initial marketing and sales costs, and administrative overhead. The marketing costs include obvious items such as market research, strategic market planning, pricing stud-Product cost M a n uf a cturi n g costs E n g i n e e ri n g M a rk e ti n g , sa l e s, a dm i n i stra ti on ies, competitive analysis, etc., as well as sales planning and advertising costs. Administrative (G & A) "overhead" includes a proportional share of the "front office" -the executive management, personnel department (human resources), financial office, and other costs.
Later, in the beginning of the manufacturing process, unit cost remains high. It is not until many units are shipped that the marginal manufacturing cost can approach the ultimate manufacturing costs.
After this, manufacturing produces units at a cost increasingly approaching the ultimate manufacturing cost. Still, during this time there is a continuing development effort focused on extending the life of the product, and broadening its market applicability.
Will the product make a profit? From the preceding, it is easy to see how sensitive the cost is to the product life and to the number of products shipped. If market forces or competition are aggressive and produce rival systems with expanded performance, the product life may be shortened and fewer units may be delivered than expected. This could be disastrous even if the ultimate manufacturing cost is reached; there may not be enough units to amortize the fixed costs and ensure profit. On the other hand, if competition is not aggressive and the follow-on development team is successful in enhancing the product and continuing its appeal in the marketplace, the product can become one of those jewels in a company's repertoire, bringing fame to the designers and smiles to the stockholders.
Modeling Product Economics and Technology Complexity: the Lesson for SOC
To put all this into perspective, consider a general model of a product's average unit cost (as distinct from its ultimate manufactured cost).
unit cost = project cost/number of units
The product cost is simply the sum of all the fixed and variable costs. We represent the fixed cost as a constant, K f . It is also clear that the variable costs are of the form K v × n, where n is the number of units. However there is a certain ongoing engineering, sales and marketing cost that are related to n, but not necessarily linear.
Let us assume that we can represent this effect as a term that starts as 0.1 of K f and then slowly increases with n, say,
3
√ n. So we get:
We can use Equation 1.1 to illustrate the effects of advancing technology on product design. We compare a design done in (say) 1995 with a more complex 2005 design which has much lower production cost. With K f fixed, Figure 1 .26 shows the expected decrease in unit cost as the volume of 1995 products produced, n, increases. But the figure also shows that, if we increase the fixed costs (more complex designs) by 10 fold, even if we cut the unit costs (K v ) by the same amount, the 2005 unit product costs remain high until much larger volumes are reached. This might not be a problem for a "universal" processor design with a mass market, but it can be a challenge for SOC type designs. These designs are targeted at specific applications, which may have limited production volume.
Dealing with Design Complexity
As design cost and complexity increase, there is a basic tradeoff between design optimization of the physical product and the cost of the design. This is shown in Figure 1 .27. The balance point depends on n, the number of units expected to be produced. There are several approaches to the design productivity problem. The most basic approaches are purchasing pre-designed components, and utilizing reconfigurable devices. 
Buying Intellectual Property
If the goal is to produce a design optimized in the use of the technology, the fixed costs will be high so the result must be broadly applicable. The alternative to this is to "reuse" existing design. These may be sub-optimal for all the nuances of a particular process technology, but the savings in design time and effort can be significant. The purchase of such designs from third parties is referred to as the sale of IP (intellectual property).
The use of IP reduces the risk in design development: it is intended to reduce the design costs and improves time-to-market. The cost of an IP usually depends on the volume. Hence the adoption of an IP approach tends to reduce K f at the expense of increasing K v in Equation 1.1.
Specialized SOC designs often use several different types of processors. Non-critical and specialized processors are purchased as IP and integrated into the design. For example, the ARM 7 TDMA is a popular licensed 32-bit processor or "core" design. Generally, processor cores can be designed and licensed in a number of ways as shown in Table 1 Clearly, the more optimized designs from the manufacturer are usually less customizable by the user, but they often have better physical, cost-performance tradeoffs. There are potential performance-cost-power overheads in delaying the customization process, since the design procedure and even the product technology itself would have to support user customization. Moreover, customizing a design may also necessitate reverification to ensure its correctness. Current technologies, such as the reconfiguration technology described below, aim to maximize the advantages of late customization such as risk reduction and improvement of time-to-market. At the same time, they aim to minimize the associated disadvantages, for instance by introducing hardwired, nonprogrammable blocks to support common operations such as integer multiplication; such hardwired blocks are more efficient than reconfigurable resources, but they are not as flexible.
Reconfiguration
The term reconfiguration refers to a number of approaches that enable the same circuitry to be reused in many applications. A reconfigurable device can also be thought of as a type of purchased IP in which the cost and risk of fabrication are eliminated, while the support for user customization would raise the unit cost. In other words, the adoption of reconfigurable devices would tend to reduce K 1 at the expense of increasing K 2 in Equation 1.1.
The best known example of this approach is FPGA (field programmable gate array) technology. An FPGA consists of a large array of cells. Each cell consists of a small lookup table, a flip flop and perhaps an output selector. The cells are interconnected by programmable connections, enabling flexible routing across the array (Figure 1.28) . Any logic function can be implemented on the FPGA by configuring the lookup tables, and the interconnections. Since an array can consist of over 100,000 cells, it can easily define a processor. An obvious disadvantage of the FPGA based soft processor implementation is its performance-cost-power. The approach has many advantages however.
Reconfiguration and FPGAs play an important part in efficient SOC design. We shall explore them in more detail in the next chapter.
Conclusions
Building modern processors or targeted application systems is a complex undertaking. The great advantages offered by the technology -hundreds of millions of transistors on a die -comes at a price. Not the silicon itself, but the enormous design effort that is required to implement and support.
In the following chapters, we shall first take a deeper look at the basic tradeoffs in the technology: time, area, power, and reconfigurability. then we shall look at some of the details that make up the system components: the processor, the cache and the memory.
The remainder of the text focuses on the issues that primarily affect SOC designs. These include the media and graphics applications and their requirements, the bus or switch interconnecting the various processors and memory, and the system evaluation tools.
The goal of the text is to help system designers identify the most efficient design choices, together with the mechanisms to manage the design complexity by exploiting the advances in technology.
Problem Set
1. Suppose the TLB in Figure 1 .17 had 256 entries (directly addressed). If the virtual address is 32b, the real memory is 512MB and the page size is 4KB, show the possible layout of a TLB entry. What is the purpose of the user ID in Figure 1 .17 and what is the consequence of ignoring it?
2. Discuss possible arrangement of addressing the TLB.
3. Find an actual VLIW instruction format. Describe the layout and the constraints on the program in using the applications in a single instruction.
4.
Find an actual vector instruction for vector ADD. Describe the instruction layout. Repeat for vector load and vector store. Is overlapping of vector instruction execution permitted? Explain.
5. For the pipelined processor in Figure 1 .7, suppose instruction #3 sets the CC at the end of WB and instruction #4 is the condition branch. 8. For the pipelined processor in Figure 1 .7, assume the cache miss rate is 0.05 per instruction execution, and the total cache miss delay is 20 cycles. For the processor, what is the achievable CPI (cycle per instruction)? (Ignore other delays, such as branch).
9. Design validation is a very important SOC design consideration. Find several approaches specific to SOC designs. Evaluate each from the perspective of a small SOC vendor.
10. Find (on the Internet) two new VLIW DSPs. Determine the maximum number of operations issued each cycle and the make up of the operations (number of integer, floating point, branch etc). What is the stated maximum performance (operations per second). Find out how this number was computed.
11. Find (on the Internet) two new large FPGA parts. Determine the number of logic blocks (CLBs), the minimum cycle time and the maximum allowable power. What soft processors are supported?
