today's systems, many applications demand still greater computing power and speed. Such relentless demands may be largely due to the apparent ease with which systems designers have achieved current performance levels. After all, vendors roll out new systems on what seems to be a regular basis. Each new system is more powerful than the last, and frequently the rollouts are scant months apart. Appearances aside, however, it takes time-consuming design iterations, followed by performance/cost evaluation, before a system design is implemented.
In this article, I compare the three costs for three leading chip technologies: VLSI (very large-scale integration), MCM (multichip module), and WSI (wafer-scale integration). See the "Definitions," box, for more information.
For discussion purposes I based my comparison on published data for two application case studies: a stand-alone computer and a DSP. I don't examine packaging costs because these vary considerably from one application to another; furthermore, packaging is not a direct result of the chip technology selection.
Computer case study
Computer fabrication is a good way to illustrate the cost estimation process. The computer in my design comparison is the Hitachi SH7600, 1 which is a 32-bit RISC processor that performs 16 MIPS (at a clock rate of 20 MHz). It has 1 million, 32-bit words of DRAM, and an interface unit interconnected as shown in Figure 1 . Although these parameters are somewhat dated, they allowed use of published chip information, which was an important consideration. The design is similar to an ultra large scale integrated system consisting of 44 Mbytes of DRAM, 384 Kbytes of SRAM, and an 18,000 gate array. 2 The CPU chip size is 8.7 mm × 9.5 mm, which includes 0.5 mm on each edge for pads (the core area is 7.7 mm × 8.5 mm). The memory consists of eight 1M-word × 4-bit dynamic RAM chips. 3 Each of the RAM chips is 4.8 mm × 11.1 mm, which includes 0.5 mm on each edge for pads. The interface is based on a preliminary design of an interface control unit for heterogeneous networks that measures 4.9 mm × 8 mm. This includes 0.5 mm on each edge for pads.
Silicon area. In determining the silicon area required by the different technologies, I used theoretical models.
VLSI. The VLSI implementation of the computer uses 10 cells (one central processing unit, eight RAMs, and one interface control unit) as the floorplan in Figure 2 shows. The layout is an incomplete 6 × 2 array of RAM cells with the CPU and ICU cells in the four spaces where RAM cells are omitted. The RAM array is 20.2 mm × 22.8 mm. The chip pad area increases the final size to 21.2 mm × 23.8 mm. I assumed the defect density to be 0.1 defect per cm 2 , which gives an expected yield of 60% for this circuit. With these values, the silicon area, or SA v , is 8.4 cm 2 .
MCM. An MCM realization of this computer uses 10 dies, placed as shown in Figure 3 . The floorplan differs somewhat from the VLSI's floorplan due to the 0.5-mm border around each die for bonding pads and the 1.5-mm space between adjacent dies. WSI. Figure 4 shows the WSI floorplan. It uses pairs of cells for the CPU and the ICU. This is one-from-two pooled sparing, which simply means for example that there are two CPUs ) is 93.6%. The probability that at least one of the two cells in the CPU pool is good is 99.6%. The pool's yield is the product of the probability that at least one cell is good times the interconnection and selection logic yield. (I assumed that to be 99% due to its small size and simplicity). This gives a 98.6% yield for the CPU pool.
For the ICU, I computed the yield of the two-cell pool by calculating the probability that at least one of the two cells is good (99.8%). I multiplied that by the yield of the interconnection and selection logic (99%), thereby arriving at a 98.8% yield for the ICU pool.
For the RAM, I computed the yield of the nine-cell pool by calculating the probability that at least eight of the cells are good (95.7%). I multiplied that by the yield of the interconnection and selection logic. (I assumed that to be 95% due to the greater size and complexity than for the CPU and ICU pools). This gave a 90.9% yield for the RAM pool.
Dividing the wafer size (6.7 cm 2 ) by the expected wafer yield (the product of the CPU, ICU, and RAM pool yields, or 88.6%), I determined the SA w to be 7.5 cm . The WSI substrate is 21.2 mm × 3.14 mm for an area of 6.7 cm 2 . For details, see the box, "Substrate size cost model."
Power consumption. Power consumption is a particularly relevant cost measure in this era of portable and battery-operated systems. It is, however, difficult to estimate because of many contributing variables. For static CMOS technology, the buffer power consumption is proportional to the product of the load capacitance times the frequency. For MCM implementations, the inter-die driver power consumption is the primary penalty relative to VLSI. With WSI, the inputs to the redundant circuits would normally be disabled so that they consume only generally negligible quiescent power.
Silicon area cost model
A chip's manufacturing cost is proportional to the silicon area; therefore, our first cost measure is the silicon area that must be fabricated to produce a functional chip.
With a VLSI approach, the silicon area, SA v , is the chip area divided by the yield. The chip area is the sum of the core area of the chip, A c , and the pad area, A p . The yield is Y v . A simple model 1 for the chip yield as a function of the process defect density, D 0 , and the chip area, A c + A p , is
Other models are possible, 1 but the negative exponential is a reasonable first approximation. The effect of the chip layout and floorplan on the yield is ignored in this first-order analysis, although they have been shown to be important for some cases. 2 With an MCM approach, the silicon area, SA m , is the sum of the silicon areas of the constituent dies. The silicon area of the dies is the die area with required pad area, divided by the bonded die yield. The bonded die yield is the product of the die yield and the bonding yield (the latter is assumed to be approximately the same for each chip).
With a WSI approach, the silicon area, SA w , is the chip area divided by the wafer yield. Here the chip area is the sum of the wafer pad area and the area of the wafer (the sum of the core areas). Each core area is increased by a redundancy factor.
Model comparison shows that relative to the VLSI implementation, the MCM approach incurs an additive area penalty for the pads required by each die, while the WSI implementation incurs a multiplicative redundancy factor penalty. To compare the power consumption, I ignored the internal consumption of the CPU, ICU, and RAM cells. I could ignore it because the power consumed by chip output buffers, not the internal consumption, is the primary difference between the various approaches.
I assumed that the output buffer consumption is proportional to the product of the load capacitance times the frequency. Table 2 lists the loadings and frequencies. The power depends on the actual loading, not the maximum load that the buffer is designed to drive.
On the VLSI and WSI chips, I assumed point-to-point connections to present a 0.5-pf load. As the address bus lines have 8 destinations, I assumed them to present a 4-pf load. For the MCM implementation, I assumed the point-to-point connections and the address bus lines to present 5-pf and 40-pf loads. In all three cases, the 32 outputs (with 50-pf loading) dominate. For VLSI power, P W = 0.47 W; for MCM, P W = 1.1 W; and for WSI, P W = 0.47W. Table 3 summarizes my findings for the three approaches to the design of a stand-alone computer. The MCM implementation has the smallest silicon area, and as a result, it should cost less in mass production to produce the required silicon chips than either of the other approaches. The VLSI chip is the smallest, so it should cost less to package and will weigh less than the other two approaches. The VLSI and WSI implementations use about half the power of the MCM. The VLSI implementation size is the sum of the chip pad area and the sum of the cell areas.
Comparison results.
The MCM implementation size is the sum of the module pad area, the die areas (which include the pad areas), and the spacing between dies. Designers use the spacing between dies to route signals, clocks, and power. I assume here that the inter-die spacing is three times the pads' width.
Similarly, the WSI implementation size is equal to that of the VLSI design multiplied by the redundancy factor. Definitions VLSI-A very large scale integrated circuit is a single IC without redundancy.
MCM-A multichip module is an assemblage of multiple dies onto a ceramic or silicon substrate.
WSI-A wafer-scale integrated circuit that includes redundancy to allow correct operation even if the chip contains faults as manufactured.
From these definitions, then, we can see that VLSI requires the fabrication of large fault-free chips; MCM provides many small dies that are assembled to create the MCM; and WSI uses redundancy to provide chips that produce correct results even though they may contain some faults.
There is some overlap in these definitions: Current large DRAM chips include extra rows and columns for fault circumvention. These are best viewed as VLSI, however, because only a small amount of circuitry is redundant. 
DSP case study
To compare VLSI, MCM, and WSI technologies in the context of a high-speed DSP, I selected the design of a radix-4 fast Fourier transform butterfly processing element. This element is a building block for a highperformance pipeline FFT processor. 4 It consists of three complex multipliers and eight complex adders, interconnected as Figure 5 shows. A complex multiplier implementation requires four real multipliers and two real adders; a complex adder's implementation requires two real adders. As a result, the radix-4 element requires a total of 12 real multipliers and 22 real adders.
Silicon area. I compared VLSI, MCM, and WSI approaches to the DSP in a manner similar to what I did for the stand-alone computer.
VLSI. As described, the VLSI implementation of the radix-4 FFT butterfly processing element uses 12 multipliers and 22 adders. My comparison assumed an aggressive 0.6-micron fabrication process with three metal layers. Consequently, the core area of either a 32-bit floating-point adder or a 32-bit floating-point multiplier is approximately 3.6 mm × 3.6 mm. As Figure 6 shows, the required 34 cells in a 6 × 6 array have two unoccupied positions (these could be sine/cosine ROMs and clock drivers). The 6 × 6 array is 21.6 mm MCM. The MCM realization uses 34 arithmetic chips, which can be mounted in a 6 × 6 array, as Figure 7 shows. The multiplier and adder chips are comparably sized. Adding the pad area, which an MCM implementation requires, increases the die size to 4.8 mm 2 for which the yield is 97.7%. With a bonding yield of 95%, the bonded yield is 92.8%. Since the die area is 0.23 cm WSI. The WSI realization uses six pools of cells-floatingpoint adders or multipliers-for the 34 needed arithmetic elements, as Figure 8 shows. Pools 1, 2, and 3 are four-fromfive pools of multipliers that provide the real multipliers needed to implement the three complex multipliers.
Pool 4 is a six-from-seven pool of adders that provides the six adders needed for the three complex multipliers. . Finally, pools 5 and 6 are eight-from-nine pools of adders. These provide the eight adders needed for the four complex adders in the last two stages of the radix-4 butterfly processing element. The 40 cells are laid out in a 6 × 7 array with two positions unoccupied, as shown in Figure 8 . Table 4 shows the yield of the six pools. In each case the yield of the sparing and interconnection logic (assumed to be 95%) chiefly determines the total pool yield. I divided the wafer area (5.9 cm 2 ) by the wafer yield (the product of the yields of the six pools (69.3%) and determined the silicon area, SA w , to be 8.5 cm Power consumption. Each of the 34 arithmetic elements produces a 32-bit result. Eight of these 32-bit results are butterfly outputs that exit the module, and I assumed these to drive a 50-pf load. I assumed the 26 elements that drive other cells for either the VLSI or the WSI implementation to drive a 0.5-pf load. The total power required for the buffers is 8.3 W. The 26 elements that drive other chips within an MCM module I assumed to drive a 5-pf load. The total power required for the MCM buffers is 10.6 W.
Comparison. Table 5 on the next page summarizes my findings for the three approaches to the design of a butterfly processing element in the DSP case study. The MCM implementation uses less silicon, and as a result, its chips should cost less in mass production than either of the other .
approaches. The VLSI chip is the smallest, so it should cost less to package and will weigh less than the other two approaches. The VLSI and WSI implementations use about 20% less power than the MCM.
I DEALT WITH A COMPARISON of VLSI, MCM, and WSI technologies on the basis of silicon area, substrate size, and power consumption for two example applications: a general-purpose computer and a specialized DSP. An important extension to this work concerns packaging. Packaging affects both performance and cost. Performance effects result from the length and the parasitics of lines from the chips to the board. Cost effects depend on the size and type of the substrate. Packaging is likely to have the greatest negative impact on the MCM (since it must include a multilayer substrate) and least impact on the VLSI approach. Another important issue for MCM is the availability of "known-good dies." To the extent that good dies are available, testing and rework costs decrease markedly.
Earl E. Swartzlander, Jr. is a professor of electrical and computer engineering at the University of Texas at Austin where he holds the Schlumberger Centennial Chair in Engineering. His research interests are in application-specific processing and the interaction between computer architecture, computer arithmetic, and VLSI technology. Swartzlander received a BS from Purdue University, an MS from the University of Colorado, and a PhD from the University of Southern California. He is an IEEE fellow and an Outstanding Electrical Engineer and a Distinguished Engineering Alumnus of Purdue University. He has also received the Distinguished Engineering Alumnus Award from the University of Colorado.
Contact the author at Dept. of Electrical and Computer Engineering, Univ. of Texas, Austin, TX 78712; e.swartzlander@comp-mail.com. 
