Introduction
Chip industry obeys a number of laws, various kinds of laws. Mathematical laws if accurate models can be formulated, physical laws, especially solid state physics, o b tained by observation and induction, chemical laws pertinent for the manufacturing processes, economical and judicial laws that concern such industries. The most famous and most cited law of chip industry is the one that An even older law, also formulated after observing properties of early logic circuitry in computers, is known as Rent's rule.
dT T dG cc E'
where T is the number of external connections of a part containing G gates. The proportionality constant is called the rent exponent. Both laws seem t o hold surprisingly accurate. Moore's law soon became the ultimate guideline for setting targets in the chip industry. In a sense it has thus become a self-fulfilling prophesy, although it is still remarkable that that industry was able t o satisfy such ambitious goals. Rent's rule went through stages of neglect and popularity. A convincing case for the usefulness of such a law came with IBM's need for wire space estimations for gate arrays, as documented in the Donath's landmark paper [SI. Both, the moore and rent exponents, had t o be tied t o a more specific class of circuits. The recent report
[17] of ICE established a moore exponent of 0.2 for microprocessors and 0.4 for memory (figure l). Bakoglu [l] showed rent exponents between 0.12 and 0.63, distinguishing dynamic and static memory, microprocessors, gate arrays and high-speed processors. 
Confinement

Memory-to-compute ratio
Embedded computer chips exhibit a trend where with every new generation an increasing percentage of the chip area is dedicated t o memory, while an ever decreasing percentage of the chip area is dedicated t o computational structures.
This observation can be rationalised as follows. It has long been known [7] that a balanced computer system is equipped with an amount of memory that is proportional t o the computational power of the processing unit. Gene Amdahl observed that mainframe computers follow the rule of 1 memory byte per instruction per second (i.e. a 10 MIPS CPU would come with 10M bytes of RAM).
To see how this rule affects the ratio of computational resources t o memory resources on a chip, we note that each new generation of semiconductor process technology reduces the area of both computational and memory structures by a factor A, while increasing the maximum achievable clock frequency of a chip by a factor S. This law points t o the conclusion that memory will increasingly dominate the available chip area in the future, while compute logic will necessarily be confined t o a small fraction of the available on-chip silicon area. Also, because the compute logic is getting so small and the memories so big, the average wiring distance between the two is becoming relatively large, resulting in increased memory access latency, especially when expressed as the number of equivalent compute cycles. In section 2 we investigate two possible ways t o deal with this confining technology trend.
Buffer area
Global wires are defined t o be interconnections whose delay can be improved by inserting buffers. It was shown in [ll] . that the delay then exceeds a critical delay, which is a process constant equal for all wiring layers (if which still makes a trade-off necessary when determining the cross-section of a wire.
In order t o calculate the total area taken by buffers we need t o know the wire length distribution of the chip.
Suppose its probability density function is given by P(1) then the buffer area is given by ONI Jc1: ;
W ) 4
NI being the total number of wires.
P(1) is usually obtained by making a model with some simplifying assumptions and requiring that 2 must be satisfied. Concise derivations of weibulkdistributions (the two-dimensional case is however not translation invariant!) and extensive calculations resulting in very long expressions (which is no objection when generating buffer area by computer) have been produced, but there is some agreement that the early result in [6] 12r-3 / 2 r -3
captures the essence. Whatever is used, the increase in buffer area percentagewise is tremendous, not in the last place because buffers become very large for deep submicron circuits1.
'After this tutorial was submitted to ICCAD, another motivation for multilayer paradigms was presented at DAC2000: S.J.Souri, e.a.. "Multiple Si layer ICs: motivation, performance analysis and design implications" (Proceedings DAC2000,pp 213-220). They also make the buffer area argument, and show results where the area is larger when the rent exponent is smaller. This is strange, but their model is not explained.
Current drive capability
The increase in complexity predicted by Moore and realized by the industry, was possible not in the last place possible because the increase in current drive capability ID,SAT/W over several technology generations. When feature sizes get very small and voltages scale at a slower rate, the electrical field becomes high. At high values for the field strength the mobility of the carriers can no longer be considered constant, and the dependence of the drift velocity on the electric field will thus depart from the linear relationship observed under low-field conditions. A semi-empirical formula for the drift velocity of the charge carriers is proposed in [4]:
The saturation velocity vsot can be considered in a first approximation the same both for holes and for electrons.
E, is the critical electrical field and the coefficient y varies with the type of charge carriers: for holes close t o 1, while for electrons close t o 2.
Based on this formula a general linear dependency between the drain saturation current and the drain-source saturation voltage was derived in [2] , a dependency valid for all transistor lengths (L) and for p as well as ntransistor types.
We recall here some elementary relations: for the drain current under the drift model, and the available mobile charge in the channel:
For the drift velocity we use (9) with E = dvcs/dy. The contact-to-source bias Vcs(y) at an arbitrary point C in the channel is a monotonically increasing function of y. The solution at the two ends of the channel satisfies the boundary conditions: Vcs(0) = 0 and Vcs(L) = VDS. Substituting (11) and (9) The above expression can,by separating the variables, be rewritten into (14) which would allow us t o find an implicit relation between the drain current and the drain-to-source voltage in triode region by integrating (14) over the channel length:
Note that F(V, I ) depends on V only, since I is uniquely determined by V: I = I(V). When the transistor operates at the border between triode and saturation regime, the first derivative of the drain current with respect t o VDS equals t o zero, that is % = 0. If we now differentiate the extreme sides of (15) we get:
We are looking for the curve r in the i-v plane such that it contains exactly the points where % = 0, and there fore where $$(I, V) = 0. As follows from the definition (15) of F(V, I ) we have --
This means that on I?
So, we found that for the general case the triodesaturation separation is given by the linear relation:
Expression (19) can be seen as the separation line between triode and saturation regions as in figure 5 . lt
illustrates that no matter how short channels are the saturation current per unit width, or current drive, is bounded above by
This maximum achievable current from a transistor is not dependent on the channel length L. Consequntly, in the quest for higher speed through the relative increase of the current drive by down-scaling of the transistor length, there is an inherent limitation.
The drain current in triode region is the implicit solution of equation (15) . For a pdevice the charge carriers in the channel are holes, and, as mentioned before the y coefficient takes values close t o 1. In that case an explicit expression for the drain current is easily derived.
For y E (1,2] it leads t o P-functions and it is better to use numerical software t o generate the I -V characteristics of n-devices, as was done for figure 5 where This shows that due t o the velocity saturation effect the current drive no longer improve significantly by scaling the transistor dimensions below a micron. Not only that the current drive improvement saturates, but also the capacitive load that a gate has t o drive increases relative t o the gate strength (as another detrimental effect of the interconnect lateral capacitance).
a7 2 Escape routes
Homogeneous processors
Revisiting (6) we try t o derive some of the consequences for future system-on-chip architectures. In this section, we focus on using (6) as a weapon against the increasing design complexity implied by (1) (Moore's Law). The issue a t stake is that it is becoming increasingly hard t o design reliable systems-on-chip with the hundreds of millions transistors that fit on new chip generations. The problems include high design costs, lack of engineers, slow simulators, and difficulties t o manage these very complex design projects. At the same time globalisation of the economy and bored consumers put an increasing pressure on companies t o bring new products t o market in a very short time.
For these reasons it is of paramount importance t o develop a system-on-chip methodology that scales trivially with (1). A simple approach could be t o repeatedly place a self-contained computing unit on a chip until the available silicon area fills up. The units are then linked through a high-speed communication network so that the aggregate of compute units can work cooperatively on one or more algorithms.
Traditionally the problem with this type of system architectures is that the compute units must be sufficiently general-purpose, or otherwise the system is not usable in a sufficiently wide range of applications. But generalpurpose computing engines often lack several orders of magnitude behind special-purpose hardware in terms of computational efficiency, i.e. speed and power consumption. This is where (6) Note that compute units in a cluster can have very specific functionality, for example they could include a complete MPEG-2 video decoder or a 3D graphics rendering engine. Even though such units are expensive by todays measure, according t o Postulate 1 we can afford t o instantiate them in a cluster because memory will dominate future chip area anyway and therefore compute logic becomes relatively cheap.
Other compute units in a cluster can be more general purpose, for example microcontrollers, DSPs and maybe even a few FPGA-like units can be used t o implement functions that don't happen t o be available as precooked engines in the cluster. Also, the microcontrollers can be used t o manipulate control registers of other specialpurpose compute engines in the cluster and t o setup their input and output streams.
In this way, configuring a cluster for a specific task can be done after chip manufacturing and could in fact be done in the field or a t the customer site. The computational efficiency of a cluster can be very high, despite its being field-configurable, because usually most of the work can be handled by one or more dedicated compute engines in the cluster, provided the cluster is truly heterogeneous and covers a wide range of applications.
This then resolves the ever recurring arguments against programmable and configurable systems: that their computational efficiency is at least one and often several orders of magnitude lower than dedicated solutions, resulting in much higher power consumption and lower computing speeds.
It also solves the problem of simulating large systemson-chip. Because the chips are matched t o an application after fabrication, the system functionality can be verified using the actual silicon instead of using HDL simulators that are easily a billion times slower and less accurate than the real thing. Of course, real-time debugging is an important issue.
An interesting consequence of the cooperating heterogeneous multi-purpose clusters is that now there is no need anymore for one cluster on a chip t o have a different composition than any other cluster on that same chip. Since every cluster is multi-purpose, we assign specific tasks t o the clusters based on their communication pattern, i.e. tasks that communicate a lot are assigned t o adjacent clusters, or maybe even t o the same cluster if enough resources inside that cluster are available. This ability of clusters t o efficiently execute a wide range of tasks therefore is very helpful in avoiding long communication latencies and reducing power consumption for inter-cluster communications.
Multilayer processors
A different way of dealing with the confinement implied by (6) is t o simply put most of the memory in different layers of the chip. In this way the silicon area dedicated t o compute logic can scale with (l), escaping the confinement predicted by (6) .
When the memory-to-compute ratio passes a certain threshold then a dedicated memory layer is added t o the 3D stack. The wires run vertically through the stack and therefore their average length is significantly reduced compared t o the 2D case. This is good news, because the execution time of many important applications depends heavily on the memory access latency, i.e. the time it takes t o do a round trip from the compute logic t o memory and then back. In some multilayer technologies the vertical wire density is high enough (i.e. more than one via in 10,000 square featur sizes in 0.1~ technology, although thse via do not scale very well yet) t o enable very wide buses running between layers. This means that very high bandwidths can be sustained between the compute layer and the memory layers. This of course is vitally important, and in combination with short latency provides an exscratch memories could be allocated t o a layer on top of the actual compute layer. On top of the caches and scratch memories, one or more layers can be stacked with DRAM, as dictated by the law in (6).
Liberation
Filling the layers
cellent memory subsystem with very good performance characteristics.
In [12] a study is presented that compares a multilayer implementation of a RlSC processor t o a conventional implementation. The conclusion is that a multilayer implementation can benefit significantly from the low latency, high bandwidth connection t o the first level and second level caches. In [3] an analysis is presented showing that many of the techniques used t o tolerate growing memory latencies do so at the expense of increased bandwidth requirements. Clearly, a multilayer microprocessor implementation that improves both latency and bandwidth can significantly relax the off-chip bandwidth requirements, resulting in lower pin counts and cheaper packages.
Although these studies focus on microprocessor imple mentations rather than complete systems-on-chip, the same arguments apply t o much more complex architectures like the homogeneous multiprocessor presented in section 2.1. In this case the first level caches and local
In an early paper [9] it was already stated that adding wiring layers could not reduce the essential interconnect complexity of circuit integration. The author also suggested that flashing the clockon the chip would not only be a temporary relief, but would also solve skew prob lems. With an active layer on top not only this would be feasible, but also selectivity with respect t o particular clock phases is within reach. The same layer can be used for housing the buffers t o speed up global interconnect as suggested in [lo] . Optimal buffering then depends on the properties of the top layers: critical delay it) depends on the active components, while the optimum segmentation (Icrit) depends also on the properties of interconnect in the global tier. Supply line shielding yields reliable interconnect characterization. This increases the line capacitance, and consequently the buffer area in the top layer and the power consumption of the global communication.
The processors, each with their own instruction and data caches fill up the next layer, in a regular formation, but each individualized t o perform the operations t o be assigned efficiently. Four wiring layers, the global tier with segmentation and buffering, and and a tier for more local interconnect is in between. The processing layer is without doubt producing the most heat. Experience reported in literature made clear that this is not a major problem [14] *. The layer still suffers under 6 but access times t o memory on other layers is certainly improved.
The other two layers are dedicated memory: t o secondlevel cache and interface electronics for controlling main memory, and t o main memory. The latter is the base active layer, made in the most advanced technology, using agressive design rules. layers can be stacked by such a process. The main disadvantage is that aligning the layers with respect t o each other. The same exercise used via's of 6p on each side, and scalability was not expected soon.
But a number of advantages were easily recognized:
1. Interconnection lengths were considerably shorter, which in their case required proper partitioning. Folding datapath over more layers and determining the optimum crossing points can shorten cycle time considerably.
2. The total footprint was of course much smaller which is beneficial for yield and/or allows larger chips.
3.
The supporting technology
For more than twenty years chip technology research has worked on so-called three-dimensional integration. However, over this periode Moore's prediction could be fulfilled without having t o break free from the single-activelayer confinement. In section 1 we discussed but three fundamental reasons why in the near future chips with a single active layer and conventional formation of the active components cannot maintain the growth in functionality and performance of the past decades. In section 3.1 an advantageous usage of four active layers has been outlined. The question whether this is economically justified, or even technogically feasible, was not touched. Several research groups have shown fabrication technologies for producing chips with active components outside the base active layer. Roughly they can be classified as growing and stacking techniques. In the first category we find most of the early true integration solutions: recrystallization, layer growth and seeding. They have as a major disadvantage that the base layer has t o undergo all those additional process cycles of heating and cooling, which will degrade the properties of the components in that layer. In the proposal of section 3.1 this is the most sensitive layer, produced with extreme agressiveness. This is clearly unacceptable. Recently lowtemperature technologies for adding components outside the base layer have been published, but they are still far from "manufacturability" i
Stacking implies the separate fabrication of active layers, later t o be combined with each other. They have the obvious advantage of much improved control over the properties of the components. The individual layers do not even have t o be be produced in the same technology. One of the first multilayer processors was made by transferring a soi-film on top of a bulk-silicon cmoschip [16] . There is also no obvious limit t o how many *The heat simulation in [12] is also a four-layer processor, but the layers are not specified. But any different ordering in our case would only increase the problem if any.
As mentioned, different technologies can quite easily be realized on the same chip as long as they can allow contact via's on both sides. The quality of components in one technology is not compromised because of favoring the quality of components in another technology. In the proposal optical receivers were included. Although buffers were planned in the same layer, their properties are not very critical.
Key remains the trade-off between via size and accurate alignment. Via's are expected t o be big, requiring quite a bit of area overhead. The alignment requirements will demand strong geometrical constraints in laying out the individual layer. In [16] they made one layer the dictator, in the dedicated layer proposal, the enforced regularity of all but the top layer forces the placement.
Heat is not expected t o be a problem for multilayer chips. In the proposal the heat generators are the top two layers, and all layers were targeted for bulk silicon processing. If several layers of soi-technology are used overheat might occur and should be investigated. In general, according t o the relation of Wiedemann-Franz, good electrical conductors are good thermal conductors, but layers cannot be completely shielded by electrically conducting layers.
The supporting computer-aided design
Obviously, the escape routes proposed in 2 require a completely different design flow. Homogeneous processors do not benefit much from parts of a traditional flow. The emphasis should be more on modeling applications as networks of communicating processes in a suitable specification language [SI. Equally important is reuse of specification software, considering the short life spans of integrated circuits and the demand for short paths t o the market.
General multilayer designs require complete new layout synthesis tools. Placement is obsolete ("modern placement is floorplan design plus legalization!') and even floorplan design for each layer not adequate because of the strong geometrical constraints. Wire planning will be more of a must, but has t o acquire a more precise meaning in this application. 
