Based on special pipelining techniques, a new methodology for increasing the clock frequency and communication speed in monolithic-WSI systems is proposed. Spice simulations show that the clock frequency on wafer scale systems implemented using a 1.2 micron CMOS technology can be operated well above 140MHz. which is approximatively five times the maximum frequency of current systems 111. It is also shown that pipelining principles can be applied to communication links. That particular strategy allows to speedup communication uanfers on 5cm interconnection wires, such as those running across a wafer, by a factor between two and ten, as compared to the case in which no pipelining is used.
Introduction
Recent advances in monolithic-WSI technology [ 1.21 have demonstrated that clock and signal distribution can severely limit WSI system performance. In the case of WASP wafers, for instance, it has been shown [ 11 that the maximum frequency of the global clock should be limited to 40MHz, because of the skew existing between both ends of a 5cm wire running across the wafer. Such a problem has been anticipated at an early stage of the HiBus project' and a methodology minimizing the negative effects of these delays has been developed. The methodology combines three complementary strategies to improve the system speed, that is the fragmentation of signal wires, pipeline clocking, and signal retiming. The next sections demonstrate that clock frequencies much larger than 40MHz can be distributed on monolithicWSl systems. Indeed. SPICE simulations of a simple and unoptimized design show that WSI systems, such as WASP, implemented using this methodology, could operate at frequencies well above 140MHz. moving the performance bottleneck on the processors themselves. It is also shown that this capability can be leveraged to speedup communication channels. 
High-speed signal propagation

A-Pipeline clocking
Pipeline clocking, first introduced by Fisher and Kung [3] , consists in inserting some number of inverters or buffers along a VLSI wire (Fig. 1 ). This allows running multiple clock edges on the bame Line, and therefore significantly increase the maximum frequency that can be propagated on a long metal wire. Moreover, this maximum frequency is independent of the wire Length. In order to obtain an estimate of the maximum frequency that could be achieved on B 5cm wire similar to the one used on the WASP wafer, SPICE simulations were carried out using distributed line parameters of monolithic-WSI mentioned in [ 11 (line capacitance C,=30 fF/mm and line resistance R,=50 ohms/"). Moreover, a typical 1.2 micron CMOS proceaa technology (41 was considered. Fig. 2 presents the result obtained when five 1O:l (L/w=l/lO) inverters were inserted along the line, and when minimum size detecting inverters (modeling a distributed load) were connected as shown in Fig. 3 . Note that the 1O:l inverters are separated by lcm segments, since this corresponds approximately to the dimension of a WASP processor (ASP). The signal frequency has been increased up to the point where a slight signal degradation could be observed. As we can see in Fig. 2 , a 155MHz signal can be easily propagated. Let us mention that no particular attempt has been made to optimize this arrangement. 
B-Wire fragmentation
Wire fragmentation, introduced in a previous study from our team [ 5 ] , is an extension of the pipeline clocking strategy. The basic idea of the technique is to insert flip-flop components along a communication line, so the line itself becomes a communication pipeline (Fig. 4) . The benefits of this technique are two-fold. Firstly, by breaking the communication line into smaller segments, the propagation delay of the signal through each segment becomes much smaller. Note that for wires having a line resistance that cannot be neglected, the propagation delay can be proportional to the square of its length [6] . Secondly, converting the communication line into a pipeline allows data transfers to operate at the speed of a single stage of the pipeline, provided that the pipeline is full. Of course, latency is experienced when filling up the pipeline, but the overall improvement of the time required to transmit several bytes far outweighs this negative effect. Based on the magnitude of propagation delays of data and clock signals (skew) as a function of the segments length, one can determine the optimum number of stapes of the pipeline. Indeed, the total transmission time of an M-byte message through an N-stage. 8-bit wide pipeline, is equal to E,:
wire thickness Eox: underlying oxide thickness size ratio between successive stages of an exponential driver. time required by a minimal inverter to charge another minimal inverter
~~~~~~~~~~~~~~~~~~~~~~~1
One can compute the minimum transmission time by solving dT(M,N)/dN=O. Fig. 5 presents the transmission times obtained when considering a 5cm 8-bit bus implemented in a 1.2 micron CMOS technology through which a 64-byte message is transmitted. More details and discussions on the various aspects of this technique can be found in [5]. Note that the flipflop components were assumed to have a response time of 511s. This has been determined from a prototype design of this component (shown in Fig. 6 ). As we can see in Fig. 5 , the minimum message transmission time occurs when there are 12 stages (imposing 12 latency cycles), and it is 2.1 times smaller than the time which would have been required if the wire had not been converted into a pipeline (that is when N=l in Fig. 5 ). One could select a smaller number of stages if factors such as area, complexity and power costs were considered in the analysis. For a 12-stage pipeline, one can obtain a speedup of at least 2.1 by using a communication pipeline running at approximately lOOMHz (obtained by adding the response time of the flip-flop to a conservative estimation of the delay through one of the 12 segments, that is 1/(5ns+4ns)=lOOMHz) instead of a 30MHz link (obtained from Fig. 5 at N=l, that is 1/(2.lps/64)=30MHz). Note that, since the emitter and the receiver are certainly more complex, and consequently slower, than the flip-flop components, the overall speedup (measured with respect to the emitter or receiver speed) is expected to be larger. 111. An architectural solution for increasing the system speed
The two techniques mentioned in the previous section can be combined, so that the maximum operating frequency and the performance of the system are significantly increased. The strategy is summarized in Fig. 7 , where a pipeline clock is used to synchronize two successive stages (flip-flop components) of the fragmented communication link. This link is implemented here as two unidirectional pipelines, one for each direction. Moreover, each processor is clocked with the same signal as its associated flip-flop components, which serve as YO interfaces to the communication link. SPICE simulations presented in Fig. 8 show that the skew between two successive stages of a 5cm communication link is approximately 0.711s when they are fragmented into 5 segments using 1O:l inverters. According to the common rule of thumb stating that the clock skew must be no greater than 510% of the clock period
[2], this would indicate that the maximum clock frequency of a string (row) regrouping 5 processors should be 14OMHz. Of course, the skew can vary along the link due to process variations, but one can consider the worst case value. Note that if a single 50:l inverter had been used to generate the clock signal on the 5cm line (with no 1O:l inverter inserted), the end-to-end skew would have been equal to 3.5ns, which would imply a maximum frequency of 30MHz, according to the previous rule of thumb. This last result corroborates a similar result mentioned in [ 11. Using a more realistic estimation of-the delay through each of the five segments (less than 211s ll"l...l. 1-1.. . . quite significant considering that one cannot expect a speedup larger than the ratio 140MHJ30MHz i.e. 4.7. Moreover, the maximum frequency (140MHz) allowed by the proposed methodology does not depend on the length of the link, whereas with the single inverter solution mentioned before, the maximum frequency decreases as the length increases. If the string should operate in a SIMD mode, common control signals must be sent to these processors. Recognizing that control and communication signals could be propagated in a similar fashion, one can fragment the control lines and operate the smng as a pipeline in which all the processors execute a given instruction at different times (Fig. 9 ). This will not affect operations except when communication operations are carried out. In these cases, one can insert latches in the communication path (which is in fact a simple retiming strategy) in order to restore synchronization for these operations. 
IV. Conclusion
An architectural approach which significantly increases the maximum speed of monolithic wafer scale systems has been presented. This approach combines pipeline clocking, wire fragmentation and retiming strategies in order to break the speed limitation imposed by signal propagation on long interconnection wires. With this technique, the maximum frequency and the size of large integrated systems can be uncoupled. SPICE simulations using typical parameters of monolithic-WS1 and a 1.2 micron CMOS technology show that a system similar to the WASP architecture could be operated at 140MHz. provided that processors could run at such a frequency. A prototype of the main component (flip-flop) allowing such a frequency increase has been designed and fabricated.
