With the escalation of clock frequencies and the increasing ratio of wire-to gate-delays, clock skew is a major problem to be overcome in tomorrow's high-speed VLSI chips. Also, with an increasing number of stages switching simultaneously comes the problem of higher peak power consumption. In our past work, we have proposed a novel scheme called Counter ow-Clocked(C 2 ) Pipelining to combat these problem, and discussed methods for composing C 2 pipelined stages. In this paper, we analyze, in great detail, the timing constraints to be o b eyed in designing basic C 2 pipelined stages as well as in composing C 2 pipelined stages. C 2 pipelining is well suited for systems that exhibit mostly uni-directional data ows as well as possess mostly nearest-neighbor connections.
I. Introduction to a high speed system distribution in C 2 pipelined realizations.
Another major concern when building high performance VLSI systems is to employ high performance pipelined structures in conjunction with high speed clocks. Pipelining is a technique for reducing the clock p e r i o d a s w ell as increasing the amount of parallel circuit activity b y splitting deep logic structures into shallower structures that are separated by pipelined latches. Although design methods for conventionally pipelined systems are well known 9], serious problems due to rigid clock synchronization may arise in very high speed pipelined designs. Strictly speaking, however, pipelining and clocking are orthogonal concepts. One can build asynchronous pipelines known as micropipelines 10] that do not employ clocks. However, the time penalty paid for generating the completion signals, as well as for handshaking 11] has prevented micropipelines from nding widespread use in high-performance VLSI systems. One can also implement wavepipelining 12] where the \latches" can be realized by the inherent combinational delays of logic structures. Despite their inherent performance advantages, wavepipelined systems require considerably more design e ort to balance combinational delays, and consequently have received only limited usage. C 2 pipelining is a synchronous design scheme that (as pointed out before) comes with clock-distribution methods as well as pipeline design-and composition-methods.
A feature of C 2 pipelining is that the clock signals travel opposite to the direction of data movement. Back-propagating clock signals have been considered previously 2, 13], but never widely used in actual circuits. These previous back-propagating circuits were rigidly clocked, and hence o ered no real advantages over H-tree distributed clocks in fact, they actually increased the clock period. Another clocking method is bu ered clocking, mentioned in El-Amawy 14], and originally described as pipelined c l o cking by Fisher et al. 7 ] (who does not assign any particular direction to pipelined clocks). This method also su ers from an increased clock period.
In C 2 pipelined systems, every pipeline stage employs clock bu ers, as shown in Figure 1 (a), detailed explanation of which will be given in succeeding sections. These inverter bu ers not only deliberately skew the clock (the exact one-sided constraints will be presented later) but also restore the clock-edge. This scheme achieves temporally distributed clocking. Clock ampli cation is also carried out in a distributed fashion. Conventional two-phase clocked pipelining is also illustrated in the gure for comparisons. The C 2 pipelining idea was rst introduced in 15] where we presented many actual uses in the context of a subband vector quantizer (SB/VQ) chip. In this paper, we will focus on analyzing the timing constraints of C 2 pipelining. In Section V we will review the results of a C 2 pipelining network for the SB ltering chip.
Another feature of C 2 pipelined systems is that it enables one to use simple and e cient dynamic latches, which o er extremely low latch d e l a ys and areas, and avoids special latch designs 1, 2]. The C 2 pipelining method also staggers the switching activities of the latches, thus reducing the peak power consumption. This, in turn, reduces internal switching noise and also simpli es power-line routing, making it easier to distribute high speed clocks. The pipeline interconnection methods to be described actually make the idea of C 2 pipelining more useful than pipelines with only nearest neighbor connections. In 15], we i n troduced such methods for 1) data forwarding, i n w h i c h data skips a few pipeline stages in the direction of the data ow, 2) data backwarding, in which data skips a few pipeline stages backwards (commonly used for iterative computations), 3) sequential connection of di erent pipelines, 4) pipeline fork and join methods to combine pipeline functionality in parallel, and 5) synchronization methods to synchronize incoming data and outgoing data to a clock signal. Timing constraints involved in these methods will also be discussed in detail in this paper in Section III.
In Section II, basic C 2 pipelining architectures are described and analyzed. Basic composition methods of data forwarding and data backwarding are analyzed in Section III. Section IV shows extended composition methods of sequential connection, pipeline fork and join and synchronization. These methods are explained using the analysis results shown in Section III. Section V gives a practical assessment o f C 2 pipelining with a design and layout example. Conclusions are given in the nal section. II Figure 1 shows the di erence between a C 2 pipelining and a conventional clock distribution for a pipeline. Circuit C2 on Figure 1 (a) employs a chain of inverters to provide local clock signals. Local bu ers attached to the chain provide appropriate output power to control local latches. Figure 1 (b) shows the conventional method in which a non-overlapping two-phase clock generator is located at the center of the clock distribution network. This clock generator is designed to cope with clock loads of the entire clock network.
C 2 pipelining can be realized in several ways as shown in Figure 2 . Clock timing analysis pertaining to a particular latch i with respect to its neighboring latches will now be discussed. First, the pipelining involves \go-throughs" during clock period I and III shown in Figure 3 (b) (due to the fact that C 2 pipelining implements overlapping clocks.) For instance, during period III, stage i-1 output can \go-through" to stage i+1 because the i-1 latch is in hold while i and i+1 are transparent. Go-through should be avoided in a rigidly clocked synchronous system with a non-overlapping clock. However, this go-through does not make stage i+1 produce a wrong output in a C 2 -pipelined system.
A possible scenario involving a go-through is the following: stage-latch i-1 stabilizes its output by period II however stage i-1 delays this output which reaches the input of latch i only during period III (note the distinction between stage and stage-latch) the output of stage i (not stage-latch) can be generated early during period III and be sent to stage-latch i+1 which is also transparent. In this scenario, the output generated by l a t c h i-1 gets processed by s t a g e s i-1 and i and is applied to the input of stage i+1|all during period III. This go-through is not harmful because it causes stage i+1 output to tend towards the same value as it will evaluate to in the absence of go-through (much l i k e chaining 16]). The go-through possible in period I can also be analyzed in the same way. In fact, go-throughs can actually help shorten the clock p e r i o d b y a l l o wing a stage to absorb a fraction of the long-path delays associated with the stage preceding it. This can potentially be an advantage if the stage delays are not exactly balanced. The other periods involved (II, IV, V and VI) do not allow go-throughs to happen. Figure 4 illustrates the overall latch operations for a C 2 pipeline. This gure shows staggered latch operations, where each latch alternates between transparent and opaque states. The vertical bold lines emanating from one period of the latch i operation marks a sending window, i n volving a transparent state and the succeeding opaque state of a latch, and a matching receiving window of the following latch. The latter latch i+1 is in the transparent state between the two bold lines. This shows that the latches are operating as described in previous paragraphs.
The novelty o f C 2 pipelining results from the use of intentionally inserted delays on clock lines.
These delays not only provide pipeline speed-up described above, but also partition the clock line into many small pieces enabling one to avoid global clock s k ew problems. This leads to a locality property of timing constraint to the whole pipeline: i. e. the whole pipeline works properly by assuring local delay constraints for all stages. In very high speed designs, the delays associated with segments of wires cannot be ignored. These are taken into consideration in the following calculations. The inequality in (2) can always be satis ed because the clock period is externally controllable as in conventional synchronous clocking. The inequality in (1) is the condition that is most important.
Hence, C 2 pipelining results in one-sided t i m i n g c onstraints. Also, notice that (2) is independent of clock-skew, which con rms the observation that C 2 pipelining is an attractive method for GHz clocked circuits where skew is expected to become a major problem using conventional rigid-clocking methods.
III. Basic composition methods
During the composition of C 2 pipeline blocks, there will arise situations in which the data needs to (1) move d o wnstream (with respect to the data movement) to be consumed by a functional block with typically several inputs (Figure 7 (a) ), and/or (2) move upstream to be consumed by a functional block with several inputs (typically in iteration structures) (Figure 7 (b) ). As we expect such \stage skipping" connections to be infrequent a s w ell as skip only a small number of stages, we do not provide any special circuits to resynchronize the data instead, we obtain timing constraints to be obeyed. Skips over longer distances have to proceed as several short skips in sequence with corresponding adjustments in the data timing. Timing constraints required for data forwarding and backwarding are now analyzed in the sections to follow. When the destination is a latch w i t h clk i+2j+1 as shown in waveform (3), the clock for this latch is inverted and leading by ( 2j+1)*d c where d c is a clock bu er delay. This leading duration can be extended up to P ; S as shown by w aveform (4) (i.e., the latest data sent b y latch i must fall before the set-up time window, marked S, of a latch with clk i+2k+1 on waveform (4)). Note that data forwarding by whole cycles is possible. However, such extended forwarding needs to be avoided since the amount o f d e l a y in a long chain of inverters can signi cantly vary with temperature, operating voltage and fabrication process parameters, and hence may not reliably track the cycle time.
Waveforms (1) and (5) give an example of an incorrect data go-through situation occurring from latch i to latch i+2l+1. This resulted from a violation of the above-stated forwarding limit. In this example, imagine that latch i presents incorrect data at the beginning of its transparent state and correct data only at the end of its transparent state. By the time correct data is presented, however, latch i+2l+1 could become opaque, as can be seen in waveform (5). value on connection wire between the two latches will restrict the number of stages to be forwarded. Figure 9 shows overall latch operations for a data forwarding to send data directly from latch i to latch i+3. V ertical bold lines on latch i operation shows a sending window and a transparent state between the bold lines on latch i+3 operation shows a receiving window. If data from latch i should be directly sent t o latch i+n in the gure, it results in hazardous latch operation as discussed above. Figure 10 shows an example circuit of a line memory unit design in C 2 pipelining to provide delayed data 15] which is often necessary during image data processing. This design has the advantage of staggering transistor switching activity among each line memory block due to the deliberately skewed nature of C 2 -pipelined clock. The peak power consumption can be much l o wer which increases noise margin and lower power rail capacity required. Data forwarding 1 is directly achieved since the di erence between data bundle clock C i and destination clock C i+5 is ve which is odd. However, data forwarding 2 with four clock di erence, which is not odd, is achieved in two steps dividing the four into one and three: data forwarding from clock C i+1 to C i+2 and data forwarding from clock C i+2 to C i+5 with the cost of extra latches for clock C i+2 . Figure 11 (a) shows data backwarding ignoring wire delays. The simplest form of data backwarding is feeding data from a latch with clk i to a latch w i t h clk i;2 with blank data path (waveform (1) and (2) of Figure 11 (a) respectively). Waveform (2) on the gure shows that latch i-2 is controlled by a non-inverting and delayed clock. The clock d e l a y o n t h e c l o c k line provides timing margin to send data from latch i to latch i-2 because the latch closing timing on waveform (2) is delayed by the clock delay a m o u n t from the latch closing timing on waveform (1). When the destination is a latch i-2j as can be seen in waveform (3) Waveforms (1) and (5) give an example of an incorrect data go-through situation occurring from latch i to latch i-2l, resulting from a violation of the above-stated backwarding limit. In this example, imagine that latch i presents correct data, to be passed, during its opaque state and incorrect data at the beginning of its succeeding transparent state. By the time incorrect data is presented, latch i-2l could be transparent still as shown in waveform (5).
The timing constraint for data backwarding, ignoring wiring delays, is : 2m d c < P ; H.
The timing constraint with wiring delays taken in to account can be derived as follows (see Figure 11 (b) also). (Note: this derivation may be skipped during initial reading.) The same wire delay conventions used for data forwarding are used. When our signal-and-data observation point is at latch i-2m, the worst case scenario is: 1) the earliest data validation time to the input of the Figure 13 which show s a m ultiplication and accumulation unit which frequently used in digital signal processing. The data backwarding is need to iterative calculation since output data should be fed back to input. Figure  13 Although both data forwarding and data backwarding provide basic means to build a system, it is better to have extended composition methods such as a pipeline fork and join, sequential connections, and synchronization interface to build a C 2 -pipelined system. Figure 14 (a) shows pipeline fork and join to connect two or more pipelines in parallel when the functionality o f t h e pipelines are to be combined in parallel. Figure 14 (b) shows sequential pipeline connection to connect pipelines directly sequentially Figure 14 (c) shows pipeline synchronization, illustrating the incoming data and outgoing data of a C 2 pipelined block needed to be synchronized to a particular clock. This is needed when a block or a pipeline should be synchronized to a particular local or global clock signal. Use of this synchronization method provides a way to build a system with hybrid clocking with counter ow-clocking and conventional clocking. This can be valuable in high performance processors that have dedicated DSP hardware for example. These three extended composition methods, in addition to the two basic composition methods, provide a VLSI designer means to build systems of non-trivial size using C 2 pipelining. This section analyzes those extended methods. To connect pipelines in parallel, it is ideal to have same length (number of stages) of pipelines sharing a clock distribution line. When the lengths are not the same, clock timing adjustment should be made while sending or receiving data. Figure 14 (a) illustrates one of those two situations, showing the timing adjustment to feed data. Since the outgoing clock from a shorter pipeline is at the downstream of the data ow from the data to be fed, the data feeding should use the data forwarding method to adjust clock timing associated with the data. The outgoing clock needs to be terminated and another outgoing clock from the longer pipeline should be fed to preceding pipelines since the other one has correct timing for incoming data. Then, all the local timing constraints associated with input and output of the parallel-connected pipelines are satis ed.
B. Sequential Connection
The sequential connection method uses the same method as to connect two a d j a c e n t pipeline stages. The output from the preceding pipeline on Figure 14 (b) is fed to its succeeding pipeline in the gure using one inverting clock delay and wiring between latches to satisfy local wiring constraints.
C. Synchronization Interfaces
Since the synchronization method uses a particular clock signal for its input and its output, timing to the input or output of a C 2 pipeline should be adjusted. Figure 14 (c) illustrates such a method employing data backwarding for input data. Since the input data and the C 2 pipeline share the particular clock clk, the outgoing clock signal from the pipeline is delayed by the amount of its clock l i n e d e l a y. This necessitates data backwarding to adjust clock timing from the input data and the pipeline input. The output data of the pipeline is okay to be consumed by a r e c e i v er using the clk signal.
With this scheme, a physical layout issue pertaining to wiring arises. The fact that the output data normally should move from the end of the pipeline to its front necessitates having a long connection wire when both ends are physically separated far. In this case the timing constraint is satis ed with the scheme that the clock clk1 is delayed and inverted from the clock clk. Then, timing constraints to the clock clk is satis es for the input and the output. This interface provides a w ay to implement a functional block i n C 2 pipelining within a conventionally synchronous-clocked system. T h us C 2 pipelining can also be used selectively in large VLSI chips.
V. A practical assessment of C 2 pipelining
This section describes a speci c C 2 pipelining design and with its associated layout issues. Figure   15 shows the design and layout of a subband ltering chip (SB chip) for processing HDTV input image data to four subband images 15, 1 7 ] The chip size was found to be 17:6 mm 15:8 mm in 2-CMOS technology with 483,000 transistors on it. The chip target clock speed, which is preliminary (using 2-CMOS technology), is 4.5 MHz. The speed is reduced by a factor of four from 18 MHz, which needed to process 72 MHz input pixel data, using 0. The units in Figure 15 are connected sequentially: the 2D-FIFO unit to the incoming pixel data, two line memory units (unit I at left and unit II at right) in the middle in parallel, and the lterbank unit to produce output data of subband images, as shown in the gure. The unidirectional data ow is well suited to C 2 pipelining. The clocks counter ows the direction of the data ow a s shown in the gure. Each unit was designed using C 2 pipelining except the 2D-FIFO unit which is designed by using conventionally rigid synchronization.
Regarding wiring lengths between the units, there are several things to mention. (1) The clock c1 line delay (to be included on clock delay, d c ) and the longest delay on data bus d1 provides timing margin for its data forwarding between the two units, Using the inequality derived in (3) of Section III. (2) Similarly, the data forwarding involved in clock c3 line and data bus d2 has more timing margin, due to longer wire-lengths, than that for previous c1 and d1. (3) Between the two clocks c2 and c4, c4 has a longer delay than c2, due to longer physical wire for c3 which i s about half of the chip width Thus c4 needs to be fed to the preceding unit and c2 is terminated, (4) The data forwarding by the d3 data bus and c4 clock (to the line memory unit I) has timing margin realized through wire delays The data forwarding by d3 data bus and c2 clock (to the line memory unit II) has more timing margin than that due to temporally advanced clock c2. There are two other noteworthy things about the dsign: the double frequency clock ( 2f clk) feeding to a particular location of the chip and 12-bit connections to another chip. The 2f clk needs timing adjustment to align the clock with the c5 clock. The adjustment can be done by a n adjustable delay element. The 12-bit connections involve d a t a f o r w arding technique.
This section shows how a big chip for image processing can be designed and implemented in C 2 pipelining. Another big chip, a vector quantizer chip, which will process a subband image from the SB chip, has also been designed and implemented in C 2 pipelining 17].
VI. Conclusions
The development o f C 2 pipelining was motivated from the fact that the development o f a n e e ctive high speed clocking technique is essential for building high performance VLSI systems. It was observed that rigid synchronization over a chip or a system makes design easier. However, for systems that have VLSI realization with mostly uni-directional data ows as well as possessing mostly nearest-neighbor connections, the assumption of rigidly synchronized clocking is not necessary, and can result in lost performance (including waste of clock period due to clock s k ews and less noise margin due to simultaneous ring of latches) when enforced. C 2 pipelining was developed from an observation that many high speed systems show mostly uni-directional data ows and mostly nearest-neighbor connections. C 2 pipelining adopts back-propagating clock signals 2, 13], which are known to be safe but the use of which is usually avoided (due to extended clock period), in combination with pipelined clocking 7, 14]. A C 2 -pipelined system can be built by using only \local delay constraints", which is a prominent feature to achieve v ery high speed clocking. This paper introduced C 2 pipelining technology including 1) basic C 2 pipelining architectures, 2) the composition methods of data forwarding and data back w arding, and 3) the extended composition methods of pipeline fork and join, sequential connection and synchronization interfaces.
Those basic architectures and composition methods provide a VLSI designer a means to build a system with many one-sided l o cal delay constraints without concern about global clock distribution problems and skew control. A C 2 pipelining design and layout example for a subband ltering chip was reported in Section V. By building a system in C 2 pipelining, one can shorten clock period signi cantly when interconnection delays are larger than gate-propagation delays 5]. The trade-o is to use inverter chains and some extra latches for data forwarding or data backwarding (in case that receiving windows do not match with their corresponding sending windows) versus to build an elaborate clock distribution network to supply a global clock. This paper concludes that by applying C 2 pipelining and its composition methods to build a system, clock p e r i o d s c a n b e m uch shorter than the one with rigidly clocked synchronization when interconnection delays are larger than gate propagation delays. In addition to this, power bump peaks can be reduced by staggered operation of latches. These two factors are essential for building large and very high speed clocked system.
