I. INTRODUCTION
A TWO-DIMENSIONAL (2-D) discrete cosine transform (DCT) macrocell is key to image and video de/compression LSI'S because various standards including MPEG1/2 (Moving Picture Experts Group) [ 1] , [2] , CCITT H.261 [3] , and JPEG (Joint Photographic Experts Group) [4] have adopted DCT-based coding. In particular among them, the MPEG2 standard covers HDTV-rate video signals which require DCT processing of more than 100 M samples (pixels) per second. A 21 mmz DCT macrocell was reported [5] which can operate at 100 MHz. However, the macrocell was still slow and large for the final goal of "a single-chip HDTV video codec" in cost sensitive consumer products.
This DCT macrocell consists of a set of iterative multiplieraccumulators (MAC's) and buffer memories [6] as do most dedicated DSP's. To speed up the clock rate in the MAC's, ManuscriptreceivedIvlay 9, 1994; revised,August 21, 1994 . M. Matsui is with STAR Laboratory, Stanford University, Stanford, CA 94305-4055 USA on leave from SemiconductorDevise Engureering Laboratory,ToshibaCorporation,Kawasaki210, Japan.
H, Hara, T. Nagamatsu,and T. Sakurai are whh SemiconductorDevice EngineeringLaboratory,ToshibaCorporation,Kawasaki210, Japan.
Y. Uetamis with ResearchandDevelopmentCenter.ToshibaCorporation, Kawasaki210, Japan.
Y. Watanabe, A. Chiba,and K. Matsudaarewith ToshibaMicroelectronics Corporation,Kawasaki210, Japan.
L.-S. Kim was with Semiconductor Device Engineering Laboratory, Toshiba Corporation, Kawasaki 210, Japan. He is now with the Korea Advanced Institute of Science and Technology, Seoul, Korea.
IEEE Log Number 9406272.
deep pipelining and fast addition techniques like carry look ahead (CLA) and/or carry select adders [7] are usually used, but they unfortunately consume much additional area. This technique, on the other hand, emphasizes a fast circuit technique and a simple adder algorithm with shallow pipeline stages to achieve a fast and small chip.
This paper describes a 13.3 mm2 dedicated macrocell which can execute 8 x 8 2-D DCT's at 200 MHz with one pixel-perclock throughput [8] . A new circuit technique, named SA-F/F (sense-amplifying pipeline flip-flop) is implemented, in which a special flip-flop used as a pipeline latch also acts as a senseamplifier to regenerate low-swing differential inputs. Applying the scheme to a simple carry skip adder in the DCT MAC's drastically shortens propagation time and also reduces the macrocell size.
The next section discusses the concept of the SA-F/F scheme, explaining in some detail why it is useful. The basic architecture and implementation of the DCT macrocell are given in Section 111.The fabrication and results of the macrocell are presented in Section IV followed in the final section.
II. SA-F/F SCHEME A. Concept by the conclusion Sense-amplifying techniques are widely used in memory LSI'S in which complementary inputs with swings lower than 100 mV are differentially detected and regenerated to full railto-rail swings by a sense-amplifier. This technique significantly speeds up signal propagation when it is applied to heavily loaded and slow dual-rail signals like a bitline pair in a static RAM. In contrast, these techniques have not been utilized for logic LSI's except for an on-chip memory macrocell. One obvious reason is that most logic gates are single rail. However, recently dual-rail logic [ 10] , [ 11],[ 16] is becoming popular in data-path design to achieve higher speed than conventional CMOS single-rail logic. Another reason is that it is difficult to generate a timing signal to activate a sense-amplifier. The signal would be optimum if it were activated at the moment when the difference between levels on the dual rails passes the input-offset voltage of the sense-amplifier. Unfortunately, the offset voltage is affected by process variations, noise and so on, and hence unpredictable. In memory LSI's the timing signal is generally generated from delay lines using self-timing and they must be carefully tuned and optimized with timing margins large enough to tolerate process variations. However, this kind of tuning among racing signals is usually avoided in the design of logic LSI's because there is a risk of a fatal malfunction which cannot be corrected by lowering the system clock frequency. Therefore, a simple solution must be found for the sense-amplifying mechanism to easily migrate into synchronous design.
The basic concept of the sense-amplifying pipeline flip-flop (SA-F/F) scheme proposed in this paper is shown in Fig. 1(a) .
In this scheme, a sense-amplifier is merged into a latch which is a synchronization element to a system clock. The SA/F/F amplifies low-swing differential inputs (D,~) and latches data in the same way as a conventional static delay flip-flop (D-F/F), synchronously to a single clock (CLK) in Fig. l(b) . Q,ã re the full-swing outputs of the SA-F/F. It is not necessary to consider the latch timing optimization of the sense-amp as it is with ordinary reduced voltage swing circuits which use self-timing, because the SA-F/F utilizes the system clock itself as signal to activate the sense-amp. As a result, the latch timing varies as the system clock frequency changes and the optimized timing can be measured as the maximum clock frequency if the path including the SA-F/F is critical. In other words, the timing margin is always optimized and there is no need to generate a critical timing signal which is constant independent of the system clock frequency. Therefore, this scheme can naturally bring the sense-amplifying mechanism into a conventional single phase clocking system widely used in recent VLSI design. The SA-F/F scheme in combination with nMOS dynamic differential logic is shown in Fig. 5(a) . The differential inputs are generated from an nMOS differential logic network controlled by a OP pulse. The timing diagram is shown in Fig. 5 (used in Fig. 2(a) ) makes it possible to detect less than 100 mV differential inputs whose common mode value is close to ground. It is a significant limitation of this scheme that it can only be applied to the last block of a pipeline stage, because outputs of the nMOS network are directly connected to latches (i.e., SA-F/F's), and its inputs must be full-swing. Moreover, the scheme requires the generation of the precharge pulse @P. Usually, the clock (CLK) is utilized for @P, in which only the latter half of a clock cycle can be used for evaluation of the network.
Another option for generating @P is to use self-timing. That option does have the racing signal hazard between @p and the inputs of the nMOS network. However, the @P pulse is much easier to generate than the sense-amp activation signal because all the related signals are generated by conventional full-swing CMOS gates insensitive to the input-offset voltage, and hence much more predictable. For these reasons, the scheme is always accompanied by gates with other primary logic styles like conventional static CMOS, DPTL, CPL and so on. Gates using the SA-F/F scheme would be clearly slower than those with the other logic styles like DPTL and CPL if the scheme was applied to a simple gate like Fig. 5(c) . As stated earlier, the sense-amplifying mechanism is efficient only when it is applied to heavily loaded and slow dual-rail signals. The SA-F/F scheme makes it possible to construct large nMOS differential logic networks with deep logic depths, which are too slow to be realized by the conventional differential nMOS logic families. An example is shown in the next section.
C. Carry Skip Adder
The nMOS differential logic style in combination with the SA-F/F are applied to a carry skip (bypass) adder. Fig. 6(a) shows a The speed of the carry propagation is determined by the transmission-line RC delay of the chain whose time constant is derived from the equivalent resistance and capacitance of the chain. In the adder, the SA-F/F can detect a mere 100 mV input voltage difference (AVin) of the dual-rail carry chains.
In contrast, the inverter used as a detector in the conventional Manchester carry chain adder with a single-rail pass-transistor carry chain requires a 1.5 V input voltage swing, which is the logic threshold of the inverter. Therefore, the carry propagation of the new adder is roughly 15 times faster than that of the conventional one. It should be noted that the amplifying time of the SA-F/F-on the order of 1 ns-is not included in the addition time but counted in clock-to-data-out delay of the pipeline register. This time is of course not usually in the critical path.
Since the differential input voltage of the SA-F/F is about 100 mV and the low level of the inputs is ground, the threshold voltage drop by nMOS pass-transistors and pull-up transistors does not hinder the function of the SA-F/F, even in lowvoltage operation. The area penalty of the nMOS differential logic network compared to the ordinary CMOS gates is small because only nMOS transistors are used. Thus, in the case of a 20 bit adder, the resulting circuit with no additional CLA will have about a NYZO area advantage as well as a so~o speed advantage over a conventional CMOS implementation with CLA. Since both the current-controlled latch sense-amp employed in the SA-FIF and the conventional delay flip-flop do not consume dc power, and the voltage swing in the carry chain is reduced in high-speed operation; the new circuit is comparable to conventional CMOS circuits in terms of power consumption. Therefore, in terms of speed, area and power, the resulting adder is superior or equal to a conventional CMOS design using CLA.
The simulated performance of the carry skip adder with various bit lengths using 0.5~m CMOS transistors is shown in Fig. 7 . The addition times were estimated using a input offsetvoltage of 100 mV. It is assumed that the adder is constructed simply by connecting the 4 bit carry skip adders shown in Fig. 6 serially. Only the transistor width used for carry bypass was optimized. The 20 bit addition time is estimated to be 1.6 ns and the 64 bit time is 3.5 ns, which is faster or competitive to adders using an asymptotically faster and area-consuming algorithms such as CLA or carry select. In the case of adders with higher bit lengths than 64 bits, it is necessary to use a multiple carry skip technique [13] to remain competitive.
III. DCT IMPLEMENTATION
A. Architecture
The 2-D DCT processor macrocell which executes a twodimensional 8 x 8 DCT and inverse DCT (IDCT) is implemented using the row-column decomposition method based on Chen [14] . The macrocell also has a regularized parallel architecture based on distributed arithmetic by Sun [16] , which delivers high throughout DCT/IDCT processing of one sample (pixel) per clock.
There are two 1-D IDCT/DCT processors; one for row DCT/IDCT and another for column DCT/IDCT. A transposition RAM is used as a buffer between them as shown in Fig. 8 . when i = 0,1, 2,3
when i = 4,5,6,7 (2) which reduces the total number of multiplications from 64 to 32. Therefore can be calculated by a MAC operation in the following iterative way: (5) In (5) The data sequence $0, XI, . . ., X7 is stored sequentially into an input buffer memory with bit-parallel structure. With a latency of 8 cycles, the contents in the buffers are read out concurrently in bit serial structure with the least significant bit first. The buffer memory is a special purpose memory for parallel-toserial transposition, which has 16 word x 16 bit capacity. The bit-serial data are loaded into the 8 MAC units concurrently and calculated iteratively. The resultant sums from the MAC units are sent to the butterfly stage. The MAC unit which realizes the expression in (5) is implemented with ROM's, accumulators and shifters. Fig. 10 shows the block diagram of the MAC unit and its circuit implementation. Two partial products from two different R(3M's are added in parallel first and then accumulated shown in the block diagram. The output has 20 bit accuracy. In the circuit implementation, two bits from ROM's and one bit from the accumulation register are first added by a full adder, and then the full adder outputs are loaded into a 20 bit carry propagation adder. This carry save addition technique eliminates the need for another carry propagation adder.
A 20 bit differential carry skip adder with the SA-F/F scheme is employed as a final adder. Owing to the high-speed nature of the SA-F/F scheme, no pipeline latch is required in the entire MAC stage, which means that compared to the previous work [5] shown in Fig. 11 , two pipeline latches were eliminated. This is crucial in area reduction. The DCT macrocell requires 16 MAC units, which occupy 60% of the total macro area. Because the 20 bit adders with the SA-F/F have a smaller area, the overall macro size is reduced by 15% compared to a conventional CMOS implementation.
IV. FABRICATION AND RESULTS
The DCT test chip was fabricated using 0.8 ,um base-rule double-metal CMOS technology. 0.5 flm nMOSFET's and 0.6 m pMOSFET's are used for 3.3 V operation. Features of the macrocell are summarized in Table I . A die microphotograph is shown in Fig. 12 . The macrocell is designed using fully The macrocell is designed to operate at 200 MHz at 3.3 V and at room temperature. switchable and fully satisfies IEEE 1180-1990. The fast speed and small area are achieved using the novel SA-F/F scheme. In the scheme, a special flip-flop, the SA-F/F, was used in combination with nMOS differential logic. The SA-F/F can be used as a differential sense-amplifier synchronous to the system clock and can amplify dual-rail inputs with swings lower than 100 mV. A 1.6 ns 20 bit carry skip adder was designed by the same scheme and used in the DCT macrocell. The adder is 50% faster and 30% smaller than a conventional CMOS carry look ahead adder, which reduces the macrocell size by 15% compared to a conventional CMOS implementation. , ,,,,,, , , ,. ,,."s I , .,m.m,,wmm,,m,,m.,m,,,,,,,mmmnm,m,mflmm.m 
