Abstract-This work shows a robust and easily implemented clock generator for custom designs. It is a fully digital design suitable for both high-speed clocking and low-voltage applications. This clocking method is digital, and it avoids analog methods like phase locked loops or delay line loops. Instead, the clock generator is based on a ring counter which stops a ring oscillator after the correct number of cycles. Both a 385 MHz clock and a 15 MHz custom DSP application using the onchip clocking strategy are described. The prototypes have been fabricated in a 0.8 pm standard CMOS process. The major advantages with this clocking method are robustness, small size, low-power consumption, and that it can operate at a very low supply voltage.
I. INTRODUCTION ONOLITHIC implementation of clock generators in digital systems has become more popular recently as advances in packing density have increased to a level where complicated digital systems can be integrated on a single or a few chips. As in any electronic system, the obvious advantages of integration are reduction of cost, simplified large volume production, and small physical size. When targeting portable battery-operated equipment like cellular-digital-mobile-telephone systems, power consumption becomes increasingly important [ 2 ] . An offchip clock costs power both to generate and also to distribute on the PC-board. In contrast, the on-chip clock is used locally near the clocked circuit. Thus, the total loading capacitance for the generator is kept at a minimum, and thereby less power is consumed. This is a digital method where the sample rate signal starts the clock. Then a cycle counter takes over the control and stops the oscillator when all cycles have been generated. Other approaches, often analog, such as phaselocked loops (PLL's), are used both for clock-recovering [8] and for on-chip frequency multiplication [ 3 ] , 191. The reasons for using a PLL for on-chip clocking are low clock skew, low-phase noise, and low-clock jitter. However, integrating a PLL in a digital noisy environment is difficult. In addition to noise issues, the PLL is also sensitive to process variations. Another method for on-chip clocking is to use a delay-line loop (DLL) . In [12] , a DLL for Manuscript received November 29, 1994; revised October 9, 1995 . This work has been supported by NUTEK, the Swedish National Board for Industrial and Technical Development.
The authors are with the Department of Applied Electronics, University of Lund, 22100 Lund, Sweden.
Publisher Item Identifier S 001 8-9200(96)03404-X.
frequency doubling is presented. However, even the DLL technique uses a sort of "analog" control loop.
In contrast, the presented fully digital on-chip clocking method is a simple and robust method. Moreover, fully digital methods are more desirable for both low-voltage and low-power clock generation. Both the static and the dynamic power consumption are lower compared to analog PLL-based methods.
Large designs often suffer from clock skew between different parts on the chip. The impact of clock skew can be reduced by partitioning the design in several self-timed parts [lo] . For the same reason, distributed PLL-based clock generators have also been used in [6]. The same approach can be applied with these on-chip clocks. Lowclock skew is reached by partitioning the processor in parts where each part is clocked with a separate on-chip clock. With a sufficient handshaking these parts can even be clocked at different frequencies. This gives several opportunities in custom ASIC design. Low-speed parts can be optimized with less driving capability which then consumes less power.
On-chip clocking is particularly useful in bit-serial applications. A disadvantage with bit-serial arithmetic has been that the throughput is limited by the maximum offchip clock frequency. To increase the throughput in bitserial arithmetic, methods for reduction of the number of pipeline stages are practiced. The on-chip clocking approach is to increase the throughput by having a local monolithic clock as near the design as possible.
The clocking method is fully digital which means that the method is robust and relatively stable against process variations. A digital solution is also a necessity for lowvoltage applications. The clocking technique has been tested down to 0.7 V in a standard 5 V CMOS process.
In contrast to analog clocking methods, some digital simulators can be used in the simulation. A digital simulator as IRSIM [4] can be used for the digital clock generator even together with a complete DSP.
As an example, a high-speed on-chip clock is shown. The clock has been fabricated and verified to run at 385 MHz in a 0.8 pm standard CMOS process. The dig,ital on-chip clock is aIso applied to a digital intermediate frequency filter for the American digital mobile telephone system. The filter, a 12th-order wave digital lattice filter [ 7 ] , is realized with fixed coefficient [5] bit-serial arithmetic [ 111. 00 18-9200/96$05 .OO 0 1996 IEEE Section I1 gives an overview of the clocking principle, and Section I11 describes the clock in detail. In Section IV, the results from the fabricated high-speed clock are shown. In the same section, the on-chip clocked IF-filter is described.
OVERVIEW
In some DSP architectures, a fixed number of clock cycles is needed to process the input signal. For these systems one would like an on-chip clock or a burst generator that can be hard wired to a fixed number of cycles. Fig.  1 shows the basic idea. An existing off-chip signal at sample rate is used as a trigger for the clock generator. Each clock cycle is counted by a cycle counter. When all the clock cycles are generated, the counter stops the generator. Besides the number of cycles, another critical parameter is the clock period time. The clock period time is designed so that all the cycles are spread evenly over the sample period. Two constraints must be fulfilled: the period time must be long enough compared to the delay time in the logic. The period time must also be so short that all the cycles fit in the external data rate. For this, a simulator like SPICE is sufficient. However, this gives a drawback compared to a PLL-based clock generator since a small margin must be practiced.
The local clock strategy can be used on a single design on a chip, but it is useful for several processors on the same chip as well. Each processor usually has its own requirements concerning clock frequency and number of cycles. The communication between the different designs is done by a handshaking at the sample rate. A serial input data-stream would also require the local clock signal off the chip to synchronize with the input signal. Therefore, all input and output signals are preferably handled in parallel at sample rate, if the clock shall be kept local.
If we design an entire system or a chip with several processors using one single clock frequency, all parts have to be designed with regard to the driving capability of the most demanding part. The parts that are ready first must also stay in "wait state" until all the different processors are ready. A large synchronously driven chip also requires large on-chip decoupling capacitors to stabilize the power supply.
In the on-chip clocking method, the clock cycles for each processor are spread out over most of the sample period. Thus, each processor can be designed for its specific frequency requirements. The driving capability of the internal circuitry is adjusted in such a way that a minimum of power is required. Furthermore, the local clock is turned off when each single computation is ready, and thus no dynamic power is wasted.
To save power, the internal circuitry is adapted to their specific loads. All cells that have a large load are traditionally provided with a buffer (or large transistors in the last stages). Fig. 2 shows an example of a bit-serial processor. It is natural to make the designs so that all the long wires are loaded only during one of the clock phases (4). In addition, some of these long wires are terminated with a larger load compared to circuitry within a block. By changing the duty cycle, the loading times are better matched to the time each clock phase requires. Thus, these circuits can be designed with reduced buffer size-or even without any buffers.
111. CLOCK DESIGN This section describes the circuitry of the digital monolithic clock. Fig. 3 shows a block diagram of the clocking principle. The triggered clock generator has two main parts: a controllable ring oscillator and a cycle counter. The ring oscillator is controlled both by the off-chip Sumple rute signal and the Hold signal from the cycle counter. When the Sample rute signal goes high, the ring oscillator begins to oscillate. It will then clock the cycle counter through the small buffer which in turn drives the Hold signal to a low state. A low Hold signal keeps the oscillator in the running mode even if the Sample rate signal goes low again. When the cycle counter has counted the wanted number of cycles, the Hold signal goes high again and thereby stops the oscillator.
The cllock signal from the ring oscillator is taken as the DSP clock via the large buffer. Of course, the DSP clock can also be used to clock the cycle counter. However, the clock generator will then be dependent on the total load of the DSP clock. A large load will affect the time it takes to stop the clock which can be hazardous since an extra cycle can be generated. In that perspective, it will be preferable to use a small buffer before the large DSP buffer. Such a strategy will simplify the reuse of the clock generator.
The circuit diagram of the ring oscillator is shown in Fig. 4 . This is a controllable ring oscillator with an odd number of inverting elements. These elements are connected in a loop, and the oscillator is controlled by the NAND gate. The control signal is provided either from the Sample rate signal or from the Hold signal, a highcontrol signal for running mode and a low for the stop or hold mode.
There are two appropriate methods for adjusting the frequency: first, adding more inverters in the ring oscillator chain will increase the clock period time since each inverter adds an extra delay. The other method to increase the delay is to use long channel lengths or short channel widths of the transistors. Both methods can preferably be used in combination. When changing the duty cycle, a special scheme of transistor dimensions is used. Tests have been done on the IF filter described in Section IV where the clock was found to be well balanced when the first clock phase (4) was shortened to 25% of the period time. Fig. 5 shows the basic transistor sizing for a clock with 25 % duty cycle (MOSIS design rules). Every second transistor width W and every other length L is doubled compared with the minimum dimensions. This is repeated in a "complementary zig-zag" pattern. After the fabrication, it turned out that the shown scaling scheme gave 21-24% duty cycle in the range of 3-5 V supply. However, due to short channel effects the pulse width decreased down to 9 % when the supply voltage was lowered to 1 V. Thus, a scheme where the transistor lengths are constant and only the widths are varied will be less sensitive against changes in the supply voltage.
If the conditions are critical, the ring oscillator can be designed for tuning. Known methods are, for instance, to use current-starved inverters [3] or a "variable CIOdd'' [9] . . The cycle counter is designed with a ring counter which has a circulating one. The number of shift-register cells gives the number of cycles that will be generated each sample.
~~. . . .
1
If the clock is designed for tuning, the chip needs another external pin which, unfortunately, must be fed with an "analog control voltage." The cycle counter in Fig. 3 is shown in detail in Fig.  6 . It is a shift-register connected as a ring counter. The Hold signal is the output of one of the register cells. Initially, a signal pattern with a single "1" and zeros in all the other cells is loaded into the shift-register. One of the external pins is used for the loading. The loading is only done once in the initial phase. The "1" is loaded at the cell that provides the Hold signal. Thus, the clock begins in the Hold mode. To count the cycles, the pattern is circulated once every clock burst. This circulation is repeated every sample. When the first clock cycle in the burst is executed, the "1" shifted one step, which sets the Hold signal to low. After one cycle the cycle counter takes over the control of the clock generator, and this running mode is continued until the "1" has circulated one round in the loop. Thus, the number of cycles is equal to and hard wired to the number of register cells in the ring counter.
The shift register gives a relatively large cycle counter especially for long bursts. A binary counter with an appropriate decoding net will save area. However, the upper frequency limit is not bound to the minimum of three inverting elements in the ring oscillator. It is rather limited by the time it takes to turn off the oscillator, where the proposed counter only adds one shift-register delay. In the binary counter we have to add the ripple through all the counter cells plus the time in the decoding logic.
Before the signal processing starts, the clock generator must be initialized. It is important that the initialization can be done regardless of the state of the ring oscillator. The oscillator will not work properly before the cycle counter is loaded. It is also important that all nodes in the register cells are properly initiated at power up. Otherwise, any undefined state will propagate through the shiftregister cells. Fig. 7 shows the circuit diagrams for the used shift-register cells designed for two phase logic. Here, the reset ccll is shown where I stands for initialize (LOAD in Fig.  6 ). The set cell is the same upside down and with all transistors changed to their opposite polarity. path is gated with the clock 4 and 3. Thus, the initialization I of all dynamic nodes is done independently of the clock. The control part in Fig. 3 is shown in greater detail in Fig. 8 . When the Sample rate signal arrives, it must be low at least as long as the time it takes for the Hold signal to go low. A Start pulse is formed from the Sample rate signal, via the edge trigger. The edge trigger is used to separate the Sample rate signal from the clock generator. Only a negative edge will start a clock burst. The start of the clock generator will give a small latency that may reduce the performance slightly compared to a PLL-based method. However, if the number of cycles is more than a few, the effect will only be marginal.
IV. PROTOTYPE RESULTS

A . A High Speed Prototype
A 385 MHz 0.2 mm' prototype of a clock generator was designed and fabricated in a 0.8 pm two metal layer standard CMOS process at AMS in Austria. A microphotograph of the test chip is shown in Fig. 9 . The two lower blocks belong to the cycle counter. This is a 32-bit ring counter which gives a clock burst of 32 cycles. Above the cycle counter, one row with the control part and the ring oscillator including the clock drivers are placed. The remaining part is a frequency divider which divides the clock to a signal that can be easily measured off-chip. Fig. 10 is a timing diagram of the clock control. The edge trigger provides a Start pulse that is long enough for the counter. The clock starts to circulate the "1" which gives a low Hold signal. The Hold signal is low until the circulating " 1 " has passed through all the register cells after 32 cycles. After that, the signal goes high again and stops the clock. Note that at least one of the control signals is low while the clock is running.
As shown in Fig. 1 1, the upper clock frequency is measured to 385 MHz with 5 V supply voltage. As expected, the frequency decreases when the supply voltage is lowered. However,. the maximum frequency is still as high as 290 MHz with 3 V supply voltage. If the voltage is reduced fuirther, the frequency will go down fast since the fabrication is done on a standard 5 V process. A low voltage process with threshold voltages down to 0.2 V will improve the current drive at low voltages 121 since the delay in a CMOS gate is proportional to l/(K[(, -V,)2. The lower limit for the clock is at 1.45 V supply voltage where the frequency is measured to 90 MHz. These frequencies are achieved with a ring oscillator containing seven inverting elements.
The power consumption versus supply voltage is also shown in Fig. 11 . The diagram shows the power con- sumption including the pads and the frequency divider. All the measurements are done with a Sample rate signal corresponding to 1 MHz, i.e., 32 cycles are generated every 1 p. The power consumption is measured to 22 mW at 5 V supply. The relationship between the dynamic power consumption and the supply voltage is as P , = C,
x f x Vio. Consequently, when the supply voltage is lowered, the power consumption will decrease quadratically. At 3 V the power consumption is 5 mW.
B. An IF-Filter Prototype
Another application where the clock is used is in an intermediate frequency filter. The filter is an IF-filter for the American digital-mobile telephone system based on the IS-54 specification [ l ] . The filter algorithm is based on lattice wave digital filter theory, and it is realized with fixed coefficient bit-serial arithmetic [ 5 ] , [ 1 11. In today's mobile telephone systems, an expensive analog ceramic or crystal filter is used for the IF-filtering. These are expensive since the filter specification is very tight, and therefore, each single filter must be measured or tuned. The general idea for this prototype is to move most of the filtering to the digital domain as illustrated in Fig. 12 . In the analog case the IF filter suppresses the adjacent channels just before the signal is to be A-to-D converted. In the digital case, a low-cost noncritical analog filter is used for antialiasing; the main filtering will be done in the digital domain. To fulfill the requirements a 12th-order wave digital lattice filter is used [7] .
The filter chip is clocked internally with a local on-chip two-phase clock. Two external pins are used to control the clock. One pin is used for the Sample rate trigger to start a clock burst. The other pin is used to initiate the cycle counter (the I-signal in Fig. 7 ). The initialization is only needed when the power is switched on. Since no external components are used, these two pins are the only control from the outside. One sample is computed in 38 internal cycles. Thus, the clock generator is hard-wired into a 38-cycle burst generator, i.e., the cycle counter described in Section I11 has 38 shift-register cells. The sample frequency is chosen to 380 kHz which is four times the center frequency of the IF-filter. This gives a needed internal clock frequency of 14.44 MHz. However, a small frequency margin for the clock generator is practiced. Fig. 13 shows two oscilloscope photos of the 38-cycle burst. The lower trace is the Sample rate signal, and the upper trace shows the internal clock. A 3 V supply is used in the photo which gave a frequency of 21.5 MHz. However, the filter works at correct speed down to 2.2 V supply where the clock frequency crosses 14.44 MHz. The clock was also tested with further lowering of the supply voltage. The IF-filter is still working below 2.2 V but not at the correct speed. At 1.2 V supply the filter gave correct calculations but the on-chip clock frequency was only 1.8 MHz. The clock works down to about one threshold voltage (0.7 V) which is lower than the rest of the filter. This differs from the high-speed clock that stopped oscillating at 1.45 V supply. The reasons for this are different design conditions. The high-speed clock is designed with regard to minimal propagation time, and the filter clock is designed with regard to "maximal" propagation time. Thereby, the filter clock is more tolerant to low supply voltage. However, the frequency is very low at voltages as low as 0.7 V, where it is measured to 3 kHz.
The photo at the bottom in Fig. 13 shows the first cycles in the burst. A duty cycle of 25% was expected, and it was measured to 24% at 5 V supply voltage. A minor gain in power consumption was achieved with the asymmetrical clock approach compared to the symmetrical buffered approach. Four-hundred and fifty-one cells are used in the filter. In the symmetrical case, 16 (4%) of them were replaced with cells using larger transistors in the last stages. The chip with the asymmetrical clock consumed 4 % less power.
The IF-filter chip is, like the high-speed clock, fabricated at AMS in a 0.8 pm standard CMOS process. The die size is 2.1 x 2.2 millimeters or 4.6 mm2, and the core area is 2.9 mm'. The clock generator is only a minor part clock buffers, an H-three-like structure of the clock lines is applied. The clock lines are first routed up and down in the middle vertical channel. After that, they are routed horizontally to left and right through each row of cells.
V. CONCLUSION
A meithod for local on-chip clocking on custom digital signal processors is presented. The clocking method is well suiied for standard CMOS processes, and it is useful for both, high-speed and low-voltage applications. The clock uses no analog parts which makes it less sensitive to process variations. No external components are needed, and only two external pins are used for the clock control. It is shown by some examples that the silicon area cost is small. The local clocking method is also advantageous for both low-voltage and low-power applications compared to off-chip clocking and compared to PLL-based on-chip clocking, methods. of the chip. With a size of 0.8 x 0.2 mm including the clock drivers, it is as little as 3 % of the total chip area. It is a 10 MIPS chip using 11 000 transistors. Using 3 V supply, the chip consumes 8 mW power where 1 mW is consumed by the clock generator.
A die photo of the chip is shown in Fig. 14 . The left part is the data-path which is mapped directly from the wave-digital data-flow. At the right side several blocks He is currently pursuing the Ph D degree in the digital signal processing group at the same depdrtment, in the field of si1 icon implementation of custom DSP's relnted to the communication area, especidlly with pardllel-$erial architectures IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 31, NO. 5 , MAY 1996 Mats Torkelson (S'83-M'88) received the M S E E degree in electrical engineering in 1980 at ETH Zurich, and the Tekn Lic and Ph D degrees from Lund University, Lund, Sweden in 1985 and 1990, respectively He has worked with ADiDA converter$ for professional Audio tape recorders at Wdli Studer AG, Switzerland and with maritime X-band radar collision avoidance systems at Lund University From 1984 to 1986 he worked part time at the Univer sity of Califomia, Berkeley He heads the digital signal processing group at the Department of Applied Electronics, Lund University, which he initiated in 1986 Since 1994, he has worked part time with Encsson Radio Systems, Stockholm, Sweden Hi\ current interests are mobile communication, algorithm implementation, and amplifier design
