Abstract-A new architecture for digital implementation of the adaptive equalizer in Class IV partial-response maximum likelihood (PRML) channels employing parallelism and pipelining is described. The architecture was used in a prototype integrated circuit in a 1.2 pm CMOS technology to implement an eight-tap adaptive equalizer and Viterbi sequence detector which consumes a total of 70 mW from a 3.3 V supply operating at an input sampling rate of 50 MHz.
I. INTRODUCTION
AMPLED-DATA techniques such as Class IV partial S response with maximum likelihood detection (PRML) are being applied to magnetic disk storage channels in order to increase transfer rates and recording densities [ 1]- [2] . In order to provide a robust implementation of the functions required in the read path of these channels, implementation of key blocks such as timing recovery, adaptive equalization, and sequence detection is often in the digital domain. The power consumed by these blocks can be appreciable due to the high speeds of operation required with data rates currently on the order of 50-100 Mb/s and exceeding this in the future. For example, the power dissipation of the CMOS logic alone in a recently reported BiCMOS 65 Mb/s read channel IC implementing digital equalization, sequence detection, and other associated functions was 1 W [ 3 ] . Powerful battery-operated portable personal computers are becoming increasingly prevalent, and the performance requirements of the storage devices in these systems continually increases. Power consumption is a critical parameter in such systems for increased battery life and to minimize heat dissipation effects due to the proximity of the electronics to the magnetic media in drives with decreasing form factors. The net result is the need for increased speed performance of the electronics at a reduced overall power consumption. A block diagram of a typical PRML read channel is shown in Fig. 1 . The output of the magnetic disk is first amplified by the read amplifier before being passed on to the analog front end which includes a variable gain amplifier (controlled in an automatic gain control loop not shown), low-pass filter, sampler, and analog-to-digital converter followed by the functions of adaptive equalization, sequence detection, and timing recovery in the digital domain. The adaptive equalizer operates on the 6 b samples coming from the A/D converter, equalizing these samples for subsequent detection by the sequence detector and use by the timing recovery block.
This paper describes a low-power architecture for digital implementation of an adaptive equalizer suitable for use in PRML magnetic disk drive channels. The new architecture was used in a prototype IC containing an eight-tap adaptive equalizer and Viterbi detector which, at a 50 MHz input sampling rate, consumes a total power of 70 mW operating from a 3.3 V power supply in a 1.2 pm CMOS technology [14] . The reduction in operating power relative to the current practice was made possible through the use of an optimum mixture of parallelism and pipelining, which in tum allowed the use of a 3.3 V supply while meeting the throughput requirements.
LOW-POWER ADAPTIVE EQUALIZER ARCHITECTURE AND DESIGN
As has been recently demonstrated in other work [4] , major reductions in power dissipation in such digital functions are possible through the use of lower power supply voltages since the power consumed by a CMOS digital circuit is proportional to CV2 f. The corresponding increase in gate delay is only linear with supply voltage, all else being equal, so that the effective power4elay product is improved at lower voltages. The increased gate delay can be accommodated through the use of a combination of parallelism and pipelining, requiring more gates for the implementation, but giving major reductions in overall power dissipation. The optimum combination of T to realize an increased multiplying rate.
Four multipliers operating staggered in phase by the output period parallelism and pipelining is highly dependent on the particular function being performed, and the exploration of these tradeoffs for the implementation of an adaptive equalizer suitable for use in PRML magnetic disk drive read channels is the main topic of this paper.
Magnetic disk drive read channels are unique both because of the high sampling and bit rates involved, and because of the relatively modest SNR of the signals detected off the disk. Most implementations of PRML channels use a 6 b signal representation into the adaptive equalizer dictated by the off-channel signal-to-noise ratio. This small word width has major implications for the minimization of power through the use of parallelism since multipliers for this width are relatively compact. Extensive simulations were performed to explore various permutations of parallelism and pipelining in the implementation of a conventional multiplier [5] for filter sampling rates above 50 MHz and a power supply of 3.3 V. It was found that for implementation of a 6 x 6 b multiplier, the use of four multipliers operating in parallel and staggered in phase by the output period T as shown in Fig. 2 resulted in the solution dissipating the lowest overall power. This advantage of low power comes at the cost of increased silicon area which, however, will scale with technology. For the required resolution of 6 b, the overhead associated with pipeline latches in a pipelined implementation as opposed to a parallel implementation of the multiplier increases the power while deceasing the attainable speed (due to latch setup times).
While the parallel architecture helps to increase overall throughput for a given supply voltage, system latency is also increased. This latency will have implications from a system standpoint, particularly in applications in which the block employing parallelism is nested in one or more feedback loops. The resulting delay will affect the stability of the feedback loops and must be considered in the system design.
A. Parallel Filter Architecture
A block diagram of a jilter stage used to implement the equalizer is shown in Fig. 3(a) , and is comprised of a delay line, a set of multipliers, and an accumulator. The morphology of the filter stage resembles the block diagram of an FIR filter. In order to use multipliers which take four output periods to perform multiplication, four filter stages are used in parallel. The multiplier and accumulator sections are each clocked at one-fourth the output rate of the equalizer and operate staggered in phase by one output period. The parallel architecture and timing diagrams for this approach are shown in Fig. 3 (b) and (c), respectively. The use of input latches in both the multipliers and accumulator enable the functions of multiplication and accumulation to be pipelined, as shown for one filter stage in Fig. 3(d) . During each filter stage clock cycle, the current eight inputs into a filter stage are multiplied by their respective tap weights. The resulting products are summed by the accumulator during the next clock cycle of that filter stage.
B. Multiplier Circuit Design
Much of the discussion thus far has focused on block level considerations to reduce power consumption of the multiplier function. Further optimization can occur with appropriate design choices at the circuit level. The multipliers were constructed using the Baugh-Wooley two's complement parallel array algorithm [5]. The benefits of two logic styles, complementary pass logif and static CMOS, were combined to implement the full adders, as shown in Fig. 4 . To reduce the number of transistors and subsequently the power consumption associated with extra parasitic drain and source capacitances, complementary pass logic was used in the sum generation portion of the full adder, while in order to increase the speed and drive capability, static CMOS was used at both the output of the sum generator and to implement the carry generation signal.
C. CoefJicient Update
The least mean square (LMS) algorithm is often used to update the tap weights in the adaptive equalizer. In the LMS algorithm, each coefficient C is updated using the equation where /3 is the step size, e k the input error to the slicer at time k, and X k the input to the particular tap weight at time k . Implementation of this algorithm requires two multiplies following generation of the slicer error in order to obtain the correction term which is added to the current value of the coefficient as shown in Fig. 5(a) . The time required for these multiplies adds directly to the latency in the coefficient update. Simulation results indicated that the latency associated with the parallel multipliers degraded stability in the adaptation loop, in turn requiring very small values of io and long adaptation times to eliminate stability problems.
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 3, MARCH 1995

EvllLMs
.3hk%us In order to reduce this feedback latency in the update of the coefficients, the sign-sign LMS algorithm was employed where the coefficients are updated using the equation where sgn (.) is the signum function which is +1 or -1 for a positive or negative argument, respectively. The coefficient update then reduces to depending on the product of sgn ( e k ) * sgn ( x k ) . In the implementation, both Ck + / 3 and C k -/3 are computed in parallel, while sgn ( e k ) * sgn ( X k ) is computed using an exclusive-OR operation on the sign bits of both the error and the input to the respective tap. Depending on the outcome of the exclusive-OR operation, Ck + p or C k -/3 is selected. This is shown schematically in Fig. 5(b) . This approach reduces the feedback latency from 16 periods down to 8, as listed in Fig. 5(c) , and performs robustly. Another advantage of the sign-sign LMS algorithm is that the multiplications required in the full LMS algorithm reduce to a single exclusive-OR, which results in significant power and area savings. Information on the stability of the sign-sign LMS algorithm can be found in Routing of the filter coefficients through each filter stage on to the Because the disk drive channel is slowly time varying, the output of only one of the four parallel filter stages is used in the update of the coefficients. This same set of coefficients is used by all four filter stages. In order to reduce timing requirements on the availability of the coefficient set to each of the four filter stages, the approach shown in Fig. 6 is used. New coefficients from the coefficient update block need be available only during the latching instant of Filter Stage 1. The coefficients are held by the Filter Stage 1 latches for the multipliers there, as well as applied to the inputs of the latches in Filter Stage 2 where they are sampled one output period later. This process is repeated until the coefficients are latched into Filter Stage 4. Since the outputs from the coefficient update block need only be available during the latching instant of Filter Stage 1, their availability requirements are much less than if the coefficient update block were to feed each of the filter stages directly. Furthermore, the amount of parasitic capacitance which would otherwise have to be driven by one set of buffers if all four filter stages were to receive the coefficients directly is greatly reduced.
A three-tap raised-cosine equalizer response with tap weights -K , 1.0, -K was assumed for the initial equalizer impulse response before convergence. An externally controlled initialization signal allows the fourth-sixth coefficients to be initialized to tap weight values provided off-chip. While these three taps are set to their respective values (fourth and sixth tap weights are the same), the other five tap weights are forced to zero. The ability to externally set each of the tap weights individually could easily have been implemented, but was avoided in the prototype for simplicity. The value of the step size p is also provided externally, and can be programmed to implement gear-shifting algorithms for faster convergence.
D. Input Data Flow
The chip receives two 6 b words at a one-half rate, relaxing the timing requirements for latching of the input data while having no effect on the equalizer performance. Input samples are supplied via a tapped delay line shown in Filter Stage 2 to be latched along with one new sample on the rising edge of Clock 2. This process is again repeated for Filter Stages 3 and 4 on the rising edge of Clocks 3 and 4, respectively. This approach reduces the amount of data needing to be passed around the chip at high speed and relaxes timing requirements on the movement of most of the input data throughout the chip. Note that the desired convolution occurs as a vector of input samples moves diagonally through the parallel stages.
SEQUENCE DETECTOR
The sequence (Viterbi) detector is designed for a P R -IV target, and as such is realized by two interleaved independent half-rate Viterbi detectors [6] operating on the outputs of Filter Stages 1 and 3 and 2 and 4,, as shown in Fig. 8 . The input words into the Viterbi detector blocks are 6 b wide. The outputs of both detectors contribute to a two-wide one-half rate bit stream which is output off-chip (commutator shown in the figure) . Descriptions of Viterbi detector implementations can be found in the literature [ 161-[ 171.
I v . CHIP PLAN AND LAYOUT
Due to the regularity in the flow of both the input data and the filter taps throughout the chip (shown in Figs. 6 while being held for the multipliers, are passed through to and 7), a close mapping of the block diagram to the layout Fig. 9 , indicating the flow of the input data and coefficient into each multiplier cell. The two words enter their respective latches where they are held for the duration of the multiply, as well as passed through the cell to the next filter stage. A high-level view illustrating the flow of the input data and coefficients through the filter is shown in Fig. 10 . A die photo of &e chip is shown in Fig. 11 . Note that the layout closely resembles the block diagram shown in Fig. 10. V. EXPERIMENTAL RESULTS A prototype IC including the adaptive equalizer and sequence detector was fabricated in a 1.2 pm MOSIS CMOS process.
To verify performance, the chip was supplied via a logic analyzer with pseudorandom channel samples generated using Ptolemy, a system simulation tool [I 11. A linear channel was assumed and modeled using a Lorentz pulse [18] to represent transitions on the disk. PWjo/T is a parameter used in the Lorentz model as a relative measure of the intersymbol interference (ISI) in the channel, where higher numbers for PWso/T correspond to more ISI. Typical disk drive channels today can be characterized with values of PWjo/T between 1.5 and 2.0, and will be approaching 2.5 in the future. Since it requires the most boost from the equalizer, a PW,,/T value of 2.5 was used to generate channel samples for all functional testing of the prototype. Experiments were performed by initializing the equalizer taps to intentionally introduce misequalization and feeding the generated channel samples via the logic analyzer to the prototype and observing the equalizer outputs and the tap weight values over time. The equalizer outputs approached the desired target response, while the tap weights converged to their desired values for pseudorandom data. Functionality was further verified by comparing the equalizer outputs to previous system simulations. The power consumption for a supply voltage ( V d d ) of 3.3 and 5.0 V is plotted as a function of output rate in Fig. 12 . The prototype circuit was originally designed to operate at 100 MHz with a power supply of 3.3 V. Due to an error in the extraction procedure during the design process, the capacitance in the critical path through the accumulator and coefficient update circuitry was underestimated. As a result, in order to achieve proper operation at 100 MHz, the supply voltage had to be increased to 5.0 V. Extrapolating the power consumption at low frequencies suggests that a 3.3 V, it is feasible that 100 MHz operation could be achieved in 
VI. SUMMARY
In this paper, an architecture for implementation of adaptive equalizers as required in applications such as the PRML magnetic disk read channel was described. The architecture utilizes parallelism and pipelining which enables highspeed operation at a reduced power supply, resulting in a speed-power ratio of 1.4 mW/MHz (for eight taps), which compares favorably to conventional approaches. The design of a prototype digital detector including the functions of adaptive equalization and sequence detection was described. The prototype demonstrates the potential of the proposed architecture to implement these key functions at high speed in a relatively conservative technology at a low-power consumption.
The speed-power ratio advantage of the described approach comes at the cost of increased hardware, system complexity, and latency. The optimum combination of parallelism and pipelining depends on the requirements of a particular system and the available power supply values. In most applications, the power supply values are set by those already available in the particular system, although efficient dc-dc conversion techniques are specifically being developed to optimize the combination of system performance and power consumption [ 151. Latency affects adaptive algorithms such as those used to control the coefficients of the equalizer (in the case of the adaptive equalizer), as well as those which operate on the equalizer outputs. The allowable latency depends upon the convergence requirements of the different algorithms within a particular system, and simulations must be performed to explore the specific performance/power savings tradeoffs particular to that system.
