Abstract-A high-throughput low-latency digital finite impulse response (FIR) filter has been designed for use in partial-response maximum-likelihood (PRML) read channels of modern disk drives. The filter is a hybrid synchronous-asynchronous design. The speed-critical portion of the filter is designed as a high-performance asynchronous pipeline sandwiched between synchronous input and output portions, making it possible for the entire filter to be embedded within a clocked system. A novel feature of the filter is that the degree of pipelining is dynamically variable, depending upon the input data rate. This feature is critical in obtaining a very low filter latency throughout the range of operating frequencies. The filter is a ten-tap six-bit FIR filter, fabricated in a 0.18-m CMOS process. Resulting chips were fully functional over a wide range of supply voltages, and exhibited throughputs of over 1.3 giga-items/s, and latencies of 2-5 clock cycles. Interestingly, the filter throughput was limited by the synchronous portion of the chip; the internal asynchronous pipeline was estimated to be capable of significantly higher throughputs, around 1.8 giga-items/s. More importantly though, the adaptively pipelined nature of the filter allows it to offer a worst-case latency of only 10 ns, which is half the worst-case latency of the best previously reported comparable fully-synchronous implementation by Rylov et al.
stream through the filter, a process known as equalization. The filter output is then passed through a partial-response maximumlikelihood (PRML) detector [3] , [4] , which uses a finite history of inputs to compute the likelihood of the current input being a "1" or a "0." The filter itself belongs to a larger category called finite impulse response (FIR) filters [5] .
The design of the filter chip is an interesting case study for several reasons. First, the chip has two distinct timing domains, one clocked and the other asynchronous. Second, the filter pipeline uses a mix of static and dynamic logic function blocks: the asynchronous domain uses dynamic blocks, and the clocked one uses static logic. Third, the degree of pipelining dynamically adapts to the data rate. Finally, as a real-world case study, the design exhibits a highly varied datapath, ranging from 30 to 216 wires in width over the length of the pipeline.
This work provides a concrete demonstration of the performance benefits of asynchrony, and shows how those benefits can be realized even when asynchronous design is used in only a portion of an overall synchronous system. In particular, performance benefits of asynchrony are demonstrated in an industrial-strength design, with direct comparison to the leading previous synchronous implementation. Further, this design shows that by exploiting domain-specific knowledge, mixed synchronous-asynchronous operation can avoid performance degradation typically associated with crossing timing domains. Such demonstrations will hopefully open up new avenues for the broader application of asynchronous design and demonstrate its viability for leading-edge industrial applications.
The main novelty of the filter is the dynamically variable pipeline depth-and hence a variable latency (as measured in clock cycles)-which can adapt to varying input data rates. This behavior is intrinsic to the asynchronous nature of the pipeline; no architectural modifications are needed. In particular, since the asynchronous datapath can contain a variable number of data items, the effective "synchronous pipeline depth" of the overall filter is dynamically variable. The advantage of this adaptive pipelining feature is that the chip naturally handles slow synchronous environments with a low latency penalty (in terms of number of clock cycles), yet can still accommodate fast synchronous environments as a highly pipelined design. This adaptive nature was the main motivation for pursuing a mixed synchronous-asynchronous approach for the design of the filter. In contrast, a comparable fully clocked pipeline would be limited to a fixed pipelining depth, with significantly longer nanosecond latencies at low input data rates.
The pipeline style used for the asynchronous portion of the filter is the high-capacity (HC) pipeline style introduced in [6] and [7] . This style is for dynamic logic implementations, and provides the benefits of high throughput, low latency, and a 100% storage capacity without the use of explicit latches (i.e., each stage is latchless, yet can hold a distinct data item). A prior design by lines, called precharge full buffers [8] , also achieves full capacity, but is a dual-rail style which requires completion detectors [9] , and as a result has significantly longer cycle times. Several other high-speed asynchronous pipeline styles have been proposed recently [10] [11] [12] [13] , but each of these has disadvantages compared to HC. The pipelines of [10] and [11] have the drawback of more complex timing constraints, requiring aggressive circuit techniques and much designer effort. HC pipelines, on the other hand, have much simpler implementation and less stringent timing requirements. The pipelines of [12] and [13] , although comparable to HC in performance and ease of design, have only half the storage capacity.
The remainder of this paper is organized as follows. Section II gives background on read channel filters, and on high-capacity pipelines. Section III gives an overview of the filter architecture, and then Section IV presents the detailed implementation. Section V discusses the operation of the filter, focusing on the adaptive pipelining feature. Performance analysis is provided in Section VI, and measurements of chip performance are given in Section VII. Finally, Section VIII gives conclusions.
II. BACKGROUND

A. Read Channel FIR Filters
Read channel filters are used in all magnetic and optical disk drives. The function of a read channel is to take the noisy data picked up by the read head, and turn it into a clean stream of "0" and "1" symbols. With ever-increasing data rates available from magnetic and optical media, high-speed read channel filters have become key to the design of modern disk drives. This subsection provide a brief background on read channels, and reviews the theory and implementation of commonly used digital read channel filters.
1) Disk Drive Read Channels:
The architecture of a typical read channel is shown in Fig. 1 (adapted from [14] and [15] ). The magnetic data picked up by the read head first undergoes analog processing, consisting of amplification and removal of high-frequency noise. The analog signal is then sampled and converted to digital form for digital processing. The digital portion of the channel performs equalization to partially remove intersymbol interference, and then uses a PRML detector (e.g., a Viterbi detector) to decode the data value. The equalization step is performed by an FIR filter. The filter's output is also used in two feedback loops in the system: 1) in a timing recovery circuit that generates the clock used by the sampling circuitry and 2) in the part of the system that controls the amplification of the variable-gain analog amplifier.
There are several design requirements for read channel filters. First, such filters must handle high data rates. For example, rates of up to 1 GHz (in m) are commonly used in consumer electronics and PCs. 1 Secondly, the filter must be able to handle wide variations in the data rate, often a factor of up to 5, because the bit rate varies as the disk head moves between tracks. Finally, the filter must exhibit a low latency because it is on the critical path of two feedback control loops: the timing recovery loop, and the gain control loop (see Fig. 1 ). The simultaneous requirement of high throughput and low latency is a design challenge. In particular, high throughput is typically achieved by pipelining the datapath at a fine granularity. However, such pipelining results in a large number of pipeline stages and, therefore, longer latency due to storage overheads. This problem is exacerbated at lower data rates; when the clock rate (recovered from incoming data) is slowed down, the latency of the pipeline, which is fixed in terms of number of clock cycles, becomes significantly longer in terms of nanoseconds. The longer latency can severely impact the performance and stability of the feedback control system.
A key goal of this research is to obtain high throughput yet an acceptably low nanosecond latency at all data rates. The use of an asynchronous datapath at the core of the filter is critical to achieving these twin objectives, as discussed in depth in Section V.
2) Theory: In a digital FIR filter [5] , the output at any given time,
, is a weighted sum of the most recent inputs,
where are the constant weights by which the inputs are weighted. Such a filter, with weights, is said to be a " -tap" filter. Each of the terms, , is called a "partial sum."
3) Implementation: Distributed Arithmetic Style: Several implementations of a read channel filter are possible. For example, one could use one or more multiplier units to compute each of the product terms, and then use one or more adders to produce the final result.
The filter design of this paper uses a particular approach that is very well-suited for a high-performance implementation: the distributed arithmetic architecture [16] . This approach does not use multiplier units. Instead, partial sums are precomputed and stored in a lookup table (LUT), indexed by the input data values. As a result, each multiplication can be performed quite fast, typically in a single clock cycle. Details on the implementation of this architecture are provided later in Sections III and IV.
Several techniques are used to keep the size of the LUT manageable. First, the entire multiplication operation is bit-sliced. Second, within each bit slice, the input values are separated into two groups, one containing only the even-indexed values and the other only odd-indexed values, with each group having its own distinct LUT. Finally, a particular data representation style is used which introduces symmetry into the LUT, further reducing the amount of storage needed. Each of these techniques is now discussed in detail. a) Bit slicing: Suppose each input value has bits. The expression (1) can be evaluated separately for each bit position in the input stream, and then the individual results can be appropriately aligned and added together, to produce the same result as would be obtained if the computation were performed directly with the -bit inputs.
As a further optimization, the result of this entire expression can be precomputed and stored in an LUT. The most recent input values for that particular bit position form a -bit word that is used as the address to access the table. Each LUT will therefore have entries. For a ten-tap filter, this corresponds to a table with 1024 entries. b) Partitioning: The size of the LUT can be significantly reduced by partitioning the inputs into even-and odd-indexed groups. That is, starting with the current input, every other input belongs to the "even group"; the remaining inputs belong to the "odd group." The even and odd groups have their own LUTs. Therefore, for a ten-tap filter, there are two LUTs, each having a five-bit address word and only 32 entries. This represents a dramatic reduction in the size of memory required, from one 1024-entry table to two 32-entry tables. However, there is a slight tradeoff: twice the number of partial sums are generated, requiring an additional adder stage to combine them. c) Exploiting symmetry: To further reduce the table size in half, a data representation scheme is used that makes the table symmetric. In particular, the signed-digit offset binary notation [16] is used, in which the symbols "0" and "1" stand for negative and positive coefficients of powers of 2. For example, in this notation, the four-bit number "1001" stands for the value . The advantage of this representation is that arithmetic negation is simply achieved by complementing each bit: "0110" stands for the value . An interesting feature of the filter (1) is that, if all the inputs are negated, the filter output is also negated. Consequently, when this representation is used, if two address words for the lookup table are bitwise complements of each other, then the corresponding table entries will also be bitwise complements of each other. Exploiting this symmetry, half of the table can be discarded.
B. High-Capacity Asynchronous Pipelines
High-capacity (HC) asynchronous dynamic pipelines [6] , [7] are used to implement the asynchronous portion of the filter chip. An HC pipeline is latchless: the datapath consists only of dynamic logic gates (staticized using weak inverter feedback); no explicit latches are used. Key to HC pipelines is a novel communication protocol that maximizes the pipelines' storage capacity by allowing every latchless dynamic stage to hold a distinct data item. In contrast, in previous latchless asynchronous dynamic pipelines (e.g., [12] , [13] , and [17] ), alternating stages usually must contain "spacers," or "reset tokens," limiting the pipeline capacity to 50%.
The key idea in the HC approach is one of decoupled control: the pull-up and pull-down of the dynamic gates are made separately controllable. Therefore, the precharge and evaluate controls can both be simultaneously deasserted, allowing the gate to enter a special "isolate phase"-between "evaluation" and "precharge"-in which its output is protected from further input changes. As a result, every pipeline stage can store a distinct data item, providing the capability of supporting 100% storage capacity with no explicit latches between stages. In addition, the decoupled control leads to increased overall pipeline concurrency, which in turn directly results in a significantly increased throughput.
1) Structure: Fig. 2 shows a simple block diagram of an HC pipeline. Each stage consists of three components: a function block, a completion generator and a stage controller. In steady-state operation, the function block alternately produces HC pipelines use a single-rail bundled datapath [9] , [18] . A control signal, Req, indicates arrival of new inputs to a stage: a high value of Req indicates the arrival of new data; a low value indicates the arrival of a spacer. For correct operation, a simple timing constraint must be satisfied: Req must arrive after the data inputs to the stage are stable and valid. This requirement is met by inserting a "matched delay" of sufficient latency to match the worst-case delay through the function block. a) Function block: Fig. 3 shows one gate of a dynamic function block in a pipeline stage. In general, for a multiple output function block, there will be one such dynamic gate for each output. The pc input controls the pull-up network and the eval input controls the "foot" of the pull-down network. Precharge occurs when pc is asserted low and eval is deasserted low. Evaluation occurs when eval is asserted high and pc is deasserted high. In HC pipelines, the two control signals, pc and eval, are separately generated and are decoupled. Therefore, when both signals are deasserted, the gate output is effectively isolated from the gate inputs; thus, it enters the "isolate phase." To avoid a short circuit, pc and eval are never simultaneously asserted. b) Completion generator: A generalized C-element, gC [19] , is used as a completion generator. The gC's output, Done, is set when the stage has begun to evaluate, i.e., when two conditions occur: the stage has entered its evaluate phase (eval is high), and the previous stage has supplied valid data (completion signal Req of previous stage is high). Done is reset simply when the stage is enabled to precharge (pc asserted low). The gC element output is fed through the matched delay, whose latency (when combined with that of the completion generator) matches the worst-case latency of the function block. Note that, for extremely fine-grain or "gate-level" pipelines [20] , the matched delay is often unnecessary: the gC delay itself often already matches the function block delay, so no additional matched delay is required.
Finally, the completion signal in turn is fed to three components: 1) the previous stage's controller, indicating the current stage's state; 2) the current stage's controller (through the matched delay); and 3) the next stage's completion generator (also through the matched delay).
c) Stage controller:
The stage controller produces the control signals for the function block and the completion generator. It receives two inputs-the delayed Done of the current stage, , and the Done of the next stage, -and produces the two decoupled control signals, pc and eval. Details of the stage controller's implementation will be discussed shortly, after presenting the desired protocol.
2) Protocol: An HC pipeline stage simply cycles through three phases, as shown in Fig. 4 . After it completes its evaluate phase, it then enters its isolate phase and subsequently its precharge phase. As soon as precharge is complete, it enters the evaluate phase again, thereby completing the cycle. The novelty of the approach is seen in the highly concurrent protocol that governs the interaction between stages. Nearly all existing protocols for asynchronous dynamic pipelines (with no explicit latches) have two explicit backward synchronizations between adjacent stages [12] , [17] . In contrast, a novelty of HC pipelines is that they have only one backward synchronization. In particular, in HC, once stage has completed its evaluation, it enables the previous stage to perform its entire next cycle: precharge, isolate, and evaluate new data item. As usual, there is also one implicit forward synchronization: the dependence of stage 's evaluation on its predecessor 's evaluation (i.e., data dependence). The complete protocol is shown in Fig. 4 .
The introduction of the isolate phase is the key to the new protocol. Once a dynamic stage finishes evaluation, it immediately isolates itself from its inputs by a self-resetting operation regardless of whether this stage is allowed to enter its precharge phase. As a result, the previous stage can not only precharge, but even safely evaluate the next data token, since the current stage will remain isolated. The isolate phase effectively allows lightweight storage at 100% capacity: no explicit latches are needed between adjacent stages.
There are two benefits of this protocol: 1) higher throughput, since a stage can evaluate the next data item even before stage has begun to precharge; and 2) higher capacity for the same reason, since adjacent pipeline stages are now capable of simultaneously holding distinct data tokens, without requiring separation by spacers.
3) Stage Controller Implementation: Fig. 5 shows a complete implementation of the stage controller along with the rest of the stage. The implementation is very simple, with the two outputs-pc and eval-and an internal state variable, ok2pc, each implemented using a single gate. The state variable is set when both inputs, and are high; the variable is reset simply when is low. This behavior is implemented using an asymmetric C-element, aC, which is a special case of the generalized C-element [19] .
Note that the stage controller is designed to have a low latency of only one gate delay. While ok2pc appears to add an extra gate delay to the control path to pc, this is not the case: the protocol allows ok2pc to be set in "background mode," so that ok2pc is typically set before T gets asserted. As a result, the critical path to asserting precharge (pc low) is only one gate delay: from input T through the 3-input NAND gate, NAND3, to the output pc. Moreover, to also shorten the critical path to deasserting precharge (pc high), is fed into the NAND gate as a logically redundant input (a low already causes ok2pc to go low). The net impact is that the stage controller's latency in starting and stopping both precharge and evaluation is only one gate delay.
4) Analytical Cycle Time and Latency:
A complete cycle of events for stage can be traced in Fig. 2 . From the evaluation of one data item to the evaluation of the next data item in stage , the cycle consists of three operations: 1) stage evaluates; 2) stage evaluates, which in turn enables stage 's controller to assert the precharge input low of ; and 3) stage precharges, the completion of which, passing through stage 's controller, enables to evaluate once again (eval asserted high). Let the evaluation and precharge times for a stage be denoted by and , the delay through the completion generator by , and the delays through the NAND3 and the inverter of The forward latency through a stage, , is simply the evaluation delay of the stage
If the datapath is more than a few bits wide, additional control buffers are typically required to provide sufficient drive for the pc and eval control signals. These buffers are inserted at the outputs of the NAND3 and INV gates, thereby adding the delay to each. As a result of this modification, the cycle time with such buffering is (4) An optimization is presented later in Section IV-B to reduce the overhead of control buffering.
5) Timing Constraints: HC pipelines require three timing constraints for correct operation. These constraints are all simple, single-sided, and easy-to-satisfy in practice (see Section VII-A).
a) State variable: The state variable ok2pc is set once the current stage has evaluated, and the next stage has precharged . Subsequently, goes high as a result of evaluation by the next stage. For correct operation, ok2pc must complete its rising transition before goes high (5) In practice, this constraint is easily satisfied. Moreover, the presence of control buffering adds a further safety margin of because of the buffer at the output of the inverter (6) b) Precharge width: For correct operation, an adequate precharge width must be enforced, i.e., once precharge is asserted for a stage, it should not be deasserted before the precharge is complete. Suppose just went high for stage 1. At this point, stage 1's NAND3 is triggered, thereby starting the precharge of stage 1 (in Fig. 2) . Concurrently, will be reset after a path through stage 2's matched delay, stage 3's gC element, stage 2's NAND3, and gC, thereby deasserting the output of stage 1's NAND3. Therefore, for correct precharge, the following must hold:
For a general stage , the constraint can be written as (7) Note that this timing constraint also is exported to the left environment, requiring it to precharge reasonably fast. In practice, this constraint is easily satisfied as well. Furthermore, if control buffering is used, the constraint is further relaxed by (assuming buffer delays in neighboring stages are similar) (8) c) Bundling constraint: As with all bundled-data pipelines, a timing constraint must be met by the request signal: the request to the next stage must be sufficiently delayed to allow time for all of the associated data bits to be valid and stable. This constraint is satisfied by inserting an appropriate delay element whose latency, together with the latency of the gC element, matches or exceeds the worst-case latency of the stage's datapath (see Fig. 5 ). (9) III. OVERVIEW OF FILTER ARCHITECTURE Fig. 6 shows the top-level architecture of the digital filter. The filter is a ten-tap six-bit FIR filter using the distributed arithmetic architecture [16] . The figure gives a detailed view of one bit slice; as indicated, there are actually six such bit slices, stacked on top of each other. Data inputs enter from the left, and are processed by the filter as they flow to the right. The filter can be divided into three portions, from left to right. The leftmost portion is clocked, from the input side to the domino latches. The middle portion, from the XOR gates to the end of the carry lookahead adder, is asynchronous. Finally, the rightmost portion, consisting of an output latch, is again clocked.
The architecture of the filter is best understood by following the flow of data from left to right. As the stream of data enters the filter, it first passes through a shift register, which stores the most recent input values that are needed to compute the filter output. In particular, for a -tap filter, for each bit, there is a -place shift register that stores the most recent history for that bit. These stored input values are then multiplied by their respective filter weights. The multiplication is accomplished very efficiently by fetching precomputed results from an LUT. In the figure, the LUT is composed of two banks of registers containing the precomputed results-called even and odd partial sums-and two output multiplexors. The entire multiplication process is bit-sliced, with one slice for each bit of the input data. The result of the multiplications is a set of partial sums which are fed to the asynchronous portion of the filter pipeline for addition. The asynchronous portion is a nine-stage pipeline that adds all of the partial sums together, and produces the result. Finally, this result is latched by a clocked latch and output to the right environment.
IV. FILTER IMPLEMENTATION
The filter implementation is now considered in more detail. The overall structure is a pipeline consisting of an asynchronous portion sandwiched between two synchronous portions, as shown in Fig. 6 . The left synchronous portion receives data from the environment, and processes it into partial sums. The asynchronous portion adds the partial sums to compute the final result. The right synchronous portion resynchronizes the result to the clock and produces it as output for the environment.
A. Synchronous Portion
The synchronous portion of the filter consists of two parts, a left synchronous portion and a right synchronous portion, as shown in Fig. 6 . The synchronous portions enclose the asynchronous portion in the middle, thereby allowing the entire filter chip to appear synchronous to the environment. The two synchronous parts are now described in detail.
1) Left Synchronous Portion:
This part receives the input to the filter from the environment (see Fig. 6 ), and computes the partial sums that are subsequently processed by the remainder of the filter pipeline. The basic operation and structure of this part is now described, as well as two optimizations for reducing the complexity of its implementation.
The first stage is a ten-slot shift register that stores the ten most recent data values. These stored input values are needed to compute the current filter output, which is a weighted sum of these values. Each input value is six bits wide.
The next stage performs multiplication of inputs by their respective filter weights. This operation is accomplished very efficiently by precomputing all possible products and storing them into an LUT. The entire multiplication is bit-sliced, with one slice for each of the six bits in the input data. Therefore, within each bit slice, there are ten input bits which together form a ten-bit address for accessing the LUT.
The size of the LUT is reduced by employing two optimizations, as discussed in Section II-A3. First, partitioning is used: the ten-bit address is divided into two five-bit addresses, one composed of only the even-index bits, and the other composed of the odd-index bits, with a distinct LUT for each, as shown in Fig. 6 . Only the five even bits are actually directly used; they are forked to the even multiplexor as its select bits, and also to a clocked register where, after one clock cycle delay, they become the odd-index select bits to the bottom multiplexor during the next clock cycle. Appropriate entries in the even and odd LUTs are then selected and sent to the domino latches. Finally, a second optimization is used: a signed-digit offset binary notation [16] is used to represent table entries and addresses, which enables the separation of the sign bit from each address, further shortening the addresses to four-bit words (see Section II-A3). As a result, the table size is dramatically reduced: two tables with only entries each are needed, as opposed to one  table with entries. The LUTs are implemented using registers and multiplexors, as shown in Fig. 6 . Each table has 16 registers, each of which can store an eight-bit entry, per bit slice. Each of the tables has a 16 : 1 multiplexor at its output, controlled by the four-bit address word. 2 The odd-index address word is generated from the evenindex address word by delaying it by one clock cycle.
The result of the multiplication is a set of products, called partial sums, that is sent to the asynchronous pipeline for addition, through the synchronous-asynchronous (i.e., left) interface.
2) Right Synchronous Portion: The right synchronous portion (see Fig. 6 ) receives the final result from the asynchronous portion of the filter, and makes it available as the output to the environment. This part simply consists of a latch that resynchronizes the result received from the asynchronous pipeline to the clock. Since this portion of the system is quite small, its implementation is discussed in Section IV-C2, which describes in detail the interfaces between the asynchronous pipeline and the two synchronous portions.
B. Asynchronous Portion
The asynchronous portion of the filter is a self-timed pipeline that is sandwiched between the two synchronous portions, as shown in Fig. 6 . The role of this asynchronous pipeline is to take the partial sums generated by the left synchronous portion, add them up to produce the filter result, and send it to the right synchronous portion for final output.
The asynchronous portion is a nine-stage pipeline, and is shown in greater detail in Fig. 7 . The first stage is a layer of XOR gates that restores the correct sign to the partial sums. This step is needed because the sign and the magnitude of each number were earlier separated in order to introduce symmetry and thereby halve the storage requirements for precomputed products (see Sections IV-A and II-A3). The next five pipeline stages correspond to five layers of carry-save adders [22] . The last three pipeline stages implement a carry-lookahead adder [22] . Taken together, this nine-stage pipeline performs signed addition of the partial sums, and generates the final value of the filter result.
The asynchronous pipeline is implemented using the HC style [6] , [7] , but with three modifications to the basic structure shown earlier in Fig. 5 . First, the dynamic datapath is implemented using two rails for each bit (instead of single-rail) because both true and complemented values of the data bits are needed to compute the XOR and addition functions using dynamic gates. However, this implementation is not like a typical asynchronous dual-rail pipeline: there are no completion detectors [9] ; instead, matched delays are still used for generating completion signals. Thus, the pipeline uses the same bundled data protocol as presented in Section II-B, but with a wider datapath. The datapath is quite wide at the input to the first stage: 216 wires data bits sign bit . The output of the last stage is a 15-bit result represented using 30 wires.
Second, since the filter has a very fine-grain datapath, no explicit matched delays are required. The delay of each function block is matched by the completion generator's gC element itself, through appropriate device sizing.
Third, since the filter's datapath is quite wide (up to 216 wires), the basic HC controller of Fig. 5 needs a modification to handle the load. In particular, buffers must be inserted in order to amplify the control signals which are broadcast to the entire width of the datapath.
Two different versions of the control were designed, as shown in Fig. 8 : a faster one using control kiting [11] , and a more robust one without kiting. The two versions differ in the placement of the amplifying buffers. The first version, Fig. 8(a) , is robust to variations in buffer delays because the completion signals are delayed by the same amount as the datapath. However, the buffers are on the critical path, resulting in the longer cycle time of (4). In the second version, Fig. 8(b) , the completion generators use control signals that are tapped off from before the buffers, resulting in the shorter cycle time of (2). However, each stage's function block now lags behind its completion generator by an amount equal to the buffer delay. Consequently, for the pipeline to function correctly, all the stages throughout the pipeline are required to have comparable buffer delays.
C. Mixed-Timing Interfaces
There are two interfaces between the asynchronous and synchronous portions of the filter, as shown in Fig. 7 . The left interface connects the left synchronous portion to the asynchronous portion. Similarly, the right interface connects the asynchronous portion to the right synchronous portion.
These mixed-timing interfaces must mediate certain differences in data representation and control sequencing. In particular, the asynchronous datapath uses dual-rail dynamic logic (although with a bundled data protocol, as described in the previous subsection), whereas the synchronous portions of the chip use single-rail static logic. Moreover, the asynchronous pipeline communicates by means of local handshakes (using req's and ack's) at each end, whereas the synchronous portion uses global clocking. At each interface, special latches are used to perform data conversion (from single-rail to dual-rail, and vice versa), and pulse generators are used to mimic the handshaking protocol used by the asynchronous pipeline.
1) Left Interface:
The left interface consists of a special master-slave pair of latches and an associated pulse generator (see Fig. 7 ). The master-slave latch is a special-purpose latch that is half static and half dynamic, and therefore referred to here as a static-dynamic latch. The master portion is a standard transparent D-latch with single-rail inputs and complementary dual-rail outputs. The D-latch is controlled by the clock. The slave portion consists of a pair of dynamic buffers, one for each rail of the data, controlled by a pulse generator. The pulse generator (see Fig. 9 ) emits a high-going pulse every time the clock goes low (provided incoming data is valid). Thus, this special master-slave latch behaves somewhat analogously to a negative-edge-triggered flipflop, emitting new data items on downward clock transitions. a) Operation: The pulse generator receives the clock signal, and a signal to indicate the validity of incoming data, Data Valid. When data is not valid, pulse generation is suppressed, thereby conserving energy by preventing garbage data being passed into the asynchronous datapath. Data is invalid whenever, for example, the disk read head is moving between tracks, or when the disk motor is starting up or spinning down. When data is valid, each downward clock transition produces a pulse on the compute control for the slave portion of the static-dynamic latch, thereby launching a new data item into the asynchronous pipeline, along with a bundled request, Req. The acknowledgment received from the first stage of the pipeline, Ack, is simply ignored; its purpose is performed instead by the compute pulse, which precharges the dynamic latch in a timed fashion.
b) Timing analysis: The input interface must meet three timing constraints for correct operation. The circuit of Fig. 9 was designed specifically to meet these constraints, which were verified by circuit simulation to be easily satisfied (see Section VII-A). First, the pulse generator must meet minimum and maximum width requirements. The pulse generated on Compute0 must be wide enough to allow not only enough time for the static-dynamic latch to complete its evaluation phase, but also for the first pipeline stage to latch the data before the data is precharged. If this pulse width is denoted by , then, relative to the start of the pulse, the time when valid data is produced by the static-dynamic latch is , and the time when that data is precharged is . For the data to be correctly latched by the first pipeline stage, the static-dynamic latch must not be precharged until at least a hold time after the data was produced Second, the HC pipeline style also imposes a maximum width constraint on the pulse generator: the data must be reset before the next pipeline stage is ready to evaluate again. Otherwise, the next stage may evaluate again on stale data. This constraint is normally subsumed by the constraint of (7), and therefore is automatically satisfied by pipeline stages that satisfy (7). However, the interface must explicitly satisfy this constraint. Thus, the data must be precharged at least a setup time before the next stage is ready to evaluate the subsequent data item For the dynamic stages, and are both approximately two inverter delays, is around two inverter delays, is around eight inverter delays, and is negligible. Thus, combining the two constraints, we get (10) The circuit realization of Fig. 9 uses a pulse width of five inverter delays, thereby adequately satisfying the constraint.
The final timing constraint that must be met by the pulse generator is the bundling constraint. A matched delay of four inverters is used, sufficient for matching the latency through the dynamic latch.
2) Right Interface: The right interface (Fig. 7) consists of a latch, a pulse generator, and a programmable delay line. The role of this interface is to receive the results computed by the asynchronous portion and resynchronize them to the clock. A key aspect of this interface's design is the avoidance of metastability by exploiting knowledge of the latency of the asynchronous portion. A synchronous D-latch is used to receive the computed result from the asynchronous portion of the filter. Only the true value of the dual-rail output data is used; the complements are simply ignored. A pulse generator (see Fig. 10 ) is used to produce the acknowledgment, Ack, for the last stage of the pipeline. Its function is to produce a high pulse on Ack when the clock transitions high, provided Result Valid is asserted.
The synchronous delay line is shown in detail at the top of Fig. 7 . The first flipflop provides one cycle delay to match the latency through the left interface, while the remaining flipflops provide a programmable latency of 1-4 clock cycles to match the asynchronous datapath latency. This implementation does not provide for latencies longer than four cycles because it was determined through simulation that optimal operation over the desired frequency range would require the pipeline to be operated with 1-4 data tokens. However, experimental results for 5-9 data tokens are obtained during chip testing by initializing the pipeline to contain an appropriate number of data tokens, instead of starting it empty (for details, see Section VII). a) Operation: The interface avoids issues of metastability by using the synchronous delay line to match the worst-case latency through the asynchronous datapath. In this scheme, the asynchronous request from the last stage of the pipeline is ignored. Instead, arrival of a new valid result at the output of the pipeline is inferred from a delayed version of the valid signal associated with that data item. In particular, the input Data Valid to the input-side interface of the pipeline is simply delayed by an integer number of clock cycles to produce Result Valid, which is then used in place of the output Req at the right end of the pipeline. The latency through the delay line, however, must be greater than the latency through the asynchronous pipeline. An advantage of this approach is that the output of the asynchronous pipeline is resynchronized to the clock without any metastability issues.
b) Timing analysis: If is the latency through each of the nine asynchronous stages, and is the number of clock cycles of programmed latency, then the following must hold for correct operation: (11) For optimal performance, i.e., minimal latency, the programmed delay should be set at the smallest integer that satisfies (11) (12) As discussed later in the results section, is approximately 2.6 ns, and is 0.75 ns or greater, which implies that a programmable delay of 1 to 4 clock cycles is adequate.
The pulse generator of Fig. 10 must also satisfy a minimum pulse width requirement: the pulse must be wide enough to allow the last pipeline stage to complete its precharge (about three inverter delays) (13) This constraint is easily satisfied by the circuit of Fig. 10 . In particular, because Result Valid is the output of a positiveedge-triggered flipflop, the width of the Ack pulse is one flipflop latency less than a half clock cycle, which easily meets the constraint.
V. FILTER OPERATION
A. Performance Goals: Discussion
The filter is designed to work over a wide range of clock frequencies, because the input data rate to a read channel can vary greatly. In fact, as the disk read head moves from the innermost track to the outermost track, the data rate can vary by a factor of as much as 1 : 5. A separate analog circuit ("clock recovery" unit) is used to generate the clock for the digital filter, whose frequency and phase are synchronized with the input data stream.
While high throughput is an important requirement, an additional key design goal is also to have as low a latency as possible. The filter, along with the clock recovery unit, is part of a closed feedback loop that monitors the clock frequency and phase, and corrects any misalignment of clock with respect to input data. In order to ensure that the clock closely tracks the input data stream, this feedback loop must have a fast response time. Consequently, the filter, which is a critical part of the loop, must have a very low latency. This goal of low latency is achieved in the new FIR filter by a novel feature: adaptive pipelining.
B. Adaptive Pipelining: Operation
The filter's latency can be varied with varying data rates, and the filter appears to the environment to have a variable synchronous depth, i.e., it can contain a variable number of data tokens. This variable-latency behavior is intrinsic to the asynchronous nature of the pipeline, and cannot be easily achieved in fully synchronous implementations without reconfiguring the structure of the pipeline. In particular, as the input rate is increased, the behavior of the asynchronous portion progressively becomes more pipelined. At these clock rates, the latency through the asynchronous datapath is longer than one clock period, and, therefore, multiple data items are present in the datapath at any given time. Accordingly, the programmable delay line, which helps interface the right end of the asynchronous pipeline with the synchronous portion of the chip, is set to one, two or more clock period delays. Thus, from the viewpoint of the composite system, at higher clock rates, the latency (again, measured in terms of clock cycles) is progressively higher.
The reprogramming of the filter's latency is performed as part of a systemwide reconfiguration of the read channel when the disk drive head moves to a different track. In particular, the clock timing recovery circuit (see Fig. 1 ) requires several cycles to lock onto the new frequency and phase. In the meantime, the output data from the filter is discarded. Computation begins anew once the electromechanical system has stabilized. Therefore, metastability in the right interface during the brief period of the reprogramming of the delay line is not an issue.
In conclusion, while the asynchronous pipeline has a fixed number of stages (nine) and has roughly a fixed overall latency in nanoseconds (2.6 ns), the effective latency as seen by the overall synchronous system (now measured in clock cycles) can be highly varied. This feature is taken advantage of to reduce the overall filter latency. As a contrast, consider the fully synchronous filter implementation of [23] , which also has a dynamic logic datapath and implements an identical functionality. This synchronous implementation has a fixed latency of four data samples. 3 Unfortunately, the result can be a serious penalty: at low input data rates (e.g., 200 MHz), the latency can be inordinately long (20 ns), thus degrading the performance of the clock recovery loop. In contrast, our implementation has a latency of only two clock cycles at low data rates (one cycle for the asynchronous portion plus one cycle for the synchronous portion), which translates to 10 ns at 200 MHz, thereby providing a significantly faster response time than that of [23] . Further details are provided in Section VII.
VI. PERFORMANCE ANALYSIS
A. Theoretical Analysis of Throughput
The filter throughput is determined by the throughputs of the synchronous and asynchronous portions, and the mixed-timing interfaces. The asynchronous throughput itself is a function of the number of data items present in the pipeline. When the number of data items is small, the throughput is low, and the pipeline is said to be "data-limited." On the other hand, when nearly every stage of the pipeline is filled with data items, 4 the throughput is once again limited because empty stages, or "holes," are needed to allow data items to flow through the pipeline; in this scenario, the pipeline is said to be congested, or "hole-limited."
1) Data-Limited Operation: Suppose there is only one data item in the asynchronous pipeline at any given time. On every clock cycle, this data item is removed from the right side, and, simultaneously, a new data item is introduced into the pipeline on the left side. For correct operation, the clock period, , must be no shorter than the forward latency through the entire pipeline:
, where is the forward latency through one stage of the nine-stage pipeline. Similarly, if there are data items in the asynchronous pipeline, then the latency of clock cycles must be no shorter than the pipeline's forward latency (14) 2) Hole-Limited Operation: Suppose all of the nine stages of the asynchronous pipeline are holding distinct data items. At the next rising clock edge, the synchronous portion on the right side consumes a data item, effectively injecting a hole at that end. This hole percolates through the pipeline, and arrives at the first stage of the pipeline after nine "hole latencies." A hole latency or reverse latency is the time from the completion of precharge in a stage (the arrival of a hole in that stage), to the completion of the subsequent precharge in the previous stage (the movement of a hole into the previous stage). 5 For correct operation, the hole must arrive at the left end of the pipeline before the left synchronous portion deasserts the new data item (the domino latches precharge). This deassertion of the new data item occurs a half clock cycle after the hole is injected at the right side of the pipeline, because the right interface produces acknowledgments on high clock transitions whereas the left interface produces new data items on low clock transitions. 6 Therefore, . More generally, if there are items in the pipeline, then there are holes, each of which can be filled with new data before a new hole injected into the right end of the pipeline is required to reach the leftmost stage of the pipeline. Therefore (15) Fig. 11 . Upper bounds on the maximum filter frequency. The shaded area represents the operating region.
3) Overall Upper-Bound on Throughput:
Equations (14) and (15) provide upper-bounds on filter throughput as a function of the number of data items in the asynchronous pipeline, (16) Fig. 11 shows a plot of the maximum filter throughput versus the number of data items in the asynchronous pipeline. The rising portion represents the data-limited region, while the falling portion represents the hole-limited region. The figure also shows a horizontal line, which corresponds to the longest local cycle time within the entire system [17] . In general, this horizontal line may either represent the maximum speed of the slowest stage in the asynchronous pipeline, or the maximum operating rate of the filter's synchronous portion or the mixed-timing interfaces. The overall filter operation will always be constrained to lie under the canopy formed by the three curves.
There is a significant advantage to using the HC pipeline style even though the pipeline will be operated with at most four data items for best throughput. If a different style, such as PS0 [17] were used instead, the pipeline could accommodate only half as many data items as in the above analysis (i.e., less than five), and the operation would be hole-limited beyond as few as two data items. As a result, the overall operating region under the canopy would be much smaller, and the maximum achievable throughput significantly lower.
B. Practical Issues
In our particular filter design, the overall throughput was limited by the synchronous portion; the asynchronous portion was capable of higher throughput. In particular, the latencies (both forward and reverse) through all of the asynchronous pipeline stages were fairly uniform, and therefore the local cycle times of all of the asynchronous stages were nearly the same. In this case, the maximum throughput of the asynchronous pipeline, to a first approximation, is given by the intersection of the rising and falling curves in Fig. 11 [17] . The horizontal line, however, represents the maximum operating rate that can be sustained by the synchronous portion and the mixed-timing interfaces. This synchronous rate limited the overall filter throughput to a level lower than the maximum asynchronous throughput. The peak of the canopy graph given by the intersection of the two sloping lines represents the theoretical maximum asynchronous throughput, provided it were not limited by the synchronous portion's throughput. In practice, however, the peak asynchronous throughput potential is slightly lower due to second-order electrical effects (e.g., "Charlie effect" [24] and drafting effect [25] , which cause slight delay variations depending on the time difference between the arrival of inputs), and also because the intersection of the two sloping lines may lie at a nonintegral value of the number of tokens. In spite of these second-order effects, the peak of the graph is still useful as an approximation of the throughput potential of the asynchronous pipeline.
C. Comparison to Williams' Analysis of Self-Timed Rings
Interestingly, a linear asynchronous pipeline embedded within a synchronous environment can be modeled as a self-timed ring, as shown in Fig. 12 . In particular, in each clock cycle, exactly one data item is removed from the right end of the pipeline, and exactly one new data item is inserted into the left end. Although these two data items are distinct, one can assume for modeling purposes that they represent the same item. Therefore, Williams' analysis [17] is directly applicable to the filter, though with a small caveat: the left and right interfaces operate on complementary phases of the clock. This difference in clock phases is responsible for the term in (15) , effectively making the filter appear as a self-timed ring with 9.5 stages.
VII. EXPERIMENTAL RESULTS
A. Layout and Fabrication
The chip was laid out and fabricated using the IBM 0.18-m CMOS-7SF process with copper interconnect and 1.8-V nominal voltage supply. Fig. 13 shows the chip micrograph. The filter core occupies an area of mm . The layout of the chip is part standard-cell and part full-custom. The entire synchronous portion is composed of standard cells from the IBM ASIC SA-27E cell library. In the asynchronous portion, the datapath is implemented using full-custom dynamic gates. The asynchronous control uses a mixture of standard cells (for basic gates) and full-custom cells (for C-and generalized C-elements). The mixed-timing interfaces use mostly standard cells.
Placement and routing is automated using the Silicon Ensemble tool. To simplify the task, the filter is divided into eight parts: the self-timed control block, the XOR block, the carry- save adder, the carry-lookahead adder, the left and right synchronous portions, and the two synchronous-asynchronous interfaces. Next, each part is placed and routed individually using the automated tool. Finally, the tool is used for top-level place and route, to assemble all of the parts together. No resizing of gates is performed after place and route.
The chip implements two versions of the filter, differing only in the pipeline control circuits used. One version uses the control circuit of Fig. 8(a) which has fewer timing assumptions, at the cost of some throughput. The second uses the circuit of Fig. 8(b) which is faster, but has stronger timing assumptions. The two versions are placed side-by-side on the same chip, but are otherwise independent: each has its own copy of the datapath and its own pins. This section gives performance numbers only for the latter version. The performance of the conservative version is around 20% lower, as expected.
1) Timing Constraints: All of the timing assumptions made in the design of the filter are satisfied during design and layout, and verified through simulation. These assumptions fall into two categories: 1) timing constraints inherent in the HC pipeline style [(5), (7) , and (9)] and 2) timing constraints for the two mixed-timing interfaces [(10) and (13) ]. In particular, the bundling constraint (9) is satisfied by designing the dynamic function block in each stage to have a precharge and evaluate latency of no more than 180 ps, and appropriately sizing the completion generator (generalized C-element) to have a latency of around 290 ps. These latencies obviate the need for insertion of an explicit matched delay, yet provide a healthy margin of 110 ps for the bundling constraint. The remaining two HC timing constraints [(5) and (7)] do not need any special design effort to satisfy; the transistor and gate sizing used to optimize the design for speed already satisfies these constraints by comfortable margins: 220 and 380 ps, respectively. Finally, the constraints on the mixed-timing interfaces [(10) and (13) ] are also satisfied with margins of about 200 ps. The ease with which all of the timing constraints are satisfied for this complex filter design bolsters our confidence in the practicality of HC pipelines for use in high-speed real-world applications.
B. Testing
A level-sensitive scan design (LSSD) [26] approach is used to test the filter chip at low speed. A scannable shift register is built onto the chip to provide input data to the filter. An output multiplexor is placed on the chip to select one of the high-speed outputs of the filter, for observation on an oscilloscope. For testing the asynchronous datapath, an additional input (labeled "burn-in" in Fig. 7 ) is used to convert the dynamic datapath into a pseudo-NMOS combinational circuit: both precharge and evaluate controls are deasserted, and a weak pull-up is asserted. The chip is initialized, loaded with test data, and tested for functional correctness at a low clock speed. Fig. 14 shows plots of the measured maximum throughput, and the corresponding power dissipation. The graphs show the variation in throughput and power as the number of data tokens in the asynchronous portion of the filter pipeline is varied. For each token count, the clock rate is gradually increased until a failure is detected in the output data. The figure shows plots for a few representative voltages, although the chips were fully functional from around 1.0 V to over 2.1 V.
C. Measured Performance 1) Throughput and Power:
The performance measurements show the benefits of adaptive pipelining. At the lowest filter frequencies, the asynchronous portion appears externally as a block of flow-through combinational logic, with a single clock cycle latency. As the frequency is increased, the latency of the programmable delay line is increased to 2, 3, or 4 clock cycles, increasing the apparent depth of pipelining provided by the asynchronous portion.
The filter is also evaluated with more than 4 data items in the asynchronous datapath. (Under normal circumstances, however, this mode of operation is not used, since it provides poorer latency for the same throughput as for 4 or fewer tokens.) Operation with more than 4 tokens is achieved by initializing the datapath to contain a nonzero number of tokens, instead of starting empty. In particular, the request and acknowledgment into the first and last pipeline stages, respectively, are externally controllable using additional test circuitry. Through direct external manipulation (i.e., handshaking), the pipeline can be initialized to contain 1-5 data tokens. Together with a programmed latency of 4 cycles, the nonempty initialization allows the pipeline's operation to be tested with 5-9 data tokens. The observed performance exactly matches the behavior predicted by our theoretical model. As the number of tokens is increased from 1, the pipeline throughput increases. However, beyond 4 tokens, the maximum throughput decreases because the pipeline becomes congested. Between 2 and 4 tokens, the performance levels off: in this region, the filter throughput is limited by the speed of the synchronous portions of the chip which cannot operate as fast as the native throughput of the asynchronous pipeline.
The best observed performance for the filter chip is around 1.1 giga-items/s, with three or four tokens and 2.1-V power supply. The asynchronous pipeline, however, is capable of somewhat higher performance. The native throughput of the asynchronous portion is estimated (modulo second-order effects; see Section VI) by extrapolating the left and right ends of the curves of Fig. 14(a) , and noting their intersection. Using this technique, the maximum asynchronous throughput is estimated to be 1.5 giga-items/s at 2.1 V. Several chip samples were thus tested. The fastest sample had an overall filter throughput of 1.32 giga-items/s at 2.1 V, with the asynchronous portion estimated to be capable of throughputs up to 1.8 giga-items/s. Fig. 16 shows the filter outputs as seen on an oscilloscope, along with a "sync" signal at 1/16th of the clock frequency.
The fastest previously reported comparable read channel filter [23] has a peak throughput of 2.3 giga-items/s, in the same silicon process. However, the filter of [23] is a "half-rate" design, i.e., it consists of two pipelines in parallel, each having a peak throughput of 1.15 giga-items/s. Therefore, the filter chip of this paper is effectively 15% faster. However, the main novelty of the new filter is the dynamically variable pipeline depth, and, hence, a variable latency (as measured in clock cycles), which can adapt to varying input data rates.
2) Energy Efficiency: The energy efficiency of the filter is computed from its throughput and power consumption. Fig. 15 shows two commonly used composite metrics: (energy-delay-squared) and switched capacitance. The first metric was computed as , and provides a voltage-independent measure of a design's combined energy and throughput performance [27] . The latter metric was computed as , and is an estimate of the amount of capacitance switched per data item processed by the filter. As expected, experimental results show that both metrics are quite voltage-invariant. Further, the graph shows once again that the design's composite energy and throughput performance is best (i.e., minimum ) when the filter is operated with three or four tokens. The second graph shows that the amount of capacitance switched per data item processed is fairly constant (approximately 0.12 nF) irrespective of the number of tokens or supply voltage. Toward the extremes (a single token or more than seven), the capacitance computed is a slight overestimate because leakage currents, which become relatively more significant at lower throughputs, could not be excluded from the computation.
3) Latency: While the latency of the filter is not directly measurable, the latency of the asynchronous datapath is estimated from the canopy graph to be approximately 2.6 ns at 1.8 V. To the synchronous environment, this asynchronous latency ranges from 1-4 clock cycles over the useful operating range of 200 MHz to 1.3 GHz. Since the synchronous portion of the design has an additional latency of 1 clock cycle, the overall filter has a latency of 2-5 clock cycles. A synchronous implementation of this filter would incur significantly higher latencies. In particular, the fully synchronous filter of [23] is a comparable implementation that also uses dynamic logic and delivers identical functionality. That design, however, is a "halfrate" design: it consists of two copies of the datapath, each pipelined into two stages and running on a half-rate clock. Its latency, therefore, is two half-rate clock cycles, or four data samples; this latency is fixed, and no adaptive techniques are used. Hence, for typical input data rates varying from 1.3 GHz down to 200 MHz, the latency of the fully clocked implementation of [23] ranges from 3.1 ns to as high as 20 ns. In contrast, our mixed-timed implementation's latency ranges from 3.8 ns (five cycles at 1.3 GHz) to only 10 ns (two cycles at 200 MHz). Thus, our implementation overall has significantly better latency, which is critical for the performance and stability of the feedback control system of which this filter is a key component, thereby contributing to better disk drive read channel performance.
VIII. CONCLUSION This paper presents the design of an experimental digital FIR filter for use in the read channels of modern high-performance disk drives. The filter is an interesting case study in mixed-timed design. The speed-critical portion of the filter is implemented as an asynchronous pipeline using the high-capacity style [6] , [7] , obtaining a high throughput yet low latency. The synchronous portion forms a wrapper around the asynchronous pipeline, making it possible for the filter to be used in a clocked environment. The design exhibits a highly varied datapath, ranging from 30 to 216 wires in width along the pipeline, thus demonstrating the scalability of the approach. Measured performance of fabricated chips easily meets or exceeds design specifications. Interestingly, the throughput is limited by the synchronous portion (1.3 giga-items/s); the asynchronous portion is estimated capable of up to 1.8 giga-items/s. More importantly, the adaptive nature of the design helps shorten the worst-case latency by half.
As a case study, this work provides a concrete demonstration of some benefits of asynchronous design for an important class of high-speed industrial designs: by combining an asynchronous processing core with a synchronous wrapper, substantial performance gains are obtained over the best existing comparable purely synchronous design. We hope that demonstrations such as this work will open up new avenues for the broader application of asynchronous design and demonstrate its viability for leading-edge industrial applications.
