Abstract-We propose a sampled-analog rank-order filter (ROF) architecture of complexity ( 2 ). It yields a very compact structure because the devices used are essentially of minimum geometry. Its sole active building block being the simple CMOS inverter, the circuit exhibits an excellent low-voltage compatibility. Furthermore, it can support a rail-to-rail common-mode input range. It is inherently fast due to fully parallel signal processing and speed is expected to increase with technological scaling at the same rate as purely digital circuitry. Finally, it supports full programmability of the rank by means of an analog reference voltage. The ROF is based on a pair of multiple-winners-take-all (mWTA) circuits and a set of AND gates. The paper includes a description of the architecture and a detailed analysis of the mWTA. Most relevant design issues are addressed and experimental results obtained from a fabricated ROF are presented.
I. INTRODUCTION

P
RESERVING rapidly changing signals while attenuating impulsive noise is a fundamental filtering problem encountered frequently in a wide variety of signal processing applications. Nonlinear filters successfully overcome this problem. However, the software implementation of these filters lacks the level of processing speed desired in many real-time applications. Dedicated digital hardware implementation, on the other hand, only marginally improves the speed and necessitates a large silicon area, hence an elevated cost. These facts have made dedicated analog hardware implementation an attractive option because it can offer enhanced speed if parallel processing is accomplished and potentially costs much less if compact architectures can be devised.
Min, max, and median filters are the most popular types of nonlinear filters. The rank-order filter (ROF) generalizes all of these by identifying and transmitting the th highest ranking element of an -dimensional signal vector [1] . This turns the ROF into an extremely versatile signal processor particularly if rank is programmable. It finds use in such applications as seismic signal processing, image enhancement, speech processing, biomedical imaging, and pattern recognition [2] .
In this paper, we present design issues and experimental evaluation of a programmable sampled-analog CMOS ROF architecture. The precursors of the proposed architecture were published in [3] . The proposed ROF is based on a pair of multiple-winners-take-all (mWTA) circuits built by extending the capacitive synapse concept of [4] into the analog domain.
Although what results is an architecture of complexity , it not only proves to be very compact, but also affords high resolution and speed and is amenable to scaling due to its excellent low-voltage compatibility.
Architectural features of the proposed ROF are discussed in Section II together with a detailed analysis of its main circuit block, mWTA. General design issues and the test circuit fabricated in 1.2-m CMOS are the subject of Section III. Finally, in Section IV, we present a brief description of the existing analog ROF architectures and summarize the conclusions.
II. ARCHITECTURAL DESCRIPTION AND CIRCUIT ANALYSIS
The architecture of the proposed ROF is shown in Fig. 1 . It contains a pair of mWTA blocks, marked ( )WTA and ( )WTA. These blocks receive -dimensional analog input voltages , , in parallel, but are programmed to select and highest ranking inputs, respectively. The binary outputs, of ( )WTA and of ( )WTA, identify each high ranking input with a binary 1 and each low ranking input with a binary 0. The respective output pairs are evaluated by a set of logic gates, each performing an AND operation between and the complement of . Consequently, the sole AND-gate output to produce a binary 1 belongs to the input of rank , while others remain at binary 0. Since each of these binary outputs controls a transmission gate tied between the corresponding analog input and the filter output, only the input of rank is transmitted to the output. Shown in Fig. 2 connections can be switched to: 1) a reference voltage which determines ; 2) circuit inputs; or 3) outputs of other inverters. The resulting capacitor matrix is symmetrical with zero diagonals. The input and output voltages of these inverters are denoted by and , respectively. The circuit operates under a three-phase nonoverlapping clock scheme and, in the most general case, passes through three epochs in one full cycle, as explained next with the aid of Figs. 2 and 3. Fig. 3 shows the simulated waveforms of , , and capacitor bottom-plate column voltages for and . The simulated examplar operates with V and processes the input vector {0.5 V, 1.0 V, 1.5 V, 2.0 V, 2.5 V}.
In epoch 1, all capacitor bottom-plate columns are tied to and all inverters are in an autozero mode. This resets all to the logic-threshold voltage and, thus, registers the following total charge at each inverter input node: (1) where represents the parasitic capacitance of each inverter input node and is assumed to be identical for all inverters. Autozero switches are turned off at the end of epoch 1, leaving all inverter input nodes afloat for the rest of the cycle, and therefore, the charge described by (1) is conserved until the beginning of the next processing cycle. Also, columns are disconnected from at the end of epoch 1, but this is slightly delayed with respect to autozero turnoff in order to secure the stability of column voltages until autozero turnoff is complete.
In epoch 2, circuit input voltages are applied to the capacitor bottom-plate columns. As a result, inverter input voltages are perturbed from the autozero level . As indicated by the waveforms of Fig. 3 , the order in which are now ranked is just the opposite of the order of circuit input voltages . This transposal may seem odd at first sight, but it is indeed the main purpose of epoch 2 and is made possible by omitting along the diagonal of the capacitor matrix. For a formal proof of the transposal, first note the following equation, which is derived on the basis of charge conservation to describe the input perturbation of an arbitrary inverter, say, inverter , in epoch 2.
(2) where is the total capacitance at the input node of an inverter. Writing a similar equation for another arbitrarily selected inverter, say, inverter , and subtracting from (2), we obtain (3) which clearly shows that the order of ranking of is opposite to that of . Since this is valid for any pair of inverters, the conclusion can be generalized to vectors and in their entirety. Of course, the order of ranking is once more transposed at the outputs due to inversion. Therefore, inverters enter epoch 3 with their outputs having the same order of ranking as that of the circuit input vector .
In epoch 3, inverter outputs are switched to the capacitor bottom-plate columns of their respective indexes and, thus, capacitive positive-feedback paths are established. The circuit, if designed properly, now evolves into a steady state in which the inverters of highest initial output voltages converge to a binary-high output , while the rest converge to a binary-low output . Since initially the output voltages have the same order of ranking as the circuit inputs, the final values of output voltages at the end of epoch 3 reflect the outcome of an mWTA function performed on . As observed in Fig. 3 , and of the exemplar circuit converge to V, while the remaining three converge to V. Note that the former two have the same indexes as the two highest ranking circuit inputs, V and V. The outstanding question is how is determined by the circuit. For an answer, consider the fact that, in the steady state of epoch 3, an inverter of output receives from other inverters a total of feedback voltages at level and voltages at level . Invoking charge conservation, we can express the input-node voltage of such an inverter as follows:
But for an inverter to generate a binary-high , its input voltage must be less than the logic-threshold voltage , which implies from (4) the condition (5) Repeating this derivation for an inverter of output , which receives in epoch 3 a total of inputs at and a total of inputs at and needs to have an input voltage in excess of , we obtain the following counterparts of (4) and (5) . (6) Obviously, the circuit can be programmed to any rank by selecting between the limits defined by (5) and (7). However, and must be determined in advance. To see how, suppose that is selected just in the middle of the range defined by (5) and (7), that is (8) whose substitution into (4) and (6) yields (9) (10) It is obvious from (9) and (10) that and are to be symmetrical with respect to and the slope of the line connecting the two stable operating points on the inverter transfer characteristics has to be . These two constraints uniquely specify and . In the case of a symmetrical inverter transfer characteristic, like the one in Fig. 4 , one can draw a line of slope through the threshold point and determine the binary operating points by intersecting this line with the characteristic.
A final conclusion about these two binary levels is that, since they are determined solely from (9), (10), and the inverter transfer characteristic, all of which are independent of , their values do not depend on the rank programmed.
The mWTA operation described so far represents the most general case of running three epochs per cycle. Since epoch 2 samples the input and epoch 3 generates the corresponding output, these two must indeed be repeated in every cycle. Epoch 1, on the other hand, is used only for registering , whose value remains constant unless the ROF needs reprogramming. It is, therefore, unnecessary to include epoch 1 in every cycle. Still, it must be inserted periodically even if is not to be reprogrammed because, otherwise, the continuous leakage of the inverter input-node charge would eventually result in a loss of rank information. The frequency of this refresh operation is determined by , leakage current, and the tolerable degradation of , but, in any case, it is expected to be much slower than the main cycle frequency.
III. DESIGN ISSUES AND TEST CHIP
The mWTA circuit described in Section II has a very simple and perfectly regular configuration comprising essentially minimum-geometry transmission gates, inverters, and unit capacitors, yet its device-level design is constrained by a number of considerations. What lies at the root of these constraints is the stability of the circuit during epoch 3. This is the main reason why we deploy single-stage inverters. The nonsaturated output levels, and , of a single-stage inverter also facilitate programming for the extreme cases of and because the corresponding conditions and set by (5) and (7) can be met without necessitating a value outside the rail-to-rail range. The nonsaturated outputs do not pose any problem as long as and are within the respective noise margins of the AND gates driven by these outputs. If not, and can be restored to rail levels with additional inverters outside the loop before feeding into the AND gates. A single-stage inverter, however, provides a limited gain, whose magnitude is maximized to an around the logic-threshold point. In order for the loop gain to exceed unity, as needed for the bistability illustrated in Fig. 4 , this gain must satisfy the condition (11) This is not the sole constraint involving . This ratio is effective also on the common-mode input range (CMR), as explained next. Consider the case in which the ROF is intended to operate as a filter with . As indicated by (8) , the of the ( )WTA block is to be set high, close to , implying that all 's enter epoch 2 with their upper plates at and lower plates at . If all circuit inputs happen to be close to , then the bottom plates of all capacitors will swing downward by approximately during epoch 2. Unless is as high as 2, the voltage of the upper plates and, hence, the inverter inputs, will drop below , causing loss of charge via the junctions of the autozero MOSFETs. An opposite but equally harmful situation will arise when the ROF is programmed to operate as a filter and all inputs happen to be close to . In this case, inverter inputs can rise sufficiently above to cause excessive charge leakage due to impact ionization at the same junctions. If the CMR is to be rail-to-rail, then inevitably, one must add dummy capacitors to all inverter inputs, so that is made large enough to attenuate the perturbation at inverter inputs. Assuming that mWTA is designed with to secure rail-to-rail CMR, (11) implies an upper limit for . For a typical basic inverter, this limit is no less than 10.
According to (5) and (7), the range of the programming voltage for which mWTA can be programmed to a given is described by
Considering the fact that is at least a few volts, an mWTA of ten inputs can support at least mV, which provides a comfortable tolerance for as needed for the robustness of calibration-free programmability against manufacturing variations.
The mWTA is relatively immune to second-order effects of inverter mismatch and charge injection. The autozero process in epoch 1 is specifically introduced to cope with the effect of inverter logic-threshold voltage mismatch. The effect of charge injection, on the other hand, is minimized by the regular layout and competitive nature of the mWTA process. Injection from autozeroing switches at the end of epoch 1, for example, add to of (1) but, since all inverter inputs receive identical injection, the order of ranking in epoch 2 is not altered. 1 Capacitor bottom-plate columns also receive injection at the end of epoch 1 as they disconnect from . If this injection occurs before the autozero switches are completely shut off, then the registered inverter input charge will be altered. Considering the possibility of clock skew, this alteration may have a severe distribution among inverters. As mentioned before, controlling upper column switches with a slightly delayed replica of the autozero clock completely solves this problem. Finally, all columns receive yet another injection at the end of epoch 2 as the input vector is disconnected. The charge injected into each column is linearly proportional to the input voltage sampled to the same column. Therefore, not only the order of sampled input voltage ranking is preserved, but also, as a bonus, the differences between input voltages are enhanced by this injection.
The resolution of mWTA is determined mainly by mismatch among 's. Assuming a large and identical input voltages , the standard deviation of the difference between any two inverter input voltages during epoch 2 is approximately described by (13)
The expected value of is related to the difference between any two circuit input voltages by . Defining the resolution as the value of causing to be equal to , we obtain the following expression for resolution:
Since is a decreasing function of , the resolution can be improved by using larger capacitors. Capacitor size can be increased without any adverse effect on the overall circuit size as long as the pitch of the regular circuit layout is limited by the peripheral circuitry of switches and inverters. Speed, however, suffers from any increase in unit capacitor size, as explained next.
The speed of the ROF is determined mainly by mWTA dynamics. The transient period in epoch 1 is proportional to autozero switch resistance, , , and , and is inversely proportional to inverter transconductance. In epoch 2, , , and input switch resistance determine the duration of the where the time constant is defined by (19) It is obvious from (18) and (19) that the time of divergence and, hence, the duration of epoch 3, is inversely proportional to , a linearly increasing function of , a quadratically increasing function of , and a logarithmically decreasing function of the initial difference between and . These predictions are fully verified with SPICE simulations, a sample of which is presented in Fig. 6 . These are semilogarithmic plots of versus time for , , and . The corresponding values of , as calculated from the slope of these plots, are 2.55, 5.10, and 8.30 ns, respectively. These are in perfect agreement with the predictions of (19) for . As (19) indicates, can be reduced by increasing , but since the latter calls for increasingly wider inverter transistors, also increases. This not only sets a lower limit on but, according to (3), also reduces , which, in turn, tends to increase the time of divergence in epoch 3. Clearly, therefore, there exists an optimum value of transistor width for which the speed is maximized.
We have designed in 1.2-m AMI technology a fully integrated ROF with nine inputs, which has been fabricated via MOSIS. In addition to the ( )WTA and ( )WTA blocks and the glue logic, it includes digital-to-analog conversion circuits to generate from an external 4-bit-wide instruction. The capacitive matrices have been built with small 15-fF capacitors, but all routine layout procedures have been followed for good matching. The total area occupied by the ROF is 840 840 m. A photomicrograph is given in Fig. 7 . Tests have been conducted with externally supplied because the on-chip reference generator has not been fully functional. The ROF can be programmed as intended for all ranks with the exception of . Although we do not know the exact reason for this malfunction, it is definitely not associated with the ( )WTA block whose externally observable outputs produce correct signals. We suspect the ( )WTA unit, which has not been made observable due to pin limitations. The inverters have been designed for V and V. Shown in Table I are the limits of the range of for each rank as calculated from (5) and (7) and measured on the ( )WTA block. Obviously, the experimental data set closely matches the calculated one. These tests, conducted on three different chips, indicate a maximum of 100-mV shift in the range limits.
The measured worst-case input resolution is 30 mV, which implies from (14) a relative standard deviation about 0.5% for the capacitors used. This is in good agreement with the figures cited in [5] for small poly-to-poly capacitors in 1.2-m CMOS. Test results have also highlighted the low-voltage compatibility of the ROF, indicating full functionality for as low as 2 V. This is not an unexpected feature because the only active building block used in the entire architecture is the CMOS inverter, which can function with a supply voltage as low as two device-threshold voltages. The speed as measured by the inverse of the minimum cycle time is 4 MHz, which exactly matches the simulated value for 30-mV minimum input difference. The measured total standby power is 9 mW, which is mainly consumed by the inverters and agrees reasonably well with the simulated value of 7 mW.
IV. CONCLUSION
The sampled-analog ROF architecture we propose has a configurational complexity of , but still yields a very compact structure because the devices used are essentially of minimum geometry. The sole active building block being the CMOS inverter, the circuit exhibits an excellent low-voltage compatibility. Furthermore, it can support a rail-to-rail common-mode input range. Its speed is inversely proportional to , but the fully parallel implementation of all signal processing tasks makes the architecture inherently fast. Furthermore, the speed is expected to increase with scaling at the same rate as purely digital circuitry. Finally, the proposed architecture supports full programmability of the rank by means of an analog reference voltage without necessitating calibration.
Among the analog ROF architectures proposed up to now, the one presented in [6] contains no sorting subsystem and operates in continuous time. Although its architectural complexity is , it needs one differential transconductor amplifier of 15 nonminimum-geometry transistors per input. Therefore, its low order of complexity can pay off only for very large . Furthermore, it has a systematically low resolution, which is likely to be aggravated by mismatch-dependent imprecision of the subthreshold operation of devices. Although speed is indeed a generic advantage of systems, the subthreshold operation of this particular ROF does not favor fast convergence either. Another continuous-time ROF is proposed in [7] . It has a very good area efficiency for and applications, but needs a very complex rank selector circuit for any other rank. Another ROF architecture, proposed in [8] , is based on a sequential operation of two-input WTA circuits. Its area efficiency is comparable to the one in [6] , but it is essentially slow due to the requirement of one cycle per input.
