We detail the design of multiresolution analog lter banks, linear models of cochlear function, with power dissipation being a prime engineering constraint. We propose that a reasonable goodness criterion is the information rate through the system, per Watt of power dissipated. Speech applications requiring lter banks with a wide frequency tuning range, from 20 Hz to 20 kHz, and low power consumption make the transconductance-C integrator in subthreshold CMOS the preferable integrator structure. As a way of example, the dynamic range of a lowpass lter is computed and subsequently used to design a lter bank that models faithfully cochlear micro-mechanics. The power consumption of the entire lter bank is computed f r om analytical expressions and is estimated as 355 nW, at 68 kbits sec overall information rate at the output of the system.
Introduction
The ultimate goal of our research is the e cient extraction of information from sensory signals -auditory and visual -by small, light w eight, highly mobile, autonomous hardware systems. In a truly autonomous system, all processing must be performed within the physical boundaries of the machine with strict constraints on the availability of energetic resources, i.e. only power sources within the physical boundaries of the information processing system can be used.
At the system level, and at each processing stage we seek to maximize the mutual information 1 Ix; y b e t ween the input x and output y de ned in terms of probability distributions as:
Ix; y = Z Z P x; y log 2 P x; y P xPy dx dy 1
Let the output at each stage be encoded in N independent p h ysical channels under the constraint that each c hannel carry equal amount of information. This is commensurate with physical systems for which the intrinsic bandwidth limitations in the basic elements are comparable or even smaller than the bandwidth of the signals to be processed, for example neural systems, and thefore information must be distributed in space and processed in parallel. This true for CMOS VLSI sensory systems operating in subthreshold.
However, maximizing the mutual information between input and output of the di erent processing stages is not adequate; decisions about the input signals and encoding decoding within the system must be done using the smallest possible power consumption. Thus, as a tentative measure of goodness for evaluating the performance of such autonomous systems, we propose the following optimizing criterion: maximize the information rate per Watt of power consumed, in bits Joule. A high information rate is the desired outcome of processing, while power consumption represents the cost.
The choice of signal representations can yield di erent gures of merit See Chapter 5 in 2 . For the purpose of this paper we consider continuous-time, continuous-value, analog, signal representation and circuits.
The maximum number of bits per second, or channel capacity, can be computed from the dynamic range of an analog continuous-time lter using information theory 1 and can be rendered in the following form 2 :
where f p is the bandwidth of the lter. The above equation applies to linear systems under additive Gaussian noise conditions. It assumes a peak amplitude constraint, which is appropriate for circuits which m ust operate within a certain voltage range to avoid distortion and clipping.
Biological systems serve a s w orking models of sensory information processing as they seem to achieve a v ery high information rate at very low levels of power consumption. Therefore, we stand to learn by abstracting from known function and organization of information processing in biological systems when we attempt to solve similar problems using VLSI. To date, several biologically-inspired VLSI systems for vision and audition have emerged from such an undertaking, including the silicon retina and the silicon cochlea 3, 4 , 5 .
In this paper, we present a design strategy for hardware cochlear lter bank models, addressing issues both at the architecture and circuit levels. Total power dissipation is a prime engineering constraint and, as such, this work nds applications in the areas of portable speech-recognition equipment, hearing aids, and cochlear implants.
Filter Bank Architecture
The e ciency and performance of any information processing system, both hardware and software, can be improved by incorporating prior knowledge in the design phase. Indeed, this is the keystone for success in all statistical speech recognition systems 6 . In the problem at hand, we know a-priori that the system will process speech signals which, in the framework of random signals, can be described by linear statistics such as mean and variance on the amplitude.
The input signal power spectral density S V i n ! and amplitude distribution pV in are two examples of prior knowledge which will be exploited in the synthesis and characterization of the cochlear lter bank.
Prior knowledge is sometimes referred to as the model" and represents the structure of the information processing system. A parametric description with a minimal number of parameters is desirable. A static, linear lter bank model of the basilar membrane in the cochlea has been proposed by Liu 7, 8 . The design is a result of the e ective bandwidth concept and reproduces faithfully the results from hydrodynamic simulation of a one-dimensional uid-mechanical model of the cochlea 9 . The lter bank structure has only four tuning parameters yet is exible enough so that an appropriate set of parameters can be found to t the neurophysiological data. A block diagram of the architecture is found in Fig. 1 . It can be viewed as a single-input, multiple-output lter bank. With N output nodes, the transfer function from the input node to output n is given by:
where ! c i is the cuto frequency of the lowpass lter and center frequency of the bandpass lters for the ith section, and Q 3 i is the 3-dB quality factor of the bandpass lters for the ith section. The four lter tuning parameters are the center frequency range, ! c 1 ! c N, and the quality factor range Q 3 1 Q 3 N.
Given an input distribution of S V i n !, a maximum entropy constraint is imposed on the output distribution, i.e. signal power is divided uniformly among the N output channels. If S V i n ! 1=!, it can be shown that the signal power is evenly distributed among the N output channels when the center frequencies of each c hannel are spaced exponentially along the frequency axis 2 . In this case, the linear lter bank representation of cochlear function approximates wavelet analysis in a scale domain that preserves good temporal and frequency resolution 8 .
Therefore, under the simplifying assumption that speech signals follow a 1 =! distribution, the set of the cochlear model parameters which spreads the signal power evenly across all channels can be expressed as 2
In Fig. 2 we plot the transfer function and group delay for a 16-channel lter bank using exact equations for Q 3 = 2 :6 and frequency range 100 8000 Hz. Two preliminary lowpass lters have been added in order to achieve a more uniform peak magnitude response.
Having distributed the signal power evenly among each of the N output channels, then we m ust assure that the noise power is also uniformly distributed in the di erent c hannels. In this way, each c hannel carries approximately the same amount of information in bits. The noise properties of the cochlear model will be investigated as we proceed with the particular design implementation.
However, the performance of any xed-model based system will inevitably degrade if the operating environment does not match the environment under which the system was originally designed, making adaptation an absolute necessity. For speech communication systems, the origin of variability can be divided into two broad classes. The rst source of variability, exogenous, is due to the environment through which acoustic signals propagate. The second source of variability is from the computational substrate where speech is produced encoded and processed decoded. In a human-computer communication channel, the variability in the speech production apparatus of the individual speakers, and the structural variability in the hardwre that will process speech real or silicon cochleas, as well as noise in a thermodynamic sense are the two sources of endogenous variability. Two software systems that employ adaptation at the acoustic processor stage in order to solve problems due to both endogenous and exogenous sources variability h a ve been developed in the area of speech recognition. Neti 10 used a software model of the basilar membrane proposed by Liu 7 , followed by temporal feature extraction proposed by Y ang 11 , as a front-end for a large-vocabulary isolatedword speech recognition experiment. By adjusting the four tuning parameters of the basilar membrane in response to changes in the level of additive babble noise, Neti reported more than a 50 decrease in word error rate at moderate signal-to-noise ratios as compared with the more conventional acoustic processing scheme. More recently, Kamm et al 12 described a system which employs adaptation of one parameter in the acoustic processor to compensate for the variability in the vocal-tract length of the individual speakers. They reported an overall improvement in the word error rate of 5 for continuousword speaker-independent telephone speech. Thus, although adaptation is not treated in the present paper, it is understood by the authors that this topic must be addressed in a nal design.
Subthreshold CMOS Integrator
The design of optimum continuous-time lter banks requires the optimization of a single integrator. The two t ypes of integrators considered in 13 are the MOSFET-C and transconductance-C integrators. The main advantage of the MOSFET-C integrator is that it can approach the dynamic-range maximum, as de ned by Groenewold 13 . Its chief disadvantage is the relatively narrow frequency tuning range of approximately half an octave. Such a tuning range is generally adequate to compensate for parametric variations in the fabrication process; however, lter banks for processing wideband signals, such as speech, require a frequency tuning range of 2 decades or more, encompassing much of the normal hearing range of 20 Hz to 20 kHz. A second disadvantage of the MOSFET-C implementation is the need for a high-gain, low-output-impedance ampli er. Thus, one possible road towards the design of optimum dynamic-range integrators for speech processing is to develop a MOSFET-C implementation with a much broader tuning range and a l o w p o wer ampli er.
The alternate approach employs transconductance-C integrators and is the one adopted in our work. By exploiting the exponential current-to-voltage relationship in subthreshold MOS transistors, integrators that span several decades in frequency can be easily made. One has the added bonus of low p o wer and low v oltage operation, as the device physics of subthreshold CMOS enable the lowest possible saturation voltage 14, 4 . The main disadvantage of transconductors operating in the subthreshold region is the relatively poor linear range. There exist however several techniques to linearize these inherently nonlinear transconductors 15, 16, 14, 17 . In short, for low v oltage, low p o wer systems, transconductance-C integrators appear to be the best, if not the only, option 16 .
In order to achieve a l o w-noise design, a class-AB transconductor with only the minimal number of active elements is sought. As a starting point, the two-transistor circuit of Fig. 3a is selected. This transconductor is tunable over a wide frequency range via the supply voltages or substrate terminals. It has no bias element, and therefore the most favorable noise properties among all CMOS transconductor con gurations at a given bias current 13, 14 . In the following section, the dynamic range is computed for the self-biased transconductance-C integrator con gured as a lowpass lter.
Dynamic Range of the Self-biased Transconductance-C Integrator
The high end of the dynamic range of an integrator is the maximal signal level it can handle. For applications in which linearity in the signal is of utmost importance, the maximal signal level is the level at which distortion products are just equal to the noise oor. We refer to this type of dynamic range as the distortion-free dynamic range. For applications in which a certain amount of distortion is tolerable, the maximal signal level is the level at which distortion products are just equal to the maximal allowable fraction of the signal level. We call this type of dynamic range the distortion-limited dynamic range. These two t ypes can be formally described by the equations , and V 2 n are the signal, distortion, and noise levels, respectively, and c is the percent allowable distortion. The convention adopted here is to refer all voltage levels to the input.
Let I b denote the current through each of the two transistors in Fig. 3a for V in = 0. Assuming subthreshold operation as described in appendix A, an expression for the output current i s g i v en by
The transconductance around the operating point, V in = 0 , i s G = ,2I b =V t .
Assuming the noise produced by the two transistors is independent, the output current noise power spectrum is 2qI b + 2 qI b see appendix A. The input-referred power spectrum is found by dividing 4qI b by G 2 , to obtain S V i n ! = 4kT 2jGj = 4kT jGj 7
where the noise factor = 1 =2. For 0:5, it appears that the self-biased transconductor has a lower noise factor than an equivalent passive conductance. To determine the input-referred noise level, multiply S V i n ! b y the equivalent noise bandwidth of the particular lter.
The mean-square distortion may be written as
where the expectation operator E can be with respect to a deterministic time-domain signal, V in t, or with respect to a stationary amplitude distribution, pV in . It can be shown that the distortion level is independent of the bias current I b and integrating capacitance C. If the input signal is assumed to follow a normal amplitude distribution with zero mean and variance 2 , the mean-square distortion can be computed numerically from 8. For comparison with the more traditional measure of harmonic distortion, 8 is also computed using a pure tone. Since we know that we will be processing speech signals, a better approximation for input can be obtained using the doublegamma distribution 18 ; however, this distortion measure does not converge for this combination of transconductance function and input distribution. Further work on the numerical algorithms is necessary.
As an example, we compute the dynamic range of the self-biased transconductor con gured as a lowpass lter, as in Fig. 3b . The input-referred noise spectrum is double that of a single integrator.
Multiplying 4kT=jGj by the equivalent noise bandwidth of the circuit, which i s jGj=4C, the inputreferred noise level is is 41.7 dB for the normal distribution and 44.3 dB for a pure tone. The distortion-limited dynamic range is 45.5 dB and 49.4 dB, respectively, for the same parameter values and a maximum of 2 distortion, as shown in Fig. 4b . Our results suggest that, by using prior knowledge of the amplitude distribution of input signals, a more accurate estimate of the system performance can be obtained. Fig. 5 shows the second-order bandpass circuit, which, together with the rst-order lowpass circuit of Fig. 3b , can be used as building blocks for the basilar membrane model of Fig. 1 . Their transfer characteristics are summarized in Table 1 . These lters are based on RC and RLC proto-types. One of their desirable properties is the insensitivity of the peak response to component mismatch, i.e., it is always unity. A second is that the noise level in a cascade structure increases only linearly with the number of stages. Finally, this particular second-order section is the optimum design for low-Q, wide-frequency range lters 19 . Transconductors labeled G 1 behave as a one-sided resistor, while those labeled G 2 and G 3 constitute a gyrator, which e ectively convert the capacitor C on the right i n to an inductor. The parasitic capacitance C p is assumed much less than C. An important problem to consider is the biasing of the transconductors. A linear change in the supply voltage results in an exponential change in the bias current, and hence the lter corner frequencies. Alternately, a linear change in the substrate voltage results in an exponential change in the bias current. If it were possible to bias both the n-substrate and the p-substrate, as in a twin-tub process, the latter solution would be the best. The substrate sinks and sources very little current, and hence little power would be consumed in setting the bias currents. Both biasing schemes are depicted in Fig. 6 .
Filter Bank Elements
Assuming ideally-matched transconductance elements exhibiting only white thermal noise, Fig. 7a shows the theoretical output noise power spectrum of the entire 16-channel lter cascade with parameters Q 3 and frequency range as before. The output channel noise is dominated by the quality factor of the second-order sections. Integrating the power spectrum across all frequencies, the output channels have almost constant RMS noise 0:105 mV, as shown in Fig. 7b .
Computing the exact distortion for the entire lter structure is beyond the scope of this work. However, we can estimate the dynamic range of each c hannel using the distortion measure of just a single integrator. Allowing a maximum of 2 distortion, the maximum RMS input signal is only 6:43 mV for a normally distributed input. The peak gain of each output channel is approximately 0.42, due to overlap between the lowpass and bandpass lters. As such, the distortion-limited dynamic range for each c hannel is approximately 6:43 0:42=0:105 2 or 28.2 dB. We can estimate the maximum information rate from 2, noting that the message bandwidth f p is approximately ! c =2Q 3 for each c hannel. Assuming non-overlapping channels and independence between channels, the total maximum information rate is
For the parameter values chosen earlier, the system capacity is calculated as 68 kbits sec.
The current consumption in the lter bank, not including that needed for tuning, which, admittedly, can be quite large, is theoretically 237 nA. Using a 1.5-Volt power supply, the total power dissipated is approximately 355 nW. Finally, w e estimate the maximum information rate per Watt, or number of bits per Joule, as 0.19 bits pJ.
Conclusions
The problem of optimal extraction of information from sensory signals by real" computing hardware in terms of maximum information rate per unit of power consumed has not been completely resolved in this work. Rather, having experimented with one transconductor circuit design and one architecture, we leave open the possibility of future improvements by w ay of: 1 integrators with higher dynamic range and or lower power consumption, 2 enhanced lter architectures, and 3 adaptation.
A Subthreshold CMOS Model
A model for the current in an NMOS device operating below threshold is given by 14, 3 , 4
where is the gate e ciency, t ypically 0.6-0.9, and all other terms are by convention. We ignore the e ects of a non-zero drain conductance. In addition, the process constants I 0 and are assumed to be the same for both NMOS and PMOS devices. In this way, w e are dealing with ideal devices and hence our results represent an upper bound on the achievable dynamic range.
The noise in a subthreshold MOS transistor can be reasonably-well modeled as a bias-dependent shot noise with a one-sided power spectrum S i;n ! = 2 qI DS . According to 20 , the power spectrum of icker noise is proportional to I 2 DS and inversely proportional to the gate-capacitance. Thus, by operating at low current levels with large transistor dimensions, the icker noise corner frequency can be made to fall below the audio frequency range. 
Input

