Abstract-This work presents a sub-6 µW acoustic frontend for speech/non-speech classification in a voice activity detection (VAD) in 90 nm CMOS. Power consumption of the VAD system is minimized by architectural design around a new powerproportional sensing paradigm and the use of machine-learningassisted moderate-precision analog analytics for classification. Power-proportional sensing allows for hierarchical and contextaware scaling of the frontend's power consumption depending on the complexity of the ongoing information extraction, while the use of analog analytics brings increased power efficiency through switching ON/OFF the computation of individual features depending on the features' usefulness in a particular context. The proposed VAD system reduces the power consumption by 10× as compared to state-of-the-art (SotA) systems and yet achieves an 89% average hit rate (HR) for a 12 dB signal-to-acoustic-noise ratio (SANR) in babble context, which is at par with softwarebased VAD systems.
I. INTRODUCTION
T ECHNOLOGICAL innovations are changing the way we interact with electronic devices. Interactions like voice control and gesture recognition are rapidly gaining popularity. Such natural interactive systems do need not only many integrated sensors but also always-awake, reactive sensor frontends. These frontends generate large amounts of raw signals that state-of-the-art (SotA) frontends immediately digitize for processing on a DSP. This very robust approach is not power efficient, as not all raw sensor signals are equally relevant. The net information content of a sensed signal is quite often significantly smaller than the Nyquist rate [1] - [7] . Existing works such as information-rate processing [1] , [2] , analog to information conversion [3] - [5] , and compressed sensing [6] , [7] show power savings by extracting or compressing the information from signals before digitizing the data. However, as these schemes operate in a static way, the compression or extraction parameters are set beforehand. Yet, the information content in raw signals and its application relevance dynamically varies depending on the operating context.
Operating such systems efficiently thus requires a dynamic system adaptation depending on the context or signal information content. Existing systems do not perform such fine grain adaptive behavior, which severely limits their power savings as shown by solid line in Fig. 1 .
We propose a self-scalable, power-proportional sensing paradigm, which gracefully scales the system's power consumption with the amount and complexity of extracted information, i.e., the power consumption for such a system increases only as the task of information extraction gets more complex. To this end, in this paper, we propose key enablers for powerproportionality and apply them to a proof of concept acoustic frontend for voice activity detection (VAD).
VAD systems distinguish speech from non-speech in different background noise contexts for varying signal-to-acousticnoise ratios (SANR). SotA VAD systems [8] - [10] extract complex features like mel-frequency cepstral coefficients and DCT to differentiate speech from non-speech. The high computational complexity of such features results in large power consumption, typically about 50−100 µW [8] - [11] in addition to the power consumption of the required active microphone. Such a continuous large power consumption is unacceptable for battery powered always-on sensor frontends. This work exploits our new power-proportional sensing paradigm along with moderate-precision, computationally inexpensive, analog feature extraction, coupled with an embedded mixed-signal classifier to save more than 10× power consumption over SotA without compromising on the classification accuracy.
The outline of this paper is as follows. Section II discusses insights into the design principles for power-proportional sensing and explains the rationale behind the analog feature extraction instead of the commonly used digital scheme. Section III describes the architecture and specification set for VAD while the detailed implementation is discussed in Section IV. Measurement results for the chip and for the full VAD system are discussed in Section V.
II. KEY PRINCIPLES FOR POWER-EFFICIENT SENSING
This section details the two key principles that allow our always-on sensing system to scale its power consumption with the information extracted saving 10× power over SotA VAD systems.
A. Power-Proportional Sensing
The core premise for power-proportional sensing is that power consumption of the sensing system scales proportionally 0018-9200 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. with the complexity of the sensing task. The sensing process with the target of information extraction can increase in complexity along two dimensions. First, the amount of information extracted from the incoming signal can scale in complexity. Consider, e.g., the task of speaker identification versus speech detection. The former task entails the later as a prerequisite first step, thus justifying the increase in power consumption. Enabling hierarchical operation for tasks of increasing complexity allows scaling of power consumption with complexity of information extraction. In such an architecture, each processing stage extracts more complex information than the previous stage while consuming more power. This enables information extraction by necessity, as is shown in the horizontal axis in Fig. 1 .
Second, even if the amount of extracted information remains the same, distinguishing the useful information from the background noise (the context) is subject to varying levels of difficulty. For this case, consider the complexity of speech detection in a quiet office, in contrast to a noisy street environment. The amount of information needed is same in both cases, but in the latter case, as the background noise maps directly onto the information spectrum, it creates in-band interference on the desired signal. As such, distinguishing speech from non-speech becomes more complex, hence justifying the increase in power consumption. Context-awareness enables power-proportional sensing to scale power as the background noise context scales the complexity of information extraction, as shown as bold in Fig. 1 . For the example above, context-awareness allows to use a much smaller discriminating feature subset in a low-noise environment and a relatively larger subset for noisy background contexts, hence scaling power.
SotA sensing systems do not exploit the power scaling opportunity offered by the above scenarios, and typically operate constantly in full processing mode. This plateaus the on-state power consumption for SotA sensing systems independent of system utility as shown in Fig. 1 .
B. Power Efficiency Through Analog Analytics
The power-proportional sensing paradigm as highlighted in previous paragraph needs complexity and precision-dependent [12] and impact on efficiency cross over point due to voltage scaling and due to digital assistance by machine learning and/or calibration.
power scalable hardware blocks. Such power scaling with precision is very different for analog and digital implementations. Analog power consumption scales gradually for thermal noise limited system with low-to-medium precision, while digital has a logarithmic power versus precision profile. As it has been shown in [12] and in Fig. 2 , for a 0.25 µm CMOS technology, analog computation is not only more power efficient than digital for low-to-medium resolution processing but also exhibits better scalability.
Reduction in supply voltage due to technology scaling allows more power-efficient digital circuits and questions the beneficial analog behavior in advanced technologies. This is because with scaling, the cost of maintaining the same precision in analog increases as a larger bias current is needed to reduce the noise-floor compensating for reduction in signal swing. Assuming that the supply voltage has scaled from 2.5 V for 0.25 µm to 0.9 V for a 40 nm technology, the active digital power has scaled down by 10log(2.5 2 /0.9 2 ) ∼ 9 dB while analog power consumption goes up by 4.5 dB [12] for subthreshold design. Contrasting effects of reduction in average capacitance per node and increase in subthreshold-leakage on digital power consumption are not considered here. The above discussion implies that while analog keeps its favorable scalability, the analog-digital efficiency crossover point moves toward left by 2 bits. This renders analog computation cheaper than digital for up to 7 bits of precision as shown in Fig. 2 .
Digital enhancements, such as machine learning and calibration, can restore some of the lost benefit of analog over digital computation for always-on sensing or classification tasks because these often do not need perfect signal reconstruction but only need error resilient processing such as detection or classification. Specifically, such tasks do not require accurate absolute computations but only relative comparisons of the computed feature values to on-chip trained thresholds, as we will show in the design presented in this paper. Hence, absolute precision requirements for such systems are rather modest, and mismatches and offset impairments are automatically taken care of and by the embedded trained classifier in the loop. As demonstrated by this work, as well as some existing works, machine learning assisted [13] , [14] and/or digital calibration [15] can improve SNR by 6-10 dB for comparable power which pushes the efficiency crossover point in the rightward direction as shown in Fig. 2 . These estimations support the use of analog computation for systems requiring scalability up to 8 bits of precision.
III. SYSTEM ARCHITECTURE AND SPECIFICATIONS
This section highlights the use of the aforementioned key principles in the developed VAD architecture [16] and derives the specifications for the analog/mixed-signal building blocks.
A. VAD System Architecture
The top-level block diagram of the proposed powerproportional VAD system is shown in Fig. 3 . The main subblocks of the system are the threshold-based wakeup block, the analog feature-extractor, the mixed-signal classifier, and the microcontroller, which operate in the described powerproportional sensing fashion as follows.
An always-awake threshold-based wakeup block keeps checking the passive microphone for sound activity. When any signal-not necessarily useful-is detected, it wakes up the analog feature-extractor that translates the input signal into a set of features. The on-chip classifier uses these computed features to classify the incoming signal as speech/non-speech. If the signal is speech, the classifier wakes up the microcontroller for more advanced processing.
Such hierarchical activation of information extraction hardware allows the VAD system to be in the lowest power-mode possible, while still able to compute the necessary information. This allows scaling the power with necessary information as outlined in Section II-A1. Also, as not all computed analog features carry information under all background noise contexts, machine-learning-based context-awareness allows dynamically disabling the computation of features that do not assist in classification. Such context-aware computing allows further power scaling depending on the number of useful features necessary as explained in Section II-A2. The control of feature activation and classifier configuration is done by the embedded microcontroller. This microcontroller periodically wakes up to check for background noise context changes and upon detecting a change, retrains the classifier, and activates the required features for the new context. As further modeled in Section III-B, considering that the analog feature-extraction blocks are in the loop during this training operation, all static analog impairments such as mismatch, gain errors, or offsets are absorbed in the trained feature thresholds and do not affect the classification accuracy. This justifies the usage of low-precision analog analytics for feature computation, as discussed in Section II-B. Before detailing the design of individual sub-blocks in Section IV, Section III-B derives specifications for the targeted VAD system.
B. Specifications for VAD System
This section first derives the system-level specifications and then the specifications for individual analog blocks. The system computes an analog feature-set for the acoustic signal by decomposing the signal into different frequency bands and then extracting the average value of the rectified signal in each frequency band. Mathematically, each analog feature af i is defined as
where Ax(t) is the amplified acoustic signal, h
is the impulse response of bandpass filter (BPF) used to decompose the input signal into a smaller frequency band, abs, * , andx represent the absolute value, convolution, and averaging, respectively. The features thus represent the average power present in every frequency band. It is, therefore, important to determine the required frequency range, number of observed frequency bands, and the necessary precision, as these parameters will strongly influence the classification accuracy as well as the system's power consumption. Such system specifications are evaluated based on a MATLAB model of the analog feature-extractor of VAD system based on (1) .
Along the frequency axis, the bulk of energy for speech and acoustic noise is concentrated in the frequency range 100 Hz to 4 kHz [17] . The MATLAB model varies the number of computed features in the above frequency range by scaling the Q factor of the BPFs. This ensures that the entire frequency range is always populated with filters, with an increasing frequency resolution as the number of computed features increases. The results of the above simulation are shown in Fig. 4(a) . It can be seen that more features improve classification accuracy, yet accuracy gains diminish beyond 16 features allowing us to limit our design to a maximum of 16 [individually (dis)activated] features. Further, the model also evaluates the impact of static analog impairments, e.g., by degrading the gain in the signal path, as seen in Fig. 4(b) , as long as these occur within the training loop, they are absorbed in the thresholds learnt for classification and thus have no impact on classification accuracy. Fig. 5 histogram shows the relative relevance of each of the 16 analog features in the speech versus non-speech classification for exhibition noise context with 0 dB SANR. It is clear that the middle-frequency features af 5 to af 12 are more commonly used. Hence, we only pass these features to an on-chip classifier, while the full feature-set is passed on to a microcontroller only when needed for more complex tasks, such as context-change detection.
Another important group of parameters are the maximum input-referred noise and the necessary gain for the system. The specifications for input-referred noise and gain strongly depend on the input signal level, which depend on the type and make of the microphones used in the system. The active microphones used by SotA VADs consume 20−50 µW [18] , [19] in addition to the power consumption of the VAD circuitry itself. This is unacceptably high for always-on sensing acoustic systems. Such systems thus necessitate the use of passive microphones in low power budget applications. Such passive microphones typically have a sensitivity down to −60 dBV. This translates to an rms signal level of 30 µV at 65 dB sound pressure level (SPL) for a nominal conversation at 1 m distance [20] . This limits the maximum allowable noise-floor to less than 30 µVrms and also decides the minimum gain necessary in the amplifier depending on the LSB size, being 45 dB to achieve 8 bit precision over 1 V. This design has a gain-range from 20 to 80 dB in 20 dB steps to cover a wide range of input signals, although we anticipate that only up to 60 dB would be necessary. Also, the averaging time depends on the frequency of classification which in a typical VAD system is every 10-16 ms [8] - [10] . This averaging is implemented as LPF with a f −3 dB of 16 Hz. A summary of the VAD system specifications is highlighted in Table I. IV. SYSTEM IMPLEMENTATION This section details the implementation nuances of the individual system blocks discussed in Section III-B, namely the wakeup detector, the analog feature-extractor, and the embedded mixed-signal classifier. A further section discusses system training for the complete VAD system before discussing onetime calibration and measurement results in Section V.
A. Wakeup Detector
The always-awake threshold-based wakeup detector acts as the system's watch-dog that wakes up the analog feature-extractor only when a signal of sufficient strength is detected. A single bit of information indicating the presence or absence of acoustic signal is needed. The wakeup detector is a low power three-phase comparator and its schematic is shown in Fig. 6 . As the input signal level for this comparator can be as low as 30 µV and the comparator reference Vref comp is generated using 1.2 V, 8 bit DAC, at least 45 dB gain is necessary in the preamplifier to keep the signal swing greater than 1 LSB ∼ 4.5 mV.
The preamplifier is a cascade of four single-stage amplifiers. Each amplifier is a PMOS input source-coupled single-ended differential amplifier and can be turned ON/OFF individually to save power depending on the microphone's signal level and is designed to provide a midband gain of 20 dB. The f −3 dB of the amplifier is limited to 2 kHz as only the speech envelope needs to be detected. The comparator Vref comp can potentially vary as per the ambient noise-level, but this is beyond the scope of this work. Measured power consumption of this block is 700 nW when all four amplifier stages are turned ON, and excluding the external bias.
B. Analog Feature-Extractor
On receiving the wakeup signal from the threshold-based wakeup detector, the analog feature-extractor decomposes the input signal into the set of 16 features. The on-chip classifier evaluates whether the signal is potentially speech or background noise by comparing a feature subset to trained thresholds in a decision tree (DT) topology (see Section IV-C). This section first describes the flow of the acoustic signal through the analog feature-extractor, followed by the implementation details of the individual blocks that participate in feature extraction. BPF filter is averaged by a rectification and LPF operation which results in 16 analog features af 1 − af 16 , from which the subset af 5 − af 12 is used by the on-chip classifier.
The partitioning of the amplification between the shared LNA and the individual frequency bands allows a finer control over necessary amplification in each band. This contributes to power-proportional information extraction, as it allows turning OFF amplifier stages of unused features along with all other circuitry involved in individual feature computation. This enables context-aware power savings, as discussed in Section II-A2. The sub-blocks of the analog feature-extractor are now explained in more detail.
1) LNA and Amplifiers:
The LNA is interfaced with a passive microphone and is designed to provide a midband gain of 20 dB up to a frequency range of 5 kHz while keeping the rms integrated input-referred noise smaller than 30 µV. The LNA is shared across all 16 bands as can be seen from Fig. 7 . Further amplification in each band is done through a cascade of four individually controllable single-stage amplifiers with each stage designed to provide 20 dB gain as in Fig. 7 . A singlestage amplifier topology was chosen for both LNA and in-band amplifiers for efficiency reasons, to avoid the power overhead of pushing nondominant pole(s) beyond the unity gain bandwidth. The closed-loop gain error introduced due to insufficient open-loop gain is a static error and is, as discussed, absorbed in the training phase.
The pseudoresistive feedback fixes the output bias point of the amplifier as shown in Fig. 8 . As the area for the input transistors is large (80 µm × 10 µm) to reduce the flicker noise, gate leakage current up to 20 pA can shift the output bias point by as much as 50 mV due to voltage drop across the pseudoresistor. The interstage capacitive coupling, however, ensures that the bias point shift is not cascaded to next stage.
As will be discussed later, the BPFs across the bands have increasing center frequencies. To cover for this, the f −3 dB of the amplifiers in each band also increases progressively from band 1 to band 16. This is illustrated by the simulated magnitude response of the amplifiers in Fig. 9 .
2) Bandpass Filters: The amplifier output in each of the 16 bands is passed through a BPF whose center frequency (f c ) increases exponentially from 75 Hz in band 1 to 5 kHz in band 16. The f c for a second-order gm-C filter (see Fig. 10 ) is scaled by varying the bias current across the bands. From the BPF frequency response in Fig. 11 , it can be seen that stop-band attenuation for individual filters is better than −40 dB, but the adjacent band rejection is only −1.5 dB. This adds redundancy in the extracted features, leading to a high correlation between features of adjacent channels. This makes the system tolerant to shifts in the center frequency of BPFs.
3) Averaging Circuit: The output of each BPF is averaged individually by first rectifying and then low-pass filtering with an f −3 dB of 16 Hz to result in 16 analog features (af 1 − af 16 ). The architecture of the current-mode averaging is shown in Fig. 12 . Normally-off transistors used for rectification (in dotted box) turn ON based on the direction of current from the BPF. The current steering network makes the current direction unipolar and is read across the gm-based resistors. A first-order gm-C LPF extracts the average value of this unipolar signal. Such normally-off transistors result in asymmetric rectification (dashed line) as in Fig. 13 . This adds a dc-offset to the computed feature level shown by the averaged line (-dashed dot) in Fig. 13 . Such offsets can be learnt during the training phase and do not affect classification accuracy.
C. DT-Based Classifier
The extracted feature subset af 5 − af 12 is passed on to the on-chip classifier (Fig. 5) , while the complete feature-set af 1 − af 16 can be passed on to an off-chip ADC for more complex information extraction, such as context-change detection and retraining the classifier as in [22] . In these cases, the Nyquist sampling rate for the features is only 16 × 2 × 16 = 512 Hz instead of 8 kHz for audio. The external ADC is not needed for embedded speech/non-speech classification.
The implementation of the on-chip 7 node 3 level mixedsignal DT classifier is shown in Fig. 14 . Each node of the DT can be configured to select one feature out of af 5 − af 12 . The selected feature (sf i ) is then compared with a reference voltage (Vref i ) determined by a modified C4.5 machine-learning algorithm [22] , generating the output decision b i of each node
where inv i bit sets the comparison direction. The decision fusion logic shown in Fig. 14 combines the outputs of all DT nodes.
D. VAD System Training
The DT configuration and the individual feature activation are done using machine learning which selects the most discriminative features between speech and the current background noise context. To this end, the on-chip DT classifier is trained with our modified C4.5 algorithm with 160 s of labeled data from the standardized NOIZEUS database [23] . The traditional C4.5 algorithm selects a feature-set to maximize the total information-gain. Our modification to C4.5 maximizes the information-gain/watt and therefore outputs a resource-efficient model that maximizes the information capture while minimizing the power [22] . This is enabled as each feature extracts information from a higher frequency band so that the power cost increases from af 1 to af 16 . This maximization of informationgain/watt furthers power-proportionality by increasing power consumption only for more (complex) information. The training runs on the microcontroller to generate a discriminating feature subset and reference levels for the comparison in the DT. The training results of the past context are not stored but dynamically learnt, as context change is detected [22] .
V. MEASUREMENT SETUP AND RESULTS
The proposed system has been implemented on a 2 mm 2 chip in 90 nm CMOS as shown in Fig. 15 . This section details the measurement results for the chip and for complete VAD system. 
A. Chip Performance Results
This section first discusses the measurement results for the LNA and some individual blocks in the 16th feature band in the chip followed by measurement results for complete bands.
The input-referred noise for the LNA is shown in Fig. 16 . The noise has been measured at the LNA output over a frequency range of 10 Hz to 10 kHz. The rms input-referred integrated noise over the range of 75 Hz to 10 kHz is 32.5 µV. The total input-referred noise is expected to be 15% larger as this does not include the contributions from subsequent amplifier stages. For 3% and 5% THD, dynamic range is measured to be 40.2 and 45.4 dB, respectively, at 1 kHz.
Frequency responses of the individual blocks in the 16th feature band are shown in Fig. 17 . Compared to simulation, the midband gain of the LNA is reduced by 4 dB, which is estimated to be due to insufficient open-loop gain. The large signal frequency response for the complete bands is shown in Fig. 18 . As the f c of the BPFs increases across bands, each band progressively processes higher frequency content to compute a feature; hence for a constant capacitive load, the power consumption increases from band 1 to band 16 as it can be seen from Fig. 19 . As already mentioned in Section IV-D, this allows a power-aware learning to enable efficient classification. Finally, the measured rms noise at the output of each band is less than 2 mV. For an output signal range of 400 mV, this gives 7.5 bits of precision.
B. System Measurement Results
The chip is integrated with the microcontroller using external level-shifters and DACs, to form the complete VAD. Fig. 20 shows a one-time calibration to characterize for mismatch in the ADC and DAC paths. This section also displays the classification accuracy results for the complete VAD system and illustrates the achieved power-proportionality.
Receiver operating characteristic (ROC) curves characterize the classifier systems and depict hit rates (HR) for the variables under observation [24] . Fig. 21 , ROC curve, shows that classification accuracy of our on-chip classifier is on-par with software-based VAD systems of [8] , [9] , and [25] . Further  Fig. 22 validates the classification capacity over multiple contexts with different background noise conditions. Table II illustrates the power-proportional sensing in our VAD system by showing the gradual increase in system power consumption with the sensing task complexity. The power consumption for signal detection is measured to be below 1 µW, whereas power consumption for classification varies depending on the complexity of the operating context and has an upper bound of 6 µW. The power consumption for background context-change detection and relearning the DT is estimated to be 57 µW on a cortex M4 microcontroller. It is predicted that the VAD system will be 80% of the time in detection mode, 15% in classification mode, and about 5% of time performing complex tasks, such as relearning the context or DT training. The resulting duty-cycled power consumption is 3.8 µW for babble noise context. Further, the estimated power overhead for generating on-chip (currently off-chip) reference voltages for the comparators is leakage limited and is estimated to be less than 50 nW per reference value [26] as the reference voltage needs to drive only the gate nodes at near dc speed. Table III compares our work to SotA VADs [8] - [10] , [25] and similar systems [27] . While maintaining the same classification accuracy as compared to software VADs, our system reduces the power consumption by 10×. Although hierarchical information extraction adds a maximum latency of 100 ms to the VAD decision task, this does not cause significant information loss as this latency is smaller than the average duration of a spoken vowel [28] .
VI. CONCLUSION
This work demonstrates a power-efficient acoustic sensing frontend for speech/non-speech classification in a VAD system. The power efficiency is achieved by the use of machine-learning-assisted analog feature computation and by infusing the power-proportionality paradigm in various ways throughout the architecture. The use of analog features for information extraction allows individual turning ON/OFF of features depending on the usefulness of a feature in a particular context while the power-proportionality concept controls the hierarchical activation of different sub-blocks depending on the complexity of the information extraction task. The idea of power-proportional sensing is demonstrated for an acoustic sensing system and can be extended to other systems such as motion and image sensing systems. He is currently a Postdoctoral Researcher with CS-DTAI, KU Leuven. His research interests include machine learning, data mining, and artificial intelligence, in general, for industrial applications.
