I. INTRODUCTION

T
HE development
of sophisticated digital watches has stimulated the research of voice input systems which simplify the commands of their various functions. The main difficulty, however, is to find a system which is speaker dependent in order to achieve voice recognition accuracy and which is capable of extracting and storing the most important features of the vocabulary words in a small size RAM.
For this particular application, the voice command sequences are precisely defined. This feature allows the simplification of Manuscript received October 22, 1982 . N. C. Bui was with Asulab S. A., Neuchatel, Switzerland. He is now the recognition system by dividing its vocabulary in two subsets: 1) a mode selection subset with the words WATCH, ALARM, TIMER, CHRONO, and HOME-TIME,and 2) a ten digit subset. Thus, a direct man-machine interaction is possible through the display which gives information about the vocabulary subset in use. The reference templates for each user are created from the above isolated words. In addition, the structure of the system is conditioned by the severe requirements concerning the power consumption and the relatively lowfrequency clock (32 kHz) which already exists in the watch.
II. SYSTEM DESCRIPTION
The most restrictive factor encountered in the design of speaker-dependent voice recognition systems is the size of the RAM for storing reference templates. Since the density of the RAM in 6~m CMOS technology is about 100 bits/mm2, both analog and digital processing techniques must be applied to the acoustical signal in order to extract significant features and to store a 15 word resident vocabulary in an allocated RAM area of 10 mmz (capacity of 1 kbit). Hl+T- The analog preprocessing method used in the system is characterized by the following steps. The incoming speech waveform from a miniature electric microphone is passed through an amplifier and preemphasis filter. The purpose of this filter is to increase the energy of the high frequencies, thereby compensating the 6 dB/octave dropoff of the speech spectrum. The filter response presents a constant amplitude up to 300 Hz, a positive slope of 6 dB/octave from 300 Hz to 3 kHz, a constant amplitude from 3 to 6 kHz, and a negative slope of 6 dB/octave beyond this last frequency. Next, the waveform is analyzed within the frequency range of 190-4800 Hz. As shown in Fig. 1 , the analysis consists in breaking the frequency range into seven regions by means of a filterbank. The energy from each channel is then extracted by a rectifier-average circuit. Digitization of the output signals is obtained with a variable threshold, which tracks the level of the input acoustic signal; this procedure allows avoidance of an extra automatic gain control circuit at the input. By sampling every 10 ms, a data stream of 700 bits/s is obtained at the parallel outputs which form the 7-bit data bus. To achieve low power consumption, all the circuitry employed in the analog signal preprocessing, such as the fourth-order bandpass filterbank, the energy extractors, etc., are designed in a switched-capacitor technique [13] - [19] . The fourth-order bandpass filters are obtained by cascading two second-order sections, as shown in Fig. 2 . Further digital processing of the signal is performed by a specially designed microprocessor.
Elimination of redundancy in voice signal, coding and compression of data, and word length normalization are the main tasks of the micro- 111-.1. .11 -.1.
1.
------.
.1
.1 Fig. 3 . Normrdized matrices corresponding to words ZERO and SIX.
processor.
In particular, coding and compression of the 700 bit/s data stream delivered by the analog preprocessor is accomplished by the use of the well-known method of run length coding, which is enhanced by a nonlinear digital filtering routine, as explained later on. The word length is normalized at the constant value of 20 samples. Normalized matrices corresponding to words "ZERO" and "sIx" are shown in Fig.  3 , where the seven columns represent the seven channels classified from left to right according to their increasing frequencies.
The coding method is represented in Fig. 4 . Each run corresponding to the energy of a filter channel requires 13 bits of RAM, distributed in three fields as follows: one 3-bit field reserved for the channel address, one 5-bit field for the run start time, and one 5-bit field for the run end time (an alternative is to normalize the word length to 16 samples, resulting in a gain of one bit per start time and end time). This method is only efficient if a small number of runs (typically 5-7) exists in the matrix. The compression of the number of runs in the same channel is performed by means of a postfiltering routine, which first concatenates two consecutive runs separated only by a single sample, and then supresses those containing only one sample. This operation also permits elimination of the effect of energy fluctuation during its detection.
The linear normalization of word templates described above can in some cases lead to a predominance of the vowels coding. For instance, in the template of the word "ALARM," the samples corresponding to the consonant '<L" are drastically reduced as compared to those of vowel "A." To avoid this effect, a vowel-cutting technique has been introduced. It consists of keeping only a limited number of identical samples for the same run (typically six), and dropping the rest.
The same microprocessor is designed to perform the correlation operations between the input-spoken word and the reference words. Referring to Fig. 4 , the correlation process consists of comparing the templates of each reference word to that of the input-spoken word. Therefore, the distance is defined as the ratio of the disjoined surface So to the sum of the surfaces S 1 + S2 of these templates. The estimated distance is then multiplied by a warping factor depending upon the length ratio of the two words before normalization.
This allows a good distinction between a short and a long word. 
III. CIRCUIT DESCRIPTION
As mentioned above, thesystem iscomposed oftwo parts: the analog preprocessor and the digital correlator-processor. In the following, the most important circuits of these parts will be described.
The main circuit of the analog preprocessor is the seven channel fiiter bank which performs the spectral analysis of the voice input signal. These filters have the same quality factor Q, and their central frequencies (.fro) are distributed according to a logarithmic law [10] , as shown in Fig. 5 . In Table I clock's frequency is now four times lower. The last channel is implemented in the same manner. The second-order sections have also been integrated in a separated test chip. Their structure is insensitive to parasitic capacitors, and the following parameters have been optimized: output swing of the op-amps and capacitor spread values. The measurements of several samples gave the results listed below. gain 3 dB with Q = 3 for each section total consumption 30 I.LA with 3 V power supply signal-to-noise ratio 50 dB.
The measured frequency responses of the six sections for a 32 kHz clock frequency are reported in Fig. 6 .
The rectifier-averager circuit, which extracts the energy from each channel, is also designed in switched-capacitor technology and is shown in Fig. 7 . The rectifying operation is performed in two steps. During the first step, the charge is stored into the input capacitor Cs and the polarity of the instantaneous input signal Vmux is detected by the comparator. In the second step, the charge previously stored in Cs is transferred into one of the four output capacitors Ca-Cd. The charge transfer is positive or negative according to the polarity of the comparator output signal. The resulting voltage across any of the output capacitors is the mean value of the rectified input signal. This dc voltage is compared to the mean value of the voltages across all the output capacitors for detect ion and digitization purposes. To reduce chip area, the same circuit is applied to the out puts of four channels which are multiplexed. The architecture of the digital correlator/processor has a standard 8-bit microprocessor organization, apart the sequencer and the instructions decoder. As shown in Fig. 8 , this processor contains a program ROM with a capacity of512 X 21 bits (1 kbit of RAM), an ALU which implements the functions Exclusive-OR, OR, NAND, INVERS1ON, and ADDITION between two 8-bit words (directly and/or indirectly addressed), eight registers, an instruction decoder, and a sequencer.
The design of the correlator/processor was made taking into account the relatively low clock frequency.
This implies a very efficient instruction sequencing. Thus, a well-suited set of 25 instructions has been defined, as illustrated in Table II . Each instruction has a length of 21 bits and is composed of one 5-bit op-code field and two 8-bit fields specifying either address or data. The execution timing of each instruction is assumed by the sequencer which contains a 256 X 6-bit microprogrammed PLA. This feature facilitates eventual modifications or optimization of the instruction set. For example, the algorithms of digital processing (coding, data compression, normalization, and correlation between templates) are programmed with only 180 instructions.
In order to optimize the chip area and to allow a modular design, the instruction decoder is distributed around the concerned units underneath the control bus rather than located at a fixed place [11] .
IV. ADAPTIVE RECOGNITION
The voice recognition system as described above is designed in CMOS low-power technology for battery-operated consumer products like watches, clocks, toys, etc. For such consumer products, it is essential to avoid any tedious training mode. The automatic adaptive recognition procedure introduced into the system is based on the repetition detection of . y
xx + DATA x 'X AND Y x -XORY a spoken word, which can be interpreted as a systematic error. This detection is done by the correlation between the present spoken word and the previous one, within a delay (typically 4 s). As explained at the bottom of Fig. 9 , each time the speaker receives an erroneous answer to the spoken word, he has only to repeat it again. This repetition is detected by the system which will not display the best matching word (given by the correlation of the spokefi word with the references) as previously, but the second best one, and so on. At the end of the scanning procedure, which is generally limited to two or three steps, the templates of the last proposed answer are replaced by the templates extracted from the user's spoken word. This algorithm performs the continuous adaptation of the system to the voice of the user. It is important to note that this adaptation is bypassed in noisy conditions since the system will never find the correlation between a spoken word superimposed with noise and the previous one.
V. CHIP EVALUATION
Several blocks constituting the voice recognition system have been successfully integrated in low-power CMOS technique: switched/capacitor filterbank and level discriminator for the analog preprocessor, as well as the ALU, the RAM, and the sequencer, shown iri Fig. 10 , for the correlator/processor. The complete chip will be integrated according to the final product specifications. The estimated chip area for a 15 word vocabulary is 35 mm2, and its distribution is given in Table III . Power consumption tinder the 3 V supply voltage is expected to attain 200 LW, The system error rate was measured on a simulation breadboard which used exactly the same algorithms mentioned above [12] . Tests were carried out under realistic users conditions, Fig. 11 reports the results of the error rate measurements performed on 13 persons, From these results, it follows that the recognition accuracy is optimized around 95 percent for a relative displacement of three samples, i.e., only seven template correlations aie necessary for each word. It is important to note that there is no word rejection during the measurement of the results since the system always proposes an answer to any input word, 
VI. CONCLUSIONS
An integrated voice recognition system for small vocabularies has been realized, it satisfies two fundamental requirements: low-power consumption for battery operation, typically 200 pW under a 3 V supply voltage, and fast recognition rate (less than 1 s) for a 15 word vocabulary with a 32 kHz clock. This has been possible thanks to the good choice of the speech processing algorithms, combined with an optimized hardware design using CMOS technology.
The system is simple enough to be implemented in a lowcost, small-area chip, standing in 35 mm2 of silicon with a conservative 6 urn technology, and it can use a low-performance miniature electric microphone.
This allows its incorporation in portable consumer products.
Breadboard measurements under realistic environmental conditions have proved that the recognition accuracy is about 95 percent, which is quite acceptable for this kind of product. Therefore, it is useless to increase the complexity of the system in order to obtain a better recognition accuracy since the latter depends not only upon the intrinsic quality of the system, but it also depends upon the external conditions.
The adaptive recognition algorithm has been proved to track any short-term modification of the user's voice, as well as long term modifications, like the replacement of a word by another one without the introduction of an extra training mode. The implementation of such an algorithm is inexpensive. Jean G. Michel (M'81) These two LSI circuit& fabricated with a standard metaf gate CMOS process, have severaf features. 1) They can be operated with a single power supply over a 3-7 V range. 2) Their power dissipations are, respectively, 0.6 mW (PARCOR) and 0.18 mW (ADM) with the supply at 3 V. 3) High accuracy 9-and 10-bit R, 2R D/A converters are constructed in each LSI circuit.
In the PARCOR system, various high quality and low data rate speech outputs are obtainable. The ADM system is used for voice transmission and synthesis. By adequately applying these two systems to the wide needs in the market, it is possible to achieve good cost performance. speech time <10 s) by reason of its simple algorithm, a small amount of hardware, and a low clock frequency, although its data rate ii relatively high.
From the viewpoint of applicability to the wide needs in the markets, it is better and more economical to use these two techniques properly than to use only one technique.
II. PARCOR SPEECH ANALYSIS-SYNTHESIS TECHNIQUE
Voice signals are nonstationary, but they can be regarded im nearly stationary in a short time period of 20-30 ms. This time period is called a frame. Moreover, in human perception, man is much more sensitive to the amplitude characteristics of the voice spectrum than the phase characteristics. These facts show that the voice signals can be characterized by the patterns of the change of the short-time spectrum. The important intelligences in the short-time spectrum are classified into the vocal cords and the vocal tract, The former is given by the distinction of voiced and unvoiced sounds, and a voiced excitation period (pitch). In the PARCOR analysis-synthesis algorithm, the latter is characterized by several PARCOR coefficients.
These PARCOR coefficients give the essential correlation between two sampling points. They are derived by eliminating the indirect influences through the medium of the other sampling points except for those two sampling points (partial autocorrelation method). Physically, the PARCOR coefficients are equivalent to the reflection coefficients on each discontinuous section of the nonuniform, diametric cylinder model of the vocal tract.
The speech synthesis with the calculated parameters is performed by means of tracing the inverse process of analysis.
0018 -9200/83 /0200 -008i $01.00 01983 IEEE
