A digital-based isolated word recognition system has been implemented in a module of dedicated hardware that uses a microprocessor and programmable digital signal processing circuitry.
Introduction
Most microprocessor implementations of speech recognition hardware preclude the precision afforded by digital processing of speech by selecting analog preprocessing of the speech waveform for feature extraction. This is because digital-based speech recognition presents a heavy computational load. Feature extraction by digital means requires on the order of 200,000 multiply-add operations per second of speech, and usually requires complicated hardware to perform in real time. Hardware for analog feature extraction, on the other hand, requires much less speed to operate in real time. For example, one common analog feature extractor uses 16 parallel channels of bandpass filters followed by rectifiers and low pass filters with the channel output digitized and recorded every 10 msec. This requires only 1600 read operations per second of speech. The recognizer described here uses a microprocessor and peripheral speech processing circuitry consisting of large scale integrated circuits to achieve near real time response in a compact, all digital module.
The recognition algorithm is based upon a feature set of linear predictive coding (LPC) parameters and pattern comparison using dynamic time warping (DTW), as proposed by Itakura in 1975 .1 Since then, the Acoustics Research Department at Bell Laboratories has carried out tests on a version of the recognizer which uses a Data General Eclipse minicomputer and CSP MAP-200 array processor. These tests used experienced and inexperienced talkers speaking over dialed-up telephone lines to examine the performance of most aspects of the recognition algorithm. The tests included speaker-trained isolated word recognition,1'2 speakerindependent isolated word recognition , connected digit recognition.4 methods of endpoint detection,5 techniques of dynamic time warpIng,9 and procedures for training.7 Other tests have imbedded the recognizer in systems which used vocabulary partitioning, directory searches, and syntactic analysis to perform such voice-activated tasks as repertory dialing of telephone numbers,8 retrieving telephone directory information,9 and making airline reservations.10 In all simulations, the recognizer was shown to attain performance sufficient for practical use over phone lines for a wide variety of CH161O-5/81/0000-0746 $00.75 © 1981 IEEE 746 talkers. Therefore, the recognizer may be considered understood sufficiently to warrant the design of dedicated hardware.
The goal of this effort was to develop dedicated hardware and software that I) met the computational requirements of the algorithm with sufficient speed to provide a reasonable (<I see) response time, 2) was compact, economical, and independent of a host minicomputer, and 3) was sufficiently versatile to follow the evolution of research activities in more advanced speech recognition based upon the same acoustic processor (speaker independence, connected words, syntactic analysis). Therefore, a widelyused, well-supported 16-bit microprocessor was combined with a special-purpose signal processor designed for the recognition algorithm. The microprocessor is thus fully supported with development systems and high-level languages. These can be used to simplify programming of global, non-computational functions such as process control, vocabulary partitioning, and syntax analysis.
(However, programming for the acoustic processing is in assembly language due to speed requirements.) The signal processor is referred to here as DSPP (Digital Speech Processing Peripheral). The DSPP, while presently constructed of commercial components, was limited to a complexity that did not exceed single-chip integration capabilities of current VLSI technology.
System Architecture
The recognizer consists of two processors sharing common address, data, and control buses (Fig. I) . The host processor is a 16-bit microcomputer based upon the Intel 8086 microprocessor. The second processor is the DSPP, which performs the bulk of the 300,000 multiply-add operations required during the recognition of a word from a 40 word vocabulary. Each processor has its own data and program memory and the two can run simultaneously. The DSPP functions as either 1) a real-time autocorrelation analyzer for three simultaneous analysis frames (Mode 1), 2) a comparator of feature vectors, performing the Itakura log-likelihood distance MICROPR5CRSSOR-BASRD ISOLATED WORD RECO6NLER The recognizer is contained on five S-100 bus cards and is partitioned as follows ( Fig. 2) : 16-bit microcomputer with program and scratch-pad memory and input and output ports (1 card); DSPP (2 cards); 16K 12-bit words of dynamic random access memory for storage of up to 160 reference templates (1 card); speech prefiltering and digitization circuitry and user space for custom output circuitry (e.g. telephone dialer) (1 card).
Execution of Computations
The recognition algorithm is shown in block diagram form ( Fig. 3 ) and has been described in detail elsewhere.1'9 In the preparation of reference and test patterns, the autocorrelation method is used to perform an eighth-order LPC analysis on a speech signal digitized at a 6.67 kHz sampling rate. An analysis frame size of 300 samples (45 msec) is used with a new frame beginning every 100 samples (15 nlsec). Thus, each speech sample falls within three consecutive analysis frames. The results indicated that changes that reduced computations (increasing L or N, or decreasing p) led to significant increases in error rate, while changes that incrcased computation did not provide enhanced performance. Therefore, no sacrifice in choice of analysis parameters was made to facilitate hardware design. The reference templates (autocorrelations of LPC coefficients) were quantized to 12-bit coefficients after scaling each coefficient to use the entire 12 bits. Scaling factors were based on histograms for the various coefficients obtained over several talkers.11 A simulation indicated that representing templates in this manner did not incur any increase in error rate over the floating-point representation used in the minicomputer studies.
An important difference between the minicomputer simulations and the hardware implementation is the change from floating point arithmetic to fixed point arithmetic. All portions of the computation have at least 16 bits of dynamic range. Both the log likelihood distance calculation and the autocorrelation analysis are carried out with 32 bits of precision. Prior to LPC analysis, the autocorrelation vector for a completed analysis frame is scaled by shifting to attain 16 bits of precision in the zero-order coefficient. Thus, the LPC calculation is carried out with 16-bit precision regardless of frame energy. For the LPC calculation, 16 bits has been shown to be adequate for a pre-emphasized signal at this sampling rate.12
We next examine the partitioning of computation among the various sections of hardware shown in Fig. 1, beginning with feature extraction.
The recognizer accomplishes most of the feature extraction in synchrony with the incoming speech samples. This has several advantages. The entire waveform of the utterance is never stored, thus reducing storage requirements from the 100 words of memory per frame required for waveform storage to 9 words per frame to store feature vectors of autocorrelation coefficients. Furthermore, signal energy of all previous analysis frames is available for endpoint detection. This allows the recognizer to make a keep/discard decision as speech is in progress, further reducing storage to eliminate frames. of silence before and after the word. Finally, most of the operations associated with feature extraction are completed by the time the word ends, reducing response time.
The following operations are performed on a sample-bysample basis and are completed in the 150 Lsec sampling period:
speech sample s(n) is read from analog-to-digital converter, 2) signal is preemphasized to yield where a15/16. i(n)='s(n) -as(n-1)
The following operations are performed three times per sample (once for each analysis frame j, j=O,l,2):
3) allocation to overlapping frames j, j=0, 1,2:
where the, sample is placed in the final third of one frame (j0), the middle third of the next frame (j'l) and the first third of the next (j2); 4) windowing, for j0,l,2;
x(n)w(n-U), w(n) .54 -
.46cos[.!!]
(3) (4) (Hamming window), where L, the shift between frames, is 100 samples and N, the frame size, is 300 samples. Every L samples, an analysis frame is completed and read from the DSPP and the signal energy is used for real-time endpoint detection.5
After the end of the utterance, the utterance length is normalized to a fixed length, typically 40 frames for isolated words of short length, by interpolating between frames. Then Durbin's recursion13 is used to transform vectors of autocorrelation coefficients to vectors of LPC coefficients. The LPC coefficients are then autocorrelated, scaled, rounded to 12 bits, and stored as a reference template. In these computations, the DSPP operates in Mode 3.
The pattern similarity calculation uses the log likelihood distance metric1 to obtain a distance d(i,j) between frame i of the test pattern and frame j of reference pattern a, as follows: During the pattern similarity calculation, the computation is partitioned between the DSPP, which calculates d(i,j), and the microprocessor, which determines the D(i,j). The computation proceeds on a frame-by-frame basis through the test pattern. After computing a distance score for each reference template, the list of distances is ordered and used for the decision operation to complete the recognition. The DSPP performs nearly all of the multiply-add operations required during recognition. Its central element is a 16 bit by 16 bit multiplier-accumulator integrated circuit (TRW IO1OJ). The additional circuitry around the multiplier accesses and stores operands and results for the multiply-add operations. The multiplier receives its 16-bit operands X and Y and outputs the upper word of the their 32-bit product P on the three data buses X, Y, and P (Fig. 4) . The lower word of the product is accessed via the Y bus, and the Y and (7) P buses may also be used as a single 32-bit bus for preloading the accumulator and retrieving results.
Each data bus has its own 16 bit, 64 word data buffer. The X buffer and YP buffer pair have separate address buses, AX and AYP. A programmable address generator circuit provides the desired addresses. To control the operation of the DSPP, a 16-bitwide microprogram of 16 words is loaded into program memory on the DSPP, The output microprogram bits fall into the four categories of multiplier control, buffer address generation, buffer read/write control, and program sequence control. The final block of the DSPP is the direct memory access channel, which is used to bring reference feature vector coefficients from template memory into the multiplier via the Y bus.
It is instructive to examine the means by which the DSPP retrieves operands and stores results for the multiplier. The multi- (13) plier may be preloaded with a 32-bit value into its P port, may then be loaded with operands into ports X and Y, and will then calculate (20) where S. the address displacement between pages of different j, is 16. In (19), the a0dress sequence is circular with each new sample x1(n) overwriting the oldest sample x1(n -1 -p).
In Mode 2, the DSPP is performing the calculation indicated by (8) . The test feature vector of frame i is first written to the DSPP: for ,n=0, I p. The global constraints on the warping path define a minimum and maximum reference frame index, j, as a function of test frame index i. These constraints determine the reference frames to which the test frame is compared. The DSPP iterates through the following steps for all allowed j, jmjs(i) jjma,(i): The DSPP is an independent processor that performs the feature extraction and frame-to-frame similarity measurement about 80 times more rapidly than possible with the microprocessor alone. The DSPP is the key to using a standard microprocessor system for the intense computation required by digital-based speech recognition. During recognition, the DSPP proceeds at an average speed of about 1 isec per multiply-add, with each multiply-add consisting of the steps of 1) calculate X and Y operand addresses, 2) send X and Y operands to multiplier, 3) calculates address of accumulating sum, 4) preload multiplier with accumulating sum, 5) multiply X and Y, 6) add to the accumulator, 7) write the result to the appropriate YP buffer, 8) increment iteration counter.
Conclusion
We have described the hardware implementation of a speaker-trained isolated word recognizer with the following features: 1) recognition achieved by digital processing of the speech waveform according to the principles of minimum prediction residual, 2) recognition algorithm and analysis parameters supported by minicomputer simulations, 3) greater processing speed, smaller size, and lower cost than minicomputer and array processor of simulations, 4) operation over telephone lines, 5) host processor which is an industry-supported microprocessor, 6) custom digital (14) peripheral processor which is of complexity not exceeding single chip integration capability.
Although the hardware implementation of a word recognizer (15) could be done more simply using analog feature extraction, by (16) duplicating the algorithm and analysis parameters of a familiar digital system, we maintain the ability to 1) make experimentally valid (17) choices concerning all features of the recognizer, 2) evolve natur- (18) ally into more advanced recognition tasks based on simulation results, 3) simulate and analyze recognizer performance in unanticipated adverse conditions imposed by practical use, and 4) exploit (19) the ever-increasing ability to achieve large-scale integration of digital circuits.
