Abstract-Time domain harmonic scaling (TDHS) has been realized in real time on the Bell Laboratories digital signal processing (DSP) integrated circuit. It is an algorithm that can expand or compress the bandwidth and sampling rate of speech by taking advantage of the pitch structure in the speech signal. As such it is useful in a variety of speech applications including speech coding, speech enhancement, and rate modification. A single DSP can perform compression and a second DSP can perform expansion. Both operations require pitch information to be supplied with the input speech. Included in the system is a real-time pitch/periodicity detector which has also been implemented on a single DSP. Its design is based on a novel modification of the autocorrelation function type pitch detector. This paper presents details of both the TDHS and pitch detector implementation and discusses their performances. In particular in this paper we discuss a 2: 1 compression and expansion system that has been used as part of a 9.6 kbit/s speech coder. TDHS was previously thought to require a much larger buffer than the RAM memory available in the DSP. We show that for all the compression/expansion ratios of interest the buffer size needed is twice the maximum pitch period.
I. INTRODUCTION
T IME domain harmonic scaling (TDHS) is a method of speech bandwidth compression and expansion developed by Malah [ 11 . It is particularly useful as an element in speech coding systems in which the speech is compressed before coding and expanded back after decoding. For example, a 2: 1 TDHS compression followed by expansion only slight degrades the speech quality for low bit rate coding. By using it in this manner an almost 2: 1 reduction in overall bit rate can be achieved without a significant loss in quality. In a previous study it was found that this technique in combination with subband coding could potentially lead to an economical approach for speech coding at a bit rate of 9.6 kbits/s [2] . In this paper we discuss a real-time implementation of the TDHS algorithm based on the Bell Laboratories DSP integrated circuit. It is one component of a 9.6 kbit/s real-time speech coder discussed elsewhere [3] , [4] . We also show that the realization of the TDHS algorithm can be accomplished with a minimum buffer size that is twice the maximum pitch period.
In addition to speech coding, TDHS can also be used for enhancement of noisy speech [SI and in rate modification of speech [l] . Thus the implementation of TDHS discussed in this paper can also be applied to these types of applications. As its name implies, TDHS is implemented in the time domain. It is based on pitch synchronous processing and requires pitch measurement for its implementation [6] . Thus, as part of the system we have also implemented a real-time pitch/periodicity detector on a DSP, based on a modified version of the autocorrelation algorithm [7] . This realization can also be applied to a variety of other speech processing applications as well as the TDHS system described here. It is capable of estimating the long-term periodicity of a signal through peak-picking of the autocorrelation function (ACF) of the input signal. It can also supply information concerning values of the autocorrelation function. We refer to this design as a pitch/periodicity detector to distinguish it from the notion of a pitch detector, sometimes used in vocoders, where both a pitch estimate and a voiced/unvoiced decision are made. This design does not include a voiced/unvoiced decision, since it is not needed for the implementation of TDHS.
In this paper we discuss the TDHS algorithm and its implementation on the Bell Laboratories digital signal processing (DSP) integrated circuit [8] .
Although the Bell Labs DSP is not available outside the Bell System, several other similar devices are now available [15] - [20] and more are being planned [21] . This paper describes the problems which were confronted for an implementation based on the Bell Labs chip. Both the problems and their solutions are representative of the types of problems encountered with implementations based on these types of devices. The problems are usually due to one or a combination of the following three limitations:
limited data (RAM) memory, limited program (ROM) memory, and the limited speed of the device. We have attempted to keep the discussion of our solutions as general as possible in order to make them as instructive as possible for others doing similar types of implementations.
The organization of this paper is as follows. We first discuss the TDHS algorithm and its requirements, then we discuss how these requirements were met using the Bell Labs DSP. Next we discuss the pitch detector algorithm and how it was tailored to fit the requirements of TDHS and to fit onto the DSP. Finally we discuss the performance of both the pitch detector and the overall TDHS system.
THE TDHS ALGORITHM
Although TDHS is a time domain technique, its interpretation can be viewed from either the time or frequency domains. a voiced speech sound where the pitch harmonics are shown as spectral "teeth." They are not just spectral lines but actually have a finite width. Nevertheless there are gaps between these teeth where there is essentially no frequency content and it is this property that is taken advantage of in TDHS compression. The 2 : 1 TDHS compression algorithm modulates each "tooth" down in frequency such that its center frequency is one half of its previous value. The width of each tooth remains the same, as illustrated in the lower spectrum of Fig. 1 , however, the gaps are now filled in.
The algorithm for implementing this operation for 2 : 1 compression is realized in a pitch synchronous manner in the time domain as shown in Fig. 2(a) . Two pitch periods of the input waveform s(n) are shown. Each period is multiplied by a triangular window where the second window, 1 -w(n), is the complement of the first w(n), and where each window is one pitch period p in duration. The two windowed periods are then added together to form a single period of compressed speech s,(n). Note that because of the choice of the two windows, the beginning of the output period in Fig. 2(a) resembles the beginning of the first input period while the end of the output period resembles the end of the second input period. In this way the output speech is continuous across the concatenated output periods (i.e., without end effects).
When the output speech is played at its original sample rate it has the same pitch and spectrum as the original speech but it occurs at twice the rate (i.e., it sounds like a fast speaker). Since only the pitch property of speech is exploited, other properties such as amplitude nonstationarity and formant structure remain in tact. This means that these properties can subsequently be exploited by a speech coder. periods are used again to produce another two periods of expanded output. The beginning of the first output period resembles the beginning of the second input period and the end of the second output period resembles the end of the second input period. Thus, the second input period is being interpolated into two output periods. As the process continues the third input period will be interpolated into two more output periods, etc.
A. Mathematical Description and Buffer Sizes
Mathematically, we can express these algorithms as follows. First, consider compression. From a pitch detector we are given the information that the pitch period is p . The formula for B, samples of the compressed speech is where B, is the length of the window. From [1] we find that for a rational compression ratio a the window length is given
Note that if a is less than two, B, is longer than p and if a is greater than two, B, is shorter thanp. When B, is longer than p , the compression window and its complement overlap which means that some points are used twice. If B, is shorter than p , then there are actually points which will not be used for the interpolation. For the case a = 2, the window and its complement just touch and all points are used once as seen in Fig.  2(a) . The formula for a triangular window is given by A simple equation can be used to compute the size of the data buffer needed buffer size = (samples in -samples out) t p.
The term in parenthesis represents the net samples accumulated in the buffer while processing a block of B, samples.
Clearly, aB, samples are input while only B, samples are output. (The accumulated samples are balanced out at the end of the block when the window is shifted.) The second term p represents the additional samples required for the complementary window, i.e., the second term in (1). When we add these two terms together we get
We get the surprising result that the minimum buffer size needed is the same for all rational compression ratios! Clearly, since p varies, the minimum buffer size is 2p,,, .
We can treat expansion in a similar manner. The formula for Be samples of expanded speech is
where ?(n) represents the expanded speech and ?,(n) represents the input speech. From [ 11 the expansion window length Be is given by
Note that since a is always greater than one, Be is always longer than p . This means there will always be overlap between a window and its complement. In turn this means that input points will be used more than once. The formula for the window is similar to (3) and is given by
A simple formula, similar to (4) can be used to compute the buffer size. It is buffer size = (samples out -samples in) t p . (9) This time more samples are output than input. To keep the buffer from underflowing the buffer must accumulate input samples. Each time the window is shifted this accumulation takes place and during the processing it is depleted. The number of output samples is Be. Thus, the number of input samples is Bela. Again the second term represents samples required for the complementary window. Solving for the buffer size we get
Again we get the same surprising result that a buffer size of 2p,,, is sufficient for any rational expansion ratio. The above results show exactly what the requirements of the TDHS compression and expansion algorithms are 1) a pitch detector to supply p , 2) a multiplier to implement (1) and (6), and 3 ) a buffer size of 2p,,,. The pitch detector requirement can be separated from the TDHS algorithm by implementing it separately and then transmitting pitch to the TDHS algorithm in a multiplexed fashion with the input speech data. In computer simulations we found that updating the pitch at least every 64 ms for compression and at least every 80 ms for expansion was required for good performance.
Before discussing the implementation of the above algorithm on the DSP it is necessary to understand some of the capabilities and limitations of the DSP. This is accomplished in the next section. The design of the pitch detector has also been accomplished in a DSP and this realization is discussed in Section 111.
B. The Bell Laboratories DSP
The Bell Laboratories DSP is a powerful, single chip, programmable processor that is especially suited for performing digital signal processing functions in a stream processing manner [8] . It has an 800 ns machine cycle time which is established by a 5 MHz clock. It contains provisions for a 1024 X 16 bit ROM memory for storage of a program, tables, and coefficients. A 128 X 20 bit RAM memory is available in the processor for the storage of dynamic data and state variables. The DSP has a main arithmetic unit with a multiplier and accumulator for numerical processing and a separate unit for updating address registers. In one machine cycle all of the following instructions can be performed a) decode an instmction, b) fetch and store data in memory, c) perform a multiply and accumulate, and d) round or truncate numbers from the accumulator. There are also a number of instructions for branching, conditional statements, and logic operations as well as instructions for mu-law-to-linear and linear-to-mu-law conversion. structure that is amenable to the architecture and 1/0 timing structure of the DSP. In this section we discuss this approach and point out how the TDHS algorithm was adapted to fit within the constraints of he DSP. As discussed in Section sc(n) 11-A the minimum buffer size needed for TDHS is twice the maximum pitch period. For 8 kHz sampled speech the maxi-
mum pitch period was assumed to be 120 samples (15 ms or only 128 RAM locations it cannot be used to implement TDHS. However, since the speech data can be conveniently converted to mu-law format, we can pack two 8 bit plaw words in each 20 bit RAM location and effectively double the memory capacity of the DSP. In this way only 120 RAM locations are needed for data storage. Conceptually, this takes care of our data storage requirements. We will postpone discussion of the precise implementation of 2: 1 mu-law packing until after we discuss how to organize the data storage.
In the discussion of Section 11-A we computed the amount of data storage that was necessary, but made no assumptions about its organization. The most convenient organizational model is that of a shift register. Thus we propose the shift register model shown in Fig. 3 as the model for our data storage. There are 240 memory locations in the shift register from 0 to 239. As each new data sample arrives all previous data is shifted down one location. The new sample goes in location 239 and the data previously in location 0 are discarded. At any instant of time we define the data in the shift register asx(O), ~( l ) , . * * , x(239) as illustrated in Fig. 3 . In the remainder of this section we will discuss the implementation of TDHS using this system for the data storage. We find that it has many good properties for implementing TDHS.
Consider the case of 2 : 1 TDHS compression first. If s(n) in (1) is stored in x(k) in the shift register then s(n t p ) is in x(k t p ) . Therefore (1) can be rewritten as
(1 1)
Before computing the next output point two new data samples arrive. Assuming s(n t 1) was in location k t 1 when we computed sc(n) it will be in location k -1 when we compute s,(n t 1) due to the arrival of the two new input samples. We see, then, that for compression the index k must be decremented by one before each output sample is computed. When the pitch synchronous window is moved to the next block, this corresponds to incrementing k by p -1 . So, before the first output sample for a new pitch period, k is incremented by the previous value of p minus one. For the second and all subsequent output values, k is decremented one output sample at a time. This means that k always has the same value at the start of any period and it will be decremented at most by 119. Therefore k = 119 is the starting value for k for each new output pitch period independent of the new value of pitch p .
For each output sample the value of w(n) must also be computed. At the beginning of a new pitch period w(n) = 1 as illustrated in Fig. 2(a) . Before each additional output sample it is decremented by A where Then we can express this relation as
The value A can be stored in a look-up table so that it does not need to be computed for each new value of p. This avoids the need for a divide operation in the DSP. For future reference we also rewrite (1 1) in slightly different form as
In this form it can be seen that we only need to store and decrement w(n) and not its complement 1 -w(n). Next consider the case of 2: 1 TDHS expansion. The shift register model for this realization is shown in Fig. 5 . If sc(n) is stored in x(k t p ) and sc(n -p ) is in x(k) we can rewrite (6) as
( 1 5) For each input sample two output samples must be produced. Depending on whether a new input sample arrives before the next output sample, k may or may not be incremented. That is, k is incremented only before the second output sample produced per input sample. When the end of a window is reached, k is decremented by p -1 to account for shifting the window. In order to keep the buffer from underflowing it is necessary to begin each new pitch period with the second output in a pair of output points. Since all expansion windows are of even length (2p) it is sufficient to start the algorithm this way. If, instead, the algqrithm started with the first output sample, k might need to be incremented 120 times in which case we would run out of real time for the last sample. In the case of expansion k + p always has the same value, k + p = 120, for the start of each new pitch period. Fig. 6 shows a flowchart for the realization of 2 : 1 expansion including the timing relationships, incrementing of k , decrementing of w(n) and start up for new windows.
Note that for expansion the window decrement A is 1 A=-2 p -1
Last, we note that (15) can also be written as
( 1 8) to again avoid decrementing both w(n) and its complement 1 -w(n).
At this point, one more very important aspect of both of these implementations should be pointed out. For both algorithms it is always the case that 0 < k < 1 19 and 120 < k + p < 239. This implies that one address pointer k always addresses the "older" half of the shift register data, while the other pointer, k + p , addresses the "newer" half [see Fig. 7(a) ] .
Since two 8 bit p-law samples of data are packed into a single DSP RAM location this leads to a very convenient implementation for a 120 word p-law packed shift register. In each RAM word we store one 8 bit p-law sample from the lower half of the shift register and the other 8 bit p-law sample from the upper half of the shift register. Depending on which pointer is accessing the data, there is never any ambiguity as to which sample is needed. Fig. 7 (b) shows how these data are organized. The upper 8 bits of RAM contain the newest data. The middle 8 bits contain the older samples. The bottom 4 bits are not used. As data are shifted out of the high 8 bits they are shifted back around to the low 8 bits as seen in Fig. 7(b) . 
D. Implementation of the Shift Register Model of TDHS on the DSP by Means of a Qclic Buffer
In the above discussion we have used a shift register model to simplify the discussion of how the TDHS compression and expansion algorithms can be implemented in a stream processing manner. A direct implementation of the shift register on the DSP, however, would require too much data to be shifted in each sample time.
Fortunately, the equivalent of a shift register can be accomplished in the DSP by using the notion of a cyclic buffer [9] . In this section we discuss this method and its realization on the DSP.
Consider first the 240-word shift register model of Fig. 7(a) . An equivalent realization .of this model can be accomplished using a fixed 240-word RAM memory by organizing it in terms of the cyclic buffer shown in Fig. 8(a) . An offset pointer labeled OFF is used to identify the newest sample in memory (~(239)) and (OFF + 1) mod 240 identifies the oldest sample in memory (x(0)). As each new sample arrives the pointer OFF is incremented by one (modulo 240) and the new data are stored in this location. In this way the newest input sample overwrites the oldest sample in memory. The data in the fixed RAM memory therefore corresponds to a circularly rotated version of what appears in the shift register model with the point OFF identifying the amount of rotation or offset. To access a value x(k) we must use the address (OFF t 1 t k) mod 240 in the fured RAM memory. Thus instead of shifting the data, we are modifying the address pointers.
This concept can be extended to the case of the 120 word jl-law packed shift register of Fig. 7(b) .
This realization is shown in Fig. 8(b) and it is the one actually used in the DSP for the TDHS algorithm. When a new sample s(n) arrives, the pointer OFF is first incremented by one (modulo 120). The high 8 bits from this memory location are first shifted down to the low 8 bits, i.e., the oldest sample in high memory becomes the newest sample in low memory. The 8 bit p-law sample s(n) is then written into the high 8 bits of this location.
To implement the compression and expansion algorithms two more pointers corresponding to k and k + p are needed. They represent the absolute values of the RAM locations used for storing the cyclic buffer. For compression we let 1 x 1 be the counterpart of k. For the start of a new window IX1 is initialized to OFF. After each new output sample is formed, 1 x 1 is simply incremented by 1 (modulo 120). The pointer IX2 is the counterpart of k t p and it is initialized to a value of (OFF t p ) mod 120 at the start -of each new pitch period. final flowcharts for the actual DSP implementations using this cyclic buffer approach.
The last important aspect of the DSP implementation is the 2: 1 data packing of RAM memory. We used three separate subroutines for dealing with these data. The first subroutine is used to enter new data into memory. It shifts the old data from high memory to low memory and writes the new data into high memory. Our assumption about the format of the input data is different for the two programs. For compression the input data are already in mu-law format. For expansion the input data is linear [3] . The other two subroutinesare used for reading data from high and low memory, respectively. The routine for reading high data uses the pointer ZX1 and increments it by one (modulo 120) each time a read is made. Similarly, the routine for reading low data uses the pointer 1x2 and increments it by one (modulo 120) for each read. In all of these routines it is necessary to use the regular arithmetic unit of the DSP to update the address arithmetic. This is because mod 120 arithmetic cannot be performed with I , J , or K registers, i.e., pointers ZX1, ZX2, and OFF are kept in regular RAM memory rather than using I , J, and K [8] . The pitch detector algorithm described in this paper was implemented as part of the 9.6 kbit/s speech coder using TDHS [3] , but its applicability is not limited to TDHS. Other potential applications for the pitch/periodicity detector are in pitch predicting implementations of adaptive predictive coders [8] and subband coders [,13] , as well as in other applications involving harmonic scaling [ I ] - [5] .
The design of the pitch/periodicity detector is based on a novel modification of the autocorrelation function (ACF) type pitch detector.
These modifications were made in part to tailor the form of the algorithm to the memory and real-time constraints of the DSP. In Section 111-A we first review the basic ACF algorithm. In Section 111-B we discuss the modifications and design parameters of the algorithm which allow it to be realized within the constraints of the DSP. In Section 111-C we point out some of the details of this implementation and point out how the features of the DSP have been utilized.
A. The Autocowelation Function (ACF) PitchlPeriodicity Detection Algorithm
The ACF pitch detector has been shown to be a reliable and time honored method for pitch detection and it has been used in a variety of speech processing systems [IO], [ 111. It is partly for this reason that this algorithm was selected for implementation on the DSP. A second compelling reason was that the ACF method, with appropriate modifications, is well suited to the computational structure of the DSP and can be fully implemented within the memory and real-time constraints of the DSP. s(n), is first low-pass filtered by the fiter h(n) to eliminate the effects of the higher formant structure on the autocorrelation function. Typically, a cutoff frequency of around 1 kHz is used. This will, in general, preserve a sufficient number of pitch harmonics for accurate pitch detection but eliminate the second and higher formants.
The resulting low-pass filtered signal, x(n) is used to compute the ACF estimate r,(m) where n refers to the sample time of the estimate and m refers to the autocorrelation lag. 
Fig. 12. Frequency response of h(n).
The set of values of m, denoted as {m), over which the ACF is computed determines the range of the pitch/periodicity de-points worth reiterating about it are that l ) it has a limited tector and it generally corresponds to the normal range of the amount Of RAM memory3 12' locations, 2, its cyc1e pitch period for human speech.
We also refer to {r,(m)} as time is ns, 3, its program is words with the set of corresponding ACF estimates. Each element in this instructions either One Or two words Of ROM, and set can be expressed as 4) that the device is stream-processing oriented, as opposed to block processing. These are the most important features
where f ( n ) corresponds to the analysis window over which rn(m) is computed. Fig. ll(b) illustrates the manner in which this computation is often performed. As seen from this diagram, the window f ( n ) plays the role of a low-pass fdter which smoothes the product signal x(n) x(n -m).
Once the set of ACF values {rn(m)} are determined, the pitch period estimate is obtained as the value of m associated with the largest weighted value of rn(m) in the set [see Fig.   1 l(a) ] . That is, the pitch period is defined as the lag mo over the range of allowed values {m) such that of the DSP as fa; as influencing our pitch detector design. In this section we discuss how the ACF pitch/periodicity detection algorithm and its design parameters were modified for realization on the DSP.
The low-pass filter h(n) was realized with an 8 tap finite impulse response (FIR) filter with a cutoff frequency of 1 kHz. Table I gives the coefficient values of this filter and Fig. 12 shows the resulting frequency response for an 8 kHz sampled signal. This fiter design performed well and could be realized with a minimal amount of computation.
The computation of the autocorrelation function rn(m) is performed using a first-order infinite impulse response (IIR) fiter f ( n ) which leads to an exponential window. This window has the form It allows the ACF estimates, Tn(m), to be sequentially updated (20c) according to the first-order difference equation where g(m) is a weighting factor. The purpose of this weighting factor is to reduce the possibility of "pitch doubling" errors which may occur in instances where the magnitude of the ACF is larger at a value of m which is twice that of the actual pitch value. Details of this weighting factor are described later.
B. Modifications and Design Parameters of the ACF Algorithm
In Section 11-A, the Bell Labs DSP was described together with some of its performance specifications. The primary r,(m) = yrn-( m ) + x(n) x(n -m).
The choice of y determines the time constant or duration of the window. The form of (22) is sometimes referred to as a "leaky integrator" implementation. Typically a value of the "leak factor" y = 0.9875 works well. Fig. 13 shows the resulting impulse response and frequency response for this design (assuming an 8 kHz sampling rate).
The form of (22) leads to an implementation which requires only two multiplies and one addition to update each ACF co- efficient. It is considerably more efficient in terms of computation and storage than the conventional block by block FIR (Hamming) window approach often used in computer implementations [IO] . Also since it is a stream processing (sample by sample) procedure it is more conveniently implemented in the DSP. By itself, however, a straightforward implementation of (22) for updating values of r,(m) for all desired values of m requires more computation and storage than is available in a single DSP.
To further reduce the amount of required computation and storage the above method of computing the ACF coefficient was modified so that each coefficient is updated only once every Qth sample. Equation (22) Besides reducing computation, the form of (23) can also be applied in a novel way to reduce the amount of required storage of the delayed signal values x(n -m). This is achieved by interleaving, in time, the order in which the values r,(m) are computed. Thus at each sample time n only one out of every Q samples of r,(m) are updated. We can formalize this concept by dividing the time n into blocks of Q samples or cycles, (24) and (25) it is seen that
x ( n -m ) = x ( Q d t q -Q c -q )
which implies that only every Qth value of x(n) is required to be stored in order to compute the necessary values of r,(m). This results in a savings factor of Q in storing the delayed values of x(n -m). (Recall that a value of Q = 4 was used in the present design.) Another key factor which determines the amount of computation required in the pitch periodicity detector is the range and resolution of pitch desired. For many applications, especially in speech coding and compression, a b bit quantization of the pitch period is desired, Thus a total of 2b elements are necessary in the set {r,(m)} and one codeword can be assigned to each element of the set for quantization. Typically a b = 6 bit quantization is sufficient for many types of coding applications. In the 9.6 kbit/s coder, mentioned earlier, a 6-bit pitch estimate, updated every 10 ms was found to give good results [3] . Thus a 6 bit quantization was chosen for the design of the pitch detector giving 64 elements in the set {r,(m)}.
The pitch/periodicity detector was designed to handle a range of pitch frequencies from 66.7-320 Hz. This spans the range for most speakers and it corresponds to a range of values of m from 25-120 (assuming an 8 kHz sampling rate).
In informal listening (based on TDHS) we determined that for female speakers with high pitch, i. In this way a more uniform performance across the full range of speakers is obtained.
A final consideration in the design of the pitch/periodicity detection algorithm is the choice of the weighting function g(m) in the peak picking process. As pointed out earlier, the purpose of this weighting factor is to reduce the possibility of "pitch doubling" errors. In our design a weighting factor of g(m) = 1.0 for values of m less than 40 was used and an exponentially decreasing value of g(m) from a value of 1.0-0.8 was used for values of 40-120. This function is plotted in Fig. 14. In the next section we consider the details of this implementation on the DSP.
C Implementation of the Modified ACF Algorithm on the DSP
In the previous section, we considered several modifications to the ACF algorithm which led to a reduction in the amount of computation and storage in its implementation. In this section we discuss how this modified algorithm was further structured to fit the architecture of the DSP and point out some of the software techniques that were used.
The most time-consuming computation in the algorithm is the ACF update and the peak picking. In the course of 4 sample times (Q = 4) a total of 64 ACF updates and peak picking decisions must be computed. Also for each input sample the 8 tap FIR filter h(n) must be computed. On the DSP this requires 13 instructions per sample. Assuming an 8 kHz sampling rate and an 800 ns instruction time, it can be determined that a maximum average of 8.9 DSP instructions are available .for each ACE update and peak picking decision (the remainder is required for the filter h(n)). The manner in which this is performed is discussed below.
Recall that the equations that must be computed are
and the index m , must be found such that of rn(m) values is computed is not sequential with m. Also because of the manner in which the shift register for values of x(n -m) must be updated it is more convenient to compute and peak pick values of r,(m) in decreasing order of m. The shift register can then be addressed sequentially from right to left. The array of ACF coefficients can also be addressed sequentially through the course of 4 cycles by storing them in a "scrambled" order, i.e., the order in which they are actually computed. If we define the address location of this array as extending from k = 0 to k = 63 then the values of m are stored in the "scrambled" order shown in Table 11 . It can be observed from this table that more computation is required for cycles 0 and 2 which contain even values of m. This is because odd values from 57 to 119 are not included in the set { m } defined by (27). Table I1 also defines the manner in which the pitch is quantized. This is trivially accomplished by associating the 6 bit codeword with the address of the table. The table then serves as the quantization "code book" to translate between code words and pitch values.
With the above framework we can now define how the ACE update is computed in the DSP. Two DSP registers for ac- The computation of the peak picking of Fn(m) in (29) and (30) is more subtle. As each value of Fn(m) is updated it must be compared with the largest weighted ACF value found so far. This running estimate of the maximum will be denoted as F* and its address location (in Table 11 ) will be denoted as k*. Then if the inequality is true the following two operations must be performed
where k is the address location in Table I1 associated with the ACF lag m.
An equivalent condition to that of (31) which turns out to be more efficient to compute in the DSP is given by comparing r,(m) to the inverse weighted value of r*, i.e., where (33) (34)
The form of (33) suggests that the weighted maximum value, r*/g(m), can be obtained from the previously computed value, Based on (32)-(35), the DSP instructions discussed above for the ACE update can be amended to include the peak picking decision. The architecture of the DSP allows us to perform these instructions as well as those for updating r,(m) in a total of only 8 instructions. This includes automatically setting the pointers for updating the next lag.
In the implementation of the pitch/periodicity detector the set of 8 instructions are effectively duplicated 64 times in software with a different value of k in each set. This avoids the use of a "tight 100p" which would require several extra instructions per ACF update and peak picking decision. At the beginning of each cycle that value of r*/g(m) is appropriately rescaled for the new value of m. A new input sample s(n) is obtained and low-pass filtered to give a new value of x(n).
The main (and only) loop of the computation occurs after 4 cycles of processing are completed. After the fourth cycle all values of the ACF have been updated and the K register in the DSP (which contains the value k ) contains the address location of the maximum (weighted) ACF coefficient. Since the contents of the K register cannot be read directly, it is Table I1 or it can be a table of the "codewords" from 0-63 if a 6 bit quantization of pitch is desired. A new estimate of pitch or quantized pitch, therefore, is available after every four input samples of speech.
With the above approach, the pitchfperiodicity detection algorithm "just fits" within the real-time processing capability of the DSP. For an 8 kHz sampling rate the DSP remains idle less than 2 percent of the time. The 8 instructions shown above have also been carefully chosen so that the maximum number of normal arithmetic instructions are used in the form which requires only one ROM location for storage rather than two. The total program occupies in excess of 1000 ROM locations out of the 1024 locations available. Thus it essentially fully utilizes the program memory capabilities of the DSP as well as its real-time capabilities. Finally, the program requires a total of 118 out of the 128 available RAM locations in the DSP.
IV. PERFORMANCE
In this section we first describe the performance of the pitch/periodicity detector and then the performance of the entire system.
A. Performance of the Pitch/Periodicity Detector
The performance of the pitch/periodicity detector was evaluated in two ways. First, it was compared against several other well-known pitch detectors on the same speech files. Second, it was evaluated in conjunction with the time domain harmonic scaling system.
In the first approach the above algorithm was compared with the Gold-Rabiner parallel processing pitch detector [ 101 sons for a female and a male speaker, respectively, where the pitch period estimates are plotted as a function of time. The speech was obtained from a high quality microphone. In the voiced regions for the female speaker the performance is very similar for all three methods. For the male speaker the results of the above algorithm occasionally differs by one sample since the pitch period in this range is quantized with a twosample resolution (see Table 11 ) whereas the parallel processing and homomorphic algorithms had a one-sample resolution in this range. No other significant errors are seen except for an occasional pitch doubling. The three pitch detection algorithms were tested and compared for telephone speech, for speech in high noise environments, and for several simultaneous speakers. In all of these conditions the above modified ACF algorithm performed well in comparison to the other two algorithms. Its main failing is that it is slightly more prone to pitch doubling or tripling errors. These types of error, however, are not disastrous in applications such as time domain harmonic scaling or pitch predictive coding. For example, when pitch doubling occurs 4 pitch periods are combined into two which is not a severe degradation. The amount of high-pass filtering done to the speech influences the rate of such errors. Generally, the more low frequency content that has been removed (e.g., the loss of the fundamental harmonic), the greater the number of pitch doubling or tripling errors.
The performance of the modified ACF algorithm in the TDHS application was very similar to that of the parallel processing and homomorphic algorithms. Also the performance of the real-time DSP implementation agrees very closely to that of compute simulations.
One critical parameter that is important to the time domain harmonic scaling algorithm is the delay of the pitch detector. This delay is about 76 samples (for an 8 kHz sampling rate) for the modified ACF algorithm. The instantaneous delay depends on the pitch of the speaker. High pitched speakers have shorter periods and therefore less delay. Low pitched voices have more delay. This delay must be accounted for in the implementation of TDHS so that the pitch is properly aligned with the data.
B. Performance of the TDHS System
In terms of quality the real-time DSP version of TDHS matches the quality of the computer simulations [2]. The degradations introduced by TDHS compression/expansion are generally perceived as a form of reverberance. The most critical element in any TDHS system is the pitch detector. A poor pitch detector degrades TDHS quality considerably. As described above our pitch detector is quite sufficient. The next most critical aspect is the proper alignment of pitch with the data. For both expansion and compression the pitch should be delayed about 120 samples, or half the buffer. This means that for expansion the delay is twice that of compression since the input sample rate is only one half as much. Improper pitch alignment is perceived as hoarseness.
The delay of the system is about 120 samples plus p for compression and twice that for expansion. Assuming an average pitch period of 55 for all speakers, this gives a total delay of 525 samples or about 65 ms. Any coder delay would be double the normal amount because of the reduced sampling rate of the input to the coder.
In a recent study involving TDHS processing of telephone network speech (i.e., speech that is band limited to 200-3200 Hz and preemphasized) it was found that the perceived reverberance due to the TDHS processing was more noticeable than that for high quality microphone speech [4] , [14] . This implies that some caution must be exercised in applications involving a direct tandem connection of TDHS with the telephone network environment.
Later we will show how this effect can be mitigated to some extent. For applications where the characteristics of the transducer can be controlled, this poses no problem.
The reason for the increased reverberance for telephone network speech was found to be a consequence of the strong preemphasis due to the 500 type telephone set specifications and the tight bandpass fdtering (200-3200) of the D-channel bank. Fig. 18 shows the resulting frequency response of the network environment [ 141 . This can be compared to the flat response of a high quality microphone.
The network frequency response of Fig. 18 has two effects on the perceived quality of TDHS processing. Both effects stem from the fact that TDHS depends strongly on an accurate pitch measurement for its performance as well as the assumption that voiced speech can be modeled as a pseudoperiodic signal. For example, if Afrepresents the error in the measurement of the fundamental pitch harmonic (due to measurement error or quantization of the pitch) the nth harmonic will have an error of n A f . This means that TDHS processing will always degrade the high frequency regions of speech more than the low frequency regions. For high quality microphone speech this high frequency reverberance is not very apparent because for voiced sounds the high frequency regions are relatively low in amplitude. Additionally, the strong energy in the first formant helps to perceptually mask any degradations in the high frequency regions. During unvoiced regions the noise-like character of speech tends to mask most of the effects of TDHS processing.
The effect of the telephone preemphasis in Fig. 18 is to amplify the high frequency region of the speech by a factor of 15-20 dB relative to the low frequency region. This also amplifies the high frequency reverberance of TDHS as well. The effect of the high-pass filtering from 0-200 Hz effectively removes most of the fundamental harmonic of the speech and consequently reduces the masking process. Thus both effects contribute to enhancing the perceived reverberance of the TDHS processing. Therefore, the problem is one of perception rather than the TDHS algorithm failing. We found that if high quality microphone speech is filtered with the same filter response as that of Fig. 18 , either before or after TDHS processing, we get the same effect as with telephone network speech.
The effects of the preemphasis in Fig. 18 on TDHS processing can be mitigated to some degree by undoing some of the preemphasis. It cannot restore the fundamental harmonic, however, which is removed by 0-200 Hz high-pass filtering. Thus the perceived reverberance can be reduced in amplitude (along with the amplitude of the high frequency content of the speech) but the masking effects due to the fundamental harmonic cannot be restored. The deemphasis filter used in the DSP realization was a simple first-order filter with the difference equation is the input signal of the deemphasis filter. This filter can be inserted or removed in the DSP realization by controlling one of the input flags to the DSP with a switch.
A final observation that we have made is that TDHS appears to perform slightly better with electret microphone speech than carbon button microphone speech. We speculate that this is due to the increased harmonic distortion of the carbon button microphone over that of the electret.
V. CONCLUSIONS A realization of the TDHS algorithm for bandwidth expansion or compression of speech on the Bell Labs DSP integrated circuit has been described. It is accomplished through the use of double packing the DSP RAM data storage in order to obtain sufficient dynamic memory. Each component of the TDHS system uses one DSP, i.e., one DSP for the pitch/periodicity detector, one DSP for the TDHS compressor, and one DSP for the TDHS expander. This system can be combined with a variety of coders in order to obtain different approaches to low bit rate speech coding [2] . Specifically, it has been used with subband coding for a real-time speech coder at 9.6 kbits/s [ 3 ] . It also has potential for use in speech enhancement systems [5] .
We have also described the design and implementation of a pitch/periodicity detector based on a modification of the ACF algorithm. The significance of this result is that it is realized in a single programmable integrated circuit. The pitch detector does not include a voiced/unvoiced decision, however, this is not required in applications such as waveform coding or time domain harmonic scaling.
In describing our work we have discussed some of the limitations of the DSP chip. The three fundamental limitations are data memory size, program memory size, and instruction execution time. These limitations caused problems which we solved and have discussed here. The problems are representative of the types of problems which would be encountered with all DSP-type chips. Their solutions should be instructive for others planning similar implementations.
