Abstract -This paper addresses the problem of improving the execution performance of saturated reduction loops on fixed-point instructionlevel parallel Digital Signal Processors (DSPs). We first introduce "bitexact" transformations, t h a t are suitable for use in t h e ETSI a n d the ITU speech coding applications. We then present "approximate" transformations, t h e relative precision of which we are able t o compare. Our main results rely on the properties of t h e saturated arithmetic.
INTRODUCTION
The latest generation of fixed-point Digital Signal Processors (DSPs) that comprises the Texas Instruments C6200 and C6400 series [8] , the StarCore SC140 [6] , and the STMicroelectronics ST120 [7] , rely on instruction-level parallelism exploitation, and fractional arithmetic support, in order to deliver high-performance at low cost on telecommunication and mobile applications.
In particular, the instruction set of these so-called DSP-MCUs [3] is tailored to the efficient implementation of standardized speech coding algorithms such as those published by the ITU (International Telecommunication Union) [5] and the ETSI (European Telecommunication Standards Institute) [4] Voice over Network (VoN), and for H.324 and H.323 applications.
In the ETSI and the ITU reference implementations of these algorithms, data are stored as 16-bit integers under the Q1.15 fractional representation, and are operated as 32-bit integers under the Q1.31 fractional representation.
The Qn,n fractional representation interprets a n + m bit integer x as [3] : product is a 32-bit 9 2 . 3 0 number, which is first shifted left one bit in order to align the fractional point. The result is a 33-bit Qz 31 number, which is saturated back to a 32-bit &I 31 number. This value resulting from saturate(a[i] * y[i] << 1) is then added to s, to yield a 33-bit Qz 31 number, which is saturated again to 32-bit, yielding the new value of s under the Q1.31 representation.
In order to efficiently execute speech coding applications under the requirement of "bit-exactness" with the the ETSI and the ITU reference implementations, a DSP must support fractional arithmetic in its instruction set. As a matter of fact, today's leading fixed-point DSPs offer multzply-accumulate (MAC) instructions, including fractional multiply-accumulates that compute On the new DSP-MCUs, exposing more instruction-level parallelism than a single MAC per cycle is required in order to reach the peak performances. Indeed, these processors are able to execute two (TI C6200, STM ST120) to four (TI C6400, StarCore SC140) MACS per cycle. Unfortunately, saturated additive reductions are neither commutative nor associative, unlike modular integer additive reductions. Therefore, a main issue when optimizing the ETSI and the ITU reference implementations onto a particular DSP-MCU is to expose instruction-level parallelism on saturated reduction loops. When porting an ETSI / ITU reference implementation to a particular DSP. the first step is to redefine the basic operators as zntrznszc functzons, that is. functions known to the target C compiler] and inlined into one or a few instructions of the target processor. Efficient inlining is compiler challenging, as virtually all the ETSI / ITU basic operators have a side-effect on a C global variable named Overflow, which is set whenever saturate effectively saturates its argument. Compiler data-flow analysis is used to isolate the reductions whose side-effects can be safely ignored.
Once efficient inlining of the ESTI / ITU basic operators is achieved, the performance bottlenecks are identified in order to trigger more aggressive compiler optimizations. On the ESTI / ITU speech coding algorithms, many of these bottlenecks involve saturated reduction loops. Typical examples of such loops, from the EFR-5.1.0 vocoder, are illustrated in figure 2. In these cases, parallel execution can be achieved without introducing any overhead, thanks to the unroll-and-jam compiler optimization.
Unroll-and-jam [l] can be described as outer loop unrolling, followed by the loop fusion of the resulting inner loops. The main issues with this transformation are checking the validity of the inner loop fusion, and dealing with iteration bounds of the inner loop that are outer loop variant. Unroll-andjam of the codes of figure 2 is illustrated in figure 3 . In the residu code, the
y Cn-01 = extract-h(s-o) ;
(residu) (convolve) compiler Inter-Procedural Analysis (IPA) infers that Ig is even, and that y does not alias a or x. Likewise in the convolve code, where the IPA infers that L is even, and that y is alias-free. Thanks to these informations, remainder code for the outer loop unrolling is avoided, and the inner loop fusion is found legal. Unlike residu, the convolve loop nest has a triangular iteration domain, so its inner loop fusion generates residual computations. Unroll-and-jam of saturated reduction loops is not always an option. Either the memory dependencies carried by the outer loop prevent the fusion of the inner loops after unrolling, such as in the synfilt code found in the ETSI EFR-5.1.0 and the ETSI AMR-NB. Or there are no outer loops suitable for unrolling. In such cases, parallel execution can still be achieved, by using the arithmetic properties of the saturated additive reduction.
I .

Exploitation of a 4-MAC DSP with 40-bit Accumulators
We now introduce a first technique, based on the arithmetic properties of the saturated additive reduction, that enables the "bit-exact'' parallel execution of the saturated reductions. This techniques requires a DSP that executes four MAC per cycle, to achieve an effective throughput of two iterations of the 
Exploitation of a 4-MAC DSP with 32-bit Accumulators
One problem with the method of section 1.2 is that it requires a target DSP that computes three saturated 32-bit MACS, plus one non-saturated 40-bit MAC, per cycle. In this section, we show that Program 5 in figure 7 is bit-exact. The corresponding C code in Program 6 achieves an effective throughput of two iterations of the original saturated reduction loop per cycle on a 4-MAC DSP, and only requires 32-bit accumulations: three with saturation, and one that uses 32-bit modular integer arithmetic denoted +. 
Equation (9) is implied by (8) , while (10) holds because z a n d s E [min32, max321.
Since S, and s, are computed the same way, except for the modular 32-bit addition in case of s,, we are left with only three possibilities: Proof identical to the proof of Corollary 1.
min, -min32, thus reducing to case (7).
-s > min, -min32 are exclusive.
U
APPROXIMATE PARALLEL REDUCTIONS
Problem Statement and Notations
Section 1 illustrates that satisfying the "bit-exact" requirements when unroll-and-jam does not apply wastes computational resources: four parallel MACs are required in order to run twice as fast as the original saturated reduction. In this section, we discuss several "approximate" transformations of the saturated reductions, that are suited to DSPs fitted with only two parallel MACs. All these transformations run twice as fast as the original saturated reduction, but more or less approximate the "bit-exact" result.
The approximate parallel saturated reductions discussed are:
The common theme of these approximate algorithms is to split the reduction into sub-reductions, which are computed in parallel and combined at the end using a saturated addition. When using non-saturated arithmetic (modular integer arithmetic), overflows must be avoided by using wider precision arithmetic, such as 40-bit integers on the new DSP-MCUs. Another difference between approximate algorithms is that some of them expose spatial locality of memory accesses, such as S3 and S4. On the new DSP-MCUs, spatial locality enables memory access packing, that is, loading or storing a pair of 16-bit numbers as a single 32-bit memory access.
The correctness of an approximate algorithm mainly depends on the potential saturation of the sub-reductions. Let us introduce the notations:
Then the different approximation cases we shall discuss for 
Summary of the Approximation Results
The proofs of this section are omitted. (They can be found in [2] .)
In this section, we denote: S Er Xz. @yi:(z, U [ ; ] ) . [8] , the StarCore SC140 [6] , and the STMicroelectronics ST120 [7] . On the ETSI ans the ITU reference implementations, several saturated reductions loops can be parallelized by applying unroll-and-jam, that is, unrolling of the outer loop into the inner loop so as to create more parallel work. When unroll-and-jam is not applicable, the arithmetic properties of the saturation operator allow to compute two saturated reduction steps per cycle, at the expense of four multiply-accumulate operations per cycle. Our main technique requires one 40-bit integer accumulation, and three 32-bit saturated accumulations, per cycle. Then we show how to replace this 40-bit integer accumulation by a 32-bit integer accumulation.
Lemma 3 (Saturation of S).
When the "bit-exact" requirement can be relaxed, more efficient but approximate techniques can be used to pa.rallelize the saturated reductions. Based on further arithmetic properties of the saturation operator, we compare the approximate techniques to each other. In particular, the commonly used technique S, that computes two interleaved saturated sub-reductions achieves the worst approximation. We find that the most precise approximate technique SI computes the first half of the reduction in order using 32-bit saturation, and the second half using 40-bit integer arithmetic.
