Abstract -The SIMD parallelization of IIR or recursive filters is more difficult than that of FIR filters due to additional data dependencies. While other methods concentrate on appropriate scheduling to enable SIMD parallel execution, this paper proposes a new method where data dependencies are resolved by fusing the recursive application of filter taps into single coefficients. In this way the overhead over perfect parallelity can be reduced to one vector multiply-accumulate operation. Speedups from 1.5 to 4.5 can be obtained with the 4-way SIMD Intel SSE extension, depending on the number of filter taps.
INTRODUCTION
The parallelization of FIR filters has been investigated thoroughly, especially for wavelet filters, for old SIMD arrays [1] , [2] , [3] and SIMD extensions of modern general purpose processors in the 1-D case [4] , [5] and the 2-D case [6] , [7] , [8] . The parallelization of IIR filters is more difficult due to data dependencies. There are several approaches including space-time transformations of loop iterations [9] , [10] and algebraic transformations [11] . The approach presented in this paper also uses algebraic transformations, although different ones.
From a computational point of view, the difference between FIR and IIR filters lies in the dependencies between loop iterations. Basically, there are two loops, one over signal data and one over filter taps. In the FIR case, iterations of the outer loop, i.e. entire inner loops, are independent of each other, leading to a rather straightforward SIMD parallelization where the two loops (inner and outer) are interchanged for a number of outer iterations equal to the SIMD vector size p. See [4] , [5] and Fig. 1(a) . In the IIR case, the dependencies are more complicated since all previous output values are required to calculate a new one. See Fig. 1(b) . Therefore, SIMD parallelization is more difficult.
This work presents a new approach where data dependencies are resolved by fusing the recursive application of filter taps into single coefficients. In this way it is possible to reduce the overhead of vector operations to only one vector multiply-accumulate operation, apart from several necessary shuffle operations, if the number of filter taps is not smaller than the SIMD parallelity.
All results in this work have been conducted on an Intel Pentium 4 CPU with 3.2GHz and 2MB cache size using the SSE extension with packed words of 4 single precision numbers. All implementations use the same amount of code optimization, i.e. memory access through incremented pointers instead of indexed arrays, and compilation with gcc 4.1.2 with the -O3 option. SIMD operations are implemented using gcc's built-in intrinsics for vector extensions and the -msse option. Note that in order to have full control over generated code, no automatic vectorization is applied.
The results are compared to hand-optimized code by human experts, i.e. the Intel Integrated Performance Primitives (IPP) v5.3, and are able to compete with and outperform it depending on the number of filter taps. Note that the IPP library also uses SIMD operations, but the applied methods are not known to the author.
SEQUENTIAL ALGORITHM
The goal of IIR filtering is to calculate the signal y from the signal x by
where the second term is an FIR part with coefficients b i and the first term is the IIR part with coefficients a i . M is the number of FIR filter taps and N is the number of IIR filter taps. The formula reveals the outer loop over n and two inner loops over i.
The sequential implementation is optimized for performance in order to allow a reasonable comparison to the SIMD parallelized version. It turns out that maintaining a pointer for y n and x n and addressing x n−i and y n−i via relative addressing is fastest. Using extra buffers or local register variables for reused values does not improve the performance. Therefore, a similar implementation style is adopted for the SIMD parallelization.
SIMD PARALLELIZATION OF THE FIR PART
For SIMD parallelization, it is best when neighboring data has to be processed independently because this leads to a natural sequence of vector operations without the need to combine elements of the same vector, which would involve shuffle operations and incomplete vector operations. The approach to interchange the inner and outer loop for as many outer loop iterations as there are vector elements, leads exactly to this situation and turns out to be near optimal. See Fig. 1(a) . The algorithm can be expressed by where ⊙ is the vectorized multiplication and p is the vector size.
However, the input signal x has to be read in a nonaligned way in this approach. This requires a shuffle operation for each non-aligned read. Moreover, there are architectures where not all shuffle operations are possible, e.g. the Intel Pentium architecture. This special architecture demands that the first two of four vector elements originate from the first vector operand and the other two from the second operand. Fig. 2 shows how all possible realignments can nevertheless be implemented on this architecture with one shuffle operation for each non-aligned read.
SIMD PARALLELIZATION OF THE IIR PART
The IIR part can be parallelized in just the same way for those iteration where i ≥ p, i.e. where the source vector y (n−i,...,n−i+p−1) does not overlap with the destination vector y (n,...,n+p−1) that is being calculated. The iterations i = 0, . . . , p − 1 might be implemented sequentially after computing the others in a vectorized way first by
followed by
A first attempt to parallelize the latter part is to split it into two phases. The first phase treats those terms that reference y n+k−i where n + k − i < n, i.e. values that are already available.
The second phase uses those elements of v that already represent y n+k values. At the beginning, only v 0 = y n . So, v 0 a i can be added to v i for i = 1, . . . , p − 1. After that v 1 = y n + 1. Repeating this for v 2 , v 3 , . . ., leads to the following algorithm:
This first approach yields an overhead of p − 1 multiplyaccumulate vector operations, since each phase has p − 1 iterations, resulting in 2(p−1) operations, where only p−1 would be necessary if there were no problems with data dependencies. Now, we will develop the novel approach that fuses filter taps to resolve data dependencies. Let us look at the second iteration (k = 1) of the last algorithm. Here, Following this approach even further recursively, we get the following algorithm that substitutes both phases.
..,n+p−1) ← v s(i) holds the fused filter tap coefficients and has the following form:
. . .
This approach finally has an overhead of only one multiplyaccumulate vector operation, since it has p iterations. For better comprehensibility, let us write the algorithm or the case of p = 4 as in the Intel SSE architecture:
Of course, each of these multiply-accumulate operations requires at least one shuffle operation, maybe two on the Intel SSE architecture.
If the number of IIR-taps N is smaller than the vector size p, the above approach unfortunately only reduces to p − 1 operations. In this case, some divide-and-conquer algorithm might further reduce the overhead. However, ⌈log 2 (p + 1)⌉ seems to be the lower bound, since y n+p−1 depends on the p + 1 values u 0 , . . . , u p−1 , y n−1 if N takes the minimal value 2.
PERFORMANCE
In [4] , [5] , it turned out that the performance of an implementation of a filtering algorithm possibly depends on whether the signal data is in the cache or not. The method to find this out is to vary the data length and to repeat the filtering several times. If the data is short, then it will remain in the cache, otherwise it will not. This approach is also adopted here.
The calculation time is expected to depend linearly on the data size and on the number of filter taps N + M . Therefore, we calculate the execution time per sample point and filter tap from the total execution time of the algorithm by t total /S/(N + M ), where S is the data size. Fig. 3 shows the results for N = M = 2 and N = M = 10. It also includes performance measures of the Intel IPP library. While the IPP library code seems to depend a little on the data size, the major reason for this seems to be startup-overhead when filling the delay-lines, which is significant only for small data sizes. The sequential algorithm and the SIMD algorithm are completely independent of the cache state.
For small numbers of taps, the IPP library code seems to be faster. This is also shown in Fig. 4 . For N = M ≤ 5, the SIMD algorithm cannot compete with the IPP code. The reason is probably that hand-optimizing assembler code, as in the IPP library, is more important for short loops. For N > 5, however, the new SIMD approach outperforms the IPP library by a speedup of about 1.7 and also shows more regular behavior. Compared to the sequential algorithm, speedups from 1.5 for small N to 4.5 for large N are obtained.
CONCLUSION
A successful approach for FIR filtering on SIMD architectures is based on the interchange of the inner loop over filter taps and the outer loop over the signal data. If this is done on a number of outer iterations that is equal to the vector size p, a convenient SIMD algorithm can be generated easily.
In the IIR case, additional data dependencies disturb this scheme. However, it has been shown that the scheme can be left unchanged for all but the first p filter taps if the order of filter tap application is reversed. For those taps, coefficients can be fused in order to resolve data dependencies. This algebraic transformation manages to limit the parallel overhead to one vector multiply-accumulate operation.
The approach yields speedups of 1.5 to 4.5 compared to the sequential version on an Intel Pentium SSE processor. The fact that the speedups exceed p = 4 is probably due to improved cache usage. It even outperforms the handoptimized code of the Intel IPP library by a speedup of about 1.7 if the number of filter taps is 6 or higher.
