Abstract. Many digital signal processors (DSPs) and also microprocessors are employing the single-instruction multiple-data (SIMD) paradigm for controling their data paths. While this can provide high computational power and efficiency, not all applications can profit from this feature. One important application of DSPs are recursive filters. Due to their data-dependencies they can not exploit the capabilities of SIMDcontrolled DSPs. This paper introduces enhancements of the SIMD control paradigm to accommodate recursive filters. Three methods for calculating recursive filters on SIMD-controlled DSPs and their requirement's for control and data transfer are presented. Their performance and hardware requirements are evaluated to determine the most efficient solution in terms of the AT-product.
Introduction
Digital Signal Processors (DSPs) are more and more widely used in signal processing applications since they can provide the flexibility which is needed in a world of evolving standards, changing requirements and costly bug fixes in dedicated application specific integrated circuits (ASICs). To minimize the size and performance penalty paid for the flexibility and achieve a high computational power, many state-of-the-art DSPs employ the single-instruction multiple-data (SIMD) paradigm for control. SIMD schemes are also finding their way into the multimedia extensions of microprocessors, e.g., MMX
TM or SSE TM . One important application of DSPs are recursive filters. They are employed mainly in audio processing, e.g. [1] , but also other applications [2] . Due to their data-dependencies they can not exploit the capabilities of SIMD-controlled DSPs as non-recursive (FIR) filters can. There are multiple ways of speeding-up the computation of recursive filters: Previously, we analyzed the data-dependencies [3] and also introduced algebraic transformations to allow for computing recursive filters in SIMD-structures at the cost of computational overhead [4] . Other approaches in literature employ pipelining instead of parallization for speedup [5] or require multiple-instruction schemes for parallel calculations [6, 7] . This paper introduces another approach, enhancing the SIMD control scheme itself to accommodate recursive filters without introducing much hardware overhead. Several techniques for enhancing SIMD control will be presented and their hardware overhead and performance will be analyzed to find the most efficient solution.
Parallel Calculation of Recursive Filters

Data Dependencies
The data dependencies limiting the available data parallelism can be seen in equation 1.
For the calculation of each output value y k the previous output values y k−1 through y k−M −1 must be available. Hence, it is impossible to complete the calculation of multiple output values at the same time, as can be done for FIR-filters [8] . Instead, if multiple output values shall be computed in parallel, their calculation can only be completed one output value at a time. This leads to the parallel calculation methods elaborated upon in the following section which were derived in [3] . We will review the data transfer and control requirements of these methods in the following sections.
Method 1: Calculation of one output value per data path
A data flow graph of an exemplary filter with N = 4 and M = 5, mapped upon a processor with 3 or more data paths is shown in figure 1 . Each output value is calculated on one particular data path. From the data flow graph the following observations can be made for the data transfer and control requirements:
-Coefficients are always loaded into data path number 0 (left) and then transfered to the other data paths by a Zurich-Zip. -Data are loaded into data path number zero as well as higher-number data paths. This will require a parameter for the load instruction. The data are then broadcasted to multiple, but sometimes not all, data paths. Again, this will require special parameters for the broadcast instruction. -The calculated output values are written back to memory and also broadcasted to the data paths to the right of the calculating data path. The broadcast is already covered by the previous point, the store operation will require a special parameter for the source data path. -During the prologue, data paths have to be activated one by one, during the epilogue they have to be deactivated again. -For fixed-point processors, e.g. one with 16 bit, no contents of accumulators, which e.g. are 40 bit wide, have to be transfered between data paths. If more output samples than available data paths have to be calculated, multiple sets of calculation must be performed. The epilogues and prologues of consecutive sets can be overlapped. However, this will require higher control efforts. For overlapping sets the time to calculate Y outputs of a filter with N + M − 1 taps on a processor with P data paths and T T cycles per tap can be determined as:
For non-overlapping sets, further denoted as method 1a, this execution time increases to:
Method 2: Calculation of one filter tap per slice
The data flow graph of the second method to calculate multiple output values in parallel is shown in figure 2 . Here, each filter tap is calculated on one data path. This requires transferring the accumulated intermediate results between data paths. Since in fixed-point processors the accumulated values have a higher bit-width than the samples or coefficients, this can lead to higher bandwidth requirements for the data transfers. But still, basically the same kinds of data transfer operations as before are required. For calculating filters where the number of taps is larger than the number of data paths, multiple sets of calculations for one output have to be performed. Intermediate instead of final results have to be stored and later be reloaded for further accumulation. Also, if the number of taps is not divisible by the number of data paths, some data paths will have to run idle during the last set of calculations. Again, consecutive sets of operations can be overlapped at the cost of more parallel load and store operations. For overlapped calculations, the calculation time can be determined as:
For non-overlapping sets, further denoted as method 2a, this execution time increases to:
2.4 Method 3: Parallel calculation of one output value using tree-addition
For comparison, we also include a third method, where just one output value is calculated at a time. First, the taps are calculated in parallel and second summed up in a tree pattern. This method requires special data transfers for the tree addition. They deviate from the SIMD paradigm but are otherwise fairly simple to implement since they can reuse the connections which are required for the Zurich-Zips mentioned above. Method 3 achieves the following calculation times:
Required Control Structures
After presenting three different methods for calculating recursive filters is parallel, we will now analyze the control structures During the epilogue, data paths have to be switched off one after another after calculating all their filter taps. Also, data are only transfered between the active data paths. For method 1, after calculating the last filter taps, denoted by the darker grey boxes, the outputs have to be written back to memory and transfered to other data path. This requires different instructions than the calculation of the other, preceding, taps.
This lets us conclude that we have to solve two problems: First, we need to activate and deactivate data paths. Second, we need to provide addresses to the data transfers to control their source and destination.
If we like to overlap two sets of calculations, the required control structures become a bit more complex, as can be seen in figure 3 b) . A true prologue is only performed for the first, epilogue only for the last set of calculations. Overlapping epilogue and prologue of two sets of calculations yields a new phase of operation where no data paths have to be (de-)activated but different operations have to be performed in certain data paths and data transfers have to be split. These split data transfers will require more addresses.
How to answer these control needs will be laid out in the following section.
Enhanced SIMD-control
After analyzing the control requirements in the previous section, we will now propose control techniques which can fulfill these requirements. For each technique a qualitative assessment of the hardware requirements is presented, exact number will follow in the next section.
Software Control Using Immediates
A very simple way of controlling data path's activities and data transfers is by supplying immediates for each instruction. One immediate bit would be set for each active data path (its activity bit) and source and destination for data transfers are also specified as numerical values. Since immediates must be set in every cycle, it is not possible to write program code in loops which may also increase the required program memory size.
Hardware Requirements This technique requires P bits for data path activation, one time ldP bits for each load or store instruction and two times ldP bits for each broadcast or Zurich-Zip instruction.
Software Control Using Registers
To allow for using loops in program code we designed a second scheme, where the activity bits and addresses are supplied by control registers. These control registers, in turn, are manipulated by small ALUs which have to be controlled by instructions in each cycle. Since these instruction can remain the same, e.g. an increment instruction, throughout some time, these instructions can be included in loop bodies.
Hardware Requirements
The width of the instruction word will remain about the same as in the previous method, since the registers must be specified in the data transfer instructions. Additionally, the register manipulation units must be controlled. However, the overall program memory requirements will be lower, since less lines of code will be required. Also, chip area for the registers and ALUs is required but quite small.
Hardware Control Using State-Machines
The previous techniques used registers manipulated by software controlled ALUs. Since this software control is fairly regular, we will replace it with hardware statemachines in this technique. The idea is illustrated in figure 4 . The left part shows Hardware Requirements The hardware requirements for this technique consist of the registers and ALUs for manipulating them and is very small. The instructions for setting up the state machine can use already existing instruction slots, e.g. for program control, and hence do not increase the size of the instruction word.
Skewed SIMD Control
When looking at the required control structure one can note, that it is basically a skewed SIMD structure. This means, the instruction executed on one data path is the same instruction as the one that was executed on the neighboring data path one loop iteration before. We are exploiting this observation to create a control technique that does not rely on activating and deactivating data path but skewed SIMD instructions. The technique is illustrated in figure 5 .
The left side does depict the control structure to be realized. The right side shows a block diagram of the control hardware. Additionally to the usual program memory, instruction decoder, and distribution of control signals, a buffer, second instruction decoder and a SIMD control unit are added. The buffer holds a copy of the program code of the first phase. One decoder decodes the instructions of the first phase, the second decoder decodes instructions of the second phase. The decoded control signals of both decoders are distributed to all data paths. However, the SIMD control unit selects the one or other set of control signals for each data path. The SIMD control unit again contains a state-machine which is set up at the beginning of the calculation of a recursive filter. For each loop iteration it advances its state so that data paths are switched from the first to the second phase. If the loop body contains just a single instruction, the control technique can be simplified to a pipeline-like structure, depicted in figure  6 . Here, the decoded signals are delayed between the data paths just like in a pipeline.
Hardware Requirements The hardware requirements are determined by the program buffer, second decoder and the SIMD control unit. Depending on the ISA and buffer size, the buffer and decoder will form the largest part. For the pipeline technique, the hardware requirements are determined by the number and size of the pipeline registers.
Comparison of Hardware Requirements
After proposing the various control techniques, we will compare their hardware requirements in order to find the most efficient solution. The hardware requirements were estimated for the UMC 0.13 µm silicon technology, where one gate-equivalent occupies 5.13µm 2 and one bit of single port memory, used for program memory, occupies about 3µm 2 . We also assume that the processor features a VLIW-like ISA with instruction word sizes of about 170 bit and very simple decoders. These assumptions stem from our currently developed DSP for which we performed the work described in this paper.
The hardware requirements for the proposed control methods are presented in figure 7 a). It can be seen, that the software control method with immediates requires the largest chip area due to its program memory consumption. The methods for skewed SIMD control require quite large chip area, too. However, these values are for our particular ISA and might be much smaller for ISAs with smaller instruction words. The smallest method is the hardware control with state-machine.
However, area consumption is not the only criteria for an efficient solution. Performance also has to be taken into account. For this paper we calculated the execution time T for each of the presented calculation schemes for 200 output values of a filter with 20 taps. This should be a reasonable assumption for applications where data is processed in frames. The cycle time of the processor is not affected by the control method since it is determined by the data paths. Hence, we can calculate the efficiency E = 1 A·T where A represents the consumed die area.
The efficiency, normalized on the method with the lowest efficiency for just the control hardware is presented in figure 7 b) . Due to its small area consumption, hardware control with state-machines yields by far the best efficiency. However, the area consumption must be viewed for the whole DSP, where also area for memories and data paths must be accounted for. This has been done in figure 7 c) where we included the area for 1Mbit of data memory. It can be seen that the different hardware requirements of the control schemes play less of a role compared to the calculation schemes. Hence, issues like compilerand programmer-friendlyness should also be taken into account when choosing a method for enhancing SIMD-controls.
Conclusions
In this paper we first presented three methods for speeding-up the calculation of recursive filters by exploiting data level parallelism. We analyzed the required control structures for the methods and presented several techniques for enhancing SIMD-controlled DSPs to perform parallel calculation of recursive filters. The hardware requirements for the proposed techniques were determined and their efficiency was compared by means of the AT-product.
