Driven by the ever increasing algorithm complexity on the field of mobile communications systems, SIMD DSP architectures have emerged as an approach that offers the necessary processing power at reasonable levels of die size and power consumption. However, this kind of DSP architectures imposes new challenges for programmers, since algorithms have to be designed to exploit the available parallelism on the processor. Taking as a starting point an algebraic framework that captures the SIMD computational model, we report in this paper about our efforts to design and automatically generate object code for our family of DSP architectures independent of the available SIMD parallelism. We show how these algebraic structures can be used as a high level programming language that offers a unified approach to design and describe algorithms using SIMD parallelism. Moreover, we show how these algebraic structures offer concise rules for the automatic code generation.
I. INTRODUCTION
SIMD (Single Instruction Multiple Data) parallelism offers the potential to speed up the computation of signal processing algorithms. This is of paramount importance in high end applications where the required processing power to compute the algorithms must be trade-off with power consumption and die size issues.
However, SIMD DSP processors are a major challenge for programmers, since algorithms have to be designed to exploit the available parallelism. In [1] we have presented a novel DSP architecture that offers SIMD parallelism. In fact, STA (Synchronous Triggered Architecture) based processor cores can be automatically generated by means of our integrated design flow. Thus, we are able to generate processor cores with different levels of SIMD parallelism. This imposes a new challenge to algorithm designers: scalability. Hence, the question here arises is not only how a programmer can design an algorithm that exploit the available parallelism, but how this algorithm can be designed in such a way that it can be easily scaled.
Our approach to deal with these challenges is based on a set of parameterizable algebraic structures that captures the SIMD computational model. This algebraic components reveal the kind of computations that our family of STA processors can support. Thus, the task of algorithm design reduces to find the way an algorithm can be expressed by means of these algebraic constructs. We use Matlab syntax as the language with which our algebraic structures are programmed. Thus, algebra can be regarded as a unified language used in order to design algorithms into SIMD DSP architectures. Moreover, it provides a set of rules that can be used to automatically generate object code.
II. ALGEBRAIC FRAMEWORK
A fundamental role on the algebraic framework plays the Kronecker product. Davio [2] in his classical paper developed an algebraic framework, where he established the connection between the Kronecker product of matrices and stride permutations. In his paper he proved the important commutation theorem of Kronecker products
where P (L, s) is a matrix that represents the permutation of L elements with stride s and A ⊗ B is a matrix known as the Kronecker product of the m A ×n A matrix A and the m B ×n B matrix B that is defined as follows:
This remarkable result enabled researchers to formulate different fast algorithms for signal transformations like discrete fourier transforms, walsh-hadamard transforms, discrete cosine, sine transform, etc. using the same algebraic framework. Tolimieri [3] uses Kronecker products in order to formulate FFT algorithms for different parallel architectures. Tolimiere makes also the observation that different models of parallel computation can be described by means of Kronecker products. This has originated a lot of research in order to automatically generate code of signal transformations like FFT, Cosine and Sine transform for different parallel architectures [4] . Although there has been a lot of research on the generation of signal transformations, there has been scarce research on extending these ideas in order to support a wider range of digital signal processing algorithms.
In order to characterize the SIMD computational model let A be an m × m matrix, x a Nm × 1 vector, I N the N × N identity matrix and consider the following important expression:
where using Matlab syntax we define the vector of N elements
Equation (3) is known as a Kronecker vector factor [3] , since it expresses the computation of m vectors y i of size N × 1 as vector operations that deal with vectors of size N . This is better illustrated if we rewrite equation (3) as follows:
for i = 0, 1, 2, . . . , m − 1 and using Matlab syntax again y i = y iN : (iN + N − 1) . Equations (3) and (4) captures the SIMD computation model. Further we assume for the level of SIMD parallelism N = 2 n for n = 1, 2, 3, . . .
III. APPLICATION EXAMPLE 1: TENSOR PRODUCT REPRESENTATION OF FIR FILTERS
In this section we show by means of a simple example how the algebraic notation presented in section II can be used in order to describe the SIMD computation of an algorithm. Let x and y be input and output vectors defined as
and a vector of coefficients defined as
We consider the following computation
where the transformation matrix is formed taking the coefficients of the vector h and has a Toeplitz structure as follows:
Assuming that two results can be computed in parallel, we can write for the computation of (5) the following equations
Using the notation introduced in section II, we can write for the computation of equation (5) the following
and the vector x j is defined as
The equations we developed above can be generalized for an arbitrary level of SIMD parallelism N and for a vector of coefficients:
where p/N is a positive integer greater that one. For the input and output vectors we can write
Further, we assume that the number of input samples to be processed L is also a multiple of N, thus the following holds:
In order to formulate the scalable computation of equation (5), we partition the input and output vectors in sub-vectors of size N . Thus, for the input vectors we can write
for 0 ≤ j < J. Otherwise, x j is a vector with N zero elements. For the output vectors, we can write
for 0 ≤ i < J. Finally the computation of (5) for an arbitrary level of SIMD parallelism can be expressed by means of the following expression:
In this equation I N and Z N are the N × N identity and shift matrix respectively. Thus, equation (6) describes an algorithm for the implementation of the core computations of an FIR filter with an scalable level of SIMD parallelism. This algorithm can be extended by any suitable partitioning technique like for example overlapping save in order to obtain the time-continuous computation of the filter.
IV. APPLICATION EXAMPLE 2:TENSOR PRODUCT REPRESENTATION OF RECURSIVE FILTERS
In [5] , we presented a block formulation of pure recursive filters using Kronecker vector factors. The block formulation was derived using a lifting-isomorphism [6] with a raising factor N applied to the state-space equations of the serial formulation of the recursive filter. Our starting point for the derivation of the raised algorithm is the serial formulation of a recursive filter of order p, namely
The state-space representation of this algorithm is given by
where the state vector x(k) = [x 1 (k) x 2 (k) . . . x p (k)] T contains past computed samples. Thus,
x i (k) = y(k − i) for i = 1, 2, . . . , p
For the system matrices we have
We define the input and output vectors of the raised algorithm as follows:
where N is the raising factor. Thus, and according to [5] , we can write for the Kronecker vector formulation of the algorithm the following
where c i are vectors formed by the columns of a raised system matrix
and Z N is the N × N shift matrix. The importance of equation (7) resides on the matter that it expresses the computation of the algorithm in terms of Kronecker vector factors. The algorithm deals with input vectors of size N and produces output vectors of size N and thus, it is independent of the available level of SIMD parallelism.
V. KRONECKER COMPILER
The automatic code synthesis is based on the matter that algorithms expressed as in equations (6) and (7) uses certain algebraic structures, which have a direct interpretation in processor instructions. For example, let us take the following expression of equation (7) y(Nk − i) ⊗ I N This expression means that the scalar y(Nk − i) has to be broadcasted to all the data paths of the parallel DSP architecture. Thus, this algebraic structure will be translated into a broadcast data transfer instruction. Let us take a new example
This expression contains two actions that have to be carried out by the processor: 1.-broadcast of y(NK − i), 2.-componentwise vector multiplication. As we can see from this example, the algorithm is expressed in such a way that enables scalability, since all the components involved in the equation have as a parameter the number of data paths of the processor N . Finally, we consider the expression
where p vector MAC operations are described. In the example above, we have presented a simple data transfer, namely broadcasting. However, we have extended the number of elementary algebraic structures in order to support a wider range of algorithms. This was the case of the N × N shift matrix Z N , which resembles the Zurich-Zip data transfer. More complex transfers that are supported by our processor like stride permutations P (L, s) are also part of our algebraic repertoire.
Speedup for a 32-Tap FIR Filter
In the algorithm of equation (6), we find the operation q N . This can be easily implemented in a processor as a right shift by n positions, where N = 2 n holds. Moreover, this operation determines that a new vector of input samples x i− q N is loaded from memory only when the index q becomes a multiple of N . This makes clear the advantageous property of FIR filters implemented into SIMD processors by which the number of expensive memory accesses are decreased with an increasing level of SIMD parallelism.
Matlab is a very popular language among digital signal processing designers. The matrix oriented capability of the language allows for easily programming algorithms using the elementary algebraic structures of equations (6) and (7). For example, the first part of equation (7) can be written using matlab syntax as follows: mac=zeros(N,1) for i=1:p, mac=mac+kron(y(Nk-i),eye(N))*C_r(1:N,i) end
In this example kron() and eye() are matlab functions that implement the Kronecker product and the identity matrix respectively. In this example is also important to note that the index k runs over the number of input vectors that have to be processed by the algorithm.
VI. RESULTS
Using equation (6), we have implemented an FIR filter in Matlab and processor code was automatically generated for different levels of parallelism. In figure 1 , we can observe the resulting speedup factors for a 32-taps FIR filter. The speedup is defined as the relation between the time that is required to compute a certain number of input samples with a processor architecture with one data path and the time required to compute the same number of input samples on the parallel processor.
Likewise, using equation (7) we have implemented a pure IIR filter in Matlab and processor code was automatically for different levels of parallelism. In figure 2, we can observe the resulting speedup factors for a 32-taps pure IIR filter. 
Speedup for a 32-T ap Pure IIR Filter

A. Discussion
As we can observe from figures 1 and 2, the achievable speedup factor depends not only on the available processing power but it also depends on the data dependencies characteristics of the algorithm. In fact, due to the direct data dependencies of the pure IIR filter, the speedup factors we obtain for this algorithm are moderate in comparison with the speedup factors we obtain for the FIR filter. Moreover, in [5] we showed that the achievable speedup factor for pure recursive filters is upper bounded and once this upper bound has been reached no more gains on the speedup factor can be obtained even if the level of parallelism is incremented. These results illustrate the fact that in determining the optimal level of SIMD parallelism for a certain application not only power and area constraints have to be taken into account, but the characteristics of the algorithms play a paramount role. This stresses the necessity for a HW/SW design strategy, where this exploration can be supported.
VII. CONCLUSION
In this paper we have presented our approach for the code generation for our family of SIMD DSP architectures. We have presented the mathematical background of our compiler technique and taking an FIR and a recursive filter as examples, we have explained how our compiler generates a sequential instruction list from a set of basic algebraic structures. We believe that is approach offer an interesting way to close the gap between real-time signal processing, compiler technology and processor architecture.
