Abstract
Introduction
The widespread popularization of mobile computing and wireless communication systems fostered the research on new architectures to efficiently deal with communications issues in hardware constrained platforms such as PDAs, mobile phones and pagers. Some tasks such as encoding, decoding and data compression are better implemented through dedicated hardware modules than using standard general purpose processors. However, the exploding costs of integrated circuits fabrics associated with shorter devices lifetimes makes the design of ASIC (Application Specific Integrated Circuit) a very expensive alternative. The growing capacity of FPGA (Field Programmable Gate Arrays) and the possibility of reconfiguring them to implement different architectures makes it a good solution to this rapid changing wireless market. Its flexibility opens a wide range of alternatives to implement algorithms directly in hardware. In this context, it is very important to provide methods and tools to rapidly model and evaluate different architectures to implement a given algorithm.
We propose the use of rewriting-logic to specify and evaluate dynamically reconfigurable systolic architectures. We show how the conceived architectures are adapted for the efficient implementation of algebraic operations such as matrix multiplication and the Fast Fourier Transform -FFT.
After the seminal work of Knuth-Bendix about the completion of algebraic equational specifications [17] , rewriting has been successfully applied into different areas of computer science as an abstract formalism for assisting the simulation, verification and deduction of complex computational objects and processes. In the context of computer architectures, rewriting theory has been applied as a tool for reasoning about hardware design. To review only a reduced set of different approaches in this direction, we mention the work of Kapur who has used his wellknown Rewriting Rule Laboratory -RRL for verifying arithmetic circuits [15, 14, 16] as well as Arvind's group that treated the specification of processors over simple architectures [2, 24, 25] , the rewrite-based description and synthesis of simple logical digital circuits [27] and the description of cache protocols over memory systems [26] . Also we have contributed in this field by showing how rewriting theory can be applied for the specification of processors over simple architectures (as Arvind's group does) as well as for the purely rewrite based simulation, verification and analysis of the specified processors [3] . To achieve this we applied rewriting-logic that extends the pure rewriting paradigm allowing a logical control of the application of the rules by strategies [21, 7] . Important rewriting-logic computational environments are ELAN [9, 7] , Maude [21, 8] and Cafe-OBJ [12] . The impact of rewriting-logic as a successful programming paradigm in computer science as well as of the applicability of the related programming environments is witnessed by [20] . All our experiments were implemented in ELAN because of its great flexibility and easy manipulation of strategies. Section 2 and 3 provide basic concepts and present the rewrite based specification and simulation of systolic arrays used for implementing simple algebraic operations such as vector and matrix multiplication. Section 4 discusses the use of rewriting-logic for specifying a dynamically reconfigurable system and efficiently implementing the FFT and Section 5 is the conclusion.
Background
We include the minimal needed notions on rewriting and, specifically, on rewriting-logic and systolic arrays. For detailed presentations see [5, 20] and [18, 19] , respectively.
Rewriting theory
A Term Rewriting System, TRS for short, is defined as a triple ¢ R, S, S 0 ², where S and R are respectively sets of terms and of rewrite rules of the form l ¤ r if p(l) being l and r terms and p a predicate and where S 0 is the subset of initial terms of S. l and r are called the left-hand and righthand sides of the rule and p its condition.
In the architectural context of [25] , terms and rules represent states and state transitions, respectively.
A term s can be rewritten or reduced to the term t, denoted by s ¤ t, whenever there exists a subterm s' of s that can be transformed according to some rewrite rule into the term s'' such that replacing the occurrence of s' in s with s'' gives t. A term that cannot be rewritten is said to be in normal form. The relation over S given by the previous rewrite mechanism is called the rewrite relation of R and is denoted by ¤. Its inverse is denoted by ← and its reflexive-transitive closure by ¤ * and its equivalence closure by ↔ * . The important notions of terminating and confluence properties are defined as usual. These notions correspond to the practical computational aspects as the determinism of processes and their finiteness.
• a TRS is said to be terminating if there are no infinite sequences of the form s 0 ¤ s 1 ¤ ...
• Using these notions one can model the operational semantics of algebraic operators and functions. Although in the pure rewriting context rules are applied in a truly non deterministic manner, in the practice it is necessary to have the control of the ordering in which rules are applied. Thus, rewriting jointly with logic, that is known as rewritinglogic, has been showed of practical applicability in this context of specification of processors since they may be adapted for discriminately representing in the necessary detail many hardware elements involved in processors.
Systolic arrays and reconfigurable systems
A systolic array is a mesh-connected pipe network of DPUs (datapath units), using only nearest neighbor (NN) interconnect. DPU functional units operate synchronously, processing streams of data that traverse the network. Systolic arrays provide a large amount of parallelism and are well adapted to a restrict set of computational problems, i.e., those which can be efficiently mapped to a regular network of operators. Figure 1 shows a simple systolic example of a matrixvector multiplication. The vector elements are stored in the cells and are multiplied by the matrix elements that are shifted bottom-up. On the first cycle, the first cell (DPU1) computes x 1 *a 11 , while the second and third cells (DPU2 and DPU3) multiply their values by 0. On the second cycle, the first cell computes x 1 *a 21 , while the second cell computes x 1 *a 11 + x 2 *a 12 , where the first term is taken from the first cell and added to the product produced in second cell. In the third cycle, the third cell produces the first result: y 1 = x 1 *a 11 + x 2 *a 12 + x 3 *a 13 . In the following two cycles y 2 and y 3 will be output by the third cell. Thus, by the end of the third cycle the first result is produced and the remaining values are produced in the following cycles.
There are several alternative configurations of functional cells, each one tailored to a particular class of computing problems. However, one of the main critics to systolic arrays is its restriction to applications with strictly regular data dependencies, as well as its lack of flexibility. Once These limitations may be circumvented by using reconfigurable circuits, the most representative of them being the FPGA. Figure 2 shows the internal structure of a RAM-based FPGA. The small boxes represent the logic cells and the larger blocks, with the letter S, are programmable switches. An FPGA can have its behavior redefined in such a way that it can implement completely different digital systems on the same chip. Fine grain FPGAs allow the user to define a circuit at gate level, working with bit wide operators. This kind of architecture provides high flexibility, but takes more time to reconfigure than coarse grain reconfigurable platforms (rDPAs: reconfigurable data path arrays: arrays of rDPUs). In those ones, the user does not provide details at gate level but specify the configuration in terms of word wide operations, i.e., a functional unit is configured to operate over n-bit data, and the configuration just specifies one among a set of available operations. The amount of configuration bits in this case is much less than in the fine grain FPGAs.
The design of reconfigurable systolic architectures [13, 23] aims to overcome the restriction of pure systolic circuits while keeping the benefits of a large degree of parallelism. In this approach, the operations performed by each functional unit as well as their interconnections may be reconfigured in order to be adapted to different applications. Moreover, it is possible to change the configuration of the circuit during run time, what is called dynamic reconfiguration, which broadens even more the architectural alternatives. A dynamically reconfigurable system, in a given instant of time t, process data d(t) using a configuration cfg(t). Instead of referring to an instruction stream and a data stream, according to Flynn taxonomy, one could describe a reconfigurable system by its configuration stream and its data stream. Optimization of such systems relies on a choice of a reconfigurable hardware structure and a corresponding reconfiguration scheme for a given application under a set of constraints. It is a complex task, since there are no commercial tools available that are well adapted to this kind of problem. Prototyping alternatives in VHDL or even SystemC, in a first approach, may be too cumbersome.
The variety of implementations that arise from the combination of systolic architectures and dynamically reconfigurable computing requires adequate tools for modeling and simulation of design decisions, providing a framework for design space exploration.
Systolic arrays via rewriting-logic
Rewriting-logic based specifications of simple systolic arrays for vector and matrix multiplication are presented. In these systems each component -DPU as in the Figure 1 -is called a MAC (Multiplier/Adder). Initially, we explain the modeling of the matrix/vector multiplier, presented in the Figure 1 . The type definition for each MAC is shown in Table 1 and the structure of the systolic array in Figure 3 .
Type definition in ELAN has the following syntax ( The rule sole (Table 2 ) describes the behavior of the processor during one cycle of the execution: after one-step of reduction applying this rule, all necessary changes in the specified processor are done. Firstly, d1, d2 and d3 at the top of the DataStream, are removed from the three lists of data and placed into the first ports of the three MACs. Afterwards, the multiplications between the contents of each first port pi1 and the corresponding constant placed in the first register of each MAC, and the additions between the first register ri1 and the second port pi2 are placed in the second port of each MAC, for i=1,2 and 3. Finally, the transfer of data from the second register ri2 of each MAC to the second port of the next component p(i+1)2 is done, for i=1, 2. This is done by only one application of the rewriting rule sole simultaneously. Notice the necessity of the extra zeros with respect to the original proposal in the Figure 1 . A simple mechanism of reconfiguration is changing the constants in each MAC. Then a computation with the systolic array consists of two phases: a reconfiguration phase, where the constants are set and the subsequent processor execution phase with the previously defined rule sole.
The Table 3 shows the rule conf created for reconfiguring the processor. It simply changes the contents of the constant part of the MACs (by the vector (1,0,0)). Note that with the pure rewriting based paradigm this rule applies infinitely.
Thus for controlling its application, we define a logical strategy, called withconf, which allows for the execution of one-step of reduction with the rule conf (the first reconfiguration stage) and a normalization with the rule sole (the second processor execution stage). Table 3 . Rule conf of reconfiguration The Figure 4 shows the structure of a systolic array for 4x4 matrix multiplication. Its description is given in the Table 4 . The approach adopted here is different from the previous one in order to reduce the number of variables needed for its description. One solution is to split the cycle defining independent rewriting rules, to be applied under a reasonable strategy, to simulate the internal process into We define a rule for each of the sixteen components, which propagates the contents into their registers two and three to their North and East connected components, respectively. [16,p(p1) ,p(p2),r(r1),r(r2),r(r3),c ] m09 m10 m11 m12 m05 m06 m07 m08 m01 m02 m03 m04 (lS1 lS2 lS3 lS4) > => < (lW1 lW2 lW3 lW4) m13 m14 m15 [16,p(p1) ,p(p2),r(p1*c),r(r1+p2),r(p1),c ] m09 m10 m11 m12 m05 m06 m07 m08 m01 m02 m03 m04 (lS1 lS2 lS3 lS4) > end ... end
To complete a whole execution cycle, as consequence of the direction in which data is transferred between the MACs, the sixteen rules should be applied right-left and top-down.
All these rules are very similar and one of them is presented in the Table 5 . Observe that the rules for the South (mac01, mac02, mac03, mac04) and West (mac01, mac05, mac09, mac13) boundary components of the processor load the data (dS and dW) from the head of the corresponding list of the data stream (lS1, lS2, lS3, lS4 and lW1, lW2, lW3 and lW4). Also observe that the rules for MACs in the North (mac13, mac14, mac15, mac16) and East (mac04, mac08, mac12, mac16) boundaries of the processor only transfer data to the East and North neighbor MACs, respectively; except, of course, for mac16. Thus, to complete a cycle of the processor, different orderings of application of these rules are possible. In the Table 6 we present a possible strategy called onecycle which defines an(other) ordering of application for completing a cycle of the processor. For completing the simulation of execution with this simple processor, one should define a normalization based on this strategy: normalise(onecycle).
The built-in strategy normalise applies onecycle until a normal form is reached. In this rewriting-logic setting our specification could be easily modified to allow the interpretation of parts of the processors as reconfigurable components. At first glance, one could look at the constants of the 16 MACs as a reconfigurable component. In this way the processor can be adapted to be either a 4-vector versus 4x4-matrix multiplier or vice-versa and the 4x4-matrix may be modified to represent, for example, either the identity or the F 4 matrix of the Discrete Fourier Transform -DFT, which is discussed in next section.
Run time efficient FFT modeling
The FFT is an implementation of the DFT, which is widely used in signal processing. Given an n-array of complex numbers a = (a 0 , …, a n-1 ), its DFT, F n × a, is the n-array The FFT is an O(n ln n) run time implementation of DFT based on a recursive algorithm proposed by Cooley-Tukey. This algorithm can be implemented in dataflow hardware as presented in classical text books on algorithms [10, 6, 1] . The number of data points is a power of 2. The network of nodes is a butterfly circuit. Each node implements a complex number multiplies-accumulate operation on its inputs: b j = u j + z v j .
The two 8-array architecture that we use for computing F 8 is based on these circuits and its (operational semantics and) correctness is founded on the adequate application of dynamic reconfiguration of the operators, constants and data selection registers. Reconfiguration and execution steps run simultaneously alternated on the two 8-array of MACs. The structure of each MAC is presented in the Figure  5 . We distinguish between reconfigurable (shadowed) and fixed components. The formers are: data selection registers, Ar1 and Ar2; operators, Op1 and Op2; and constant, C1. The latter are the ports and registers: P1, P2 and R1 and R2.
The registers, ports and the constant store complex numbers and consist of two components: real and imaginary. The operators can be reconfigured as any operation over complex numbers. In particular, for implementing FFT we will use only addition (+), subtraction (-) and multiplication (×). The two data selection registers, Ar1 and Ar2, are used to indicate in each of the eight MACs of one of the two 8-arrays the origin of the data that should be loaded into the respective ports, P1 and P2. The options for configuration of these address registers are either the input (I) (as input we will supply the coefficients of a given polynomial permuted adequately) or the output (second register R2) of one of the eight nodes of the opposite 8-array of MACs (indexed by 0,1,...,7). In any reconfiguration the constant is set with arbitrary complex numbers. For computing FFT, we will set these constants with the adequate complex roots of the unity.
The two 8-array of MACs system
The Figure 6 shows the basic idea behind the two 8-array system. The North and South rows are composed by 8 nodes with the architecture depicted in the Figure 5 . The node outputs of a row are feedback to the inputs of the other row through a reconfigurable interconnection network (RIN). The RIN can provide to the MAC ports any MAC output or an external input.
The configuration of data selection registers Ar1 and Ar2 will select from the RIN the specific node inputs in a given iteration of the algorithm. In the first step, one of the 8-array receives as input zeros and coefficients of an input polynomial a 0 +a 1 ·x+...+a 7 ·x 7 in the adequate ordering (bitreversal permutation), taken from the primary (external) inputs. Then, at each step the interconnections and the node operations are reconfigured in order to implement the corresponding butterfly slice alternating from a row to the other. In this way while the MACs in one row are executing the others are being reconfigured, which eliminates from the run time analysis the time spent for reconfiguration except for the time spent for the initial reconfiguration. The initial reconfiguration parameters are given by the sequence: The first zero stands for indicating that the North row is being reconfigured while the South row is executing vacuous operations. The other parameters of reconfiguration indicate that the node 0 receives its inputs from the corresponding external inputs; its first operator is configured as addition; its constant component as 1; and its second operator as multiplication. Similarly for the remaining seven nodes. After this reconfiguration, the operations in the north row are executed while the system is being reconfigured according to the parameters: Execution in the North row gives in the output register (R 2 ) of each node the coefficients: a 0 , a 4 , a 2 , a 6 , a 1 , a 5 , a 3 and a 7 , respectively. Observe that this second step provides again the same input, but now, adjusted to be processed in the South row that is being simultaneously reconfigured according to the above parameters. The first "1" in the above reconfiguration parameters means that the South row is being reconfigured while the North row is executing as it has been explained. The other reconfiguration parameters mean that the first and second data selection registers of the nodes 0 and 1 should be loaded with 0 and 1. Thus, the outputs of nodes 0 and 1 are loaded in the associated ports, and these are added in the first node and subtracted in the second node. All nodes are configured with the constant 1 in this iteration except for the fourth and eighth where the constant is the complex i. The second operator remains as multiplication. After this second reconfiguration and the third execution over the South row (while the North row is being reconfigured) we will obtain as respective outputs the values: a 0 +a 4 
Specification of the two 8-array in ELAN
The key operators of our ELAN specification of this system have the type description given in the Table 7 . Notation "<@ @> : ( num num ) complexUnit;" means that "< >" is a binary operator of type complexUnit with two parameters of type num.
Our system is described as the operator:
< @ @ @ @ > : ( int list[ReconfParameter]
MACsArray MACsArray ) Proc;
whose last two parameters are the two 8-arrays of MACs of type MACsArray, the first parameter of type int identifies the 8-array being reconfigured and the second parameter is a list of reconfiguration parameters. Each MACsArray consists of eight MACs being the operator MAC defined by "[@ # @] : ( fixMAC recMAC )", where fixMAC and recMAC are the types of the operators for its fixed and reconfigurable parts, as described in the Figure 5 .
Each simultaneous execution-reconfiguration step of this system is specified by rewriting rules as the one presented in the Table 8 . This rule changes the first (North) 8-array MACsArray1 to MACsArray1Res by applying the EXECUTE strategy:
MACsArray1Res :=(EXECUTE) MACsArray1
while the second (South) 8-array MACsArray2 is being reconfigured according to the head parameter of The second 8-array finishes this step loading their ports according to the address selection registers of its MACs with the corresponding output registers of the first 8-array. The last is done by means of the operator propagateRegsValuesFromTo. All operators are defined by rewriting rules.
The execution cycle is split in four rewriting rules (MAC01, MAC23, MAC45, MAC67) for pairs of MACs. The specification of the rule MAC01 for the first pairs of MACs of one 8-array is presented in the Table 9 . In this rule the values in the ports of the first two MACs are operated according to the configuration of the first operator in each MAC (cRegRes1 := () operate(cPort1, cPort2, op1) and cRegRes3 := () operate(cPort3, cPort4, op3)); then these results, which are loaded in the first register of the corresponding MACs, are operated, according to the configuration of the second operator, with the configured constants (cRegRes2 := () operate(cRegRes1, cConst1, op2) and cRegRes4 := () operate( cRegRes3, cConst2, op4)) and the results are loaded in the second register of each MAC.
The execution over an 8-array of MACs is implemented via the logical strategy EXECUTE => MAC01; MAC23; MAC45; MAC07. In fact, in theory a unique rule is necessary for the execution, but it is done in this way because of a restriction in ELAN in the maximum number of different variables that one can use in the description of a rewriting rule. 
The reconfiguration over an 8-array (which is applied simultaneously to the previously described execution over the other 8-array) is guided by the rewriting rule in the Table 10 . The first argument of the operator reconfigure is an 8-array of MACs whose MACs are reconfigured according to the reconfiguration parameters given by eight arguments of type MAConfig (see the Table 7 ). Each of these arguments include two values for the address selection registers, two numbers for the reconfigurable constant (real and complex part) and two values for the reconfiguration of the operations.
As input of this system both data and a reconfiguration stream are given. When no reconfiguration is necessary one can use a reconfiguration called continue with vacuous effect over the reconfigurable part of each MAC.
Now we explain how we use logical strategies for simulating the desired execution with the simultaneous dynamic reconfigurations.
The key for a correct simulation of our processor is in fact a very simple logical strategy, which simulates the executionreconfiguration steps. The former corresponds to the use of the strategy EXECUTE and the latter to the execution of the rewriting rules of reconfiguration (see the Table 8 ). PROCESS basically organizes the application of rules for propagating the input data and reconfiguration stream, repeating the oneCycle rules (see the Table 8 ) as long as possible and then giving the output (i.e., the contents of the register 2 of the MACs belonging to the 8-array in execution during the last cycle).
The use of logical strategies for guiding the application of rules in ELAN allows for a natural separation between the execution and reconfiguration steps in our proposed processors. We believe that this is a clean way to specify and simulate this kind of (dynamically) reconfigurable architectures. By clean we mean in a realistically manner in relation to eventual physical implementations of the conceived hardware.
By providing appropriate reconfiguration streams this two 8-array system can be adapted to solve other operations, like matrix multiplication, inverse of the DFT, string matching, etc.
It should be stressed here that one of the main advantages of this rewriting formalism is the direct reduction of the correctness proof of our specification of the FFT to the usual algebraic proof as presented in [6] .
A physical in-place implementation of the FFT
Our system has used two 8-arrays in order to alternate execution-reconfiguration steps which are alternatively executed simultaneously during each cycle. In this way time for reconfiguration is discarded from the run time complexity. This makes as efficient our implementation of the FFT as the usual software implementations. This is possible since computing of operations with complex numbers takes longer time than reconfiguration time eliminating the reconfiguration overhead. But our system is not space optimal for implementing the FFT. In fact, in a system consisting of a sole 8-array of MACs, steps of reconfiguration and execution can be alternated. In this approach, the data processing must be interrupted while reconfiguration takes place. And over this one 8-array system it is possible to implement the FFT alternating reconfigurations and steps of the computation of the FFT.
The use of a unique array of MACs makes this proposed physical system optimal in the use of space such as the well-known in place algorithmic solutions of the FFT [4] .
Of course, in this one 8-array system we have to take in count, for computing the run time complexity, the time required for reconfiguration. For both proposed systems, the number of necessary reconfigurations and execution steps for computing F 8 is four (and in the general case ln(n)+1).
The one 8-array architecture was modeled and simulated in ELAN, using a similar approach. The implementations are available in our web site: www.mat.unb.br/~ayala/TCgroup.
Although our specifications were proved correct, we have verified their correct functionality, even for complex polynomials, by comparing our outputs with the ones given by the algebraic system Maple.
Conclusions
The examples in the paper describe reconfiguration using rewriting-logic strategies. Representing the reconfiguration in this way, outside of the rewrite rules, seem unnecessary: one can argue that this can be expressed as rules using conditions on appropriate state variables -functional approaches for describing digital circuits is nothing new [11] -. But in our rewriting-logic based setting, we showed how one can naturally profit from the discrimination between rewriting and logical strategies to simplify the purely rewrite based specification, experimentation, simulation (and even verification [3] ) of reconfigurable systems.
By rewriting-logic even the sophisticated dynamical reconfiguration appears a very natural mechanism to be simulated via logical strategies.
Since digital systems get more and more complex, modeling the various architectural trade offs in the context of reconfigurable systems may benefit from the high abstraction level provided by rewriting-logic environments. Our experiments with ELAN targeted reconfigurable systolic arrays and their use for the efficient implementation of algebraic operations. For the implementation of complex operators such as the FFT, we have conceived physical systems, which are run time efficient (O(n ln n)) as well as space efficient (in place).
Hardware description languages like VHDL, Verilog, and SystemC, do not provide the degree of abstraction and flexibility found in rewriting(-logic) systems. In fact, they do not compete in this field, since the detailed hardware design still must pass through a hardware description language (VHDL is the "assembly language" in this context). We do not need their architectural and circuit details for mapping an application onto a rDPA, nor design space exploration to optimize, for instance, KressArray platforms [22] .
