This paper presents a new and retargetable method to identify patterns of instructions with direct support in coarsegrained processing elements (PEs). The method uses a three-address code SSA (static single assignment) representation of the kernel being mapped and Rewriting Logic for template matching and algebraic optimizations. This approach is able to identify sets of SSA instructions that can be mapped to different PE complexities available in coarsegrained reconfigurable computing architectures. As a proof ofconcept, results of the approach with a number of benchmark kernels, as far as coverage of template instructions is concerned, are included.
Introduction
The use of VLIW-based templates in order to accelerate loop intensive behavior has been recently focus of renewed research efforts [1] . VLIW-based approaches rely on 1-dimension (ID) of PEs (typically ALUs, multipliers, load/store units, etc.) directly used to execute typical single and two operand arithmetic and logical operations. They use a register file to store data loaded from main memory and/or computed by the PEs. More advanced VLIW templates use clusters of simple VLIW architectures (PEs and register file) in a distributed scheme. On the other hand, coarse-grained reconfigurable architectures [2] are not tied to a simplistic template (such as the VLIW one). They can use different routing topologies, ID or 2D matrixes of PEs, PEs with support to more complex operations, heterogeneous or homogeneous PEs, distributed memories, etc. In this aspect, a VLIW template can be thought as a specific case of coarse-grained reconfigurable array architectures. Although also suitable to identification of templates of instructions for generating ISEs (Instruction Set Extensions), the work presented in this paper bears in mind the exploration of different coarse-grained architecture templates based on ID or 2D arrays of PEs. One of the interesting design explorations is to evaluate the impact on performance when different PEs supporting complex operations are used. However, the exploration needs a retargetable compiler able to identify instructions that resemble templates directly jmpcgacm.org supported by the PE's. For such exploration, this paper shows an approach using an SSA (Static Single Assignment) form [3] , composed by a three address based representation output generated by the NW4 compiler [4] from the Java bytecodes of a given method [5] . Figure 1 presents the framework under development. The environment uses an extended SSA form and a Term Rewriting System (TRS) to identify patterns of instructions. Note, however, that the integration of a simulator engine and techniques to estimate performance results with different coarse-grained architectures are currently the focuses of our work and are not the scope of this paper. -Additionally, the approach is able to consider optimizations (e.g., expression tree transformations and algebraic simplifications) to achieve better template matching results; -To the best of our knowledge this is the first time Rewriting Logic is used to accomplish the referred goals; -Experimental results to validate the concept are presented for different instruction templates applied to a number of kernels from image and signal processing domains. This paper is structured as follows. Next section introduces coarse-grained reconfigurable architectures. Section 3 explains concepts about term rewriting and rewriting logic. In Section 4 the intermediate representation is presented. The proposed methodology is explained in Section 5, and in Section 6 experimental results are presented. In Section 7, the related work is introduced and discussed. Finally, Section 8 draws some conclusions.
Coarse-Grained Array Architectures
Coarse-grained array architectures [2] consist of a number of PEs interconnected by certain routing topologies. Various architectures have been proposed with different routing topologies and/or different types of functional units (FUs) that can be implemented by each PE (e.g., Morphosys [6] , ADRES [7] , PACT XPP [8] ). Regarding each PE's functionality, the simplest ones consider multiplier and ALU operations on each PE. Each PE usually has two inputs and one output. In this case, simple one or two-operand arithmetic operations can be directly mapped to each PE. There are architectures that use more complex PEs. As an example, the architecture proposed in [9] uses a coarsegrain component for each PE. In this kind of architecture, a large number of groups of operations can be implemented using each PE. For instance, we may program a PE to execute AxB+CxD, AxB-CxD, AxB+D, A+B+C+D, A+B+C, etc. Thus, the kind of patterns of operations we may be able to map to a single PE largely depends on the target architecture. Another example of PE's complexity is the support to implement a counter in a single PE of the architecture, as is the case in the XPP [8] .
To explore the large design space we need a strategy able to map an input program representation to the target architecture based on the specification of the templates supported by each PE. Next section presents the foundations behind the novel strategy proposed in this paper.
Term Rewriting and Rewriting Logic
Term Rewriting [10] is the formal mathematical framework for the reduction of expressions using matching and substitution of terms. Term rewriting is applied in the form of rewriting rules that define how the term is transformed.
Rewriting rules are of the form:
Meaning that a sub-term that matches the left-hand side of the rule will be replaced by the right-hand side when the condition "c" holds. These operational semantics are the same as those involved in functional environments and have been promoted in functional programming languages since the well-known McCarthy LISP of the 1950s. Rewriting-logic is the result of using logic strategies to control how and when the term-rewriting rules are applied. Term Rewriting Systems can be efficiently implemented using term rewriting computational environments, being two of the most popular ELAN [11] and Maude [12] .
Intermediate Representation
The SSA form [3] with three address code format (see Figure 2 and Figure 3 for an example) is used as the starting point of our approach. As can be seen in Figure 3( This representation is generated with the Na4 compiler [4] and has been selected as the input intermediate representa- Figure  3(b) ). This way, the initial specification required for generating the code to program the PEs of the target architecture, and for the simulation step is maintained. An important additional step deals with the maintenance of redundant instructions as is explained in next section.
Methodology
The proposed methodology is supported by the steps illustrated in Figure 4 . The input of the methodology is the SSA representation of a function or procedure. The RewritingLogic rules and strategies are used to group sets of SSA instructions that can be directly mapped to the PEs of the target architecture. The capabilities of the PEs are defined as a set of templates expressed as Term Rewriting rules. It is important to note that a PE may not have outputs for all its internal intermediate blocks, therefore it may be necessary to duplicate one or more of the intermediate instructions in the cases where their results are used in other nodes of the expression tree. The current version maintains the original instructions and adds copies of the grouped instructions in the TRS step. Those replicated instructions which are not required are automatically removed in the following step (see Figure 4) . The Maude System [12] with the strategy language extensions [13] is used to implement the proposed methodology. Rewriting-Logic is used to exploit the mapping of highlevel procedural programming languages to coarse-grained reconfigurable arrays using the following optimizations:
-Mapping of groups of instructions to the expressions directly supported by the PEs of the target architecture under evaluation (e.g., merging operation trees into multiple input operators such is the case when merging a MUL-ADD tree into a MAC instruction); -Expression tree transformations. E.g., tree height reduction to decrease the critical path delay; -Algebraic optimizations applying transformations such as commutative, associative, etc. These transformations may increase the potential for template matching; -Identification of counters related to, e.g., loop iteration control; -Performing operator strength reduction (e.g., mapping of multiplications by constants to shifts and additions/subtractions); Other optimizations that are planned to be included are: decomposition of instructions into subparts. E.g., decomposing a 32-bit operation into 16-bit operations; merge of operations to be implemented as SIMD (Single Instruction Multiple Data) operations; merge of operations working on packed data. The grammar of the SSA intermediate form was defined in the TRS as two modules: one for the basic grammar structure and the second one for the instruction set. The types defined for the basic grammar are shown in Table I . The syntax of the grammar is described as operators in the TRS. The opcodes and the syntax for the PEs were defined in the same way, but adding a section with the functional behavior of the PE. Figure 5 ) was written as a set of rewriting rules. The list of some implemented rules is presented in Table II. Table III shows some rule definitions related to the template PE3x IBA.
The application of the term rewriting rules is controlled by logic strategies. The simplest strategy is to select one set of rules and apply them until the term is in a normal form (i.e., the term cannot be further reduced by the selected rules). A slightly more complex strategy is to normalize the term by using a sequence of different rules. These two types of strategies were used to analyze the maximum coverage of each template for each benchmark. A list of the implemented strategies and their sequence of rules is given in Table IV. .. .XiP As aforementioned, in order to preserve functionality, the TRS step includes instructions that can be redundant. Suppose the two examples shown in Figure 6 . In the case A, the instruction 2 must be preserved since variable "3.0" is used by instruction 5 and the PE3x1 used has only one output and no way to output both variables "3.0" and "5.0" (in this case, variable "3.0" has only the scope of the PE). Case B uses PE3x2, which can output both variables and thus instruction 2 is removed from the final SSA form. This final representation preserves the initial behavior, adds information about the template matching as annotations, and supports templates with instructions across basic block boundaries (e.g., counters for loop control). Table V shows the benchmarks used in our experiments. They resemble typical signal and image processing computational intensive tasks using integer and fixed-point arithmetic. Their SSA complexity ranges from 13 to 210 instructions (an average of about 67 instructions). Figure 7 shows the average percentage of each SSA instruction in those benchmarks. Instructions of type PHI# represent phi SSA form instructions [3] where the integer represents the number of inputs. We can see that integer addition (IADD) is the most used operation (about 22%). The other most represented are IMUL (integer multiplication) and ASSIGN (e.g., assignment of a constant to a variable).
Experimental Results
Next we show results on applying our approach to the examples considering the templates illustrated in Figure 5 . Figure 8 shows the usage percentage of each template for each benchmark being used. As can be seen from the results, the template PE3xlBB is the one with the highest coverage in the benchmarks used. A 13.7% average usage has been identified for this type of template. Concerning PE4xl and PE4x2 average coverage they seem to be low (2.3% and 0.2%, respectively) but they are over 8% and IO% for thefdct example.
As an example of grouping of SSA instructions across basic blocks, the matching of counter type templates has resulted in an average coverage of 4.5% for the benchmarks being used (a maximum result of 11.8% has been achieved for the smooth example). These results were expected since the counters are directly related to loops in the code. [14] . However, the compiler relies on a subsequent mapping step to bind operations to the PEs of the architecture. The notion of retargetable compiler has been used in a limited extent when targeting coarse-grained arrays. Such tools have been usually specific to certain architectures with some variations that permit to target a set of architectures preserving common features. Examples are the DRESC Compiler [15] and the KressArray Xplorer [16] , which provides a dataflow compiler and a complete system for hardware design space exploration (based on KressArray features).
There has been a renewed interest on VLIW type architectures [1] . Architectures with different PE complexities are being studied and algorithms to identify the sets of operations grouped to each PE being proposed [17] [19] . In our work we are focusing more on the second approach being, however, able to retarget different architectures and to exploit some PE's characteristics. Term Rewriting Systems and Rewriting Logic have been recently used in a number of applications, especially in the context of prototyping algebraic operations in reconfigurable systems [20] , verification of arithmetic circuits [21] and hardware synthesis [22] Ongoing work intends to extend the approach with exploration of the strategies that must be applied to a better decision on the template coverage. A high-level model, able to acquire the main characteristics of the target architectures being exploited, is under development in order to estimate the latency to execute each kernel and to serve as a figure of merit for design decisions.
