Abstract
Introduction
Combining programmability and efficiency, custom instruction processors are emerging as basic building blocks in the design of complex systems-on-chip. Typically, a base processor is extended with custom functional units that implement application-specific instructions. A dedicated link between custom functional units and the base processor provides an efficient communication interface. Re-using a preverified, pre-optimized base processor reduces design complexity and time to market. Commercial examples include Tensilica Xtensa, ARC 700, Altera Nios II, MIPS Pro Series, Xilinx MicroBlaze, and Stretch S6000.
Techniques for the automated synthesis of custom instructions from high level application descriptions have received considerable attention in the recent years. The typical approach limited the maximum number of input and output operands of custom instructions can have to the available register file ports [6, 9, 16] . Although these constraints can be prohibitive on some architectures, most existing customizable processors (such as Tensilica Xtensa) allow custom instructions to have more input and out operands than the available register file ports through custom state registers that can temporarily hold some of the operands. In fact, recent work shows that input/output constraints deteriorate the solution quality on architectures where there is no explicit limit on the number of custom instruction operands [4, 12, 13, 15] . Thus, there is a need for new algorithms that can efficiently explore custom instruction candidates within application DFGs without imposing a limit on the number of input and output operands.
In this work, we develop an efficient subgraph enumeration approach for the automated identification of custom instructions. Similar to the work of Pothineni [12] and the work of Verma [15] , we enumerate only convex subgraphs of application DFGs that are maximal, imposing no constraints on the number of input and output operands for custom instructions. Our main contributions in this work are:
1. a tight upper bound on the number of maximal convex subgraphs within a given DFG (Section 3, 4); 2. a novel maximal convex subgraph enumeration algorithm for custom instruction synthesis (Section 5, 6); 3. demonstration of the scalability of our algorithm on a set of benchmarks, which can achieve an order of magnitude speed-up with respect to a single issue base processor (Section 7).
Related Work
Most of the early work and some of the recent work [8, 14] in automated instruction-set customization relied on heuristic clustering of related DFG nodes. Gradually, the attention shifted towards closer to optimal solutions, such as subgraph enumeration [6, 7, 9, 16] and integer linear programming (ILP) [5, 11] . All of these approaches assumed constraints on the number of input and output operands for custom instructions. In particular, subgraph enumeration based approaches explicitly make use of the input/output constraints to prune the search space and reduce the exponential computational complexity. It was recently shown that ILP based techniques [4] scaled well with the relaxation of the input/output constraints. However, subgraph enumeration based approaches become intractable, as the computational complexity grows exponentially with the relaxation of the input/output constraints. In fact, recent work [7] showed that the worst case time complexity of enumerating subgraphs having N in input and N out output operands in a DFG with N nodes is O(N Nin+Nout+1 ). Pothineni et al. [12] targeted the maximal convex subgraph enumeration problem. Given a DFG, Pothineni et al. first define an incompatibility graph, where the edges represent pairwise incompatibilities between DFG nodes. Pothineni et al. define the ancestors and the descendants of an invalid node as incompatible. A node clustering step identifies groupwise incompatibilities and reduces the size of the incompatibility graph. The incompatibility graph representation allowed Pothineni et al. to formulate the maximal convex subgraph enumeration problem as a maximal independent set enumeration problem. Pothineni et al. indicate that the complexity of enumeration is O(2 NC ), where N C represents the number of nodes in the incompatibility graph.
Verma et al. [15] used maximal clique enumeration instead of maximal independent set enumeration. However, the two problems can be directly transformed into each other (see for example, Garey and Johnson [10, p.54] ). Therefore, the approach of Verma et al. [15] and the approach of Pothineni et al. [12] are essentially the same.
Problem Formulation
We assume that the source program is converted into an intermediate representation (IR), where every statement in the IR is a branch, or an assignment with at most two source operands and one destination operand. We represent an application basic block as a DFG G (V b , E b ) where the nodes V b represent statements within the basic block, and the edges E b represent flow dependencies between nodes.
The subset V f b ⊆ V b represent forbidden statements in G that cannot be included in custom instructions, either because of the limitations of the custom processor architecture, or because of the limitations of the custom datapath, or by the choice of the designer.
A custom instruction candidate is a subgraph of G induced by a set of nodes
A subgraph S is convex if there exists no path in G from a node u ∈ V s to another node w ∈ V s which involves a node v / ∈ V s . The convexity constraint is imposed on the subgraphs to ensure that no cyclic dependency is introduced in G and that a feasible schedule can be achieved for the instruction stream. We associate with every graph node v i ∈ V b a binary variable x i that represents whether the node is included in the subgraph (
In this work, we assume a generic and simple optimization model for custom instruction identification. We represent our problem using the following indices:
We associate with every graph node v i ∈ V b a software latency s i ∈ Z + , which gives the time in clock cycles that it takes to execute v i on the pipeline of the base processor. The objective of optimization is to identify the subgraph S with the maximum accumulated software latency:
( 
Theorem 1. A maximal subgraph that satisfies Equation (2), satisfies also Equation (3).
Proof. Assume that Equation (2) holds for subgraph S, i.e., no node in V f b has both an ancestor and a descendant in V s . Assume also that S is maximal, i.e., no additional node can be included in V s without violating Equation (2). We are going to show that S satisfies Equation (3), as well.
Suppose that a node
violates Equation (3), i.e., v i has both an ancestor and a descendant in V s . We are going to show that including v i in V s does not violate Equation (2).
In a convex solution, there exist three possible choices 
that violates Equation (3), we can safely include it in V s without violating Equation (2). However, this contradicts with the maximality of S. (3) cannot exist in the maximal S satisfying Equation (2).
Corollary 1. The maximal subgraph S that satisfies Equation (2) is a maximal convex subgraph.
We note that there exists only three valid Given a valid a j , d j combination for j ∈ I 1 , the associated maximal convex subgraph can be found as follows: 
, it is sufficient to evaluate two possible choices (i.e., a j = 1, d j = 0 or a j = 0, d j = 1). Each choice is associated with a single maximal solution S, which can be inferred directly using Equation (4). Table 1 shows the solutions associated with each of these these four choices. Figure 3 shows the incompatibility graph generated by Pothineni's algorithm [12] for the DFG of Figure 2 . The incompatibility graph contains seven nodes. According to Pothineni's work, the worst case complexity of maximal convex subgraph enumeration for this graph is 2 7 . On the other hand, we have shown that it is possible to enumerate all maximal convex subgraphs in the DFG in only 2 2 steps.
A Novel Enumeration Algorithm
We have shown in Section 4 that there exists an upper bound of 2 |V f b | on the number of maximal convex subgraphs given a graph with V f b forbidden nodes. Therefore, the time complexity of the maximal convex subgraph enumeration algorithms should not have an exponential factor higher than 2 |V f b | . In this section, we describe a novel enumeration algorithm that further reduces the execution time.
Similar to the work of Pothineni et al. [12] and the work of Verma et al. [15] , we firs apply a node clustering step that reduces the size of the DFG, and the number of forbidden nodes. In particular, if In order to demonstrate further ways of reducing the complexity, we construct a new graph G from G, where the nodes of G represent the forbidden nodes of G. We introduce a directed edge between two forbidden nodes (where a i = 1 − d i always) is much smaller than 2 |V f b | . We exploit this property in order to design a simple and efficient algorithm for maximal convex subgraph enumeration. Figure 5 shows the pseudo-code of our algorithm. We first apply a node clustering step on G. Next we derive G from the clustered graph. After that, we order the nodes of G topologically, such that if there is a path from v i to v j in G , v i is associated with a lower index value than v j . Our algorithm generates combinations of d i values. We note that each combination in turn, is associated with a maximal convex subgraph that can be derived using Equation (4).
Overall Approach
Once enumeration of all maximal convex subgraphs within a basic block is complete, we first pick the maximum convex subgraph as the most promising custom instruction candidate. After that we prune the maximal convex subgraphs (1) that overlap with the chosen subgraph, (2) that are cyclicly dependent with the chosen subgraph. Again we pick the largest one among the remaining maximal convex subgraphs, and we continue the same process until no more profitable maximal convex subgraphs can be found.
We apply the same procedure on all application basic blocks, and generate a unified set of subgraphs. After that, we group structurally equivalent subgraphs that can be im-1: ALGORITHM: search(index, choice, graph) 2: curent combination[index] = choice; 3: if index == num nodes in graph-1 then 4: store current combination; 5: return; 6: end if 7: if choice == 0 then 8: ensure that all descendants of index in graph are disabled; 9: end if 10: index=index+1; 11: search(index, 0, graph); 12: if index is not disabled then 13: search(index, 1, graph); 14: end if 15: if choice == 0 then 16: ensure that the disabled descendants are again enabled; plemented using the same hardware. We estimate the software execution latency Z(S) of a subgraph S by scheduling the subgraph in software under base processor resource constraints. We obtain the hardware execution latency H(S) of S through hardware synthesis using Synopsys Design Compiler. We estimate the communication latency C(S) of S by calculating the number of cycles required to transfer its input and output operands under register file port constraints, in a similar way as described in [4] . Given the frequency of execution F (S) of the subgraph S, we estimate the amount of reduction in the schedule length of the application by moving S from software to hardware as follows:
Finally, we obtain the area costs of subgraphs using Synopsys synthesis, and we choose the most promising subgraphs under area constraints using a Knapsack model [8] .
Experiments and Results
We integrated our algorithms into Trimaran compiler [3] . We marked the memory access and branch instructions as forbidden instructions. We applied our algorithms on eight benchmarks from multimedia and cryptography domains.
We carried out our experiments on an Intel Pentium 4 3.2-GHz workstation with 1-GB main memory, running Linux. We developed our algorithms in C/C++ and compiled with gcc-3.4.3 using -O2 optimization flag.
In Table 2 , we compare the run-time of our enumeration algorithm with an ILP based approach [4] , using both a commercial solver (CS) [1] and a public domain solver Table 2 show the number of nodes and the number of forbidden nodes respectively, in the largest basic block of each benchmark. The run-time of our algorithm for identifying the maximum convex subgraph within the largest basic block is given in the fourth column. The remaining two columns show the respective ILP results using CS and PDS. We observe that except for AES encryption, our approach is at least an order of magnitude faster than the ILP based approach. The advantage of using our technique is more evident when PDS is used instead of CS. In fact, CS [1] can automatically recognize constraint patterns that correspond to maximum independent set and maximum clique problems and includes efficient solvers targeted specifically for these problems. We observe that in the case of AES, ILP based approach is faster than ours even if PDS is used. This is also possible. We note that, ILP solvers can reduce the search space not only based on the constraints, but also based on the definition of the objective function.
The only run-time result provided by Verma et al. [15] is for AES, which is reported to be around 30 seconds. Pothineni et al. [12] , on the other hand, report a run-time of two seconds for DES. Although we have not implemented these two techniques, our work appears to be faster for AES and DES. We note that other existing subgraph enumeration algorithms [6, 7, 16] do not scale well with the relaxation of input/output constraints, and fail to identify the maximum convex subgraphs in several hours given benchmarks with very large basic blocks, such as AES and DES.
Using Trimaran framework, we defined a single issue base processor that implements an instruction-set similar to MIPS IV instruction-set. We synthesized the custom instructions to UMC's 130nm standard cell library using Synopsys Design Compiler. We note that our custom instructions are pipelined in order not to increase the cycle time of the base processor, which we estimated to be around the critical path delay of a 32-bit carry propagate adder. Figure 6 shows the speed-up results we obtain using custom instructions with respect to the base processor on our benchmarks. Assuming a base processor register file with 32 32-bit registers, we evaluate four register file read and write port combinations: (2,1), (2,2), (4,2), and (4,4). We observe that increasing the number of read and write ports supplied by the register file, decreases the communication overhead of the custom instructions, which improves the speed-up. Finally, above each column, we show the cell area for the associated custom datapath in terms of µm 2 . We note that, we have observed marginal difference in our speed-up results compared with the ILP based approach [4] . Figure 7 shows the most promising custom instruction our algorithms identify from the DES C code. The software implementation fully unrolls DES round transformations within a single basic block, which consists of 822 base processor instructions. The custom instruction implements the combinational logic between the memory access layers of two consecutive DES rounds. We note that X and Y represent the DES encryption state. Eight of the inputs of the custom instruction are static look-up table entries (SBs), and two of the inputs (SK1,SK2) contain the DES round key. Accordingly, eight of the outputs contain the addresses of the look-up table entries that should be fetched for the next round. We observe that the size of the look-up tables is rather small (256 bytes only). We could avoid all the related main memory accesses and address calculations by embedding the eight look-up tables in local memories. Similarly, the DES scheduled key is only 32 bytes wide, and can be embedded in local memories, again eliminating a number of main memory accesses. Once these optimizations are done, the size of the core basic block of DES drops from 822 instructions into only 22 instructions, which incorporate only three different types of custom instructions, each one having only a single cycle execution latency. The result is about 30 times improvement in the performance of DES.
Summary
This paper provides novel theoretical and practical results for improving efficiency of automatically-generated custom instruction processors and their design. Current and future work includes extending our approach to cover additional application domains, and to support implementations targeting field-programmable devices. 
