In order to meet the increasing challenges concerning the performance and power demands of embedded applications, a processor is now embedded with the Application-specific functional units. Customized Functional Units both as hardware and the corresponding instructions are embedded to the base processor in order to improve the computational efficiency for a target application. During this process of generating the complex instructions and also for the code generation on this extended processor, one of the critical challenges for the compiler is to automatically perform fast and efficient instruction matching and selection. In this project, we developed a novel and efficient algorithm for matching the multiple-output complex Functional Units (FU's). We will also illustrate that the assumption, which is the basis of the most of the current covering methodologies, may not always hold true. Current covering algorithms, generally aim to find the optimal cover within each basic block that minimizes the number of selected matches. Fewer matches translate to fewer operations for the schedule, and it is expected that the increased scheduling freedom leads to better (shorter) schedule. We provide some examples showing that this assumption need not necessarily achieve the goal of minimizing the execution time.
INTRODUCTION
Recently in the market, there is explosive demand for the Mobile phones, PDAs, iPads, digital video cameras, audio devoices and other high performance special purpose electronic devices. These devices need to be incorporated with application-specific hardware design is order to meet the challenging cost, power and performance demands. One of the known strategies to provide such special hardware designs is to build a system comprising a low cost core processor, such as an ARM [1] and a number of highly specialized application specific integrated circuits (ASICs) incorporated within it. The ASICs are specially designed hardware units that can accelerate the execution of the computationally demanding portions of the application which would otherwise run too slowly if using just the core processor. Clark et. al [2] and Pothineni et. al. [3] mentions that while this approach is effective, ASICs are costly to design and offer only a hardwired solution that permits almost no post programmability. An alternative approach is to have a processor centric system with customized accelerators. Here the core processor is augmented with the instruction set that is capable of significantly improving the performance and power of application in a cost-effective manner, compared to general purpose system. Also the system is post programmable and here the customized instructions with minor modification can easily be generalized to have their use across a set of applications Seeing these potential benefits, couple of commercial efforts [4, 5] had also been made to bolster the high level design of custom processors. An ASIP need to efficiently utilize the instruction level parallelism (ILP) available in the given application, so it can deliver the high performance. The VLIW architecture provides a better opportunity of customization, so we consider a VLIW architecture which consists of some application specific coarse grain functional units that augmented with a core set of functional units (FUs). Particularizing or constructing custom-make FUs for esp. frequent occurring complex operations in a given application can likely lead to very significant performance gains.
RELATED WORK
Matching and covering algorithms are well-known in the fields of code generation and logic synthesis. Keutzer [6] was the first to recognize the similarity between the software compiler's task of generating code and the technology mapping problem in automated VLSI design. Both problems can be handled with a matching algorithm, to find all possible instantiations of patterns (instructions or standard cells), followed by a covering algorithm to make a selection of matches that optimizes some criterion (execution time, code size, VLSI area or latency, etc.). Many researchers had studied the potential utility of customizing the instruction set, but most of them do not describe the methods to automate the process. Some algorithms [7, 8] evaluates each node of the DFG via exploring the corresponding complete binary tree to decide if it can be a possible candidate. Their time complexity being exponential limits the size of DFG that can be considered in order for the algorithm to provide results in timely manner. For code generation, we need to make the selection from the given set of instructions (including the complex ones) in a way that effectively utilizes the VLIW processor and Application Specific Functional Units (AFU). The goal is minimizing the schedule length. The complexity arises due to the fact that we want covering to be architecture driven and the I/O time-shape of the AFU could be distributed, it gives us to efficiently use the resources (Functional units and registers), thereby reducing the execution time. Most methods [7] generally aim to find the optimal cover within each basic block that minimizes the number of selected matches. Fewer matches translate to fewer operations for the schedule, and it is expected that the increased scheduling freedom leads to better (shorter) schedule.
Instruction Matching
As described in [9] , there are two main approaches to handling the matching problem when performing technology mapping: the Boolean and the structural approach. The Boolean approach can only be applied to networks of Boolean functions and Structural matching will work on networks containing nodes of any type of function.
In a more recent work by Kukimoto et al. [10] , a structural matching method was introduced that can handle DAGshaped subject graphs, allowing reconvergence within the graph. None of the previously described matching algorithms allow pattern graphs to have more than one output. Peymandoust et. al. [11] employs symbolic algebra using commercial symbolic computer algebra systems like Maple and Mathematica, which can perform matching for even multiple-output pattern graphs. But it works only for arithmetic data flow segments, making this approach infeasible for most of the embedded applications. Paolo Ienne et. al. [12] make use of symbolic algebra tools for instruction selection. As opposed to tree covering based algorithms, mapping is performed simultaneously with algebraic manipulations in their algorithms. But as mentioned earlier this is restricted to only arithmetic data flow segments. Matching algorithm proposed by Arnold and Corporaal [13, 14] does not have this restriction, making it possible to exploit a larger family of pattern graphs. They perform a detailed search space exploration. We will experimentally show that the instruction matching algorithm proposed by us is highly efficient in comparison to them.
PROBLEM DEFINITION
Let CFU-lib be a library of Complex Functional Units (may be multiple-output), G pat-i be a data flow graph corresponding to the ith customized instruction (or CFU) in the CFUlib( Figure 1 ) and G sub be a data flow graph (DFG) (Figure 2a ) within the control flow graph of a C application,
Instruction Matching:
Given G sub and CFU-lib, find for each G pat-i in CFU-lib, all those dataflow segments of G sub that match with the G pat-i . 
ALGORITHM FOR INSTRUCTION MATCHING
Given G sub and CFU-lib, we must try to find all matches between pattern graphs G pat (from pattern library CFU-lib) and sub graphs of G sub -The strategy for this is described in Section 4.1. After all matches are found, we try to find the best cover, or selection of matches, that is, the set of matches that, when implemented, minimizes the data-ready time of the longest path through the subject graph. The covering approach is described in Section 4.2.
Instruction matching
Primary nodes: These are those nodes of G pat , who's all input operands belong to the set of source operands of the customized instruction, corresponding to that particular G pat . 
Two algorithms for partial match identification
Let us consider for analysis that G sub and G pat has n and m number of total nodes. And no of primary nodes in each G pat is p. Let us compare our algorithm (Algorithm 3.4) for identifying partial matches with the partial match identification algorithm (Algorithm 3.5) proposed by Arnold [13] . Figure 2 illustrates that the common assumption, (heuristic used by many current methods) that lesser the number of patterns used, better will be the scheduling time, may not always hold true.
Insn_matching ( )
In Fig 2(b) , number of FUs used = 4 but it takes 9 cycles. In Fig 2(c) , number of FUs used = 5 but it takes 7 cycles only.
Fewer matches translate to fewer operations for the schedulethis is generally considered the prime assumption for developing efficient covering algorithms. So current covering algorithms, generally aim to find the optimal cover within each basic block. The above example clarifies that that this assumption need not necessarily achieve the goal of minimizing the execution time. matching algorithm. Afterwards, it is adopted in Trimaran frame-work with some changes in data-structures, used for traversing along the data-flow edges.
EVALUATION FRAMEWORK

ANALYSIS AND RESULTS
For evaluating the performance of proposed instruction matching algorithm, we used the bitwise benchmarks new life, histogram and bubble sort. Our CFU-lib consists of 6 patterns (or customized instructions) as shown in Fig 3 . Table 1 shows the number of complete (valid) matches found in a particular Basic block of that application. Number of partial matches is of O(n p ) but experimentally we observed that it is much less than O(n p ). Also we analysed the effect of heuristics in pruning the search space. The Table 2 shows the fraction of eligible matches for different values of p and for different benchmarks. The eligible matches are those partial matches that qualify the outdegree constraint, Common sink constraint, etc. Fraction of eligible matches is computed as Number of eligible matches / number of partial matches. Best case = 0% for a certain benchmark and p, means that for a certain basic block (DFG) in that benchmark, we observed that after applying the heuristics, none of the partial match become eligible candidate. Without going through the complicated task of traversing through the data flow edges, we have filtered out lot of unsuitable partial matches. It is important to note that for p = 3, number of matches that would be considered for full match are only 0.1 to 2% of the partial matches identified in the stage1 of our algorithm. It means that for higher values of p, the eligibility criteria imposed is very effective in pruning the search space. For applying the eligibly criteria, it is required to compute the Reachability matrix, the Commonsink matrix beforehand. The overhead involved is Reachability matrix and Commonsink matrix computation time but as it is to be done for each basic block only once and also, it is very helpful in effectively pruning the search space, and we can easily choose to pay for this overhead. Theoretically, the number of comparisons performed while checking for eligibility criteria is num_partial_matches*(p*(p-1)/2). We compare this with the number of comparisons actually done. On an average, for p = 2, we found the ratio of actual number of comparisons and num_partial_matches*(p*(p-1)/2) to be about 2.5 and for p = 3, this ratio is 1 (Table 3) . Comparing two algorithms of finding out partial matches: We have described in Section 4.2, the two algorithms for finding partial matches. Table 4 shows for different benchmarks, the comparison between the two algorithms in terms of the number of partial matches (that are to be evaluated for eligibility and validity) found. We observed that the Algorithm 3.4 to be very efficient than Algorithm3.5. The number of partial matches identified in Algorithm3.5 is much more than the number of partial matches identified in algorithm3.4 by us. 
CONCLUSION AND FUTURE WORK
We have presented a novel and very efficient algorithm for instruction matching. It successfully matches even multioutput complex Function units with a sub graph in the DFG of an application. We observed that the concept of Commonsink (or common descendent) plays very significant role in effectively pruning the search space. Matching only primary input nodes of the Graph Gpat corresponding to customized instruction and the concept of commonsink constitute the crux of the algorithm. We evaluated the performance of the matching algorithm with many benchmarks and compared its efficiency with some already existing algorithms. At present we did not include arithmetic-logic reduction in our instruction matching algorithm. We provide support for handling commutative cases in instruction matching algorithm but not for complex arithmetic-logic reductions. The algorithm can be extended in future to match the Customized instruction (G pat ) with a sub graph (G sub ) in DFG, where the same computation is performed as in Gpat but the topology of G pat and G sub may not be the same.
