Abstract-Extensible processors provide an efficient mechanism to boost the performance of the whole system without losing much flexibility. However, due to the intense demand of low cost and power consumption, customizing an embedded system has been more difficult than ever. In this paper, we present a framework for custom instruction generation considering both area constraints and resource sharing. We also present how we can speed up the process through pruning and library-based design space exploration.
I. INTRODUCTION
In recent years, the demand on high performance, low power systems in the market of consumer electronics has been growing exponentially. In those systems, extensible or configurable processors can help overcome the challenges of cost, performance, and time-to-market pressure. Typically, an extensible processor is optimized to a specific application to obtain high performance and low power consumption that cannot be achieved by a general purpose processor (GPP), while maintaining a certain level of flexibility that is not possible with an application-specific integrated circuit (ASIC). There already exist several commercial products such as Altera Nios II [1] , Xilinx MicroBlaze [2] , Tensilica Xtensa [3] and ARC 700 [4] .
Instead of designing the processor entirely from scratch, the designer of an extensible processor are given a base processor and allowed to add their own custom instructions (CIs) and/or other peripherals to meet the application and design constraints, which makes the design process much easier. However, due to the lack of a fully automated design tool chain that finds best CIs [5] , configuring an extensible processor still depends much on the experience of the designers.
There have been many researches on automatic CI generation [5] [6] [7] [8] [9] [10] [11] [12] [13] , and many different methodologies have been developed to solve partial problems. However, none of them studied generating CIs considering both area constraints and resource sharing. In this paper, we propose an efficient framework to explore the design space to generate an optimal set of CIs considering these two factors. This is an extension of our previous work [15] , where only one CI is selected among the candidate CIs possibly overlapping with each other.
II. PREVIOUS WORK
Typically, CI generation starts from execution time analysis. One simple method would be profiling the target application with an instruction set simulator (ISS) of the base processor. Time consuming basic blocks, known as kernels, will be identified from the application code. Then the code of each kernel will be converted to a directed acyclic graph (DAG) representation. This work can be easily done with the SUIF compiler [14] focuses on reusability of each CI [8] , whereas the other focuses on performance gain of a single CI [7] . There is no guarantee that one methodology can beat the other one. And as state in [5] , it is important to obtain a large CI in realistic cases. For these reasons, our work also takes the second approach. An effective single CI generation algorithm based on the second approach is proposed in [7] . The basic idea of their algorithm is enumerating all possible solutions in the DAG with a search tree. After a topological sort, operations are arranged into corresponding levels. Fig. 1 shows a binary search tree of four levels, which is used for generating one CI from the given DAG. A notion of cut is introduced to represent a subset of nodes in a DAG. At each node, the edge (labeled 1) to the right child indicates "adding" the operation corresponding to that level to the cut, whereas the edge (labeled 0) to the left child indicates "ignoring" that operation. So, by visiting all the nodes in the tree, all possible solutions could be tested and an exact solution could be found. The cuts that satisfy all the constraints such as number of input ports, number of output ports, and convexity constraint are called candidate CIs. Notice that, the total number of cuts is O(2 n ), where n is number of nodes in the DAG. A branch-and-bound technique is implemented to reduce the runtime of the exponential complexity algorithm. While traversing the search tree, whenever a constraint violation occurred, all the descendent nodes are pruned, because adding more nodes will not recover the violation. It is stated that the runtime is quite reasonable.
To extend the problem to multiple-CI selection, one can build up a tree with multiple branches (one branch for each CI). However, due to the exponential nature, such a method is unreasonably slow in practice. So in [5] , a heuristic multiple-CI identification approach called iterative selection is proposed. This approach is based on a relatively fast single-CI identification algorithm. They use the single-CI identification algorithm iteratively on the given DAG and select the best CI for each iteration. Nodes of the CI selected in an iteration are set as forbidden nodes in the following iterations. The experiment shows that the performance difference between this heuristic and the optimal solution is reasonably small, and runtime is significantly reduced.
Lee et al [9] extended the above work in three ways. First, CIs are implemented as sequential logic including memory load/store units. As a result, memory operations could be included into CIs, which had been forbidden in other approaches. Secondly, register operations are serialized. Thus I/O ports of CIs are not limited by the architecture. The direct impact of these two improvements is that CIs become larger. Thirdly, area is saved by resource sharing among CIs through the highlevel synthesis (HLS) process. However, this algorithm still suffers from the complexity problem of [7] . And including memory operations and serializing register operations decreases the chance of pruning (a branch that includes a memory operation should not be pruned), which makes the algorithm even slower. What's more, if the designers need to set the area constraint, there is no better way to obtain optimal resource allocation (how many multipliers, how many adders, etc.), but to check all the combinations.
III. MOTIVATIONS AND CONTRIBUTIONS
Die area is of great value in processor design, especially for embedded processor design where cost and (static/dynamic) power consumption are important. So every small piece of die area should be used carefully. Under an area constraint, there can be various combinations of resources that satisfy the constraint. Depending on the application, different combinations of resources can reveal very different performance numbers. For instance, let's consider two different combinations having similar total area: one containing three adders and one shifter, and the other containing one adder and three shifters. Apparently, for applications with more concurrent addition operations, the first combination will be preferable. In contrast, if the application has more concurrent shift operations, then the second combination would be better. Thus proper allocation of resources should be paid with great attention.
Given a kernel, we try to find a number of CIs and optimal resource allocation under an area constraint, considering resource sharing. Resource sharing is achieved through HLS. This problem can be formulated as follows:
Problem: Find cuts C1, C2, C3, ..., CN CI for CIs and a resource allocation (X 1 ,X 2 ,…,X n ) in the set of feasible resource allocations (FRAs) given by
is maximized, where A i is area of a functional unit (FU) of type i, X i is number of function units of type i, T SW (Cj) is the number of cycles required to execute Cj with base instructions, and T HW (Cj) is the number of cycles required to execute Cj with the generated CI for the cut. The algorithm in [9] could be applied directly to explore the design space. However, there are two problems that cause this primitive method unfavorable. First, depending on the number of resource types and the total area constraint, the number of FRAs can be huge. Secondly, for a given resource allocation, the algorithm for finding the best CI is exponential, even though it can be made efficient by implementing it with the branchand-bound technique. Since the algorithm should be executed repeatedly for every FRA, it can be too much time consuming, especially when the kernel is large in size.
In this paper, we consider two techniques to reduce the overall runtime of design space exploration. One is to prune the resource allocation space, and the other one is to decrease the runtime for finding a CI for each resource allocation. Fig. 2 shows an example of resource allocation space with two types of resources: FU1 (e.g., multiplier) and FU2 (e.g., adder). Dashed line represents the area constraint. Each dot under the dash line represents an FRA. We assume only data-dominated circuits where the areas of registers, multiplexers, control units, and interconnects are ignorable. Although there are many FRAs in our example, not all of them need to be explored.
IV. CI GENERATION FRAMEWORK 1. Pruning of Resource Allocation Space
FRAs in the design space can be divided into two categories, namely tight FRAs (shown in black dots in Fig. 2 ) and loose FRAs (shown in gray dots in Fig. 2 ). Tight FRAs are those, to which adding any more resource will violate the area constraint. The rest of the dots are loose FRAs. Notice that, for any loose FRA, we can always find a tight FRA that has at least one more resource of a certain type. It is clear that allocating more resources will not decrease the performance, since if it does not help, then we can just ignore the extra resources during the resource binding phase. Thus loose FRAs can never give a better solution and therefore all of them are pruned from the resource allocation space.
Moreover, we can also prune some of tight FRAs when there is not enough concurrency in the kernel to exploit all the allocated resources. For each type of FU, we can set an upper bound of the number of instances used for implementing a CI as shown in Fig 2. We can obtain such upper bounds by performing HLS for every candidate CI with unlimited resources. As soon as possible scheduling (ASAP) is used for this purpose. For each type of FU, by recording the actual number of instances used in each candidate CI, we can obtain an upper bound for the CI. And by taking the maximum of the upper bounds over all candidate CIs, the global upper bound can be obtained.
Library Based Design Space Exploration
The approaches in [6] and [9] perform CI identification together with CI selection. However, it is observed that, separating these two processes can create another opportunity for decreasing the runtime for design space exploration. Note that, CI identification simply enumerates all the possible cuts and check the constraints on number of input ports, number of output ports, and convexity. It has nothing to do with the performance gain that could be obtained by the CI implementation. Thus, for a given kernel, despite different FRAs, the result of CI identification does not change at all. However, CI selection is performed based on the performance gain of each CI and the performance gain is determined by the HLS process with given FRAs. Thus HLS and CI selection are independent of the CI identification. This feature provides an opportunity for improving the efficiency of design space exploration. More specifically, by reusing the identified CIs again and again during the CI selection, we can reduce the runtime significantly. Fig. 3 shows the benefit of the proposed framework, where we assume two different FRAs and up to two CIs. In the traditional design space exploration approach (simply running the Lee's algorithm iteratively for each resource constraint), there are two iterations for each FRA. Within each iteration, CI identification is performed first to find all candidate CIs, HLS is performed over the candidates, and the CI with largest performance gain is selected. Notice that the runtime for the identification process of the second iteration is reduced compared to that of the first iteration, because the nodes included in the first CI are set as forbidden. The entire process is repeated for the second FRA, and the better solution is selected among the two best solutions. However, in our proposed method, we perform the CI identification process only once, and reuse the stored candidate CIs for HLS and selection process.
Multiple-CI Selection
In our framework, we adopt the idea of iterative selection from [5] . But we implement it in a different way. The basic principle of multiple-CI selection is that, any two cuts should not have overlapping nodes, in other words, two CIs should be disjoint. In [5] , they set nodes from previously selected cuts as forbidden nodes. So in the next generation, forbidden nodes are not included in the search tree.
In our implementation, the path to each node from the root is encoded as a label of the node. Each '0' bit in the label indicates not including the corresponding operation, while '1' bit indicates including the operation. So if two candidate cuts overlap, both labels must have a '1' bit at the corresponding position. Based on this observation, we use the labels for the candidate cuts to select disjoint cuts. The pseudo code is shown in Fig. 4 . In each iteration, the label of each candidate CI is compared with the mask. Among the candidate CIs that do not overlap with the mask, we find the CI with largest PG and update the solution and mask. Fig. 5 shows the design flow in the proposed framework. CI identifier identifies candidate CIs following the approach in [9] . It receives DAG representation of the given kernel and checks the feasibility of considered cuts based on relaxed I/O constraint and the convexity constraint. This process is repeated until we find all the N CI CIs or no more CI is found. The entire process is repeated for each FRA in the resource allocation space. In the end, the FRA with the larget performance gain will be selected for the final design.
Proposed Framework

V. EXPERIMENTAL RESULTS
We have selected basic blocks from the Mibench [16] and DSPstone [17] benchmarks to show the merit of our proposed multiple-CI generation framework. The kernels have been converted to DAG representations with the SUIF compiler [14] . We have used an extensible processor, whose instruction set architecture is compatible with ARM7. We assume that two register reads and one register write are supported in each cycle. Memory operations are also supported. Relaxed I/O constraint is set to 4/2 (at most four inputs and two outputs). We use the fastest implementation for each functional unit and obtain delay and area through synthesis using Synopsis Design Compiler.
Profiles of each kernel are shown in Table I . Size of each kernel is shown in the second row of the table. The number of visited nodes is the number of nodes in the binary search tree visited during the CI identification. As the number increases, the runtime of the identification process increases. The number is affected by the size of the kernel as well as the topology of the DAG. Usually, as the size of the kernel increases, the number of visited nodes grows exponentially. That is why our proposed framework is more effective for larger kernels. The number of candidate CIs indicates the speed of CI selection. The ratio of number of visited nodes to number of candidate CIs indicates the speedup achieved by our approach. Fig. 6 shows the set of Pareto points obtained by the proposed approach for the CONVOLUTION example with different area constraints and up to four CIs. Horizontal axis represents the actual area cost for implementing the generated CIs, and vertical axis represents the number of cycles required to execute one iteration of CONVOLUTION. The point with zero area of CI represents no instruction extension, which takes 73 cycles. By using four CIs and using more area (more resources) for the CIs, we can improve the performance and reduce the execution time down to 22 cycles.
To see the efficiency of our approach, we have measured the runtime of the proposed method as well as the traditional method. In this experiment, the pruning of the resource allocation space has been performed for both methods. The area constraints have been set such that ten different FRAs remain and then only one best CI has been generated from each example kernel. Thus in the traditional method, Lee's algorithm is executed 11 times (once for the resource allocation space pruning, and 10 times for CI selection). For each example used in this experiment, we have generated three CIs. Fig. 7 shows speedups (vertical axis) achieved by the proposed method compared to the previous approach [9] . For the DOT_PRODUCT example, due to the small size, our proposed method shows only 2.1 times of speedup. However, for other examples, the speedup is huge. This is because the runtime for CI identification is huge. Because of the relatively complex topology, SUSAN has the largest number of candidate CIs and relatively small ratio of A to B as shown in Table I . Thus the speedup is smaller than other examples, such as REAL_UPDATES, FIR, LMS, SHA and CONVOLUTION.
In our last experiment, to show the benefit of multiple-CI selection, we have set the area constraint to 30,000 m 2 and generated one to four CIs for each example. Fig.   8 shows the experimental results. Notice that, for most of the examples, it is worth generating multiple CIs, since the performance increases as we generate more CIs. Especially in the SUSAN example, the second CI brings significant performance gain. However, we could not generate four CIs for some examples. We could generate only one CI for the DOT_PRODUCT example, and up to three CIs for the SUSAN and CONVOLUTION examples.
VI. CONCLUSIONS
In this paper, we have proposed an efficient framework for multiple CI generation, considering area constraint and resource sharing. To the best of our knowledge, this is the first attempt to find an optimal set of CIs under area constraint. To speed up the design 
