Domain-specific application analysis for customized instruction identification by Amarasinghe Arachchilage, Madhushika et al.
Microprocessors and Microsystems, vol. 38, no. 7, pp. 637-648, 2014.
Domain-specific Application Analysis for
Customized Instruction Identification
Madhushika M. E. Karunarathna, Yu-Chu Tian 1 and
Colin Fidge
School of Electrical Engineering and Computer Science
Queensland University of Technology
Brisbane QLD 4001, Australia
Abstract
With the increasing importance of Application Domain Specific Processor (ADSP)
design, a significant challenge is to identify special-purpose operations for implemen-
tation as a customized instruction. While many methodologies have been proposed
for this purpose, they all work for a single algorithm chosen from the target ap-
plication domain. Such algorithm-specific approaches are not suitable for designing
instruction sets applicable to a whole family of related algorithms. For an entire
range of related algorithms, this paper develops a methodology for identifying com-
pound operations, as a basis for designing “domain-specific” Instruction Set Archi-
tectures (ISAs) that can efficiently run most of the algorithms in a given domain.
Our methodology combines three different static analysis techniques to identify
instruction sequences common to several related algorithms: identification of (non-
branching) instruction sequences that occur commonly across the algorithms; iden-
tification of instruction sequences nested within iterative constructs that are thus
executed frequently; and identification of commonly-occurring instruction sequences
that span basic blocks. Choosing different combinations of these results enables us to
design domain-specific special operations with different desired characteristics, such
as performance or suitability as a library function. To demonstrate our approach,
case studies are carried out for a family of thirteen string matching algorithms.
Finally, the validity of our static analysis results is confirmed through independent
dynamic analysis experiments and performance improvement measurements.
Key words: Customized instructions; special purpose operations; static analysis;
domain-specific analysis
1 Corresponding author: Y.-C. Tian. Phone: +61 7 3138 2177, fax: +61 7 3138
2703. Email: y.tian@qut.edu.au.
Preprint submitted to Elsevier Final version on June 28, 2014
1 Introduction
Application Domain Specific Processors (ADSPs) are designed around oper-
ations that are executed frequently by a domain, or family, of related algo-
rithms, so that a specific property such as the run-time performance of the
implementation is maximized. Dedication of the processor core to a specific
target application domain is provided by a customized Instruction Set Archi-
tecture (ISA). Customized instructions, also known as Specialized Instructions
(SIs), are tailored to execute special-purpose operations found in the target
domain. They improve the performance and power consumption of the pro-
cessor while providing more flexibility [1].
Generation of customized instruction sets has become a significant area of re-
search due to the rapid development of ADSPs. It consists of two steps: custom
instruction identification, and custom instruction implementation [2]. In the
first step, special-purpose operations which can profitably form customized
instructions in an application-specific ISA are identified, while in the second
step the identified operations are formalized and implemented in hardware.
The work of this paper focuses on the first step, i.e., custom instruction iden-
tification. Since the target application domain for an ADSP may not be just
a single algorithm but could be a whole family of related algorithms, this pa-
per presents an analysis method to identify candidate sets of special-purpose
operations that span multiple algorithms.
There are two basic types of approaches for customized instruction identi-
fication: static approaches and dynamic approaches. Existing static analysis
approaches are built on the concept of template matching. With template
matching, a target application is represented as a graph. Then, it is ana-
lyzed to find occurrences of subgraph templates that can be candidate custom
instructions of an application specific processor [3–9]. Alternative to static
analysis approaches, dynamic approaches have also been developed from dif-
ferent perspectives. Making use of the results obtained from profiling tools,
dynamic approaches identify “hot spots” in the application to form the basis
of special-purpose operations [10–12].
Although these approaches have achieved good results to some extent, they
are limited to analysis of a single algorithm at a time for the target application
domain. Consequently, the resultant customized instruction set may not be ef-
fective for other related algorithms which solve the same or similar problems.
Therefore, the goal of this paper is to develop a methodology which is funda-
mentally different from these existing single application dependent approaches
so that the results can be made applicable to a whole family of related al-
gorithms. This will support design of domain specific ISAs whose users can
choose any suitable algorithm from the whole family of algorithms for their
2
particular requirements.
To achieve the goal, we combine several different existing static analysis tech-
niques to identify candidate instruction sets applicable to multiple algorithms
from a particular family that all solve the same or similar computational
problems. In practice, it is unlikely that any single ADSP instruction set will
benefit all algorithms in a large family, so we exploit different analysis tech-
niques, each of which should find some similarities across a large proportion
of the algorithms. The results of the different analyses can then be merged to
provide the best fit with the algorithms most likely to be employed.
In this paper, we prefer static analyses to dynamic ones for designing the in-
struction set because dynamic analysis results depend on the particular train-
ing data set used. Therefore, dynamic analysis does not guarantee consistent
results for all inputs [13]. However, as will be seen later, we have used dy-
namic analyses to independently evaluate our methodology for thirteen string
matching algorithms in an experimental setup.
The frequency of execution of an operation is considered to be the most im-
portant parameter to measure how critical an operation is in a static based
application analysis. In an automated environment, there are basically two
ways to identify the frequency of execution of an operation: the number of
occurrences in the code, and their depth of control-flow nesting. In the first
case, all the occurrences of an operation in the given algorithms are found;
while in the second case, deeply nested operations are identified. In either way,
the operations are executed frequently because they occur frequently in the
application code or the loops they appear in execute the involved operations
repeatedly at runtime. The approach in this paper identifies special-purpose
operations in both of these aspects. However, in our automated environment,
this static analysis is performed on basic blocks which are limited to non-
branching operations. Therefore, in order to consider branching code, an ad-
ditional technique based on block-structured control flow analysis has been
used. These approaches work as complementary components in order to cover
all the essential properties of multiple algorithms. The results of these analyses
are then used to guide the final choice of candidate instruction sequences for
building custom operations.
The paper is organized as follows. Section 2 reviews related work and mo-
tivates our research. Section 3 proposes our methodology. Section 4 carries
out case studies. In Section 5, the results obtained from static analyses are
evaluated independently using dynamic analysis. The performance gains of
our approaches are illustrated in Section 6 in an ADSP environment. Finally,
Section 7 concludes the paper.
3
2 Related Work
During the last few decades, many application analysis methods have been
introduced to identify critical operations of an algorithm. They can be catego-
rized into two groups based on the underlying concepts in the analysis: static
and dynamic approaches. In dynamic approaches, many sophisticated profil-
ing tools have been developed to locate hot-spots in the source code. However,
a particular weakness of dynamic analyses is that they are dependent on the
test data set used during the analysis. If this test set is not characteristic
of the actual data set, then the results can be skewed. Therefore, such dy-
namic analyses are not considered to be a general methodology for different
types of application domains. Compared with dynamic analysis, static anal-
ysis approaches are more reliable. They are more often used than dynamic
approaches due to their sound property in facilitating generalization of the
results for future executions. For this reason, we favour static analyses over
dynamic ones in our approach to be developed in this paper.
Most existing static analysis methods use the concept of template matching
to identify special-purpose operations [3–7,14–16]. The objective algorithm is
presented in intermediate representations such as direct acyclic graphs (DFGs)
and control flow graphs (CFGs). Then, previously-defined or automatically-
generated sub-graphs known as templates are generated as potential special-
purpose operations. They are matched throughout the original graph to find
those templates which are most profitable to implement as customized instruc-
tions. Although this approach can produce good results, it suffers from high
computational complexity [2] as well as the challenge of producing the tem-
plates themselves. Therefore, various heuristics or cost functions are deployed
to prune the candidate custom instructions.
The approach by Zhao et al. [3] represents the algorithm as a DFG and per-
forms sub-graph matching to identify special-purpose operations. The results
obtained from static profiling are used to generate TIE (Tensilica Instruction
Extension) customized instructions on Tensillica’s Xtensa tool set [17]. Simi-
larly, Cong et al. [4] use control data flow graphs (CDFGs) derived from the
objective algorithm, and match instruction patterns with the graph to identify
customized instructions. Using the same concept, another two methods were
proposed by Zhao and Bian [5] and Yang et al. [6], respectively. Zhao and Bian
introduced a “peeling” algorithm to speed up execution while Yang et al. pro-
posed a two-phase design flow of automatic instruction generation. However,
both approaches targeted only a single objective algorithm at a time. These
approaches focused on performance improvement of the template matching
technique for single algorithm analysis. Lin and Fei [8] made use of DFGs and
a template matching technique to identify customized instructions to improve
performance under design matrix restrictions. Similarly, the work presented by
4
Atasu et al. [7,14] and Tao et al. [16] introduced efficient techniques to identify
custom instructions with or without hardware and other architectural spec-
ifications. However, these approaches also did not fulfill our requirement for
domain-specific ISA design spanning multiple algorithms.
As in our work of this paper, Fanucci et al. [9] discussed designing common
customized instructions for multiple algorithms within an objective applica-
tion domain. However, a set of special-purpose operations was found based on
previous knowledge of the application in their approach. The approach did not
introduce a process to identify previously-unknown common operations. Shirai
et al. [18] performed static analysis on a basic block representation of the tar-
get algorithm. Their purpose was to identify special-purpose operations which
could be used as customized instructions by considering their static frequency.
Although they concentrated on designing a special purpose architecture for
a set of thirteen algorithms, no method was proposed to identify common
operations across the algorithms. Arnold and Corporaal [7] carried out static
analysis on DFG graphs and proposed a template matching algorithm to iden-
tify special-purpose operations. Even though their approach considered a set
of benchmark algorithms as the target application domain, it did not focus
on identifying operations common to the objective algorithms. Notably, Clark
et al. [1] positively addressed this problem by presenting a general methodol-
ogy to identify critical operations of a given application domain. In spite of
targeting multiple algorithms, they design custom instructions for a selected
set of special-purpose operations from a single algorithm. The other objec-
tive algorithms are then generalized in a way that, they can use the already
designed custom instructions. However, the resultant custom instructions of
this method cannot be always generalized for the whole application domain.
This is because other available algorithms are not considered when the special-
purpose operations are selected to implement as the new custom instructions.
This is also confirmed and acknowledged by their experimental results.
Approaches based on dynamic analysis techniques produced good results in the
application analysis stage. However, the main drawback of dynamic techniques
is their high dependence on the specific application considered and the specific
test data set used. Mbaye et al. [10] performed dynamic analysis to identify hot
spots and hence special-purpose operations in the algorithm, and optimized its
execution by substituting customized instructions. Choi et al. [11] measured
the dynamic frequency of each micro operation of the algorithm using profiling
tools. From these results, they identified special-purpose operations which can
be implemented as customized instructions. Making use of cost functions,
Sun et al. [12] employed both dynamic and static analysis results in order to
prune the customized instruction design space. However, all these mentioned
approaches are dependent on one algorithm and thus cannot be directly used
to identify common operations across multiple algorithms.
5
As we have seen, most traditional special-purpose operation identification
methods target single algorithms. They are not suitable for designing domain-
specific ISAs. To obtain a multi-algorithm solution, correlated analyses are
required over all algorithms [19]. Our methodology to be presented in this
paper does this by applying three different static analysis techniques across a
family of related target algorithms.
3 Development of our Methodology
Figure 1 shows our overall application analysis methodology. The initial input
of the method is a family of related target algorithms which have been imple-
mented in C/C++. From the high-level language source code, assembly-level
and intermediate basic-block representations are obtained prior to the analysis
steps. Most of our analysis strategies are performed at assembly level. This
makes it easier to examine the operations deeply and hence facilitates identifi-
cation of similarities between the operations invisible from the high level imple-
mentation. The final output of the method is a candidate set of special-purpose
operations which can form customized instructions for a domain-specific ISA.
Implementation in C/C++
 
 
 
 
Target Application
 High level source code 
& assembly code
Loop Nesting
Analysis
Control Flow 
Analysis
 Long special 
operations
Merging
Candidate Special-purpose Operations
Deeply nested
special operations
Sequential Code 
Analysis
 High frequency 
special operations
Fig. 1. Block diagram of our analysis methodology.
As discussed in Section 2, existing automatic static analysis approaches suffer
from their high computational complexity and costly time consumption (even
for a single algorithm) due to their need to construct and traverse complex data
6
structures. As our analysis considers multiple algorithms, it is important to
reduce the complexity of the problem by developing an alternative light-weight
method. This is particularly important in the costly operation matching phase
involving multiple algorithms. For this purpose, we use a syntactic approach
by working directly on the source code. From the syntactic approach, we derive
instruction operations, operands and data flow and control flow relations. In
our approach, assembly instructions with their operands are considered as
separate individual strings. Thus, the matching phase is narrowed down to a
problem of exact string matching, which is a much lighter approach than a
problem of pattern matching of graphs.
Taking application source codes as inputs, three complementary analysis steps
are possible: sequential code analysis, loop nesting analysis and control flow
analysis. In each of these steps, the applications are analysed in terms of
different characteristics such that different types of critical operations common
to the given algorithms can be identified. In sequential code analysis, non-
branching sequences of instructions that appear most frequently in the code
are identified. In loop nesting analysis, instruction sequences that occur deeply
nested within iterative constructs are identified. In both of these analyses, non-
branching sequences are found within basic blocks, so control flow analysis is
then used to find common branching instruction sequences that span basic
blocks. As will be demonstrated in Section 4, each of these steps produces
different results. These results can then be combined depending on the desired
characteristics of the special-purpose instructions or a chosen target subset of
the algorithm family.
The three analysis steps can be performed in any order since the processes are
independent of one another. They are described below in further detail.
3.1 Generation of Valid Operations at the Assembly Level
As seen in Figure 1, the initial step of the process is to compile high level
source code for a target architecture to obtain assembly instruction code. In
static analysis, the source code of an algorithm needs to be deeply analysed to
identify characteristics and performance behaviour of the algorithm. Assembly
level algorithm representations facilitate understanding or studying function-
alities involved in the algorithm from a low level where micro operations can
be clearly seen. Therefore, in order to achieve deep investigations in the fre-
quency analysis process, we use assembly level analyses with necessary help
of the high level implementation. Conversion from high level to assembly level
is performed with all compiler level optimizations turned off. This allows di-
rect mapping of assembly instructions to corresponding high level source code
whenever it is needed. Before the assembly code is used in the next steps of
7
the process, the code is filtered to remove function handling instructions such
as call, ret, push and pop, which are not part of this analysis process.
The filtered assembly source code is then used to generate all possible non-
branching operations of a given algorithm. Assembly instructions are read
line by line (one instruction at a time) and stored consecutively in a string
array. Starting from a size of two (instruction sequences which consist of two
instructions) to the maximum allowed length of expected sequences, all pos-
sible instruction sequences are generated by binding sequential instructions
together as sequences. For instance, the assembly instruction source segment
shown in Figure 2 forms a maximum of six different instruction sequences in-
cluding itself. In this example, the smallest operation is two instructions long
and the longest one is limited to four instructions since the original assembly
code is only four instructions long.
l32i.n   a1, a2, a3
addi     a4, a5, a6
addx4  a7, a8, a9
l32i.n   a10, a11, a12
l32i.n   a1, a2, a3
addi     a4, a5, a6
addi     a4, a5, a6
addx4  a7, a8, a9
addx4  a7, a8, a9
l32i.n   a10, a11, a12
l32i.n   a1, a2, a3
addi     a4, a5, a6
addx4  a7, a8, a9
addi     a4, a5, a6
addx4  a7, a8, a9
l32i.n   a10, a11, a12
Assembly instruction code
Other possible operations
Fig. 2. Sample assembly instruction code segment and possible instruction subse-
quences
However, in order to map the generated instruction sequences to all possible
valid operations, further analyses are carried out. Instructions belonging to
different basic blocks are executed with different frequencies and hence their
characteristics can be different from each other. However, they may appear to
be consecutive in the compiled code, making it difficult to differentiate their
different execution frequencies. Therefore, in this analysis, by tracing bound-
ary operations, basic blocks are identified and separated from each other.
These basic blocks are then considered as a single block of operations which
act independently. The possible instruction sequences are formed only within
the boundaries of basic blocks. Such operations generated by all the considered
algorithms are analysed during the Sequential Code Analysis and Loop Nest-
ing Analysis steps, which consider operations within basic blocks. In contrast,
in the Control Analysis step the operations may span basic blocks.
8
l32i.n   a1, a2, a3
addi     a4, a5, a6
movi    a7, a8
bne      a9, a10, a11
addx4  a12, a13, a14
l32i.n   a15, a16, a17
s32i.n  a18, a19, a20
Assembly instruction code
All possible operations
l32i.n   a1, a2, a3
addi     a4, a5, a6
l32i.n   a1, a2, a3
addi     a4, a5, a6
movi    a7, a8
addx4  a12, a13, a14
l32i.n   a15, a16, a17
l32i.n   a15, a16, a17
s32i.n  a18, a19, a20
addx4  a12, a13, a14
l32i.n   a15, a16, a17
s32i.n  a18, a19, a20
addi     a4, a5, a6
movi    a7, a8
Fig. 3. Sample assembly instruction code segments belonging to two basic blocks
and all their possible instruction sequences.
In the automatic implementation, separation of the basic blocks and genera-
tion of valid operations are performed using string matching techniques. Typ-
ically, in the assembly instruction level, instructions like “branch”, “call”
and “return” imply boundaries of basic blocks (entry and exit points) of
the source code. While reading the assembly instructions, each instruction is
matched with a pre-defined list of operations, which can be a potential entry
or exit point of a basic block. Once these instructions are found in the code,
the sets of instructions are broken into different basic blocks. An example of
such a scenario is presented in Figure 3. In this example, the original assembly
code is divided into two parts by the branch instruction “bne”. Therefore, the
instructions in the differently coloured backgrounds belong to two different ba-
sic blocks. From these two basic blocks, six distinct operations are generated.
Operations resulting from such a validation step are called valid instruction
sequences, which are truly executed sequentially at runtime.
3.2 Sequential Code Analysis
Once all possible valid instruction sequences are generated within basic blocks,
two different frequency analyses are carried out: sequential code analysis and
loop nesting analysis. In this section, Sequential Code Analysis is described
in detail. In our method, the static frequency of each operation in the code
is counted automatically. All the generated valid instruction sequences from
the above step are considered as candidate operations; and for each operation,
9
their occurrences within each algorithm are counted by pattern matching.
Matching of two operations is performed as an exact string matching task
so that computational complexity and time consumption can be significantly
reduced in comparison with existing methods. Valid operation lists obtained
for each objective algorithm are read line by line to consider one instruction
sequence at a time.
In the matching phase, operations are matched both syntactically and se-
mantically. As illustrated in Figure 4, in syntactic matching, the sequence
of instructions are matched, while in semantic matching the corresponding
operands are matched. The syntactic matching is much more straightforward
as it only needs to match the opcodes. However, checking whether the operands
are consistent requires a more sophisticated semantic analysis, as will be ex-
plained below.
l32i.n a1, a2, a3
addi a4, a5, a6
movi a7, a8
l32i.n b1, b2, b3
addi b4, b5, b6
movi b7, b8
l32i.n addi movi 
a1, a2, a3, a4, a5, a6, a7, a8 
l32i.n addi movi 
b1, b2, b3, b4, b5, b6, b7, b8 
Sequence 1 Sequence 2
Syntactic matching
Semantic matching
  
Fig. 4. Syntactic and semantic matching of two instruction sequences.
In addition, to improve the efficiency of the matching phase, the candidate
sequences are divided by their lengths of instructions and matching is carried
out for sequences which are within the particular length. This cuts down the
huge set of operations to be matched to the current sequence. Thus, if the
maximum length of instruction sequences is n, the program will separate all
possible sequences into n− 1 sets (from sequence length 2 to n) and matching
will be performed for different sets at a time.
To perform semantic matching, previously identified syntactically matched
operations are considered. When the sequences are read one by one, a list of
symbols are substituted for the original operands, based on the type and the
value of the operand. This is done by consistently renaming operands based on
their type and usage in order to normalize the representation of the instruc-
tion sequence. Thus, if there are two semantically equal sequences, their list
of standardized operand symbols will be the same and hence the two corre-
sponding sequences can be considered similar. Figure 5 illustrates an example
of such two sequences. In Figure 5, the common representation is the sequence
obtained by assigning symbols for both sequences of operands. Consequently,
semantically similar sequences will have a similar pattern of operation names
10
and operand symbols lists. The static frequency of instruction sequences is
then obtained by counting the number of occurrences of the normalized in-
struction sequences.
mov A, Add
addi.n B, B, A
subi.n B, B, Im
s32i.n C, B, A
mov a1, 60007494
addi.n a2, a2, a1
subi.n a2, a2, 25
s32i.n a4, a2, a1
mov a10, 80557065
addi.n a6, a6, a10
subi.n a6, a6, 52
s32i.n a0, a6, a10
Sequence 1 Sequence 2
Common representation
Fig. 5. Semantic matching of two instruction sequences.
After performing the above analysis for the objective algorithms, the instruc-
tion sequences with high frequencies are selected as the final set of candidate
operations from the sequential code analysis step.
3.3 Loop Nesting Analysis
This is the second analysis step to determine how critical operations are in
terms of their execution frequency. If an operation is nested within loops, it
is likely to execute frequently at runtime. Therefore, it can become a critical
operation of the algorithm. This analysis takes valid instruction sequences
generated for all considered algorithms as input and produces these operations
with their corresponding nesting levels. As shown in Figure 1, in order to map
nesting levels to corresponding instruction sequences, high level source code
is also used.
Determining nesting levels in assembly instructions is not trivial. This is be-
cause loops and iterative codes cannot be clearly identified. Therefore, firstly,
nesting levels of the algorithm are marked by examining its high level lan-
guage implementation. Then, the high level source code is mapped to the
corresponding assembly instructions and the nesting level of each line is ap-
pended to all its corresponding instructions. Since the assembly instructions
are obtained with compiler optimizations turned off, the high level to assembly
level mapping can be performed unambiguously.
To perform this task in an automated environment, we have developed a parser
that reads high level source code in C/C++ and identifies loops and other
specific programming statements such as conditional statements. According
11
if
if
if
if
if
else
else
else
else
else
else if
else if
1 * (1/
2) = 0
.5
1 * (1/2) = 0.5
1 * (1
/3) =
 0.33
1 * (1
/2) = 
0.5
1 * (1/3) = 0.33
1 * (1/2) = 0.5
1*(1/3)=0.33
0.5 * (1/2) = 0.25
0.5 * (
1/2) =
 0.25
0.25*
(1/3)=
0.083
0.25*(1/3)=0.083
0.25*(1/3)=0.083
Fig. 6. Demonstration example for calculating the execution probability of condi-
tional statements (here the probability of execution of a root statement is considered
as 1).
to pre-defined criteria, numerical values are used for the nesting level for each
line based on their syntactic nesting in the code. Different weightings are used
depending on whether the code segment of interest appears within a loop or
a conditional statement. Code appearing within loops is assumed to execute
more often. But if a code segment is executed conditionally, its frequency
is assumed to be reduced. Therefore, for instance, if a statement is situated
inside a loop, the nesting level is increased by one; and if it is inside a condi-
tional statement, then the value is decreased proportionally to the number of
conditional branches as illustrated in the example in Figure 6.
In the implementation, we have used a pretty printer with custom settings
to standardize the C/C++ source code. It automatically inserts the scoping
brackets in the loops and the conditional statements wherever applicable in
the code. Thus, the implementations of all the algorithms are made consistent
before they are fed into the parser. In our case studies, the open source Artis-
tic Style 2.04 pretty printer is used with customized options. Thus, a com-
plete parser tool set is capable of identifying for, while, do-while, if-else,
switch-case statements accurately, simply by counting opening and closing
braces.
To perform this task, a string array is used to act as a stack. When the
parser enters a loop or a conditional statement, the opening bracket “{” is
identified and the type of statement is pushed onto the stack. When the closing
bracket “}” is read, the parser exits a statement and the stack is popped. This
functionality is illustrated in Figure 7.
12
    if((f = fopen(path, "rb")) == NULL)
    {
        md5_starts(&ctx);
        while((n = fread(buf, 1, sizeof(buf), f)) > 0)
        {
            md5_update(&ctx, buf, n);
            if(ferror(f) != 0) 
            {
                fclose(f);
                for(i = 0; i < keylen; i++)
                {
                      ctx -> ipad[i] = (unsigned char)(ctx -> ipad[i] ^ key[i] );
                      ctx -> opad[i] = (unsigned char)(ctx -> opad[i] ^ key[i] );
                }
                return(POLARSSL_ERR_MD5_FILE_IO_ERROR);
            }
        }        
        md5_finish(&ctx, output);
        memset(&ctx, 0, sizeof(md5_context)) ;
        return(POLARSSL_ERR_MD5_FILE_IO_ERROR);
    }
ifBracket
whileBracket
ifBracket
forBracket
Bracket Stack
Sample source code
Fig. 7. Stacking the brackets in Loop Nesting Analysis.
In the code segment shown in Figure 7, when the parser reads the first line,
it identifies the if condition and its opening bracket (“{”) and pushes the
bracket onto the bracket stack. Like wise, whileBracket, ifBracket and finally
forBracket are pushed respectively when their opening brackets are read by the
parser. Then, when the closing bracket of the for loop is read, the top bracket
of the stack is popped. Each time when an opening bracket is popped from
the stack based on the type of the bracket (for, while, if, else, etc.), the
nesting value is calculated. Thus, nesting levels for all lines in the algorithm
are determined and appended as comments of each line. This is shown in
Figure 8.
Nesting values of high level source code lines are then mapped to corresponding
instructions (Figure 8) and then instruction sequences. Thus, the resultant
instruction sequences of all algorithms contain a measurement to imply their
criticalness in terms of their nesting levels.
3.4 Control Flow Analysis
Apart from the sequential code analysis and loop nesting analysis outlined
above, Control flow analysis is also used to identify special-purpose operations
in our method. In the previous two steps, sequential instruction sequences have
been identified only within basic blocks, so potential special-purpose opera-
tions which span more than one basic block are ignored. This step of control
flow analysis aims to identify commonly-occurring branching code segments,
13
 
for(i = 0; i < 255; i++)   //,1
{   //,2
     pow[i] = x;   //,2
}   //,1
low[x] = x;   //,1
l32i.n     a1, a2, 0   //,2
addmi    a3, a2, 0x800   //,2
addi       a3, a3, 32   //,2
addx4    a1, a1, a2   //,2
addmi    a4, a2, ox800   //,2
l32i.n     a4, a4, 20   //,2
s32i.n    a4, a1, 0   //,2
addmi    a5, a2, 0x800   //,1
l32i.n     a5, a5, 16   //,1
addi.n    a5, a5, 1   //,1
addmi    a4, a2, 0x800   //,1
s32i.n    a5, a4, 16   //,1
addmi    a6, a2, 0x800   //,1
l32i.n     a6, a6, 16   //,1
movi      a7, 254   //,1
bge        a7, a6, 60007494   //,1
l32i.n     a8, a2, 4   //,1
addmi    a9, a2, 0x800   //,1
addi       a9, a9, 80   //,1
addx4    a10, a10, a9   //,1
movi      a11, 255   //,1
s32i.n    a11, a10, 0   //,1
Fig. 8. Nesting level assignment of a sequence of instructions according to corre-
sponding high level source code.
which may be longer than those restricted to just one basic block. In our con-
trol flow analysis, our main focus is on identification of critical loops in the
algorithms. These code segments are likely to form more beneficial custom
instructions due to their repetitive executions at runtime [10].
To perform control flow analysis, the control flow graph (CFG) for each algo-
rithm is obtained from its intermediate representation. The nodes of a CFG
represent basic blocks of an algorithm, and the edges represent control flow
between them. In order to identify iterative code segments/loops of the algo-
rithm, the CFGs are then analysed. Firstly, the dominators which dominate
other nodes of the graph are identified. A node A dominates node B if and
only if A is the unique immediate predecessor of B or A is a dominator of
all immediate predecessors of B [20]. For instance, in Figure 9, node BB1 is
a dominator of node BB5. Then, back edges are identified. A “back edge” is
an edge where the node at the head is a dominator of the node at the tail.
Then, all the nodes and edges in between the head and tail of the back edge
are identified. In the selected set of basic blocks, other than the back edge,
if incoming edges only come to the basic block with the back edge head and
other than back edge only one outgoing edge from the basic block with the
tail of the back edge exist, it is considered as a loop (Figure 9) [21]. In this
way, all the iterative code segments of the objective algorithms are found.
The above analysis is applied over all objective algorithms to find their loops.
Then, the most frequently occurring such loops are selected as the final set of
special-purpose operations obtained from this step.
14
BB1
BB2
BB3 BB4
BB5
BB6
Back edge
Fig. 9. Loop segment in a control flow graph.
3.5 Selecting the Final Candidate Instruction Sequences
After the three analyses described above have been completed, the final step
of our method is to choose a set of instruction sequences that will form our
special-purpose ADSP operations. As shown in Figure 1, the main objective of
this step is to merge the three sets of candidate instruction sequences obtained
in the previous steps. Therefore, all the selected instruction sequences are
firstly matched with each other to eliminate duplicates.
The target is to select the most beneficial set of special-purpose operations
from the resultant candidate set of operations from this method. When design-
ing a customized instruction, there are many software and hardware param-
eters to be considered to find an optimized solution. From the three analysis
approaches carried out to find the special-purpose operations, a list of param-
eters are identified as shown in Table 8 as the most important set for selection
of suitable customized instructions in these case studies. For instance, since
custom instructions need to be used by all or most of algorithms of the applica-
tion domain, the commonality of the operation among them is substantial. Like
wise, other important parameters such as execution frequency and hardware
cost (instruction latency, energy consumption and area cost) are also consid-
ered when selecting suitable operations. The resultant instruction sequences
from this step are the final set of candidate special-purpose operations suitable
for implementation as customized instructions of an domain-specific ISA.
15
4 Case Studies
To validate our approach, we have performed a number of large-scale case
studies to show how the overall process works. The case studies involve a
typical algorithm family where common special-purpose operations are not
immediately obvious. They have been carried out on a set of thirteen string
matching algorithms. The implementation of these algorithms is taken from
the Handbook of Exact String Matching Algorithms by Charas et al. [22].
4.1 Experimental Design
String matching algorithms are used to identify all occurrences of a given
substring, known as a pattern, in a given text domain [23]. They are funda-
mentals of many application areas, e.g., text processing, information retrieval,
natural language processing and bioinformatics. Because the expected perfor-
mance of the matching process varies depending on the objective application
area, string matching algorithms have been developed to improve different
parameters such as accuracy, speed, software/hardware resources and power
consumption.
For our experiment, we choose the set of commonly used string matching
algorithms shown in Table 1. All these algorithms perform the same match-
ing functionality in different ways. Our experimental results are useful in the
challenging task of designing an ADSP suitable for executing a large subset
of these algorithms.
In the experimental set-up, the thirteen string matching algorithms are divided
into two groups. Nine of them are used as a training set, while the remaining
four are kept in reserve as a test set. The test set algorithms are selected to
represent different deviations of related algorithms based on their historical
background. As the AG, KMP, RT and GG algorithms are historically different
from each other, they are selected as the test set.
The experiments are conducted in several phases. The first phase is to an-
alyze the training set algorithms using our three analysis steps described in
Section 3. From these analyses, the most critical set of operations common
to these algorithms are identified. Then, such special-purpose operations for
the test algorithms are also identified independently using our methodology.
After that, the resultant operations are compared with each other for both
the training set and test set. From these comparisons, the accuracy of our
analyses is evaluated quantitatively.
16
Table 1
Thirteen string matching algorithms.
Algorithm Abbreviation
Training Set Boyer-Moore BM
Brute Force BF
Colussi CL
Horspool HP
Morris-Pratt MP
Turbo Boyer-Moore TBM
Zhu-Takaoka ZT
Not So Naive NSN
Shift Or SO
Test Set Apostolico-Giancarlo AG
Knuth-Morris-Pratt KMP
Galil-Giancarlo GG
Raita RT
4.2 Results of Sequential Code Analysis for Training Algorithms
As explained in Section 3.2, assembly codes are derived from their C/C++
implementations for the nine training algorithms. Although our case stud-
ies are carried out on an x86 32-bit and Xtensa RISC ISA (for evaluation
purposes) [17], the methodology is independent of the particular processor be-
cause it does not use any architecture dependent analysis. After the assembly
codes are obtained, syntactic and semantic analyses are performed on all the
nine algorithms.
Not surprisingly, for such a large set of algorithms, no significant operations
common to all the nine objective algorithms have been found from our se-
quential analysis. Obviously, given their different characteristics of these nine
algorithms, there is little chance to find an existing and non-trivial operation
common to all these algorithms. However, i is observed that some subsets of
the algorithms perform the same types of operations, thus these algorithms
can be grouped together based on the similarities of their operations. Table 2
shows common special-purpose operations identified from such a subset of al-
gorithms, comprising the Colussi, Boyer-Moore and Horspool algorithms. It
has listed the most frequently found operations. It is seen from Table 2 that
at this simple level of analysis, the common instruction sequences identified
are relatively short.
17
Table 2
Sequential code analysis results for the Colussi, Boyer-Moore and Horspool algo-
rithms (A: Immediate value; X,Y : Registers; a, b: Stack addresses; m,n, i, j: Integer
values; p: Integer array).
No. Operation Instruction seq. CL BM HP
1
p[i]
(Calculate address of
an array element)
mov a, %X
mov b, %Y
mov (%Y , %X,
4),%Y
38 19 5
2
m + n
(Add two integers)
mov a, %X
add b, %X
37 44 12
3
p[i + j]
(Obtain next array
element)
mov a, %X
add A, %X
mov b, %Y
mov (%Y , %X, 4), 4
8 6 3
4.3 Results of Loop Nesting Analysis for Training Algorithms
The training set of the string matching algorithms in Table 1 are analyzed
to identify commonly-occurring operations that appear nested within loops.
As with the previous sequential analysis, subsets of the group of nine train-
ing algorithms are found to have loop-nested operations in common. For in-
stance, Tables 3 and 4 show the special-purpose operations identified in such
a group, comprising the Boyer-Moore, Zhu-Takaoka and Turbo Boyer-Moore
algorithms. For this particular subset, much longer instruction sequences have
been identified in comparison with the previous sequential analysis. Clearly,
long sequences are good candidates for special-purpose ADSP operations.
4.4 Results of Control Flow Analysis for Training Algorithms
In control flow analysis, all nine algorithms in the training set are represented
as Control Flow Graphs in order to identify iterative code segments as de-
scribed in Section 3.4. Then, the identified operations are filtered to select
only common operations as candidate special-purpose operations.
In the experiment, loop implementations such as for, while and do-while
loops are found, as shown in Table 5. Once again, while no such construct
is common to all nine algorithms, it is seen from Table 5 that certain loop
structures occur in several of the algorithms. Such large code segments also
offer good choices of special-purpose ADSP operations.
18
Table 3
Approximately three nesting levels: loop nesting analysis results for the Boyer-
Moore, Zhu-Takaoka and Turbo Boyer-Moore algorithms (A,B: Immediate values;
X,Y : Registers; a, b, c, d, e: Stack addresses; al, dl: 8-bit Register values; m,n, i: In-
teger values; p: Integer arrays).
No. Operation Instruction seq.
1
p[i] == p[m + i]
(Comparing next
character of
pattern with text)
mov a, %X
mov b, %X
add c, %X
movz (%X), %Y
mov e, %X
add b, %X
add d, %X
movz (%X), %X
cmp al, dl
2
p[i] = m− 1− i
(Calculating next
shift of the
window)
mov a, %X
shl A, %X
mov %X, %Y
add d, %X
mov c, %X
sub B, %X
sub d, %X
mov %X, (%Y )
3
p[i] == m
(Compare value of
an array element)
mov a, %X
shl A, %X
add c, %X
mov (%X), %X
cmp al, dl
4.5 Results for the Test Algorithms Set
Having identified common operations from the training set, our analysis steps
are then applied to the four test set algorithms. In these four algorithms,
the number of occurrences, nesting levels and existence of the corresponding
operations are again counted. Table 6 summarizes the most critical operations
found from all three analyses for the test algorithms.
It is seen from Table 6 that most of the resultant critical operations identified
from the test set algorithms are similar to those found from the training set
algorithms. This is confirmed by the comparisons shown in Table 7. Therefore,
our analysis process is capable of recognizing similar operations both within
and between sets of algorithms.
19
Table 4
Approximately two nesting levels: loop nesting analysis results for the Boyer-Moore,
Zhu-Takaoka and Turbo Boyer-Moore algorithms (A,B: Immediate values; X,Y, Z:
Registers; a, b, c, d: Stack addresses; al: 8-bit Register values; m, i: Integer values;
p, q: Integer arrays).
No. Operation Instruction seq.
4
p[q[i]] = m− 1− i
(Initializing “bad
character shift”
array elements)
mov a, %X
add b, %X
movz (%X), %X
mov al, %X
shl A, %X
mov %X, %Y
add c, %Y
mov a, %Z
mov d, %X
sub %Z, %X
sub B, %X
mov %X, (%X)
5
p[m− 1− q[i]] =
m− 1− i
(Filling “good
suffix” array
elements)
mov a, %X
sub A, %X
mov b, %Y
mov c, %Z
mov (%Z, %Y ,
4), %Z
mov %X, %Y
sub %Z, %Y
mov %Y , %Z
shl B, %Z
mov %Z, %Y
add d, %Y
mov a, %Z
sub A, %Z
sub b, %Z
mov %Z, (%Y )
4.6 Combining the Analyses
From the results shown in Sections 4.2 to 4.5, different types of candidate
special-purpose operations have been identified using the three different forms
of analysis for the whole algorithm set including both the training and test sets.
In sequential code analysis, simple and short sequences are identified; while in
control flow analysis, long and complex operations are found. However, in loop
nesting analysis, both simple and complex operations are found.
All operations identified from loop nesting analysis are general operations
20
Table 5
Control flow analysis results for nine string matching algorithms (the omitted ele-
ments are 0).
No. Special Operations MP BM TBM ZT HO CO NSN SO BF
1
while (j > −1 and x[i] 6= x[j])
j = a[j]
1
2
for (i = 0; i < A; ++i)
x[i] = a
2 2 1 1 1
3
for (i = 0; i < a− 1; ++i)
y[x[i]] = a− i− 1 1 1 1
4
for (i = 0; i ≤ a− 2; ++i)
b[a− 1− x[i]] = a− 1− i 1 1 1
Table 6
Resultant critical operations of the test algorithms from the analysis steps.
Analysis Algm Operation 1 Operation 2 Operation 3
Sequential
AG p[i] m + n p[i + j]
KMP p[i] m + n m = p[i]
GG p[i] m + n p[i + j]
RT p[i] m + n p[i + j]/m = p[i]
Loop Nesting
AG p[i] == q[m + i] p[i] = m− i− 1 p[i] == m
KMP m = p[i] p[i] 6= q[j] p[i] == p[j]
GG p[i] == q[m + i] p[i] = m p[i] == m
RT p[i] == m p[i] = m− i− 1 m = p[i]
Control Flow
AG (2) - Table 5 (3) - Table 5 (4) - Table 5
KMP (1) - Table 5
GG (2) - Table 5
RT (2) - Table 5 (3) - Table 5
common to many algorithms, while in other two approaches more specific
operations are found. Many of these identified operations are suitable for im-
plementation as customized instructions. The challenge now is how to merge
these outcomes to produce a final design for a “string matching ADSP”. While
how to choose the final set is beyond the scope of this paper, we offer some
heuristic strategies below.
Table 8 shows a list of parameters we have used to select the most useful
special-purpose operations from the lists of candidate operations found via the
three analysis steps. The final set of special-purpose operations are selected
21
Table 7
Results comparison of training set and test set algorithms.
Analysis Rank Training Set Test Set
Sequential Code Analysis
1 p[i] p[i]
2 m + n m + n
3 p[i + j] p[i + j]
Loop Nesting Analysis
1 p[i] == q[m + i] p[i] == m
2 p[i] = m− i− 1 p[i] == q[m + i]
3 p[i] == m p[i] = m− i− 1
Control Flow Analysis
1 (2) - Table 5 (2) - Table 5
2 (3) - Table 5 (3) - Table 5
3 (4) - Table 5 (1) - Table 5
in a way that all or most of the algorithms will benefit from the customized
instructions. Considering the parameters in Table 8, the final set of special-
purpose operations is selected as shown in Table 9. These operations can be
considered as the most promising application-specific operations according to
our selection process for the whole set of the algorithms. For instance, the
operation “p[i] == m”, which is identified in the loop nesting analysis step,
fulfils many selection parameters such as commonality, a deep nesting level and
speciality towards the application domain. Thus, the final set of operations in
Table 9 are critical and specific to the string matching application domain.
Table 8
Special-purpose operations selection parameters.
Sequential code analysis Loop nesting analysis Control flow analysis
(1) Commonality in
algorithms
(2) No. of occurrences
(3) Speciality towards the
application domain
(4) No. of reduced
instructions
(5) Hardware cost
(1) Commonality in
algorithms
(2) Nested level
(3) Speciality towards the
application domain
(4) No. of cycles in each
loop
(5) No. of reduced
instructions
(6) Hardware cost
(1) Commonality in
algorithms
(2) Speciality towards the
application domain
(3) No. of reduced
instructions
(4) Hardware cost
22
Table 9
Candidate set of special-purpose operations.
Operation Functionality Reference
p[i] Calculate the index of an
array element
Table 2,
Operation No. 1
p[i] == m Compare the value of an
array element
Table 3,
Operation No. 3
p[m] == q[m + i] Comparing next character of
pattern with text
Table 3,
Operation No. 1
for (i = 0; i < A; ++i)
x[i] = a
Initializing the “alphabet”
array in pre-processing
Table 5,
Operation No. 2
5 Result Evaluation in Dynamic Environments
Since the final set of special-purpose operations has been obtained from a
purely static analysis methodology, evaluation of the results in a real execution
environment is important to confirm their criticalness and usefulness to the
algorithms. Therefore, a dynamic analysis is carried out independently for
the selected candidate special-purpose operations in Table 9 considering both
training and test algorithms.
In the evaluation, the execution frequency of each special-purpose operation
is counted using an experimental dataset and method as described below. The
frequency of each special-purpose operation is measured for different problem
sizes to analyse different frequency variations. In order to obtain smooth and
reliable results, for each measurement, 30 trials are executed and the average
value from these 30 trials is calculated.
To obtain a comprehensive set of experimental results, the following experi-
mental matrices are used. The length of the input “text” data ranges from
10,000 to 200,000 characters, while the length of the input “pattern” ranges
from 5 to 100 characters. A standard library of genome sequences is used to
generate “text” and “pattern” strings. In the practical set-up the “text” string
is generated by using the same set of genomes, while the “patterns” are gener-
ated randomly for each trial. The problem size is defined as “text length− pat-
tern length + 1”, where text length is the length of the text and pattern length
is the length of the pattern, respectively. It represents the number of distinct
locations at which the pattern may appear in the text.
Selected dynamic evaluation results are depicted in Figures 10, 11, 12 and 13
for different cases. Figure 10 shows the number of trivial array indexing op-
erations for each algorithm. Array indexing, p[i], is the first special-purpose
operation in Table 9. Notably, algorithms that are related to one another
23
generally appear together:
• The BM, TBM, ZT and AG algorithms, which all lie at the bottom of the
graph, are related to each other since they are different deviations of the
original BM algorithm that uses a “bad character shift” [22].
• The RT and HP algorithms also use similar functionalities in matching and
hence appear together in the bottom of the graph [22].
• The CL and GG algorithms are related to each other because GG is a
refinement of CL [22]. This is evident in the graph because these algorithms
appear close to each other.
• It is observed from the graph that KMP and AC show close behaviour
to each other. These two algorithms are also functionally related as both
use the “shift table” technique for window shifting [22]. However, although
algorithm MP also uses the same technique it appears at the top of the graph
since KMP and AC are extended versions of the original MP algorithm and
hence they are optimized further [22].
It is also noted that algorithm NSN remains isolated as it does not have any
similarity to any other algorithm.
Fig. 10. Frequency of operation p[i] for pattern size = 100.
Figure 11 shows similar results for the more complex array assignment oper-
ation, p[i] == m. In this case, only a subset of the algorithms benefit signif-
icantly from this choice of operation. The algorithms appearing near the top
of the graph are AC, HP, RT and TBM, while all other remaining algorithms
lie at the bottom of the graph. For those algorithms which appear near the
top of the graph:
• The HP and RT algorithms are closely related and both rely heavily on this
operation.
• The TBM algorithm introduces this operation to effect a shift and appears
24
Fig. 11. Frequency of operation p[i] == m for pattern size = 100.
Fig. 12. Dynamic frequencies of all selected operations for a pattern size of 100.
near the top of the graph, whereas the original BM algorithm appears near
the bottom.
• The AC algorithm uses this operation in order to calculate shift indexes
under particular conditions in its core matching phase and appears at the
top of the graph; while the original MP algorithm lies at the bottom of the
graph [22].
Finally, Figure 12 shows all special-purpose operations from Table 9 as used by
all algorithms, and Figure 13 expands the view of the collection of algorithms
at the bottom. This results depicted in Figures 12 and 13 confirm the relations
between algorithms found from Figure 10 by showing their similar variations.
Exceptionally, The NSN algorithm has reached the top of the graph, indicating
that it benefits by the most of the selected operations.
25
Fig. 13. Zoom-in of the lower group of algorithms from Figure 12 (the pattern size
is 5).
Another observation is that all the lines are linearly increasing. This is con-
sistent with the fundamental nature of searching problems. According to Fig-
ure 11, this trend of linear increase is followed by all the algorithms and
selected operations regardless of the different absolute frequencies they have
achieved.
6 Validation of the Selected Operations in an ADSP environment
In order to illustrate that the operations selected from our approach as custom
instructions truly improve the performance of the application domain, this
section conducts performance evaluation of the selected operations in an ADSP
environment. The special-purpose operations selected from the above section
are implemented as custom instructions in a simulation environment facilitated
by Tensilica’s Xtensa customizable processor [17]. Xtensa enables design of
custom operations (referred to as “TIE” instructions) and attaches them into
the core ISA so that new processor configurations with a new ISA can be
automatically built within its software tool set, which consists of a compiler
and a linker for the new ISA.
The performance of our new custom instructions is measured against the ba-
sic ISA. Several algorithms are executed on top of both ISAs and profiled to
measure performance differences in terms of various parameters such as cy-
cle count, instruction count, and energy consumption. Figure 14 shows the
performance gain obtained from this evaluation. As shown in Figure 14, all
algorithms have a significant percentage speed-up. This is particularly evident
26
for the ZT algorithm.
Fig. 14. Performance gain comparison of original and new domain-specific ISAs for
string matching algorithms.
7 Conclusion
Existing approaches for custom instruction generation produce results for a
single algorithm as the target application domain. In practice, however, there
will be several different algorithms available to solve a particular problem,
each with different characteristics and limitations. This paper has therefore
developed a “domain-specific” approach to custom instruction identification.
The approach finds operations common to several different algorithms belong-
ing to the same family. It uses a number of existing static analysis techniques,
which can find superficial similarities in the algorithms’ compiled code, with-
out the need for “heavyweight” semantic analyses. By combining the results
of all these syntactic analyses, an instruction set can then be designed, which
will benefit the largest possible number of algorithms from the family of the
algorithms. Case studies have been carried out to demonstrate our static anal-
ysis based approach for a family of thirteen string matching algorithms. In-
dependent dynamic analysis experiments have also been conducted to further
analyze the criticalness of the resultant set of candidate special-purpose op-
erations in terms of their real-time execution frequencies. The performance
improvement of the derived custom instructions has been measured as well
via simulation, showing a significant percentage speed-up for the whole family
of the algorithms. The final set of selected operations is indeed frequently used
27
and is beneficial to implement as customized instructions of a domain-specific
ISA.
Acknowledgements
The authors would like to acknowledge Dr Ross Hayward for useful discussions
on selecting the x86 architecture in our case studies.
References
[1] N.T. Clark, Hongtao Zhong, and S.A. Mahlke. Automated custom instruction
generation for domain-specific processor acceleration. IEEE Transactions on
Computers, 54(10):1258–1270, Oct 2005.
[2] C. Galuzzi and K. Bertels. The instruction-set extension problem: A survey.
ACM Transactions on Reconfigurable Technology and Systems, 4(2):18:1–18:28,
May 2011.
[3] K. Zhao, J. Bian, S.n Dong, Y. Song, and S. Goto. Automated specific
instruction customization methodology for multimedia processor acceleration.
In Proceedings of the 9th International Symposium on Quality Electronic
Design, pages 321–324, Washington DC, USA, 2008.
[4] J. Cong, Y. Fan, G. Han, and Zhang Z. Application-specific instruction
generation for configurable processor architectures. In Proceedings of ACM
International Symposium on Field-Programmable Gate Arrays, pages 183–189,
2004.
[5] K. Zhao and J. Bian. Peeling algorithm for custom instruction identification.
In IEEE Asia Pacific Conference on Circuits and Systems (APCCAS’2010),
pages 720–723, Dec 2010.
[6] S. Yang, C. Lin, C. Hung, J. Wu, and Y. Wang. Application-specific instruction
generation for SOC processors. In IEEE International Symposium on Circuits
and Systems, pages 3752–3755, May 2007.
[7] M. Arnold and H. Corporaal. Automatic detection of recurring operation
patterns. In the 7th International Workshop on Hardware/Software Codesign,
pages 22–26, 1999.
[8] H. Lin and Y. Fei. Exploring custom instruction synthesis for application-
specific instruction set processors with multiple design objectives. In Proceedings
of the 16th ACM/IEEE International Symposium on Low Power Electronics and
Design (ISLPED’10), pages 141–146, New York, NY, USA, 2010.
28
[9] L. Fanucci, M. Cassiano, S. Saponara, D. Kammler, E.M. Witte, and
O. Schliebusch. ASIP design and synthesis for non linear filtering in image
processing. In Proceedings of Design, Automation and Test in Europe (DATE
’06), volume 2, pages 1–6, March 2006.
[10] M. Mbaye, N. Belanger, Y. Savaria, and S. Pierre. Application specific
instruction-set processor generation for video processing based on loop
optimization. In IEEE International Symposium on Circuits and Systems, pages
3515–3518, May 2005.
[11] H. Choi, J. Kim, C. Yoon, I. Park, S. Hwang, and C. Kyung. Synthesis
of application specific instructions for embedded DSP software. IEEE
Transactions on Computers, 48(6):603–614, Jun 1999.
[12] F. Sun, A. Raghunathan, S. Ravi, and N. K. Jha. Custom-instruction synthesis
for extensible-processor platforms. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 23:216–228, 2004.
[13] Michael D. Ernst. Static and dynamic analysis: Synergy and duality. In ICSE
Workshop on Dynamic Analysis (WODA’2003), pages 24–27, Portland, OR,
May 2003.
[14] K. Atasu, W. Luk, O. Mencer, C. Ozturan, and G. Dundar. Fish: Fast
instruction synthesis for custom processors. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 20(1):52–65, Jan 2012.
[15] K. Atasu, C. Ozturan, G. Dundar, O. Mencer, and W. Luk. Chips: Custom
hardware instruction processor synthesis. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, 27(3):528–541, March 2008.
[16] Tao Li, Zhigang Sun, Wu Jigang, and Xicheng Lu. Fast enumeration of maximal
valid subgraphs for custom-instruction identification. In Proceedings of the
2009 International Conference on Compilers, Architecture, and Synthesis for
Embedded Systems (CASES’09), pages 29–36, New York, NY, USA, 2009.
[17] Tensilica. Xtensa customizable processors, accessed in April 2014.
http://www.tensilica.com/products/xtensa-customizable.htm.
[18] K. Shirai, T. Ikenaga, and H. Kitabatake. Design system for special
purpose processor executing algorithms described by higher level language. In
Proceedings of the 3rd Annual IEEE ASIC Seminar and Exhibit, pages P7/1.1
–P7/1.4, Sep 1990.
[19] Madhushika M. E. Karunarathna, Yu-Chu Tian, Colin Fidge, and Ross
Hayward. Algorithm clustering for multi-algorithm processor design. In
Proceedings of the 2013 IEEE 31st International Conference on Computer
Design (ICCD’2013), pages 451–454, Asheville, NS, USA, 6-9 Oct 2013.
[20] F. E. Allen. Control flow analysis. SIGPLAN Not., 5(7):1–19, Jul 1970.
[21] R. Tarjan. Testing flow graph reducibility. In Proceedings of the fifth annual
ACM symposium on Theory of computing (STOC’73), pages 96–107, New York,
NY, USA, 1973.
29
[22] Christian Charras and Thierry Lecroq. Handbook of Exact String Matching
Algorithms. King’s College Publications, 2004.
[23] M. J. Fischer and M. S. Paterson. String-matching and other products.
Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA,
1974.
30
