トクテイ　オウヨウ　ブンヤ　ムケ　メイレイ　セット　ヲ　モツ　クミコミ　プロセッサ　ノ　タメノ　コード　カイカツ　シュホウ by タナカ, ヒロアキ et al.
Osaka University
Title Code Generation Method for Embedded Processors withApplication Domain Specific Instruction Set
Author(s)Tanaka, Hiroaki
Citation
Issue Date
Text VersionETD
URL http://hdl.handle.net/11094/23435
DOI
Rights

Code Generation Method for Embedded Processors 
with Application Domain Specific Instruction Set 
Submitted to, 
Graduate School of Information Science and Technology 
Osaka University 
January~ 2008 
Hiroaki TANAKA 
P,ublications 
Journal Articles (Refereed) 
[11] Hiroaki Tanaka, Yoshinori Takeuchi, Keishi Sakamishi, Masaharu Imai, Hiroki Tagawa, 
Yutaka Ota and Nobu Matsumoto: "Generation of Pack Instruction Sequence for Media 
Processors Using Multi-Valued Decision Diagram," IEICE Trans. on Fundamentals of 
Electronics, vol. E90, no. l2,pp. 2800-2809, Dec, 2007. 
[J2] Hiroaki Tanaka, Yoshinori Takeuchi, Keishi Sakanushi and Masaharu Imai: "A Code 
Optimization Technique for Processors with SIMD instructions Considering Permutation 
Instructions," IPSJ Journal (in Japanese, to appear). 
International Conference Papers (Refereed) 
[II] Hiroaki Tanaka, Shinsuke Kobayashi, Yoshinori Takeuchi, Keishi Sakanushi and Masa-
haru Imai: "A Code Selection Method for SIMD Processors with PACK Instructions," in 
Proceedings of the 7th International Workshop on Software and Compilers for Embedded 
Systems (SCOPES), pp. 66-80, Sep., 2003. 
[I2J Hiroaki Tanaka, Yoshinori Takeuchi, Keishi Sakanushi, Masahau Imai, Yutaka Ota, Nobu 
Matsumoto and Masaki Nakagawa: "Pack Instruction Generation for Media Processors 
Using Multi-valued Decision Diagram," in Proceedings of the 4th International Confer-
ence on Hardware/Software Codesign and System Synthesis (CODES+ISSS), pp. 154-
159, Oct., 2006. 
[13] Hiroaki Tanaka, Shiro Kobayashi, Yoshinori Takeuchi, Keishi Sakanushi and Masahaxu. 
Imai, ''A Block-Floating-Point Procesoor for Rapid Application Development," in Pro-
ceedings of IEEE International Conference on Acoustics. Speech and Signal Processing 
2007 (lCASSP2007), vol. II, pp65-68, 2007. 
Dom,estic Conference Paper 
[Dl] Hiroaki Tanaka, Shinsuke Kobayashi, Yoshinori Takeuchi, Keishi Sakanushi and Masa-
haru Imai: "A Code Selection Method for SIMD Processors with PACK Instructions," 
Technical report ofIEICE. VLD, vol. 103, no. 145, pp. 43-48, Jun.; 2003 (in Japanese). 
[D2] Hiroaki Tanaka, Hassan M. AbdEiSalam, Shiro Kobayashi, Yoshinori Takeuchi, Keishi 
Sakanushi and Masaharu Imai, "Implementation fo 2D-FFT on a Block-Floating-Point 
DSP," IPSJ Symposium Series, vol. 2004, no. 8, pp. 91-96, Jul., 2004 (in Japanese) 
[D3] Hiroaki Tanaka, Keishi Sakanushi, Yoshinori Takeuchi and Masaharu Imai, "A Code 
. Generation Method for Processors with SIMD Instructions Considering Dependency and 
Distance between Operations," IPSJ Symposium Series, vol. 2005, no. 9, pp. 79-84, 
Aug., 2005 (in Japanese) 
[D4] Hiroaki Tanaka, Shiro Kobayashi, Yoshinori Takeuchi, Keishi Sakanushi and Masaharu 
Imai, "Implementation and Evaluation of an Instruction Set Processor with Herarchical 
Block-Floating-Point Arithmetic Capability," In the Proceedings of 21th Signal Process-
ing Symposium, Nov., 2006 (in Japanese) 
Summary 
Application specific instruction set processors (ASIP) realize high cost-performance embedded 
systems. However, there is a difficult problem in software development for ASIPs. In typical 
software development for general purpose processors, programniers write programs in high-
. level programming language and then generate assembly code using compilers. On the other 
hand, in software development for ASiPs, programmers often have to write assembly code 
in some cases because compilers cannot efficiently generate application specific instructions 
in assembly code. Programming using assembly code requires much development effort and 
loses portability of programs. Therefore, there is a great need for a code generation technology 
to generate assembly code which brings out processors's performance. 
This thesis discusses two types· of code generation method for processors with application 
domain specific instruction set. First, this thesis proposes a code generation method with special 
functions, which are mapped into specific instructions, in high-level programming languages. 
To ease software development of the targ~t application domain, programming scheme and code 
generation method for block-floating -point processors are proposed. Experimental results show 
the proposed code generation method successfully generates assembly code forblock-floating-
point processors and quality of generated code is good enough for practical applications. 
The second code generation method is automatic generation of application specific instruc-
tions. To enable programmers to use application specific instructions without much programing 
effort, automatic utilization of application specific instruction is proposed. Media processors 
are selected as target processors in this thesis. Difficulties in compilation for media processors 
are extraction of operations to perform in parallel and to deal with positions of data in registers. 
. Mediaprocessors have two types of instructions, SIMD instructions which operate on subword 
data in registers and permutation instructions which reorder or repack data in registers. Utiliza-
i 
tion ofbothSIMD and permutation instruction is crucial to maximize the perfonnance of media 
processors. In this study, code optimization problem for media processors is formulated into 
integer linear programming problem. Then, optimum and heuristic code generation methods 
are proposed and evaluated. Experimental results show the proposed methods generate assem-
bly code with SIMD and pennutation instructions. High perfonnance improvement is shown 
by tl).e proposed code generation methods. 
Acknowledgements 
I would like to thank my supervisor, Prof. Masaharu Imai, for his guidance throughout my 
undergraduate and graduate researches. I am especially grateful for his useful advice which 
develops, my expertise in the research area and academic mentality. 
I would like to thaTIk Prof. TakaoOnoye, Prof. Akihisa Yamada and Dr. Yoshinori Shigeta 
for a review of this thesis, useful discussion and comments on my study. Also,] would like to 
thank Prof. Onoye for treating me friendly throughout my student life. I enjoyed with his talks 
and thoughts very much. 
I would like to thank Prof. Yoshinori Takeuchi. His comments greatly improved this thesis. 1 
am also grateful for his guidance, support and encouragement through everything in my student 
life. 
I would like to thank Prof. Keishi Sakanushi for his guidance, giving comments on my 
research activities. 
I would like to thank Prof. Shinsuke Kobayashi and Mr~ Kentaro Mita for introducing me 
to this research area and guiding my early study. They gave me instruction in research activity, 
presentation skills, programming skills, etc . 
. I would like to thank Dr. Yuki Kobayashi, Mr. Noboru Yoneoka and Mr. Tatsuhiro Yoshimura. 
They provided me good time through my days in Integrated System Design Laboratory in Os-
aka University. I always enjoyed all kinds of activities in the laboratory by them. 
I thank Dr. Kyoko Ueda, Dr. Mohamed AbdEISaiam Hassan, Miss. Yukako Nishikawa, 
Mrs. Motoko Higashide, Mr. Ittetsu Taniguchi, Mr. Takashi Hamabe, Mr. Hirofumi Iwato, 
Mr. Takuji Hieda, Mr. Takeshi Shiro, Mr. Takahiro Itoh, Mr. Akira Kobashi, Mr. Yu Okuno, 
Mr. Hitoshi Nakamura, Mr. Ayataka Kobayashi, Miss Aiko Watanabe, Mr. Kazuhiro Kobashi, 
Mr. HideYuki Okajima and other current and former members 'of Integrated System Design 
iii 
Laboratory in Osaka University. They gave me a lot of comments on my study. They also 
made my life enjoyable with their talks and acts. 
I would like to thank collaborators through my doctoral research. I would like to thank Dr. 
Shiro Kobayashi and Mr. Wang David from Asahi Kasei Corporation for giving discussion 
and comments on the study of the block-floating-point processor. They also helped me with 
implementation of the processor and programs for evaluation. I would like to thank Mr. Nobu 
Matsumoto, Mr. Yutaka Ota and Mr. Hiroki Tagawa fr:om Toshiba Corporation. They provided 
me discussion and comments on the study of code generation for media processors. They also 
provided me tools for evaluation including compiler, simulator and so on. 
Last, I would like to thank my fatherYasuo, my mother Nobuko, and my brothers Katsuyuki 
and Masahiro. They gave me the encouragement and continuous support throughout my life in 
Osaka. 
'Contents 
1 
2 
Introduction 
1.1 Overview of Embedded Processors .. 
1.1.1 Digital Signal Processors . 
1.1.2 Media Processors . 
1.1.3 Microcontrollers 
1.1.4 . Application Specific Instruction Set Processors 
1.2 Requirements for Compilers 
1.3 Contributions 
1.4 Thesis Outline . 
Related Work 
2.1 Overview of Code Generation Method for Embedded Processors 
2.1.1 Assembly Language or Compiler Intrinsic Functions 
2.1.2 Language Extension or Construction ..... . 
2.1.3 Automatic Code Generation and Optimization. 
2.2 Code Generation for Block-Ploating-Point Processors 
2.2.1 Block-Ploating-Point Processors ...... . 
2.2.2 Related Work on Code Generation Method for Block-Floating-Point 
Processors .......... . 
2.3 Code Generation for Media Processors . 
2.3.1 Media Instruction Set ..... . 
2.3.2 Related Work on Code Generation for Media Processors 
v 
1 
3 
3 
3 
4 
5 
6 
7 
9 
11 
11 
12 
13 
13 
14 
14 
16 
17 
17 
18 
3 Code Generation for Block-Floating-Point Instructions 
3.1 Hierarchical Block-Floating-Point Arithmetic . . . . . . 
3.1.1 Conventional Block-Floating-Point Arithmetic 
3.1.2 Introduction of Hieracrchica1 block:'floating-point Arithmetic. 
3.1.3 Principle ofH~BFP Arithmetic. 
3.2 H-BFP Processor and its Compiler 
3.2.1 Design Concept. ..... 
3.2.2 Hardware Implementation 
3.2.3 Processor Description 
3.2.4 Compiler Support . 
3.3 Experimental Results ... 
3.3.1 Experimental Setup . 
. 3.3.2 Results 
3.4 Summary ... 
4 Optimal Code Generation for Media Instructions 
4.1 SIMD Instructions 
4.2 Code selection .. 
4.3 .SIMD Instruction Formulation 
4.3.1 Rules for SIMD instructions 
4.3.2 Constraints on selection of rules 
4.3.3ILP formulation ........ . 
4.4 SIMD Instruction Formulation with permutation Instructions 
4.4.1 IR and Rules for Data Packing and Moving 
4.4.2 Constraints on selection of rules 
4.4.3 ILP Formulation 
4.5 Experimental results 
4.6 Summary . . . . . . 
23 
23 
23 
24 
25 
28 
28 
28 
30 
32 
34 
34 
35 
40 
41 
41 
42 
43 
43 
45 
46 
49 
49 
52 
53 
54 
59 
5 Efficient Code Generation Algorithm for Media Instructions 61 
5.1 Generation of SIMD Instructions . . . . . . . . . . . . . . . . . . . . . . . .. 61 
5.1.1 Grouping SIMD Operations . . . . . . 
5.1.2 Ordering SIMD Operations in Registers 
5.2 Generation of Permutation Instructions . . . . . 
62 
62 
63 
5.2.1 Introduction ofMDDs for Representation of a Set of Permutations. 64 
5.2.2 Permutation Operation Manipulation on MDDs 65 
5.2.3 Permutation Instruction Generation Algorithin 68 
5-.3 Experimental Results . . . 
5.3.1 Experimental setup 
S.3.2 Results 
5.4 SUmInary ... 
6 Conclusion and Future Work 
6.1 Conclusion . 
6.2 Future Work . 
6.2.1 Automatic ASIP Design Space Exploration 
6.2.2 Compilation Techniques for Low Power . . 
·6.2.3 Compilation Techniques for Multi Processor SoC ~ 
72 
72 
76 
84 
87 
87 
88 
88 
89 
89 
List of Figures 
2.1 H-BFP Arithmetic ......... . 
2.2 SIMD and permutation instructions 
3.1 Block-Floating-Point Data Format 
3.2 H-BFP Arithmetic .. '.' ..... 
3.3 Comparison of conventional BFP and H-BFP 
3.4 H-BFP Data Path ...... . 
3.5 H-BFP Processor Architecture 
3.6 Comparison of Application Development Flow 
3.7 Program Refinement forH-BFP processor ... 
3.8 Relative Ratio of Execution Time ofDLX-BBFP to DLX-FP 
4.1 Example of SIMD instructions. 
4.2 permutation instructions. . . 
4.3 Consistency of nonterminals. 
4.4 Schedulability. ...... . 
4.5 Nodes insertion for data transfers. 
4.6 Rule of permutation instructions. 
4.7 EXaInple of permutation instructions. 
4.8 Identification of a register which source values located. 
4.9 . The ratio of generated code size. 
4.10 The ratio of execution cycles. 
4.11 convolution. 
4.12 n real update. 
ix 
15 
17 
24 
26 
26 
29. 
31 
33 
34 
38 
42 
43 
45 
.45 
49 
50 
51 
52 
55 
56 
58 
58 
5.1 Operation grouping 62 
5.2 Operation ordering 63 
5.3 MDDs for { abcd } and { abcd,abdc } 65 
5.4 Adding permutations on MDDs 67 
5.5 Reordering on MDDs ..... 67 
5.6 An example permutation instruction 68 
5.7 Testing Target Permutation Generation 69 
5.8 Expression Tree Construction 70 
5.9 Target processor architecture . 73 
5.10 Permutation instructions of the target processor 74 
5.11 Code length reduction ratio. 76 
5.12 Code length reduction ratio. 77 
5.13 Speedup against without SIMD instructioris. 78 
5.14 Speedup against without SIMD instructions. 79 
5.15 Permutation in rgbgray ........... 82 
List of Tables 
3.1 Results of logic synthesis . . . . . . . . . 
3.2 Breakdown of gate count ofDLX-HBFP . 
3.3 Comparison of the number of insns. among different implementations, 
3.4 SNR of each programs run on DLX-HBFP andDLX~FP 
36 
36 
37 
39 
4.1 " Generated code size and execution cycles. . . . . . . . . . . . . . . . . . . 57 
4.2 The number of DFT nodes, variables and constraints in. ILP and CPU time. ,58 
5.1. Breakdown of generated instructions . . . . . . . . . .. . . . . . . 81 
5.2 Comparison of compilation time between [1] and proposed method. 
5.3 Permutation count and MDD node count of sets of permutations . . 
xi 
83 
85 
xii 
Chapter 1 
Introduction 
The progress of semiconductor manufacturing technology and design methodology of inte-
grated circuits enable us to realize various electronic devices based on digital circuits. As 
integratIon scale of semiconductor grows, electronic devices can be manufactured in smaller 
size. The progress of design methodology enables to implement large scale electronic systems. 
Nowadays, electronic systems are applied to products in various fields; mobile platforms such 
as cell-phones and PDAs, media players such as MP3 play~rs and DVD players, engine con-
troller for vehicles, and all kinds of consumer electronics. These kinds of electronic systems 
have been referred to as embedded systems. Embedded systems have several features' which 
general purpose systems like personal computers do not have. 
• Embedded systems perform application specific tasks. 
• Embedded systems have to meet tight design constraints which come from applicati'on 
specification. 
Performance requirements for embedded systems are becoming high, and functional require-
ments are becoming various. These requirements result in complex embedded systems. Product 
vendors have to make a large effort to develop electronic products. Rising development cost 
is the most serious problem in the design of electronic systems. Embedded systems have been 
developed as Application Specific Integrated Circuit (ASIC), which is dedicated specific ap-
plications, in early era of electronic system develo.lJment. However, designing ASICs for all 
1 
2 CHAPTER 1. INTRODUCTION 
products is not realistic approach in recent embedded system design. Flexible hardware which 
cannot be used for only a specific application but also various applications is required. 
To address the problem of design productivity, instruction set processors are widely used in 
embedded system design. Instructions-set processors realize desired functionality with given 
application programs .. There is no need to develop new hardware for new products if instruc-
tion set processors are employed. Instruction set processors greatly reduce development cost 
in embedded system design. Many applications are realized by instruction set processors in 
today's electronic products. Generally, instruction set processors designed for general pur-
poses. However, general purpose processors are often inefficient from the point of view of 
/ 
I 
cost-performance since they are not optimized for a specific application. Such inefficiency _ 
causes high manufacturing cost of products or low energy efficiency in application run-time. 
Recent electronic devices required high performance and high energy efficiency. General pur-
pose processors usually do not meet requirements in recent electronic products. 
To achieve required performance in short development period, Application Specific Instruc-
tion set Processors (ASIP) are developed and used. AS IPs are the instruction set processors 
which have custom hardware and instructions for dedicated applications. Such custom instruc-
tions can process applications effectively. Especially, high performance is required in digital 
signal processing field, and· processors for digital signal processing applications are widely 
used in recent years. Though AS IPs realize high cost-performance embedded systems, there is 
a considerable problem from the point of view of software development of ASIPs. In a typical 
software development for general purpose processo!s, programmers write programs. in high-
level programIning language, then, generate object code using compilers. On the other hand, 
in software development for ASIPs, programmers often have to write assembly code because 
compilers hardly generate application specific instructions in assembly code in some cases. 
Programming in assembly code requires much development effort and loses portability of pro-
grams. Therefore, there is great need for a technology to generate assembly code which brings 
out processors's best performance. 
This chapter gives an overview of embedded processors, then, requirements for compilers 
for these processors are summarized. Then, the contribution of this thesis is described. Finally, 
the outline of this thesis is described. 
1.1. OVERVIEW OF EMBEDDED PROCESSORS 3. 
1.1 Overview of Embedded Processors 
This section gives brief survey of several types of embedded processors and their characteris-
tics. 
1.1.1 Digital Signal Processors 
Digital Signal Processors (DSPs) are dedicated to real-time digital signal processing such as 
FIR filtering, IIR filteiing, FFT and so on. To meet severe performance requirements due to 
real-timeprocessing, many architectural enhancements have been introduced into digital signal 
processors. 
There are some concepts in design of bsps. 
• Efficient memory access mechanism to process a large amount of data, such as dual 
memory access, modulo addressing and memory accesses with address modifications. 
• Complex operations to process data efficiently such as multiply-accumulation, memory 
access with shift and parallel arithmetic operations. 
• Zero overhead loop mechanism to reduce loop overhead. 
• Instruction set which supports Fixed-Point Arithmetic. 
The first successful DSP in industry was TMS320C1x series[2] provided by Texas Instru-
ments. Because the TMS320C1x is early generation of DSPs, some characteristics enumerated 
above are not introduced. However, some DSP specific features such as memory access with 
address modification, complex operations and support of fixed-point arithmetiC are found. 
The next generation of Tl's DSP, TMS320C2x series[3] has all features listed above. These 
, 
features enable to fetch two operands from data memories and perform one multiply-accumulate 
operation within one cycle. This is one of the most remarkable feature of TMS320C2x. 
1.1.2 Media Processors 
Media processors target real-time media encoding and decoding. In recent years, the standards 
for audio CODEC, video CODEC, and picture compression have been proposed and used in 
4 CHAPTER 1. INTRODUCTION 
many kinds of electronic devices. Since encoding and decoding of multimedia digital data 
require high performance for electronic devices, existing general purpose microprocessors and 
DSPs did not have ·enough processing ability to perform the co dec standards. The emerging 
technologies to meet higher performance require exploitation of parallelism in two different 
levels. 
• SIMD instructions which operate on subword data in registers 
• Multiple instruction issue mechanism as VLIW architecture 
SIMD instructions exploit data level parallelism in applications. In media processing appli-
cations, usually there exist a lot of operations which can perform in parallel. By mapping such 
operations in one instruction, higher performance can be obtained than conventional instruction 
which perform one operationper one instruction. Multiple instruction issues mechanism which 
is referred to as vq:W (Very LongInstruction Word) exploits instruction level Parallelism. In 
assembly code, instructions which have no dependence among them are found. By issuing such 
instructions simultaneously, higher performance can be obtained than single issue processors. 
Media processors have been provided by several companies. Texas Instruments has provided 
TMS320C6x series [4]. TMS320C6x series is a advanced generation ofTMS320C2x, and ded-
, 
icated not only to digital signal processing applications but also to digital meclia applications. 
Processors in TMS320C6x series have all features ofDSPs and both of SIMD instructions and 
VLIWarchitecture features. TMS320C6x series are VLIW processors and allowed to issue up 
to 8 instructions simultaneously. 2 parallel and 4 parallel SIMD instructions are also supported. 
PNX1300 media processors [5] which provided by NXP Semicon~uctors are also VLIW based 
media processors. PNX1300 media processors are able to issue up to 5 instructions and support 
SIMD instructions. Several DSP features are also supported by PNX1300 media processors. 
1.1.3 Microcontrollers 
Microcontrollers are used for controlling several peripheral devices. Microcontrollers receive 
signals from various peripheral devices such as sensor signals and interrupts, and then send 
signals to control other peripheral devices. The increasing usage of microcontrollers in recent 
1.1. OVERVIEW OF EMBEDDED PROCESSORS . 5 
years is controllers for vehicles. Conventional mechanisms of vehicle control,mechanically 
controlled systems like oil pressure, are to be replaced with electronical control systems. Re-
quired performance to microcontrollers is not so high compared to media processors. However; 
microcontrollers should be dependable, and requirement of real-time reactivity must be met be-
cause of drivers's safety. 
1.1.4 Application Specific Instruction Set Processors 
All processors reviewed in previous sections are designed by processor vendors, and system 
developers just use them without any modification. Unlike those processors, configurable pro-
cessors are redesigned or modified by product developers using processor design tools provided 
by tool vendors. Since processor designers can configure their processors, processors optimized 
to specific applications can be obtained. Software development tools such as compilers, assem-
blers and linkers are typically provided by tool vendors. . 
Many academic and commercial processor design tools are presented. 
Chess/Checkers [6] [7] is an embedded processor design tool-suite provided by Target Tech-
nologies. In the Chess/Checkers processor development environment, processor designers de-
sign processors using nML processor description language. Chess takes target processor de-
scription written in nML language and -application program written in C language, compiles 
, 
the application program, and generates assembly code for the target processor. The processor 
performance can be estimated using a retargetable instruction set simulator, Checkers. The 
processor HDL is generated by HDL generator Go from the processor description written in 
nML. Processor designers can explore the processor design space by modifying the processor 
descriptions and using Chess/Checker tool-suite. 
Processor Designer [8] [9] provided by Coware is also an embedded processor design tool. 
Processor Designer uses LISA processor description language like Chess/Checkers's nML lan-
guage. Generation ofHDL of processors and software development tools such as compiler, 
assembler and simulator are also supported by Processor Designer. 
Tensilica's Xtensa [10] is an configurab1e processor and its architecture parameters and in-
struction set are customizab1e. Processor designers are able to add instructions using TIE lan-
CHAPTER 1. INTRODUCTION 
guage which describes additional instructions [11]. Similar to Chess/Checkers and Processor 
Designer, HDL generation of processors and generation of software development tools are sup-
ported. The feature of Tensilica's processor design tool-suite support automation of processor 
customization. Xtensa Xplorer takes target application written in C programs, then, automati~ 
cally customize the target processor to process efficiently the target programs. 
1.2 Requirements for Compilers 
The main role of compilers in embedded system design is to ease software development. Writ-
ing assembly code is a hard task because of the low-level of abstraction of programing model. 
Additionally, the requirements of a huge variety of functionalitiesof recent embedded systems 
. I 
make embedded software large and complex. Developing all embedded software by assembly 
language is not feasible in recent embedded system development. To use compilers in develop-
ing embedded software is not unusual while embedded software developers are used to program 
in assembly language in the past. 
The most 'significant task of compilers is to translate programs written in high-level pro-
gramming language into assembly code which is an instruction sequence of a target processor. 
Moreover,there are some additional requirements around compilers. 
• Compilers should support a programming language which gives a suitable programming 
model to describe applications in the target application domain. 
• Compilers should generate optimized assembly code which take advantage of the target 
processor. 
The first requirement stems' from the need of high software productivity. The abstraction 
of programming language should be not only high-level but also suitable for the target appli-
cation domain. The translation of algorithm of applications into programs should be easy for 
programmers, if the programming model is suitable to the application domain. As a result,: 
programmers can rapidly develop software. 
The second requirement stems from the requirements for performance. Unless the appli-
cation specific architecture features and instructions are utilized, high performance cannot be 
1.3. CONTRIBUTIONS 7 
obtained. Traditional compilation methodology does not take into account most application 
specific features. Generally, it is difficult for compilers to utilize architecture specific features. 
However, compilation method to utilize application specific features are essential for embedded 
system development 
1.3 Contributions 
This thesis discusses code generation method for application domain specific instruction set 
processors. Challenges involved by the requirements mentioned. in the previous section ate 
tackled. 
As a compilation method of application specific programming model, code generation method 
for block-floating-point processors are studied. Block-floatirtg-point processors have instruc-
. . 
tion set to perform operations based on block-floating-point arithmetic. A challenge in compi~ 
lation for block-floating-point processors is to bridge a gap between the programming model of 
block-floating-point arithmetic and usual programming model of a programming language. The, 
method to describe the computation procedure of block-floating-point arithmetic in high-level 
programing language and to compile programs is the main topic of this study. 
Compiler optimization method for media processors is also studied in this thesis. Media 
processors play an important role in recent embedded system design because of the spread of 
digital media applications among many kinds of electronic devices. Compilation method to 
fully utilize media processors is crucial for development of embedded software. The assembly 
code generation method for a class of instruction called SIMD instructions is focused in this 
study. Difficulties in compilation for SIMD instructions are extraction of operations to perform 
in parallel and to deal with positions of data in registers. To minimize the number of arith-
metic operation instructions, compiler should map the maximum possible operations to SIMD 
instructions. However, since each operation in SIMD instructions process data existing the 
same position among different registers, permutation instructions which reorder or repack data 
in registers may be required. Permutation instructions take additional execution cycles hence 
it is desirable to use as less permutation instructions as possible. By determining positions of 
data in registers appropriately, the number of permutation instructions will be small. However, 
8 . CHAPTER 1. INTRODUCTION 
the positions of data in registers is determined inappropriately, overhead caused by permutation 
instructions becomes large. The code generation method to use SIMD instructions considering 
the permutation instructions is the main topic of the latter half of this thesis. 
The main contributions of this thesis are as follows. The first contribution is on· compilation 
method for block-floating-point processors. 
• First, the code generation method for block-floating-point processors is proposed. With 
the consideration to describe the block-floating-point arithmetic in the high-level pro-
gramming language, the processor architecture, programming scheme and compilation 
method are studied. A complete evaluation of processors and compilers have been per-
formed. 
The second contribution is on compilation method for media processors. 
• Code generation for media processors with SIMD and permutation instructions is formu-
lated irito integer linear programming (ILP) problem. 
• The code generation method using ILP solver is proposed. The effectiveness of the 
method is demonstrated by applying it to a set of programs from the domain of digi-
tal signal processing applications. 
• A heuristic for the code generation with SIMD and permutation instructions based on 
data flow graph representation of programs is proposed. The output of this heuristic is 
not always optimal. However, this heuristic outputs solutions in very shorter time than 
ILP based code generation method. 
• The method to generate permutation instruction sequence which is necessary to heuristic 
code generation method is presented. This method c;omputes an instruction sequence to 
generate desired data permutation with given permutation instruction set. 
• The heuristic code generation method is evaluated using a real commercial media pro-
cessor and several digital signal processing applications. 
104. THESIS OUTLINE 9 
1.4 Thesis Outline 
The remainder of this thesis is as follows. 
In chapter 2, related work of this thesis is summarized. Code generation methods for appli-
cation domain specific instruction set processors are reviewed in this chapter. Ideally, assembly 
code of any processors should be generated from descriptions of processor independent high-
level programming languages. However, such compilers are not always provided or desired 
performance cannot be always obtained. Several researches or practical approaches to deal 
with this problem are reviewed. Compilation techniques to generate high performance code 
from processor independent high-level programming languages are also reviewed. 
In chapter 3, a compilation method for block-floating-point processors is described. An in-
struction set processor supporting H-BFP arithmetic and its. application development meth~d 
are proposed. The processor is designed based on the RISC architecture to enable compiler-
based development. The programming scheme using intrinsic functions and compilation method 
for the block-floating-point processors are presented and evaiuated. 
In chapter 4, a code generation method for SIMD instructions considering permutation in-
structions is described. The code generation method is based on a code selection problem for-
mulated into integer linear programming problem. Code generat~on by solving the formulated 
problem using ILP solver is evaluated. 
In chapter 5, a heuristic code generation technique for SIMD instructions with permutation 
instructions is described. This method identifies SIMD instructions by finding and grouping the 
same operations in programs. After the SIMD instruction identification, permutation instruc-
tions are generated. In this permutation instruction generation, Multi~va1ued Decision Diagram 
(MDDY is introduced to represent and to manipulate sets of packed data. The code genera-
tion method integrated the heuristic to identify SIMD instructions and permutation instruction 
sequence is evaluated . 
. '
Conclusion and future work are described in chapter 6. 
10 CHAPTER 1. INTRODUCTION 
Chapter 2 
Related Work 
This chapter discusses code generation methods for embedded processors. Then, related work 
of the topics studied in this thesis is summarized. 
2.1 Overview of Code Generation Method for Embed-
ded Processors, 
There are some code generation methods for embedded processors. A difficulty in code genera-
tion for embedded processors is to utilize instructions dedicated to specific application domain. 
The most desirable way to use application specific instructions is automatic generation of as-
sembly code by compilers. Code optimization techniques for application specific instructions 
or architectures have been studied for any kinds of instructions or architectures [12], [13],[14], 
[15]. 
Automatic generation of assembly code from processor independent programs in high-level 
languages is a challenging task because of the gap between model of programming language 
and the model of applications in the target domain. Some approaches have been investigated 
to manage this gap [16], [17], [18], [19]. [20], [21], [22]. After the review of code genera-
tion methods, existing code generation methods for two classes of embedded processors are 
discussed. 
11 
12 CHAPTER 2. RELATED WORK 
2.1.1 Assembly Language or Compiler Intrinsic Functions 
When compilers are not available or do not generate application specific instructions, the sim-
plest solution to request for using appiication specific instructions is to write software in as-
sembly code. Assembly programming of performance critical processes is still a prevailing so-
lution in current software development. Programming with compiler intrinsic functions which 
are functions to be mapped into specific instructions in high~level programming languages is 
. also taken to use application specific instructions. There are ~ome variations of this approach. 
• Program entire software in assembly language. 
• Program performance critical functions in assembly language, then integrate them with 
, programs written in high-level programming languages at linking time. 
• Use previously designed libraries. 
• Embed assembly code in programs written in high-level programming languages. 
• Use compiler intrinsic functions in programs written in high-level programming lan-
guages. 
If compilers are not available or the target application is small, assembly programming of 
entire software is a possible way to use application specific instructions. Programming perfor-
mance critical functions in assembly language is another possible approach to exploit proces-
sors's potential. Since small portion of software is frequently executed typically in run-time, 
programming in assembly only critical portion code is effective. 
Some processor vendors provide pre..;compiled functions as libraries, which are frequently 
used for applications in target application domains. Application programmers can use applica-
tion specific instructions through functions collected in libraries [23], [24]. 
Using application specific instructions in high-level language is also a possible approach to 
, exploit application specific instructions. Some compilers have mechanism called inline assem-
bly. Inline assembly generate fragments of assembly code written in programs of high-level 
programming language as it is. Programmers can embed assembly code directly into target 
application programs. Compiler intrinsic functions can be used in some compilers. Compiler 
2.1 OVERVIEW OF CODE GENERATION METHOD 13 
intrinsic functions are the functions in high-level languages mapped into specific instru~tions 
by compilers. Programmers can use specific instructions by writing compiler intrinsic functions 
like usuCll functions. 
2.1.2 Language Extension or Construction 
To ease application development and tasks of compilers, special high-level languages have been 
proposed. 
are extended or new languages for specific application domain are constructed. Compilation 
of general high-level languages for application specific instructions would be a hard task due 
to the gap between programming model of languages arid instruction sets. Limited data type, 
arithmetic operations and program control flow model in high-level programming languages 
make production of efficient assembly code difficult. Sequential programming model is an 
obstacle to exploit instructions to perform panillel operations. 
Language extension to break the limitation of existing high-level languages by introducing 
new data type, new operations, new control flow model and parallelism has been studied before 
[19]. [20], [21]. Construction of application domain specific languages has been also studied 
[22]. These appr~aches allow programmers to use application specific instructions easy. 
2.1.3 Automatic Code Generation and Optimization 
The most desirable way to use application specific instructions for programmers is automatic 
generation of assembly code making the best use of target processors. Many code generation 
and optimization methods for many kinds of processors have been studied [25],[26], [12], [15]. 
There are several approaches to generate and optimize assembly code or programs [26],[15]. 
One approach to optimize programs is transformation of intermediate representation of pro-
grams. Processor independent and dependent optimization algorithms at intermediate repre-
sentation level have been studied a lot [26]. Another approach to optimize programs and as-
. sembly code is generation of high quality assembly code in assembly code generation phase 
[15]. In assembly code generation, compilers have three tasks, instruction selection, instruc-
tion scheduling and register allocation. Optimization algorithms in these tasks have beep also 
14 CHAPTER 2. RELATED WORK 
studied. Since this thesis focuses method to utilize application specific instructions, the code 
generation and optimization methods at intermediate representation and instruction selection 
are reviewed only in this section. 
Early generation of digital signal processors have single-issue architecture, and have complex 
instructions such as memory access operations with complex addressing mode and multiply-
accumulation operations to achieve efficient instruction encoding [3]. Such instruction set 
causes multiple translation into assembly code for a given program. A problem of translat-
ing programs into most efficient assembly code has been studied by many researchers [27], 
[28], [29], [30]. 
In the research on the code generation for digital signal processors, code generation methods 
to achieve high signal processing quality for fixed-point digital signal processors have been 
studied. In this topic, accuracy of outputs of signal processing is the matter to be considered. 
Program transformation from floating-point progrl:JlIls into fixed-point programs at high-level 
programming language have been studied [17]; [18], [16]. 
2'.2 Code Generation for Block-Floating-Point Proces-
sors 
2.2.1 Block-Floating-Point Processors 
There are two major types of digital signal processors classified by the supported arithmetic. 
The first one is floating-point arithmetic and the other is fixed-point arithmetic. Each of these 
DSPs has different features. From the application software development point of view, software 
for floating-point DSPs can be efficiently developed, because high signal quality can be easily 
achieved due to the nature of floating-point arithmetic. However, floating point units are ex-
pensive and not suitable to realize cost-effective digital signal processing systems. On the other 
hand, fixed-point DSPs do not have expensive hardware. For the reason of low hardware cost, 
fixed-point DSPs are widely used in consumer electronics. However, the development of soft-
ware for fixed-point DSPs is time consuming task because it is difficult to develop fixed-point 
implementation which achieves high signal processing quality. 
2.2. CODE GENERATION .FOR BLOCK-FLOATING-POINT PROCESSORS 15 
Figure 2.1: H-BFP Arithmetic 
As a compromise between floating-point and fixed-point, block-floating-point (BFP) imple-
mentation is also used on fixed-point DSPs. Block-floating-point is based on the concept of 
. floating-point; that is to say; the block-floating-point systems use floating-pointformat as num-
ber representation and performs arithmetic operations in floating-point manner. The difference 
between floating-point and block:·floating-point is that several numbers share an exponent in 
block-floating-point systems while each numberhas its own exponent in floating-point sys-
tems. The advantages of block-floating-point arithmetic are lower performance requirement 
for processing devices than floating-point and suitable computation steps for fixed-point DSP. 
These advantages have motivated system developers to use block-floating-point arithmetic. 
While a BFP implementation can realize high signal processing quality similar to floating-
point on a low-cost fixed-point hardware, developing BFP implementation is still a hard task. 
BFP has been applied to some limited applications [31] [32], but efficient BFP implementations 
fo~ other applications are not well known. Actually, there is a considerable trade-off between 
accuracy and hardware cost, in BFP arithmetic [33]. This trade-off makes development of BFP 
systems difficult. In order to solve the problem of lacking a good implementation approach, 
hierarchical block-floating-point (H-BFP) arithmetic [34] [33] and a pr~cessor with H-BFP 
instruction set have been proposed [35]. The basic concept of H-BFP is to keep data on the 
memory in floating-point format, while processing data in fixed-point format. 
16 CHAPTER 2. RELATED WORK 
Fig. 2.1 illustrates the concept ofH-BFP arithmetic. The data in memory are represented as 
floating-point numbers. The floating-point to fixed-point converter is used to convert the data. 
Data processing are perforined on the fixed-point data path, and then the/results are converted 
floating-point representation by the fixed-point to floating-point converter. The fixed-point to 
floSlting-point converter performs floating-point normalization. The results are stored to the data 
memory. With H-BFP arithmetic, desired signal processing quality and reasonable hardware 
implementation can be obtained simultaneously like usual BFP. Details OfH-BFP is 
2.2.2 Related Work on Code Generation Method for Block-Floating-
Point Processors 
There exists processors which uses block-floating-point arithmetic such as conventional fixed-
point DSPs [4] and H-BFP DSPs [35]. However, compilation method for block-floating~point 
instruction set have not been studied in the past. The application development approach pre-
sented in [35] is writing assembly code from scratch referring to the floating-point implemen-
tation of the target application. Block-floating-point implementation of FFT on conventional 
fixed-point DSPs presented in [36] is also assembly programs. Manual translation of programs 
from high level language into assembly language is error prone and a time consuming task. A 
software development method which offers high productivity is required. 
In this thesis, an instruction-set processor supporting H-BFP arithmetic and its application 
development method are proposed. Since this is the first research on the code generation by 
compilers for H-BFP instruction set, RISC based processor is originally proposed to separate 
compiler's task from difficulties which do not stem from H-BFP feature. In the proposed soft-
ware development flow, H-BFP programs are implemented by modifying usual floating-point 
progra)TIs. The required modification is only to add special functions. The proposed code gen-
eration method converts the special functions into H-BFP instructions. Since the modification 
does not require complex program transforri:lations by hand, H-BFP program can be easily 
developed. 
2.3. CODE GENERATION FOR MEDIA PROCESSORS 17 
(a) a SIMD instruction (b) a permutation instruction 
Figure 2.2: SIMD and permutation instructions 
2.3 Code Generation for Media Processors 
2.3.1 ·Media Instruction Set 
Multimedia applications become popular applications which run on wide variety of platforms. 
Most multimedia applications demand high performance for electronic devices which execute 
those applications. Recent microprocessors are. often .customized to execute multimediaap-
plications efficiently. The good nature in most multimedia applications is data parallelism in 
applications. Therefore, many processors adopt SIMD (Single Instruction Multiple Data) in- . 
structions to exploit data parallelism: SIMD instructions perform operations on multiple data 
packed in registers as shown in Fig.2.2(a). When a SIMD instruction is executed, the. same 
operations are executed at the same time. Obviously, the processing efficiency of SIMD in-
structionsis higher than that of conventional instructions which perform one operation at a 
time. Moreover, no special hardware is required to implement SIMD instructions. 
SIMD instructions are useful, but most compilers have limited ability to use SIMD instruc-
tions. In view of this limitation, in order to exploit SIMD instructions, programmers need to use 
compiler intrinsics, special functions in high level programming languages, which are mapped 
to specific instructions, or to write programs in assembly languages. However, using compiler . 
intrinsics or writing assembly programs are tiine consuming tasks, and decrease portability of 
programs. Therefore, a technique for automatically generating assembly programs including 
SIMD instruCtions is required; 
The difficulty of code generation that exploits SIMD instructions stems from the data par-
18 CHAPTER 2. RELATED WORK 
allelism in registers. When using SIMD instructions, the positions of data in registers must 
be noted. When a SIMD instruction which operates a binary operator is executed, operands 
of each operation performed by a SIMD instruction mustbe at the same position in registers. 
If data and operations on the target application are well coordinated, SIMD instructions can 
be generated easily. If not, generation of additional permutation instructions which reorder or 
repack data in registers is needed. The permutation instructions take two source registers with 
packed data, and take some data elements from each register, and put these data elements into 
one target register. Fig.2.2(b) shows a typical permutation instruction which takes two elements 
from each source register and packs into them the target register. Although such data repacking 
instruction~ take run-time execution cycles, the total execution cycles can decrease by the effect 
of SIMD instructions. The combination of SIMP and data repacking instructions cq.n achieve 
high performance improvements compared with the case that SIMD instructions are unused. 
There are many problems around code generation with SIMD instructions. One of the most 
essential topic' is. finding permutation instruction sequence which generates required packed' 
data from given packed data with given permutation instructions. Most processors do not have 
all possible permutation ins~ructions, but have several permutation instructions. In addition, 
the number of all combination of data repacking is very large. Therefore, the permutation 
instruction sequence which generates required packed data is not always found easily because 
of limitation of available permutation instructions and large search space of data repacking. 
This is one of the most significant problem in the exploitation of SIMD instructions, 
2.3.2 Related Work on Code Generation for Media Processors 
Many publications have been released about automatic code generation of SIMD instructions [37 , 
38,15,39,40,41,1,42,43]. 
In [37], using several analysis and loop transformations, loops are vectorized to generate 
SIMD instructions. Though [37] aims at exploitation of SIMD instructions, [37] does not take 
account of data reordering. [39],[40] and [41] present a vectorization technique in the pres:" 
ence of misaligned memory access and data type conversions in loop bodies. Using the con-
cept of virtual vectors, statements in loop bodies are vectorized. Misaligned memory access 
2.3. CODE GENERATION FOR MEDIA PROCESSORS 19 
and unaligned vectar aperatians are aligned by pack and unpack aperatians. Far the state-
ments with mixed data length, data canversian instructians are alsa generated. [42] prapases 
a vectorizatian technique .of laap badies with interleaved data access .of arrays. Utilizing twa 
classes .of data packing instructians, extract and interleave instructians, data repacking instruc~ 
tian sequence which arrange interleaved data in a nan-interleaved farm 'is generated. SIMD. 
instructians are generated far vectar aperatians· an the rearranged data:. [43] prapases a SIMD 
instructianutilizatian technique cambined saftware pipelining. Functianal units used by SIMD 
instructians and used by canventional scalar aperatian instructians are individually equipped 
in mast processars which have SIMD instructians. Therefare, gaad utilization .of bath SIMD 
functianal units and scalar functianal units lead higher perfarmance than canventianal vectar-
izatian techniques. In [43], aperatians in laaps are mapped ta. scalar aperatian instructians 
.or SIMD instructians sa that the number .of instructians ta be perfarmed in a laap "iteratian is 
minimized. 
All the abave appraaches target laaps. On the ather hand, there are same approaches target-
ing basic blacks .or unralled laap badies. The advantage .of the basic black level appraaches is 
that it is applicable to wide range afpragrams. The laap level approaches such as [37,39,40, 
41] and [43] ate targeting well-farmed laaps generally. Hawever, well-farmed laaps are nat 
frequently appeared in practical programs. Additianal tasks such as man~ally analyzing and 
rewriting pragrams may be necessary befare applying laap level SIMD aptimizatian sa that the 
aptimizatian warks an pragrams. On the ather hand, basic black level appraaches are easily ap-
plicable ta practical pragrams, because they are nat sensitive ta cantral structures in pragrams. 
In additian, parallelism within a basic black can be alsa utilized in basic black appraaches, 
whichis nat cansidered inlaap level appraaches. In [15], pattern matching and cavering prab-
lem with SIMD instructians are farmulated ta Integer Linear Programming (ILP). Salving the 
cavering problem using ILP salver, highly .optimized assembly cades with SIMD instructians 
are .obtained. Hawever, this approach takes taa much time ta salve ILP prablems. Far the latest 
SIMD instructians which handle 4 or 8 packed data, the time required ta salve ILP problems 
may nat be acceptable. Mareaver, rearder .or repacking data in registers is nat hanqled in this 
methad. In [38], SIMD instructians are generated bygrauping statements in a basic black. Us-
ing data dependency and alignments infarmatian, statements which are executable in parallel 
20 CHAPTER 2. RELATED WORK 
are grouped into Pack Set to minimize data packing cost. Performance improvements are larger 
than traditional vectorization. However, the way to generate permutation instructions and the 
related problem of packing are not mentioned. In [1], generation of SIMD and permutation 
instructions is presented. SIMD instructions are generated by grouping operations in a basic . 
block represented by Data Flow Graph(DFG). After the grouping, permutation instructions are 
inserted between SIMD instructions. In [1], in 'order to generate permutations which mean 
packed data ordering in registers, permutation decomposition backward tree and forward tree 
are used .. The backward tree represents the decomposition of the target permutation by permu-
tat~on instructions. On the other hand, the forward tree represents generation,of permutations 
by permutation instructions from input permutations. By matching the leaves of backward and 
forward trees and searching paths from the target permutation to input permutations, permu-
tation instruction sequence to generate target permutation is obtained. The advantage of [1] is 
that more efficient code can be generated compart?d to [15], because of utilization of permu-
tation instructions. Also, a method to generate instruction sequences to generate permutations 
is presented, which is not mentioned in [38]. However, the size of the backward tree used in 
[1] exponentially grows according to the depth of the backward tree. The larger the number of 
data in registers or the number of permutation instructions becomes, the longer the compilation 
time becomes. 
In this thesis, code generation methods for media processors are proposed. At first, the 
code generation problem with SIMD and permutation instructions are formulated into inte-
ger linear programming (ILP) problem extending [15]. Then, the code generation method 
using ILP solver is. proposed. In this me~hod, data move operations are introduced in directed 
acyclic graphs representing programs. Moreover, permutation instructions are introduced in 
ILP formulation. The problem can be solved by using ILP solver. Consequently, compilers can 
generate assembly code including SIMD and permutation .instructions. The advantage of the 
proposed method is that the SIMD instruction utilization is higher than without permutation 
instructions. As a result, performance and code size are improved at the same time. Sec-
ondly, a fast code generation method for media processors is proposed. This method generates 
SIMD instructions by finding operations which are executable in parallel and mapping them 
to SIMD instructions. In this method, permutation instruction generation technique based on 
2.3. CODE GENERATION FOR MEDIA PROCESSORS 21 
MDD is presented. Using MDD, the proposed technique exhaustively generates permutations 
and finds feasible instruction sequences which reorder or repack data elements in registers. 
Since MD D can represent and manipulate sets of permutations efficiently, permutation instruc-
tion sequences can be generated in a short time. 
22 CHAPTER 2. RELATED WORK 
Chapter 3 
Code Generation for 
Block~Floating-Point Instructions 
This chapter is organized as follows. The H-BFP arithmetic is summarized in section 3.1, 
and an instruction-set processor with H-BFP arithmetic and its compiler are in section 3.2. 
Experimental results are described in section 3.3. Finally, this chapter is concluded in section 
3.4. 
3.1 Hierarchical Block-Floating-Point Arithmetic· 
IIi this section, conventional block-floating-point arithmetic and hierarchical block-floating-
point arithmetic[34] [33] are briefly summarized. 
3.1.1 Conventional Block-Floating-Point Arithmetic 
Block-floating-point arithmetic (BFP) is based on the concept of floating-point. Each number 
is represented as a pair of the scale-factor (exponent) .and mantissa, and arithmetic operations 
are performed by the similar way as that of usual floating-point CFP) such that computing and 
normalizing mantissa then computing scale-f~ctor. The difference between usual FP and BFP 
is that the scale-factor is shared by some numbers in BFP while each number has its own 
scale-factor in usual FP. In BFP, a set of numbers sharing a scale-factor is referred to as a 
23 
24 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
ro -10.11111 
r1 -110.1001 
r2 0110.101 
r3 0.011001 
(a) Binary Numbers 
mantissa 
_20 2-1 2-2 2-3 2-4 2-5 2-6 2~7 
mo 11.~(i)H~~*;kG~II(!};,I\~Q*@1 
m1 .q,ID!lI~0~I~rrfil))f)lI1~1m~~riuml 
m21[lls0~;I~~F311'1~;1~0H*t1?i14n11)}1ijl 
m31 0 I 0 I 0 I 0 .iA1~~1~1;*ml 
block-scale-factor 
-23 2221 20 
blolol1111 
! D redundunt bit III sign bit ~ mantissa bit 
(b) Block-Floating-Point Representation· 
Figure 3.1: Block-Floating-Point Data Format 
data block, and the shared, scale-factor is referred to as a block-scaleJactor. All numbers in 
a data block is normalized based on a block-scale-factor. Fig.3.1 shows a block-floating-point 
representation of 4 binary numbers. Fig.3.1(a) shows 4 binary numbers, and Fig.3.1(b) shows 
a block-floating-point representation of Fig.3.1(a). The block-scale-factor of the data block is 
determined by finding the largest scale-factor 
Arithmetic operations are done in block basis. Scaling and normalizing mantissa .are to be 
performed as same as usual FP. However, computational cost is less than that of usual FP 
because the number of scale-factor manipulations are small by sharing scale-factor. Also, main 
computation of BFP consists of integer operations such as shift, addition and multiplication. 
This means that arithmetic operations of BFP . are suitable to perform on fixed-point DSPs. 
These are the motivation to introduce BFP. 
3.1.2 Introduction of Hieracrchical block-floating-point Arithmetic 
Hierarchical block-floating-point arithmetic, an improved versiQn of BFP has been proposed 
in [33]. In H-BFP, each data element in a data block is represented as a floating-point number 
when it is on memories, and it is represented as a mantissa and a common block-scale-factor 
when it is computational data path. Note that the block-seale-factor is still available if all data 
eh~ments in a data block are on memories,· though the representation ofa data element does not 
depend on the value of block-scale-factor. During data processing, data elements are loaded 
from memories to the registers, then, they are converted from floating-point format into fixed-
point format in the block-floating-point manner. Arithmetic operations such as addition, multi-
3.1 .. HIERARCHICAL BLOCK-FLOATING-POINT ARITHMETIC 25 ' 
plication and accumulation are performed on the data represented in fixed-point format. After 
the computations, results are converted into floating-point representation again,. and stored into 
memories. Fig. 3.2 illustrates U:·BFP arithmetic. As mentioned above, a set of data is located 
in data memory in floating-point representation. The floating-point to fixed-point converter is 
used to.convert the data. Data processing are performed on the fixed-point data path, and then 
the results are converted floating-point representation by the fixed-point to floatIng-point con-
verter. The fixed-point to floating-point converter performs floating-point normalization. The· 
results are stored to the data memory. 
The goal of H-BFP is to obtain: reasonable implementation of digital signal processing sys-' 
terns with less development effort. H-BFP has the advantage in over conventional BFP arith-
metic. In the development of conventional BFP based systems, design quality, such as·signal 
quality, hardware' cost and processing time, heavily depend on the implementation options. 
Application developers have to make effort to find a reasonable implementation. On the other 
hand, though H-BFP requires additional overhead in execution time due to data format con-
versions, high precision signal processing and low hardware cost are ensured. Fig.3.3 shows 
a comparison of the behavior. of BFP to H-BFP. The upper part shows the case of BFP, and 
the lower part shows the case of H~BFP. Fig.3.3 depicts how the precision of numbers in a 
data block varies during a sequence of operations. In Fig.3.3, a data block is loaded from a 
data memory, used for certain operations as operands, stored to a data memory, then loaded 
from a data memory for next computation again. Comparing H-BFP with BFP, H-BFP re-
quire additional memory usage to keep element-scale-factor, additional computational cost for 
floating-point normalization. However, there isa remarkable point in H-BFP. The numbers in 
the data block loaded from memory for the next computation, which is the most left data block 
in Fig.3.3, some lower significant bit keep mantissa bit, while those of the data block of BFP 
lose some mantissa bit. Hence, H-BFP achieves higher accuracy than conventional BFP. 
3.1.3 Principle of' H-BFP Arithmetic 
In this section, H-BFP arithmetic operations are described in detail. 
26 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
Figure 3.2: H-BFP Arithmetic 
on memory on data path 1 on memory on data path 
Load Computation Store Load 
I~"!~': -~!II '!;::-III ~!:rl i II "'!~I -~II! ~"~fl\-I 1m,·,,,,,,,,.··,,,, I + I IIN,,,,,',""1./·4 +- I liiI·o~"..",>"~;",d + 11iiI,·",·,·""",' + I 1Il1 •.. ""·.·.,,.,j,lol; I I."";;)  I: I*"~I I OO~;;,:,-";'i",,')1 ! I I,n·,i;j:':el ! I : !jf'i~:;,~.~!~1j 
block l l block l t 
normalization 1 normalization lose 
BFP 
. l. ' accuracy 
H-BFP r;:;:::;;;;;::~;::;:::~:;:;::,~~;;;:1:;;;;-;;;;·;;Jtl I ~:;rH I ~~;l-r E~::ii -It I ~at I 
block ! 'f!Oating-p!int block I ;" . 
normalization normalization normalization keep 
D redundant bit g sign bit [jill mantissa bit accuracy 
• block-scale-factor IiiiI element-scale-factor 
Figure 3.3: Comparison of conventional BFP and H-BFP 
As mentioned in section 3.1.2, floating-point and block-floating-point representations are 
used in H-BFP. The floating-point representation of a data element consists of an element-
scale-factor and a mantissa. Let the element-scale-factor and mantissa be represented as two's 
complement of binary numbers in this paper. The block-floating-point representation of a data 
element consists of a block-scale-factor and a mantissa. The block-scale-factor and mantissa 
in a block-floating-point representation are also two's complement of binary numbers. In the 
block-floating-point representation, guard bits are reserved at most significant bits of a man-
tissa. Guard bits prevents overflow in data processing and allows to omit input data scaling-
3.1. HIERARCHICALBLOCK-FLOATING-POINT ARITHMETIC 27 
down[33].This leads higher accuracy than without guard- bits. The bit width of the mantissa 
on memories and the one on data path do not have to be the same. Generally, the bit width 
of the mantissa on data path is longer than the one on memories to save memory capacity and 
memory: access cost. 
The procedural steps of H-BFP arithmetic operation are listed as follows. ~he input data 
blocks, .data elements represented as floating-point numbers and block-scale~factors, are given 
on memories. The block-scale-factor of a data block is a maximum value of element-scale-
factor in the data block. The computation resuits are obtained as output . data blocks, data 
elements and block-scale-factors. 
1. Compute input block-seale-factors which are block-scale-factors used for binary align-
ment of mantissa during computation. Initialize output block-seale-factors which are 
block~scale-factors of computation results. 
2. Repeat below steps for every arithmetic operations 
(a) Loading data elements from memories, and extract and convert mantissas into fixed-
point numbers referring to the input block-scale-factors. 
(b) Process data by fixed-point operations. 
(c) Convert fixed-point numbers into floating-point numbers, then store them to mem-
ories. Update output block-scale-factors simultane<?usly. 
3. Save output block-scale-factors. 
Looking the procedural steps, some H-BFP specific primitive operations can be found. Han~ 
dling block-scale-factors, converting floating-point number to fixed-point mantissa and convert-
ing fixed-point mantissa are required to realize H-BFP. Realization of those hardware functions 
is the main topic of this paper. 
28 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS. 
3.2 H-BFP Processor and its Compiler 
3.2.1 Design Concept 
The goal of this study is to realize an instruction-set processor which efficiently executes H -BFP 
operations and provides a compiler for the processor. To achieve this goal, H-BFP processor is 
designed following principles: 
• Based on a typical 32-bit RISC processor with integer instruction-set. 
• Build a data path for H~BFP operations into a processor pipeline. 
• Implement additional three types of instructions for H-BFP. 
- Integer arithmetic operations on scale.-factors 
- Load floating-point data from memories, then convert them into fixed-point data 
- Convert fixed-point data into floating-point data, then store them to memories. 
The first principle comes from a requirement for compilers. The compilation technology for 
32-bit RISC is available today, such technology .can be used for the proposed processor as it is. 
The second and third principles are lead by a requirement for realization of H-BFP operations. 
The fixed-point data path is originally supported by an integer processor. Handling scale-factors 
and supporting data format conversions are necessary to execute H-BFP operations. Hence, by 
. implementing such operations on integer processors, H-BFP can be executed on the processors. 
3.2.2 Hardware Implementation 
A data path struCture to execute H-BFP specific operations is shown in Fig.3.4. There are three 
parts enclosed with dotted line, float to fixed conversion, scale-factor manipulation and fixed 
to float conversion. The scale-factor manipulation is divided into two parts further, align-factor 
computation and scale-factor computation. scale-factor register file is shown in the center of 
Fig.3.4, fixed-point data path is shown at the bottom. The processing data is incoming from 
the left top, and is outgoing to the right top through two conversions and fixed-point data path. 
Impiementations of individual components are as follows; 
3.2. H-BFP PROCESSOR AND ITS COMPILER 
Floating-Paint Data 
(from Data Memory) Scale-Factor 
Manipulation 
Floating-Point Data 
(to Data Memory) 
------------~----, 
" I _--------_, 
,'-----,...----" 1 Align-Factor S.cale"Factor I,'EI t ' 
Element 1 1 Computation Computation· 1 , emen \ 
Scale-Factor 1 ,--------------, r--------------------, 1 1 Scale-Factor 
1------;.1 ..... 1-'---. ' : G d B't 1 Conca!. I I I I uar I 
.. =::=:;--'O ____ ~~ 1 1 :: Count II 
r I: II 
\ 
,~---
Float to Fixed 
Conversion 
" ----'" 
1 
I Scale-Factor Register File 
Fixed-Point 
Data Path 
Figure 3.4: H-BFP Data Path 
3.2.2.1 Scale-Factor Register File 
1 
1 
'-="'"""lr'-"',---->o. 1 
1 
1 
1 
1 
1 
1 
1 
, 
I , 
Fixed to Float 
Conversion 
29 
The scale-factor register file is a register file to keep block-seale-factors. Input and output 
block-seale-factors (IBSF, OBSF) and temporary block-scale-'factors (not appeared in Fig.3.4) 
are kept during data processing. T4e bit width of a register in this register file should be the 
same as the bit width of block-seale-factors of a H-BFP system. 
3.2.2.2 Float to Fixed Conversion 
In float to fixed conversion, floating-point numbers are converted into fixed-point numbers. A 
barrel shifter (Alignment Shift) is used in this conversion. The inputs of alignment shift are 
mantissa and a sc.ale-factor. The mantissa is shifted by the value of the scale-factor generated. 
by scale.:.factor manipulation, then, extended by the guard bit width. 
3.2.2.3 Align-Factor Computation 
In align-factor computation, the shift amount of mantissa for floating-point to fixed-point con-
version is computed. The difference between input block-seale-factor, and element scale-factor 
30 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
is used typically. 
3.2.2.4 Fixed to Float Conversion 
In fixed to float conversion, fixed-point numbers are converted into floating-point numbers. 
A barrel shifter (Normalize Shift) is used in this conversion. The inputs of normalize shift are 
mantissa as a fixed-point number and a scale-factor to normalize the mantissa. The shift amount 
in this normalization is computed by a leading sign counter (Leading sign). The leading sign 
counter is implemented using a priority encoder. The element scale-factor is generated by scale-
factor manipulation. The normalized mantissa and the element scale-factor are concatenated, 
then it is outputted. 
3.2.2.5 Scale-Factor Computation 
In scale-factor computation, the element scale-factors of floating-point number are computed. 
The element scale-factor is computed as (input block-scale-factor - leading sign count + guard 
bit count). The output block-scale-factors are also computed in this data path. The output block-
scale-factor of a data block is the maximum element-scale-factor in the data block. Therefore, 
the output block-scale-factors can be computed by taking maximum number b~tween tempo-
rary output block-scale-factor and generated' element scale-factor for every time the floating-
point number is generated. 
3.2.3 Processor Description 
Fig.3.5 shows the proposed H-BFP processor architecture. The H-BFP processor is based on a 
standard 5-stage pipelined RISC processor. The H-BFP data path is embedded in the pipeline. 
The target architecture is five stage pipelined; i.e., instruction fetch, instruction decode, ex-
ecution, memory access, and write back stages. The H-BFP data path shown in Fig.3.4 is 
decomposed, and equipped over 4 stages. The scale-factor register file is placed in the second 
stage to allow scale-factors to be available in later stages. Float to fixed conversion and align-
factor computation data path are placed in the five stage. Floating-point data loaded from data 
memories can be converted into fixed-point data before storing them to general purpose register 
3.2. H-BFP PROCESSOR AND ITS COMPILER 
. -
-----------------------------------I \ 
Scale-Factor 
Register File 
I 
I 
·~~~~!I 
I 
I 
31 
F;~~~~~~~: : float-to-fixed 
H-BFP 
,~a!a!~t~ _ 
I 
(Risc-c~~~--------------------------------- ___________________________ 0_--------.--------------- ------------------------- -------------------------- -------"1 
:,i Insl Register MDematOary :,! Integer Data Path . i File ! 
! ! 
l.~~ __________________________________________________ . __________ . _________________ . ____________ . ____________________________ . __ . _________ . _________ . __________ . _______________ ) 
Figure 3.5: H-BFP Processor Architecture 
file. Floating to fixed conversion and scale-factor computation data path are placed in the third 
stage. Floating-point data converted into from Fixed-point data.can be stored to data memories 
right after the conversion. 
For H-BFP based processing, instructions which perform primitive operations of H-BFP 
are implemented. As mentioned in section 3.2.1, tJJ.ere are three types of H-BFP specific 
instructions. As scale-factor manipulations, 13 instructions including to move block-scale-
factors between scale-factor register file and data memory, several operations such as addition, 
subtraction and select maximum value are implemented. These instructions provides various· 
block-based processing. Load instructions from data memory with floating-point to fixed-point 
conversion and store instruction with floating-point to fixed-point conversion are also imple-
mented. There are 4 instructions for each load and store instructioris. A variety of conversions 
can be performed by those instructions.· 
, 32 CHAPTER 3: CODE GENERATION FOR BFP INSTRUCTIONS 
3.2.4 Compiler Support 
In this section, software development approach for H-BFP processor introduced. 
Software for H~BFP processor is developed as follows. A program based on floating-point 
arithmetic of the target application is written in a high level programming language first of all. 
Then, the program is rewritten into the program based on H-BFP arithmetic. The differences 
of the computation model between the floating-point and H-BFP are that data are non-blocked 
or blocked, and data conversions afterlbefore operations are not needed or needed. Hence, the 
software for H-BFP processor can be developed by adding H-BFP specific operations to the 
floating-point arithmetic based program. To enable the programmers to add such operations, 
a compiler technique called compiler intrinsic is used. Compiler intrinsic functions are the 
functions in the high level programming-language which are mapped to the specific instructions 
of the target processor. Using compiler intrinsic, H-BFP specific operations in H-BFP programs 
can be directly IIlapped to instructions of the H-BFP processor. 
Fig.3.6 shows the de~elopment flow of conventional approa-ches and the proposed' approach. 
In conventional fixed-point or BFP based software development, and H-BFP based software de-
velopment presented in [35], floating-point arithmetic based program is developed first. Then, 
the target application is developed referring to the floating-point based program as a reference 
model. In the conventional fixed-point or BFP based software development flow, several imple-
mentations must be considered, and analysis of trade-off between signal processing quality and 
costs has to be performed. In the flow of [35], while the feature of H-BFP eases development, 
assembly programming is still needed. On the other hand, in the proposed approach, the target 
application program can be obtained easily because all have to do is program refinement by 
insertion of compiler intrinsic functions. 
Fig.3.7 shows a program refinement example of a floating-point program. There are three 
program fragments in Fig. 3.7. The left program is the floating -point implementation for proces-
sors supporting floating-point arithmetic. The upper and lower programs at the right in Fig.3.7 
are the H-BFP implementation and floating-point implementation for the H-BFP processor, re-
spectively. The upper right program can be obtained by inserting scale-factor manipulations and 
data conversions in block-floating-point manner. On the other hand, the lower right program ' 
3.2. H-BFP PROCESSOR AND ITS COMPILER 33 
(a) Conventional approach (b) Proposed approach 
Figure 3.6: Comparison of Application Development Flow 
can be obtained by insertingH-BFP specific operations in floating-point manner. Compiler in-
trinsic functions, sfcselect, tofix, tofl t , are appeared into their corresponding instructions 
for the H-BFP processor. 
All these example programs compute the addition of two vectors. In the upper right program 
in Fig.3.7, the addition of two vectors is interpreted as addition of two vectors which belong 
to different data blocks. The variables asfb and bsfb in Fig.3.7 hold the block-sc;ile-factors 
for data blocks a and b respectively. The sfcselect function computes the block-scale-factor 
which determines the fixed-point data format on the addition. In the loop body, tofix func.:. 
tion performs floating-point to fixed-point conversion; toflt function performs fixed-point to 
floating-point conversion. By inserting scale-factor manipulation such that the H-BFP pro-
cessor performs the manipulation before every addition as shown the lower right program, 
floating-point implementation on H-BFP processor can be obtained. The floating-point imple-
mentation takes more execution cycles than H-BFP implementation in the runtime. However, 
floating-point implementation achieves higher precision of arithmetic operations than H-BFP 
implementation. 
34 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
Floating-point(FP) 
implementation 
for processors with FP 
;~~~\a[N]' bEN], c[N]; /' 
H-BFP implementation for H-BFP processors 
:., •... , ... "' •.. " ..... ,.,."', .....•.••.•.. ,' ........ ,,.".,: • ____ b/ock-sca{e-factor :mtasf;,bsfb}·csfbv:~ I computatIon 
................ 1 ........... 1 .............. \.. . for addition 
F~f6~~rieEF'§sfu:;·tl~fb"\"~: : 
........ "I: ...... ~ .... x ...... · .. )' .. r· ...... ·I../.. float to fixed 
for\I=O;I<N;I++ conversions 
c[i] 
'-----------------------------~ l~ff}fafih~Jti~hto 
===-:,.:== FP implementation for H-BFP processors 
for(i=O;i<N;l++) { 
cD] = aD] + b[i] ; 
} 
} 
Figure 3.7: Program Refinement for H-BFP processor 
3.3 Experimental Results 
In this section, the experimental results are presented. 
3.3.1 Experimental Setup 
An HDL model of the H-BFP processor has been designed using an ASIP development tool, 
ASIP Meister[ 44]. and the compiler with compiler intrinsics has been' developed. The H -BFP 
processor is based on the DLX[45] without floating-point operation instructions, DLX-Int. 
Adding the H-BFP data path and H-BFP instructions to DLX-Int, a H-BFP processor, DLX-
HBFP is developed. Note that the original DLX[45] has no integer multiplication instruction, 
and it is supposed that floating-point multiplication should be used if the original DLX com-
putes integer multiplication. Therefore, 16 bit multiplier and multiplication instruction are 
added to DLX-HBFP. The 16 bit multiplication is usually implemented in commercial digital 
signal processors. The bit width of integer and fixed-point data path of DLX-HBFP is 32 bits. 
3.3. EXPERIMENTAL RESULTS 35 
The floating-point format on memory is composed of 8bits exponent and 16 bitsmaIitissa. 
To compare the H-BFP processor with usual floating-point processor, DLX-Int with floating-
point unit, DLX-FP, has been also developed. DLX-FP has a floating-point unit providing 
single floating-point computation with the IEEE standard[46]. The floating-:-point unit operates 
arithmetic operations such as addition, multiplication, and floating-point to integer conversion, 
integer to floating"'-point conversion, etc. DLX-FP take 10 cycles per a single floating'-point 
addition/subtraction, and 13 cycles per a floating-point multiplication. If DLX-FP issues a 
floating-point operation, the internal pipeline of DLX-FP stalls UI1tiIDLX-FP finishes floating-
point operation. 
A compiler to generate assembly code from H-BFP program, H-BFP compiler, is also de-
velopedfor this experiment. H-BFP compiler generatesH-BFP instructions from intrinsic func-
tions in high-level program. 
In this experiments, the design quality of hardware and application processing performance 
ofDLX-HBFP are studied. DLX-HBFP, DLX-Int and DLX-FP are synthesized and estimated 
delay, frequency and gate count to evaluate DLX-HBFP in terms of the design <l,uality. The 
performance of DLX-HBFP is evaluated using DSPstone benchmark suite[47]. DSPstone is a 
set of programs taken from digital signal applications. Compiling and executing programs of 
DSPstone, accuracy of output is evaluated. Static instruction count in generated assembly code 
is compared between H-BFP processor and floating-point processor. Moreover, execution time 
of programs run on DLX:-HBFP and DLX-FP is compared to evaluate performance of the H-
BFP processor. Finally, the outputs of DLX-HBFP and DLX-FP are compared with the output 
of double floating-point computation. 
3.3.2 Results 
In this section, experimental results are shown. 
36 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
Table 3.1: Results of logic synthesis 
Delay Frequency Area 
r, 
[ns] [MHz] [gate] 
DLX-Int 4.82 207'.4 54161 
DLX-HBFP 5.36 . 186.6 63107 
DLX-FP 6.08 164.5 79087 
Table 3.2: Breakdown of gate count of DLX-HBFP 
Base(DLX-Int) 
54161 
Incremental gate count 
H-BFP components 1809 
Pipeline register -, 2712 
Multiplexer 2279 
Control 2016 
Other 115 
8931 
Total 63107 
3.3.2.1 Hardware Evaluation 
The hardware area of the H-BFP processor was estimated. The HDL model ofDLX-Int, DLX-
HBFP and DLX-FP were synthesized using a 0.14j.Lm process. Synthesis results are summa-
rized in Tab.3.l. The estimated delay, frequency, gate count for DLX-Int, DLX-HBFP and 
DLX-FP are shown. The total area of DLX-HBFP is about 63K gates with the maximum fre-
quencyat about 186.4MHz. The increased gate count is about 8.9Kgate while the DLX-FP 
increased about 25Kgate compared with DLX-Int. 
The breakdown of gate count of synthesized DLX-HBFP is shown in Tab.3.2. In Tab.3.2, 
total gate count of DLX-Int are shown, and breakdown of incremental gate count is also de-
3.3. EXPERIMENTAL RESULTS 37 
scribed. The gate ceunt efH-BFP cempenents is just 1. 8 Kgate. Additienal gate ceunt to. embed 
H-BFP cempenents into. precesser pipeline, which is the sumef pipeline register, Ii1Ultiplexei, 
centrel and ether is abeut 7. I Kgates. On the ether hand, increased gate ceunt ef DLX-FP mere 
than 25Kgate. The fleating-peint ceprocessor typically cests abeut 20-40Kgate[48][49]. This 
means that the ability efH-BFP can be added with little hardware and delay everhead. 
3.3.2.2 Performal1ce 
To. cenfirm the perfermance ef the H-BFP precessor/cempiler, the size ef assembly pregrams 
fer H-BFP precesseJ;" has been evaluated. DLX~HBFP was cempared with DLX-FP. Table 
3:3 shews the number ef instructiens to. process each pre gram fer DLX-HBFP and DLX-FP. 
Cemparing DLX-HBFP with DLX -FP, the number ef instructiens ef DLX -HBFP is cemparable 
to. that ef DLX-FP in 7 cases. This result indicates the H-BFP precesser/cempiler can process 
applicatiens as efficient as usual precesser/cempiler supperting fleating-peint arithn;letic. 
Table 3.3: Cemparisen ef the number ef insns. ameng different implementatiens, 
N:1the size efvecter, M :11 the Wi~~:~::ht Ofm1atriX, T ~~~~:ber ef. t1aps 
[# efinsns] . [# efinsns] 
nJeal_updates 14N+26 14N+13 
n_cemplex-updates 44N+30 36N+13 
cemplexJllultiply 26N+15 26N+7 
cenvelutien 12N+20 12N+12 
deLproduct 12N+22 12N+ll 
fir 12NT+30 lONT+9 
matrix! 12M3+16M2+7M+l1 lOM3+14M2+7M+4 
matrix2 12~3+25~2+7~+12 12M3+23M2+7M+4 
matlx3 12~2+14~+11 12M2+12M+4 
fir2dim 30~2T +23~2+5~+ 11 30M2T+24M2+6M+5 
38 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
0.5 r-- r-- - r-- r--
o '-- ..-.-L -L- -'-- LL... -L- LL... LL... -L- L..l- l...J 
Figure 3.8: Relative Ratio of Execution Time ofDLX-HBFP to DLX-FP 
Fig.3.8 shows relative ratio of execution time of DLX-HBFP to DLX-FP. The execution 
time is defined as execution cycle count divided by the frequency. In Fig.3.8, relative ratio is 
ranged from about 0.5 up to 0.8. This is because that the execution cycles of DLX-HBFP is 
less than that of DLX-FP and the frequency of DLX-HBFP is higher than DLX-FP. Actually, 
the floating-point operations of DLX-FP take a larger number of cycles than usual floating-
point unit. However, the number of dynamically executed instructions ofDLX-HBFP is almost 
same as DLX-FP, and computatinal cost of H-BFP operations is smaller than floating-point 
operations. Hence, DLX-HBFP can prcocess applications more efficiently than DLX-FP. 
3.3.2.3 Signal Processing Quality 
C programs implemented by floating-point arithmetic in DSPstone benchmark has been modi-
fied for the DLX-HBFP. The HDL model of the DLX-HBFP with the object code generated by 
the compiler has been simulated on an HDL simulator. The white noise has been used as the 
. input of the programs. 
The signal processing quality ofH-BFP implementation has been evaluated using an signal-
to-noise ratio ·measure which is defined as 
~;:':Ol R( n )2 
SNR = 10 .10glO[~;:':l{R(n) _ S(n)pl (3.1) 
where N is the number of output samples of the application, R( n) is the n th output of 
3.3. EXPERIMENTAL RESULTS 39 
Table 3.4
1
: SNR of each program
lll 
SD:~~~~XI-~:::dIDLX-FP 
. '. [dB] [dB] 
nJeaLupdates 83.13 146.67 
n_complex_updates 81.66 148.33 
~ 
. complex-Illultiply '83.91 141.89 
convolution 80.85 151.48 
dot-product 82.68 137.00 
fir 80.38 . 146.67 
matrix 1 81.0L 146.51 
matrix2 81.34 146.27 
matlx3 81.27 142.54 
fir2dim 83.57 145.92 
the double precision floating-point computatiqn, and S(n) is the n-th output obtained by HDL 
simulation of the H--aFP processor, respectively. 
Tab.3.4 shows the SNR of DLX-HBFP and DLX-FP for each program. The first column 
shows the names of programs, the second and third columns shows the SNR of DLX-HBFP 
and DLX-FP, respectively. In Tab.3.4, SNRs ofDLX-HBFP ranged from 80.85 to 83.91. SNRs 
ofthe DLX-FP score higher than those of the DLX~HBFP. This is because SNRs are depends 
on the bit width of mantissa. The bit width of mantissa of DLX -HBFP is 16 bits on the memory. 
On the other hand, the bit width of mantissa of DLX-FP is 23 bits because of single floating:.. 
point. According to the previous study, SNRs ofH-BFP are expected up to 80dB, and the SNRs 
are enough to practical applications[33]. The results of Tab.3.4 is consistent with the previous 
work. 
40 CHAPTER 3. CODE GENERATION FOR BFP INSTRUCTIONS 
3.4 Summary 
In this chapter, a processor supporting hierarchical block-fioating'-point arithmetic and software 
development method for. the processor are proposed. In experiments, some applications has 
been implemented and simulated on the H-BFP processor. It is confirmed that the H-BFP 
processor can ~chieve. high signal quality and low hardware cost. Using the proposed method, 
signal processing applications can be easily developed. 
. Chapter 4 
Optimal Code Generation for Media 
Instructions 
This chapter describes the optimal code generation method for media processors considering 
permutation instructions. This chapter is organized as follows: Section 4.1 describes SIMD 
instructions. Section 4.2 introduces a code selection method using tree parsing and dynamic 
programing [50]. Section 4.3 explains the Leupers'smethod [15]. Section 4.4 describes the 
proposed method. Section 4.5 shows experimental results. Section 4.6 concludes this chapter. 
4.1 SIMD Instructions 
n SIMD instructions, a value in a register is assumed to consist of several values. Fig. 4.1(a) 
shows a SIMD instruction that performs two additions on upper and lower parts of registers. 
LOAD/STORE instructions is also regarded as SIMD instructions. Fig. 4.1(b) shows an ex-
ample of SIMD LOAD/STORE instructions. Usually, processors with SIMD instructions also 
have permutation instructions. permutation instructions transfer several values from a couple 
of registers into a register. permutation instructions are useful to execute SIMD instructions 
effectively because permutation instructions produce packed data type. 
Fig. 4.2 shows an example of permutation instructions. First, a[i] and a[i+ 1], and b[i] and 
b[i+ 1] are loaded by a LOAD instruction. Since source values of additions are not located 
41 
42 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
r- 32bits 32bit register 
: 
~""'" 
32bit register 
short c[N]; 
short d[N], 
l 
I 
32bit LOAD 
t 
al I a2 
bl I b2 
32bitSTORE 
memory 
~ .. 16bits 
c[O] . al 
c[l] a2 
J1 
id[O. bl 
• d[l b2 
(a)"ADD2" instruction (b)"SIMD" LOAD/STORE instructions 
Figure 4.1: Example of SIMD instructions. 
.. ~ 
regularly, a SIMD.instruction is not applied right after loading. However, replacing values 
by permutation instructions, SIMD instructions can be applied and the program is executed 
effectively. 
4.2 Code selection [ 
Code selection is usually implemented by using tree pattern matching and dynamic program-
ming [50]. 
Let us assume a DAG G = (V, E) representing a given basic block. Here v E V represents 
an IR level operation such as arithmetic, logical, load and store. e E E. represents data de-
pendency. A DAG is divided at its CSE(Common Sub Expression)into DFT(Data Flow Tree). 
Consequently, a set of DFT is got for a basic block. 
In tree pattern matching and dynamic programming technique, an instruction set is modeled 
as a tree grammar. A tree grammar consists of a set of terminals, a set of nonterminals, a set 
of rules, a start symbol and a cost function for rules. Terminals represen~ operators in a DFT. 
Nonterminals represent hardware resources which can be stored data such as registers and 
memories. A cost function determines a cost, which is usually execution cycle of instruction 
corresponding the rule. Rules is used to represent behavior of instructions. For example, an 
ADD instruction which performs addition of two register contents, and stores the result to a 
4.3. SIMD INSTRUCTION FORMULATION 
short a[N], b[N], e[N] 
e[i] = ali] + a[i+ 1]; 
e[i+1]= b[i] + b[i+1]; 
. . 
"SIMD"LOAD 
a[i],a[i+ 1] 
register....--,,,.,..--lf---:-:---:-::-t 
"SIMD" LOAD 
b[i],b[i+ 1] 
"SIMD" STORE 
c[i],c[i+ 1] 
Figure 4.2: permutation instructions. 
register represents as follows. 
reg--+ PLUS(reg, reg) 
43 
Code select~on for a DFT is carried out by deriving the DFT which has minimal cost. In 
order to derive a tree which has minimal cost, dynamic programming is used. In a bottom-:-up 
traversal, all nodes v in DFT are labeled with a set oftriples(n,p, c), where n is a nonterminal, 
. p is a rule, and c is the cost for subtree which root is v. This means that node v can be reduced 
to n by applying rule p at cost c. 
4.3 SIMD Instruction Formulation 
In this chapter, formulation and solution in reference [15] are summarized. 
4.3.1 Rules for SIMD instructions 
A set of DFTs mentioned in section 4.2 is considered. The flow of this method is as follows; 
first, a set of rules is computed at each node in DFTby pattern matching. Next, a rule is selected 
from the set on condition that cost is minimum. For the sake of simplicity, we discuss the case 
of two data placed in a register. However, it is easy to extend this method to the case of three 
or more data placed in a register. 
44 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
When aN-bit processor with SIMD instructions perform an operation on ~-bit data, there 
are three options to execute the operation. 
• Execute an instruction which performs on N -bit register 
• Execute a SIMD instruction, where the operations perform on upper part of register 
• Execute a SIMD instruction, where the operations perform on lower part of register 
In the tree grammar, it is necessary to distinguish full registers as well as upper and lower 
subregisters. To represent the operation on upper and lower part of a register, additional non-
terminals reg_hi and reg_lo are introduced. Using reg ..... hi and reg_lo, three operations men-
tioned above can be represented. 
• Arithmetic and logical operations 
For example, 32-bitaddition and upper and lower parts of SIMD addition are represented 
as follows. 
reg~ PLUS (reg, reg) 
reg_hi ~ P LU S(reg_hi, reg_hi) 
regJo ~ PLUS(regJo,reg_lo) 
Other operations can be represented similarly to the example of addition. 
• Loads and stores 
Similar to arithmetic and logical operations, 16-bit load operations are represented as 
follows. 
reg ~ LOAD_SHORT(addr) 
reg_hi ~ LOAD_SHORT(addr) 
regJo ~ LOAD_SHORT(addr) 
16-bit store operations are represented as follows. 
S ~ STORE_SHORT(reg, addr) 
S ~ STORE_SHORT(reg_hi, addr) 
S ~ STORE_SHORT(regJo,addr)· 
4.3. SIMD INSTRUCTION FORMULATION 
M(vj)={ . \ 
Rl = reg->MUL(reg,reg), 
R2 ~o->MUL(reg::Jo,re~lo), 
R3 = re&-up->MUL(reg_up,reg_up) * } . 
M(vi)={ 
R4 = reg->PLUS(reg,reg), 
R5 = reg_lo->PLl~Jo,re~lo), 
R6 = reg_up->PLUS(reg_up,reg:...up) 
} 
Figure 4.3: Consistency of nonterrninals. . Figure 4.4: Schedulability. 
• Common sub expressions 
The definition and the use of CSE are respectively represented as follows. 
S ---t DEF _SHORT _OSE(reg) 
S ---t DEF _SHORT _OSE(reg_hi) 
S ---t DEF _SHORT _OSE(regJo) 
reg ---t USE_SHORT _0 SE 
reg_hi ---t USE_SHORT _0 SE 
regJo ---t USE_SHORT_OSE 
4.3.2 Constraints on selection of rules 
45 
In matching phase, a set of rule is annotated at each node. In the next phase, a rule is selected 
from the set, while the selection of rule have to be done under constraints as follows. 
• Selection of single rule 
For each node Vi, exactly oile rule has to be selected. 
• Consistency of nonterrninals 
Let Vj and Vk be children of Vi in a DFT. Here, a nonterrninal which is left hand side of a 
rule is called target nonterrninal. Each target nonterrninal of the rule selected for Vj and 
Vk corresponded to argument of the rule selected for Vi has to De consist. 
46 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
Fig. 4.3 shows an example of consistency of nontenninals. If R2 is selected for Vi, R5 
have to be select for Vj. 
• Common sub expressions 
Nonterminal of the rule selected for definition of CSE Vi and nontenninal of the nile 
selected for its use Vj must be identical. 
• Node pairing 
When Vi is executed by a SIMD instruction, another node Vj which is executed by the 
identical SIMD instruction must be existed. 
• Schedulability 
When we determine which nodes execute by SIMD instructions, data dependency be-
tween each pair should be considered. As shown in Fig. 4.4, if Vi and Vj are executed by 
an identical SIMD instruction, Vk and Vl cannot be execute at the same time. 
4.3.3 ILP formulation 
Let V = {VI, ... , vn } be the set ofDFG nodes, and let R j be a set M(Vi) of all rules matching 
Vi. Boolean solution variables Xij is defined as follows: 
Xij = { 1, if Rj is selected for Vi 
0, other 
variables Xij denote~ which rule is selected for Vi from M(Vi) after ILP is solved. 
Let a pair of nodes (Vi, Vj) denote a SIMD pair if it holds below conditions. 
(4.1) 
• Vi and Vj can be executed in parallel. Namely, there is no path from Vi to Vj or from Vj to 
Vi in DFG. 
• Vi and Vj represent same operation, and 
• M(Vi) contains a rule with target nontenninal reg_hi, and M(vj) contains a rule with 
target nonterminal reg.Jo. 
4.3. SIMD INSTRUCTION FORMULATION 47 
• If Vi and Vj are LOAD or STORE, which work on memory address Pi and Pj, then Pi - Pj 
equal to the number of bytes occupied by the 16-bit value. 
Boolean auxiliary variables Yij is defined as follows: 
. Yij = { ·1, if Vi and Vj are executed at an identical SIMD instruction (4.2) 
0, other ..,. 
where variable Yij denotes nodes which are executed by an identical SIMD instructions, and 
the result of the operation on Vi is stored to upper part of a destination register, the result of the 
operation on Vj is stored to lower part of a destination register. 
Constraints described above are represent as follows. 
• Selection of a single rule 
Since only one Xij become 1 each Vi, this constraint represents as follows. 
"iVi: L Xij = 1 
RjEM(Vi) 
• Consistency of target nonterminals 
(4.3) 
Let Rj E M( Vi)' Rj = nl -+ t(n2' n3) for a terminal t and nonterminals nb n2, n3, and 
let Vz and Vr be the left and right child of Vi' Let M N (v) ~ M ( v) denote the subset 
of rules matching V that have N as the targetnonterminal. If Rj = nl -+ t(n2,71,3) is 
selected for Vi, then the rule chosen for Vz and Vr must have the target nonterminals n2 
and n3 This constraint is represented as follows. 
• Common sUbexpressions 
XZk 
Xij ~ L Xrk 
Rk EMn3 (vr) 
(4.4) 
(4.5) 
48 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
Definitions of 16-bit CSEs and uses of 16-bit CSEs have been defined as follows. 
R 1 = S ---+ DEF_SHORT_CSE(reg) 
R 2 = S ---+ DEF _SHORT_CSE(reg_hi) 
R3= S ---+ DEF _SHORT _CSE(reg_lo) 
R 4 =' reg ---+ USE_SHORT_CSE 
R5= reg_hi . ---+ USE_SHORT_CSE 
R6= reg_lo ---+ . USE _S H 0 RT _C S E 
Therefore, if Vi is definition of CSE and Vu is use of CSE, it is clear that M( vd -
{Rl) R 2 ) R 3} and M(vu ) = {R4) R 5) R6}. This constraint is represented as follows. 
(4.6) 
,. Node pairing 
. Let P denote the set of SIMD pairs. If Rk E Mhi( Vi) is selected for Vi, there must be Vj 
and Rl E MIO(vj) which ~olds (Vi) Vj) E P. This condition is represented as follows. 
'i/Vi : L Xij = L Yij (4.7) 
RkEMhi(VI) j:(Vi,Vj )EP 
'i/Vi : L Xij = L Yji (4.8) 
RkEMlo(Vr) j:(Vi,Vj )EP 
• Schedulability 
Let X ( V) denote a set of nodes that must be executed before V, and let Y ( v) denote a set 
of nodes that must be executed after v. If (Vi) Vj) E P, then a set Zij defined below have 
to be empty. 
(4.9) 
This constraint is represented as follows. 
(4.10) 
• Objective function 
4.4. SIMDINSTRUCTION FORMULATION WITH PERMUTATION INSTRUCTIONS 49 
P ·DT3 
Figure 4.5: Nodes insertion for data transfers. 
The optimization goal is to make the maximum use of SIMD Instructions. Since tar-
get nonterminals of the rules for SIMD instructions are reg_hi or regJo, the objective 
function is represented as follows. 
(4.11) 
4.4· SIMD Instruction Formulation with permutation In-
structions· 
In this section, the proposed method is explained. . The proposed method is extended from 
theLeupers's method [15]. Data transfer for SIMD instructions is considered in instruction 
selection of compiler. The following sections explain in detail. 
4.4.1 IR and Rules for Data Packing and Moving 
To represent data transfers on DFT, nodes that represent data transfer operations are introduced. 
Since candidates of data transfers appear between operations, nodes for data transfers are in-
serted between all operations. Fig. 4.5 shows nodes insertion for data transfer. DTl, DT2 and 
DT3 are added to DFT. Moreover, the rules of data transfer are also introduced. When a pro-
cessor executes a permutation instruction, there are three conditions according to the locations 
where data exist. 
50 CHAPTER 4. OPTIMAL, CODE GENERATION FOR MEDIA INSTRUCTIONS 
reg_hi reg_lo reg 
Fii~'~J"1 .F."r··:, .... 1 ~'II 
~~.:.:.:.:.: .. 
reg_hi reg_hi reg_hi 
(a) reg_hic>PACK(reg_hi) (b) reg_hi->PACK(reg_lo) (c) reg_hi->PACK(reg) 
Figure 4.6: Rule of permutation instructions. 
• Two values are located in a register. The value that would be packed is in the upper part 
of the register. 
• Two values are located in a register. The value that would be packed is in the lower part 
of the register. 
• A value is located in a register 
These three conditions are shown in FigA.6. Fig. 4.6(a) shows a data transfer from upper part 
of a source register to upper part of a destination register. To represent permutation instructions, 
terminal PERM is used. Fig. 4.6(a) represents the rule reg_hi ---+ P ERM(reg_hi). Similarly, 
Fig. 4.6(b) represents the rule reg_hi ---+ PERM(regJo). Fig. 4.6(c) shows a data transfer 
from source register occupied by a value to upper part of a destination register. Fig. 4.6( c) 
represents the rule reg _hi ---+ PERM (reg). Data transfer to the lower part of destination 
register is represented as same as the case of data transfer to the upper part mentioned .above. 
These conditions for permutation instructions are formulated as additional rules shown below. 
reg_lo ---+ P ERM(reg_lo) 
reg_hi ---+ P ERM(reg_lo) 
reg_lo ---+ PERM(reg_hi) 
reg_hi ---+ PERM(reg_hi) 
regJo ---+ PERM(reg) 
reg_hi ---+ PERM(reg) , 
4.4. SIMD INSTRUCTION FORMULATION WITH PERMUTATION INSTRUCTIONS 51 
PACK2 PACKLH2 
PACKHL2 PACKH2 
Figure 4.7: Example of permutation instructions. 
where a permutation instruction consists of two rules: one has reg_hi as a target nonterminal, 
'. 
anp the other has reg_lo as a target nonterminal. 
For example, consider four perm]ltation instructions which TMS320C62x [4] have shown in 
Fig. 4.7. Using the rules introduced above, permutation instructions are represented. PACKH2 
consists of two data transfers, one is from upper part of source register to upper part of desti-
nation register, and the other is from upper part of source register to lower part of destination 
register. Former data ft.ow·is represented by reg_hi --:-+ PERM(reg_hi): and latter is repre-
sented by regJo --:-+ P ERM(reg_hi) , therefore, PACKH2 instruction can be represented by a 
pair of rules, reg_lI,i --:-+ P BRM(reg_hi) and reg_lo --:-+ P BRM(reg_hi). 
Moreover, the rule for UNPACK which is a instruction that moves a value. located upper or 
lower part of a register into a: register is adopted. Those rules are represented as follows. 
reg --:-+ UN P ACK(reg_lo) 
reg --:-+ UNPACK(r;eg_hi) 
In addition, the rules indicates no operation called "NOMOVE" is introduced. 
reg --:-+ NOMOVE(reg) 
reg_lo --:-+ NOMOVE(reg_lo) 
reg_hi --:-+ NOMOV E(reg_hi) 
These rules are selected when it is not necessary to move data. 
PERM and UNPACK have some cost since actual instructions are executed if they are 
selected. However, NOM OV E have no cost since that is corresponded to no actual instruction. 
52 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
Figure 4.8: Identification of a register which source values located. 
4.4.2 Constraints on selection of rules 
According to additional DFT nodes and rules, the following constraints has to be considered . 
• Node pairing for permutation instructions 
PERM, UN P AC K and NOM OV E match DFT nodes for data transfer. Those rules 
must be selected under constraints shown below. 
- If PERM is selected for Vi, another node Vj that is selected as PERM must exist, 
. and they are executed an identical PERM instruction. 
- If U NP AC K is selected for Vi, there is no node executed with Vi. 
- If NOMOVE is selected for Vi, even if a target nonterminal is reg_hi or reg_lo, 
Vi is not paired to other nodes because behavior ()f NOM 0 V E does not depend 
on other part of a register. However, when SIMD instructions are executed succes-
sively, the nodes for data transfers between SIMD instructions must be selected as 
NOM OV E and must be paired them. 
• Identification of a register which source values located 
When a SIMD instruction is executed, left arguments have to be located in an identical 
register, and right arguments also have to be located in the source register. Fig. 4.8 shows 
an example of identification of registers. Each result of Vil and Vjl' and Vir and Vjr must 
be located in an identical register to perform Vi and Vj as a SIMD instruction. 
4.4. SIMD INSTRUCTION FORMULATION WITH PERMUTATION INSTRUCTIONS 53 
4.4.3 ILP Formulation 
In this section, ILP formulation for permutation instructions is explained . 
.• Node pairing for permutation 
Boolean auxiliary variables aij and bij are defined as follows: 
{ 1, Vi and Vj are executed an identical PERM instruction aij 0, other 
bij { 1, Vi and Vj are stayed in an identical register - 0, other 
Let VMOVE denote a set of nodes for data transfers, and let Mffp(v) denote a subset 
of rules. in M N ( v) that have 0 P as the terminal 0 P. This constraint is represented as 
follows. 
L:. Xik = L: aij (4.12) 
RkEM~kRM(Vi) j:(Vi,Vj )EP 
L: Xik = L: aji (4.13) 
RkEMJj'ERM(Vi) j:(Vj,Vi)EP 
L: Xik ~ L: bij . (4.14) 
RkEM~~MOVE(Vi) j:(vi,vj)EP 
L: Xik ~ L: bji (4.15) 
RkEMJsOMOVE(Vi) j:(vj,vi)EP 
According to the definiti<?n of Yij , aij ,and bij , following constraint is needed. 
(4.16) 
• Identification of a register which source values located 
Let Vi! and Vir be left and right children of Vi, Vj! and Vjr be left and right children of Vj. 
In order to execute a SIMD instruction for Vi and Vj' results of Vi! and Vj!' and Vir and 
Vjr must be located.in a register. When Vi! and Vj! are executed by an identical SIMD 
instruction, the results of Vi! and Vj! are stored to a register. Therefore, to execute a 
54 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
SIMD instruction for Vi and Vj, Vii and Vjl' and Vir and Vjr must be executed by a SIMD 
instruction. Yij denotes that SIMD instructions is executed for Vi and Vj. This constraint 
is represented as follows. 
• Objective function 
V(Yi' Vj) E P, Vi E V: Yij < Yiljl 
V(Vi,Vj) E P,Vi E V: Yij < Yirjr 
(4.17) 
(4.18) 
The optimization goal is to minimize code size. Consider variables Xij, Yij for arithmetic, 
logical operation and load/store, Yij corresponds to a SIMD instruction, and Xij for the 
rule which has reg as a target nonterminal corresponds to an instruction. On the other 
hand, if variables Xij, aij, bij represent data transfer operations, aij corresponds to a 
petinutation instruction, Xij for UN P AC K corresponds to a data transfer operation, and 
Xij, bij for NOM OV E corresponds to no instruction. Let P MOV E denote a set of pairs 
of nodes for data transfer, and code size can be represented as follows. 
f L Yij 
(Vi,Vj)EP-PMOVE 
+ L 'aij (4.19) 
(Vi,Vj )EPMOV E 
4.5 Experimental results 
The proposed formulation was implemented by using CoSy compiler development environ-
ment [51] on RedHat Linux 8.0. For evaluation, a DLX based processor which had DLX 
instruction set without floating point arithmetic operation, but had SIMD instructions, such as 
ADD2, MULT2, and several permutation instructions. ADD2 instruction performs two addi-' 
tions on 16-bit values, MULT2 instruction which two multiplications on 16-bit values, and a 
variety of permutation instructions are PACKL, PACKLH, PACKHL and PACKHH. To com-
pare the quality of generated code, three compilers were used: (1) a compiler generated by the 
compiler generator of ASIP meister [52] (2) a compiler applied the Leupers's method based 
4.5. EXPERIMENTAL RESULTS 55 
~ Leupers's [15] III proposed 
1.0 
0.5 
0.0 
Figure 4.9: The ratio of generated code size. 
on (1)'s compiler, and (3) a compiler applied the proposed method based on (1)'s compiler. 
Programs for evaluation which consists of iiLbiquad_one_section, complex..multiply, convo-:-
lution, dOLjJfoduct, fir, matrix and n...reaLupdates were selected from DSPstone benchmark 
·[47]. Original codes such as convolution, doLproduct, fir, matrix, n...reaLupdates were unrolled 
easily extract parallel executions. 
Table 4.1 shows generated code size and the number of execution cycles of each program 
compiled by each compiler. Figs. 4.9 shows the ratio of code size generated by (2) and (3) to 
generated by (1), and Fig. 10 shows the ratio of execution cycles of generated code. Table 4.2 
shows the number of nodes of DFT, the number of variables and constraints in ILP and CPU 
time. 
In Fig. 4.9 and Fig. 4.10 comparing the Leupers's method with no SIMD the Leupers's 
method was effective in only n...reaLupdates. However, the proposed method reduced code 
56 CHAPTER 4. OPTIMAL CODE GENERATION FORMEDIA INSTRUCTIONS 
~ . Leupers's [15] III proposed 
1.0 
0.5 
0.0 
Figure 4.10: The ratio of execution cycles. 
size and execution cycles in convolution, doLproduct, FIR, matrix, and nJeaLupdates. The 
Leupers's method can select SIMD instructions the case where a sequence of instructions con-
sists of SIMD instructions only because the Leupers's method does not consider data transfer. 
However, such conditions are not often filled. On the other hand, the proposed method inserts 
data transfer instructions when SIMD instructions can be applied by moving values, or un-
packed data is required. Actually, in convolution, the proposed method selected a permutation 
instruction to adapt the location of values for SIMD multiplication instruction and select it. 
In Table 4.2, comparing the Leupers's method and the proposed method, the proposed method 
takes much more time to solve ILP. This is because the proposed method has wider solution 
space than the Leupers's method. Therefore, the proposed method spends much time to get an 
optimum solution. However. the proposed method can select SIMD instructions effectively. 
The code size ofthe proposed method i.s smaller than that of the Leupers's method, and execu-
tion cycles of the proposed method is smaller than that of the Leupers's method. 
Table 4.2 shows that the proposed method compiles 6 programs within a minite. However, 
the proposed method takes more than 5000 seconds to compile FIR. Because it is proved that 
4.5. EXPERIMENTAL RESULTS 57 
. Table 4.1: Generated code size and execution cycles. 
no SIMD optimization Leuper's method proposal method 
program code execution code execution code execution 
size cycles size cycles size cyclyes 
iir _biquad~ _section 132 420 132 420 132 420 
complex multiply 126 562 126 562 126 562 
convolution 62 784 62 784 54- 514 
dot product 57 162 57 162 44 118 
FIR 88 828 88 828 67 365 
matrix 137 5268 137 5268 127 4458 
n real update 95 1162 53 634 53 634 
the ILPbelongs to NP complete problem [53], it is expected that the compilation time for large 
programs will be very long. The compilation time depends on not only the size of programs but 
also characteristics of programs. Characteristics of programs reflect the number of variables, 
the number of constraints and the solution space in an instance of ILP. Comparing the convolu-
tion and n real update in Table 4.2, the number of nodes and the number of constraints oin real 
update are smaller than convolution. However, the compilation time of n real update is longer 
than convolution. This indicates that the instance of ILP of convolution can be solved more 
easily than n real update. Fig. 4.11 and 4.12 show the-program fragments of convolution and 
n real update. In Fig. 4.11 and 4.12, 4 additions and 4 multiplications can be found for each 
programs. However, only multiplications are executable in parallel in 4.11, while additions are 
also executable in parallel in 4.12. This means the constraints in the instance of ILP of convo-
lution is tighter than n real updates. As a result, the optimum solution of convolution can be 
found in shorter time than n real update. 
There are some. points to .be considered to shorten compilation time. The first point is that 
the input DFGs were not modified before ILP formulation in this experiments. Generally, there 
are several DFG representations for a specific program fragment. Processor independent opti-
mization techniques such as sharing common sub expression, redundant expression elimination 
58 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIA INSTRUCTIONS 
Table 4.2: The number of DFT nodes, variables and constraints in ILP and CPU time. 
Leupers's method proposed method 
program #of 
nodes 
iir _biquad...N_section 40 
complex multiply 16 
convolution 34 
dot product 18 
fir- 48 
matrix 6 
n real update 28 
y+=px[i+O]*ph[N-l-i+3]; 
Y+=px[i+l]*ph[N-l-i+2]; 
y+=px[i+2]*ph[N-l-i+l]; 
y+=px[i+3]*ph[N-l-i+O]; 
Figure 4.11: convolution. 
#of 
var. 
189 
62 
149 
67 
305 
21 
129 
#of CPU time #of #of #of CPU time 
cons. [sec] nodes var. cons. [sec] 
190 0.11 69 2304 7974 0.99 
69 0.09 30 776 1789 0.18 
174 0.09 60 2062 7504 1.99 
88 0.08 32 704 1522 0.18 
627 0.17 81 3660 20097 5679.00 
25 0.12 10 92 101 
137 0.12 51 2166 4557 
p_d[i+O]=p_c[i+O]+p_a[i+O]*p_b[i+O]; 
p_d[i+l]=p_c[i+l]+p_a[i+l]*p_b[i+l]; 
p_d[i+2]=p_c[i+2]+p_a[i+2]*p_b[i+2]; 
p_d[i+3]=p_c[i+3]+p_a[i+3]*p_b[i+3]; 
Figure 4.12: n real update. 
3.79 
22.72 
reduce the number of nodes in DFGs. If the number of nodes inDFGs decreases, the compi-
lation time can be shorten because the size of the instance of ILP becomes small. The second 
point is that the instance of ILP is solved without any modification. Redundant variables and 
constraints may be found in the instance of ILP which is derived by the proposed ILP formu-
lation. If variables or constraints can be reduced by some analysis before solving the instance 
of ILP, ILP solver may solve the modified problem in shorter time comparing with the case to 
solve the original instance of problem. 
)" 
4.6. SUMMARY 59 
4.6 Summary 
In this chapter, a code selection method for SIMD instructions considering data tr~nsfer has 
proposed. In the proposed method, nodes for data transfer has been added to DAGs, and rules 
for data transfer has been introduced. Similar to the Leupers's method, code selection problem 
was formulated into ILP, and the problem was solved by using ILP solver. Experimental results 
show that the proposed method can g€;merate more efficient codes than the Leupers's method, 
which use data transfer instructions to exploit SIMD instructions. 
60 CHAPTER 4. OPTIMAL CODE GENERATION FOR MEDIi\ INSTRUCTIONS 
Chapter 5 
Efficient CO,de Generation Algorithm. 
for Media Instructions 
This chapter describes the fast code generation method for media processors. The organization 
of this chapter is as follows: The way to find -SIMD operations in high level language program 
is described in section 5.1: Permutation instruction generation based on MDDs is presented in 
section 5.2. Experimental results are shown in section 5.3. This chapter is concluded in section 
.. 5.4. 
5.1 Generation of 51(\110 ~nstructions 
The proposed code generation approach mainly consists of two parts. The first part is SIMD 
instruction generation, and the second part is permutation instruction generation. 
In the first part; using tree matching in [50] and [15], a data flow graph (DFG) whose nodes 
are elements of SIMD operations is constructed. After the DFG construction, the DFG is 
divided into data flow trees (DFT), then operations are grouped into SIMD instructions. FiIially, 
a DFG whose nodes are SIMD instructions is constructed. In this section, grouping of SIMD . 
operations is explained. Pattern matching, DFG construction and DFT construCtion are similar 
to [1]. 
61 
62 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
(aO; (b"i; (a"i; (cO; 
""y"'y 
(a) Selection'of DFT nodes (b) Grouping DFT nodes 
Figure 5.1: Operation grouping 
5.1.1 Grouping SIMD Operations 
Groups of operations performed by SIMD instructions are determined as follows. First, leaves 
of DFTs which have the same operations are selected. Then, if the number of the selected 
nodes is less than the number of operations that one SIMD instruction can perform, the selected 
nodes are grouped as a SIMD instruction. If not, the selected nodes are divided into smaller 
groups whose number of elements is less than the number of operations of a SIMD instruction. 
Finally, the nodes in the selected group are removed from the DFTs. This grouping is repeatedly 
processed until all nodes are removed from DFTs. In this process,\ nodes corresponding to 
load and store operations that can be executable by one SIMP jnstruction are restricted by· 
its memory address. Since misaligned memory access is unavailable or cause large penalty 
cycles, corisecutive and aligned memory access operations are grouped as SIMD instructions. 
Fig.5.1 shows an example of operation grouping. The load operations have been removed from 
DFTs as shown in Fig.5.1(a). The add operations, nodes with a plus operator, are grouped and 
added to the DFT in Fig.5.l(b). Then, the add nodes will be r~moved from Fig.5.1(a). After 
this grouping, the DFT in Fig5.l(b) is constructed. This process continues until all npdes are 
removed from DFTs. 
5.1.2 Ordering SIMD Operations in Registers 
The order of operations in a register is determined as follows. The load and store operations are 
uniquely ordered by the memory address accessed by operations, because the available group 
of memory access operation is limited by the memory address and alignment as mentioned in 
5.2. GENERATION OF PERMUTATION INSTRUCTIONS 63 
(a) graph of grouped operations (b) the reordering result 
Figure 5.2: Operation ordering 
sectionS.l.l. The order of operations except for load and store is determined by the order of 
load and store operations. The most frequently used position where sources and· destinations 
are arranged is selected for each operation. Fig.S.2 shows the example of operation ordering. 
In Fig.S.2(a), a part of grouped graph is shown. The most left add operation has two sources al 
and b 0, one destination dl. The order in the grouped node of a 1 is the second element; bOis 
the first and dl is the second. Therefore, the most left add operation in Fig.S.2(a) is reordered 
to the second in the grouped node as shown in Fig.S.2(b). Similarly the secoIid add operation 
is reordered to the most left in the grouped node. 
5.2 Generation of Permutation Instructions 
This section describes how the permutation instructions are generated. Hereafter, the contents 
of packed data are called permutation because they are naturally represented by permutations. 
The generation method of permutations consists of two steps. In the first step, it is examined 
whether the target permutation can be generated using given permutation instructions. The 
basic concept of the first step is to generate all permutations from source permutations using 
available permutation instructions. In the second step, the expression tree representing the con-
struction process of the target permutations from input permutations is generated. The tree 
/" 
construction starts from the root, which is the node corresponding to the target permutations, 
and the tree grows from the root to the leaf by adding nodes representing permutation instruc-
tions. In the permutation instruction generation, Multi-valued Decision Diagram (MDD) [54] is 
64 CHAPTER S. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
utilized to represent and manipulate the sets of permutations. Using MDD, a permutation oper-
ation can be manipulated not on a pair of permutations but on pairs of sets of permutations. This 
characteristic enables efficient generation of permutations. In the rest of this section, MDDs 
are introduced first. Then the permutation instruction generation algorithm is described. 
-5.2.1 Introduction of MODs for Representation of a Set of Permu-
tations 
In this section, representation of the set ,of permutations using Multi-valued Decision Dia-
gram(MDD) is introduced. 
Consider the register which has n elements of packed data. Let S = {81' 82, ... } _ be the set 
of given sub-word data. Let r E sn be the permutation representing the content of a register. 
Let R E 2sn be the set of permutations. When a set of permutations R and a permutation r 
are given, a function FR : sn ---+ {O, I} is-defined as follows: 
{
I, if r E R 
FR(r) = 
0, ifr fjR 
(S.I) 
According to the definition, the function F R implicitly represents the set of permutations R. 
Here, variables Xl, .. , Xn whose domain isS are introduced. Assume X{to be the ith element 
of r which is the input of FR. Using Xi, the equation(S.I) can be expressed as follows: 
(S.2) 
In the equation(S.2), FR is defined as the multi-valued input, binary-valued output func-
tion. Such functions can be represented 'by Multi-valued Decision Diagram. Fig.S.3 shows two 
MDDs for {abed} and {abed,abde}. In Fig.S.3(a), the only one path exists from the root 
to I-terminal through the edges a,b,e,d. On the other hand, in Fig.S.3(b), there are two paths 
exist from the root to I-terminal through a,b,c,d and a,b,d,e. Considering the sequences of 
the labeled symbols on edges as elements in a set, a set of permutations can be represented by 
an MDD. Moreover, some MDD manipulations correspond to operations on the sets of per,. 
mutations. The logical-or operation on MDD corresponds to the union operation on the set of 
5.2. GENERATION OF PERMUTATION INSTRUCTIONS 65 
(a) { abed} (b) { abed, abde } 
Figure 5.3: MDDs for { abcd} and { abcd,abdc } 
permutations. Similarly, logical-and operation on MDD corresponds to the intersection oper-
ation. For example, the MDD shown in Fig.5.3(b) is constructed by the logical-or of MDDs 
representing {abed} and {abde}, which corresponds to the union of {abed} and {abde}. 
5.2.2 Permutation Operation Manipulation on MODs 
Using MDDs, basic operations such as union' and intersection can be applied to permutations. 
Similar to such basic operations, permutation operations can also be performed on MDDs. 
Consider a permutation operation p which takes two permutations rl, r2, and returns a per-
mutation r3. Let ri(j) be the jth elements of a permutation rio Let (J(k) be a function defined 
by (J(k) ~ (ik' jk), ik E {1, 2}, jk E {1, ... , n} for k = 1, ... ,n. Let qa(k) be the jkth element 
of the ikth input permutation of p(j. Given a function (J, a permutation operation p(j(rl' r2) is 
66 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
defined as follows : 
(qer(l), ... , qer(n)) 
- (rh(jl), ... ,rin(jn)) (53) 
Let Per be the permutation operation on sets of permutations Rl and R2 • Per(R1 , R2 ) is 
defined as follows: 
u (5.4) 
The- result of Per (R1 , R2 ) is a set of permutations whose elements are results of Per on any 
pairs of elements of input sets. 
The direct computation of equation(5.4) is hard when IRll and IR21 are large. However, 
-using MDDs, permutation operations on sets of permutations are effectively manipulated. In 
other words, there is no need to execute the permutation operation for each pair. The way to 
compute P(R1 , R2 ) using MDDs consists of three primitive manipulations. 
1. For each Ri , make R~ from R by adding all permutations whose elements at the position 
where the permutation operation works are the same as that of any of permutations in Ri. 
This computation on MDDs is simply implemented. Every node which corresponds to 
unused element is replaced with the union of its children as shown in Fig.5.4. 
2. For each R~, make R~' by reordering the elements of all permutations in R~ to match with 
the order of the output of the permutation operation. 
This computation on MDDs is almost same as the conventional variable ordering tech-
nique for decision diagrams. The difference between this reordering and conventional 
variable ordering is that the level of variable is not changed in this reordering whereas it 
is changed in the conventional variable ordering. Fig.5.5 shows reordering of elements 
onMDDs. 
5.2. GENERATION OF PERMUTATION INSTRUCTIONS 67 
union (t1, ... ,tm) 
/ '\. 
Figure 5.4: Adding permutations on MDDs 
Figure 5.5: Reordering on MDDs 
3. Finally, P(Rl' R2) is obtained by computing intersection of R~and Rll. The intersection 
operation corresponds to the logical-and operation on MDDs. 
For the explariation of the permutation operation manipulation, consider a permutation op-
eration p shown in Fig.5.6. Assume the input sets of permutations are Rl = {abed} and 
R2 = {deba}. The elements of R~ are all permutations matching ab** .. As a result, R~ is 
. obtained as follows : 
. R~ {abaa, abba, abea, abda, 
abab,abbb,abeb,abdb, 
abae, abbe, abee, abde, 
abad, abbd, abed, abdd} 
Similarly, the elements of R;, are all permutations matching de * *. In the second step of 
~68 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
r1 
~ 
r3 
r2 
I I 
Z I 
Figure 5.6: An example permutation instruction 
the permutation operation manipulation, elements in R!1 and m are reordered according to the 
permutation operation. The elements of R!{ and R~ are all permutations matching a * b * and 
*d*c respectively. Finally, the intersection of R{ and R~ is computed. The intersection of R{ 
and R~ is the set of permutations matching both a*b* ~nd *d*c. As a result, { adbc } is 
obtained for this example. 
In the example shown in this section, the length of permutation is 4, -and the input sets of 
permutations have only one element. However, it is clear that these are not restrictions, because 
such 1?arameters are independent of those manipulations. 
5.2.3 Permutation Instruction Generation Algorithm 
In this section, the permutation instruction generation algorithm is explained. 
The inputs of the algorithm are the set of permutations Ro, a required permutation r 0 and a 
set of permutation operations P. The output is an expression tree whose operations are PEP 
and leaves are r E Ro. The result of evaluation of the tree have to be r o' Note thatthe subscript 
of R is used to distinguish among the variants of the set of permutations generated in this 
algorithm though I4 means the ith input of the permutation operation in the section3. 
The permutation instruction generation algorithm consists of two sub-procedures. 
1. Examine whether the required permutation can be generated using the given permutation 
. operations. 
2. Build the expression tree whose intermediate nodes are permutation operations, leaves 
5.2. GENERATION OF PERMUTATION INSTRUCTIONS 69 
CanGeneratePermutationCRo, P, r 0) 
1: i +-- 0 
2: while r 0 Ej: Ri do 
3: VPjE P: ~+l,j +-- Pj(~,~) 
4: ~+1 '+-- (Uj ~+l,j) U ~ 
5: if ~+1= ~ then 
6: return false 
7: end if 
8: i +-- i + 1 
9: end while 
10: ndepth +-- i 
11: return true 
FigureS.7: Testing Target Permutation Generation 
are input permutations. 
The first sub-procedure CanGeneratePermutation is shown in Fig.5.7. The basic concept. 
of CanGeneratePermutation is. to generate all permutations from source permutations using 
available permutation operations until r 0 is generated. The main process is the while loop in the 
lines from 2 to 9. The variable i is initialized to 0 and incremented for every iteration. ~holds 
all permutations generated'in 0, ... ,i th iterations. In the lines 3 and 4, ~+1 is made from ~ 
by adding permutations generated by available permutation operations. In the line 5, ~+l and 
~ are compared. If ~+1 is equal to ~, this sub-procedure will finish and return "false" since 
it means that no more permutations can be generated and the required permutation could not 
be generated by available permutation operations. Until the required permutation is generated 
or no other permutations can be generated, the while loop is executed repeatedly. When this 
sub-procedure finished, the number of iterations is obtained as a constant ndepth. The constant 
ndepth and the sets of permutations Rh . .. ,RndePth generated in this sub-procedure are also 
. used i;n the second sub-procedure. 
70 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
GetExpressionTree(Rrequire, i) 
1: if Rrequire n Ro =F ¢ then 
2: return a leaf corresponds to r E Rrequire n Ro 
3: end if 
4: if Rrequire n ~ = ¢ then 
5: return nil 
6: end if 
7: for all Pj E P do 
8: (R;eqUire,l, R;eqUire,r) f- Pj- 1(Rrequire) 
9: Tj f- GetExpressionTree(R;equire,l, i-I) 
10: TJ f- GetExpressionTree(Rjequire,r.i - 1) 
11: if TJ =F nil and TJ =F nil then 
12: Tj f- a tree with Pj as root,' subtrees are Tj and TJ 
13: else 
14: Tj f- nil 
15: end if 
16: end for 
17: if \1'1j: Tj =F nil then 
18: return Tj such that the number of nodes is minimal 
19: else 
20: return nil 
21: end if 
Figure 5.8: Expression Tree Construction 
, . . 
5.2. GENERATION OF PERMUTATION INSTRUCTIONS 71 
The second sub-procedure GetExpressionTree is shown in Fig.5.8. The inputs are a set of 
permutations Rrequire arid an integer i. An expression tree representing the expression to gener-
ate one of the elements in Rrequire is returned. The second input i indicates the depth of tree to 
be built. The depth of obtained tree will be less than equal to i. This sub-procedure constructs 
an expression tree recursively. At the start, GetExpressionTreeis invoked with i = ndepth and 
Rrequire= {ro}. In Fig.5.8, the lines from 1 to 3, a leaf of permutation in Rrequire n Ro is 
returned if Rrequire includes any permutations in Ro. In the lines from 4 to 6, nil is returned 
if the condition is satisfied since the condition indicates that no required permutation is in ~. 
In the lines from 7 to 16, for each permutation operation Pj, a tree whose root is Pj is con- . 
structed. p;;l is the inverse permutation operation. In the line 8, p;;l returns a pair of sets of 
permutations (Rr;eqUire,1 ,RjeqUire,r) which is the source of Rrequire. In the lines 9 and 10, GetEx-
pressionTree is re~ursively invoked to build the left and right subtrees, Tk' and TI. In the lines 
from 11 to 15, If both Tk and TI are not nil, a tree Tk whose root is Pk, and the subtrees areTk 
and TI is built. Finally, in the lines from 17 to 21, Tk which has minimal cost is returned~ If 
any Tkis nil, nil is returned. 
In GetExpressionTree, the function recursively called itself 2 . IP I times. Therefore, GetEx-
pressionTree is called (2 . IPI)ndePth times in the worst case. However, the lines from 4 to 6 in 
GetExpressionTree check whether it is possible to generate necessary permutations, and redun-
dant subtree construction is pruned if it is not possible. Since sub expression trees which are not 
the part of feasible expression trees are not searched, computation time of GetExpressionTree 
practically depends on the number of feasible expression trees. This gives a great reduction 
of computation time to search a desired expression tree. In CanGeneratePermutation, on the 
other hand, permutation operations are performed ndepth . IPI times. It is reasonable since it is 
polynomial in both the number of permutation operations and the depth of the tree. 
This permutation instruction generation algorithm generates a feasible expression tree with 
minimum number of permutation instruction from minimum depth of feasible expression trees. 
In general, the depth of expression tree of the best solu~ion is not minimum, and common sub 
expression should be considered in the algorithm. Therefore, to find the best solution required 
. . 
much time. The proposed algorithm can find a good solution in reasonable time in this feature. 
72 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
5.3 Experimental Results 
In this section, the proposed method is evaluated. 
5.3.1 Experimental setup 
To confirm the effectiveness of SIMD and permutation instruction generation algorithm, the 
algorithm was implemented. Media embedded Processor, MeP [55] was used as the target pro-
cessor. MeP is a configurable processor core. The base processor, MeP core, is a 32 bit RISC 
architecture and has no SIMD instructions. SIMD instruction capability can be added to MeP 
core by customizing configuration. There are several configuration options for MeP, such as 
embedding user designed logics, adding coprocessors, and So on. The configuration. option 
used in these experiments was coprocessor option. A du~l-issue coprocessor which has a 64 
bit register file and supports 8-parallel byte, 4-parallel halfword, and 2-parallel word SIMD in-
structions , was added to MeP core. Fig. 5.9 shows the target architecture in this experiments. 
There are 5 components in the Fig. 5.9, MeP core, a coprocessor, a local memory, a global 
bus interface and a data memory access controller. The MeP core· and the coprocessor share 
a local memory, and both MeP core and the coprocessor can directly access the local memory 
through 64 bit data bus. Data transfers among MeP core, the coprocessor and the local mem-
ory are available. The local memory consists of data cache, . data RAM, instruction cache and 
instruction RAM. The MeP core and the coprocessor fetched instructions and processing data 
from the local memory. The data memory access controller manages data transfers between the 
local memory and external memories. The target processor communicates with other compo-
nents through the global bus interface. The coprocessor has 2 SIMD pipeline data paths. Both 
data paths supports 2/4/8 way arithmetic operations such as addition, subtraction and mUltipli-
cation. 6 permutation instructions were implemented in the coprocessor. These permutation 
instructions take data elements in the lower or upper part of register from 2 source registers, 
then, interleave and store them into I destination register. Fig.5.IO shows all permutation in-
structions of the coprocessor. The additional coprocessor works with MeP core in parallel. 
Therefore, the target processor behaves as a 3-way VLIW processor. If the bit width of the 
processing data is 8, the target processor can perform up to 9 operations simultaneously; The 
5.3. EXPERIMENTAL RESULTS 73 
Data Memory Access Controller 
Figure 5.9: Target processor architecture 
MeP core processes one operation, each SIMD pipeline in the coprocessor processes 4 opera-
tions. By the same token, if the" bit width of the" processing data is 16, the target processor can 
perfo~ up to 17 operations. 
In these experiments, 12 programs were used to evaluate the proposed method. To evalu-
ate the ability to generate data reordering instruction sequence, following 4 programs which 
peuorm only data reordering were used: 
• matrix transpose : matrix transposition 
• bitreverse: bit reversed reordering 
• reverse: reversed reordering 
• shuffle: shuffle reordering 
Other 8 programs were used to evaluate the entire SIMD instruction utilization technique. 
SIMD instructions could notbe used to execute those programs without data reordering. Fol-
lowing 8 programs were selected as benchmarks: 
74 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
(b) unpackl.b 
(d) unpackl.h 
(f) unpackl.w 
Figure 5.10: P~rmutation instructions of the target processor 
• complex mUltiply: vector multiplication of complex numbers 
• complex update : vector multiplication and addition of complex numbers 
• convolution: 1 dimensional convolution 
• dot product: inner product of two vectors 
• matrix: multiplication of two matrices 
• fft: Fast Fourier Transform of 16 complex numbers 
• rgbgray: color conversion of an image from RGB to gray scale 
• rgbcmyk: color conversion of an image from RGB to CMYK 
5 of 8 programs, complex multiply, complex update, convolution; dot product and matrix are 
selected from DSPstone benchmark [47]. Other 3 programs, fft, rgbgray and rgbcmyk were 
coded from scratch. 
These programs were suitable to confirm the ability to use SIMD and permutation instruc-
tions, because all programs includes operations which can be executed in parallel, and data 
reordering was necessary to process those operations by SIMD instructions. The size of a data 
5.3. EXPERIMENTAL RESULTS 75 
element was 16 bits and 8 bits for all programs. Therefore,4 or 8 parallel SIMD instructions 
were used for 16 or 8 bits versions of programs. 
The prdPosed method was implemented as a translator which translates plain C program to 
C program using built-in functions which are mapped to SIMD and permutation instructions. 
In ~ur implementation, theconipiler analyzed given C program, and unrolled loops in the given· 
program. Then, proposed methdd was applied to the unrolled loop body. Finally, a C program 
with built-in functions of SIMD and permutation instructions were generated. For these ex-
, 
periments, an MDD package which performs several operations needed to realize the proposed 
method was used. MDDs were represented as shared ROMDD[56] in the MDD package, and 
any special technique to reduce the size of MDD such as edge negation was Not used. To evalu-
ate the proposed method, we also implemented the permutation instruction generation method 
based on backward tree and forward tree proposed in [1], and developed another translator with 
the method of [1] for comparison. The translator was the same as the translator with proposed 
method except for the permutation instruction generation method .. In [1], not only generation 
method of permutation instIilction sequences, but also method to generate SIMD instruction. 
However, we implemented only permutation instruction generation method of [1], because the 
main objective of these experiments is to evaluate method to generate permutation instruction 
sequences. 
The SIMD instruction utilization methods were applied to application programs by the trans-
lators. Then, the output programs of translators were compiled by MeP C compiler, and sim-
ulated by a cycle accUrate instruction set processor simulator (ISS). The compiler and ISS 
provided by Toshiba Corp. was used for compilation and simulation. These experiments were 
performed on RedHat Enterprise ~ operating system running on Intel Xeon 2.8 GHz processor 
with 2GB of memory. 
In these experiments, the SIMD instruction generation methods were evaluated in terms of 
the number of instructions and execution cycles. The number of instructions in the main loops 
of programs was counted for each assembly code generated by compilers with and without 
SIMD instruction generation method. The number of execution cycles was measured by using 
ISS. This experiments assumed all processing data were located in the local memory, and the 
results of processing were also stored into the local memory. The number of execution cycles 
·76 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
o 
~ 
c::: 
o 
".g 
:::J 
:-c 
Q) 
0::: 
1~------------------------------------------------~----. 
Ifil MDD(proposed) 
0.5 
Figure 5.11: Code length reduction ratio. 
The size of data element is 16 bits, 4 parallel SIMD instructions are used.· 
measured in these experiments was sum of the cycle count for loading data from the local 
memory, data processing and storing data into the local memory. 
5.3.2 Results 
Fig.S.lI and 5.12 show the reduction ratio in the number of instructions in the main loop of 
assembly code comparing the code generated by the compiler with the proposed method or the 
method of [1] to that without SIMD instruction. Comparing the proposed method to [1], the 
5.3. EXPERIMENTAL RESULTS 77 
1r-------------------------------------------~------~~ 
o Tree([1]) 
II MDD(proposed) 
~ 
=u 
0 II § I '';; ~ cu ~ ~ 0::: 0.5 ~ ~ ""'" I If C Ii § 0 ~:1 
'';; II !Ii 0 __ 0 ~ =l 
::l • 
~ 
-. 
""C I! : CI) = 
0::: ~ ~ -: I .... II 
= 11 Ii . II ~ iii -, ~ ~ 
~ ~ mI 
"I 
0 ~ ~--
. ~~~~~~0,,0:::0~~0~~~~\:O~t;~O~&~t~ NO\V~O~ o~~v~~~\~_ 
~~(\ ~ v~ cp(\ ~~ 
Figure 5.12: Code length reduction ratio. 
The size of data element is 8 bits; 8 parallel SIMD instructions are used. 
proposed method achieved the same or higher reduction ratio. fbr most of the programs. Both 
methods successfully generated permutation instruction sequences for all programs. However, 
generated permutatipn instruction ~equences for the same program were different for almost all 
programs. In case of data reordering programs (matrix transpose, bitreverse, reverse,shuffle) 
high reduction ratio was achieved by SiMD and permutation instructions. Without SIMD in-
structions, load and store instructions were generated for each data elements. However, with 
SIMD instructions, 4 or 8 data elements were loaded by one wide memory access instruction. 
Data elements were reordered by using permutation instructions, then reordered data elements 
were also stored by one wide memory access instruction. In case of other programs, not only 
memory access, but also data processing operations such as addition and multiplication were 
78 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
10~--~--------------------------------------------~ 
o Tree([1]) 
8 1-----------------------1 III MDD(proposed) 
--
6 
0-
::J 
-I -c 
(J.) I (J.) 0- 4 en !Ii 
2 
o 
Figure 5.13: Speedup against without SIMD instructions. 
" 
The size of data element is 16 bits, 4 parallel SIMD instructions are used. 
mapped to SIMD instructions. 
Fig.5.13 and 5.14 show the speedup by the SIMD and permutation instruction utilization. 
Since the proposed method achieved higher code reduction ratio than the method of [1] as 
shown in Fig.5.11 and 5.12, the proposed method achieved higher speedup than [1] in most of 
the programs. In the case of fft of 8bits version, speedup of [1] was lower than the proposed 
method while the code reduction ratio of [1] was higher than the proposed method. This' is 
because instruction level parallelism in the assembly code generated by the proposed method 
is higher than that of the method of [1]. The final assembly code generated by the proposed 
method became more efficient by instruction scheduling performed by the MeP C Compiler. In 
the case of 16 bits version ofprograms(Fig. 5.13), speedups ~ere achieved from about 1.7 up 
5.3. EXPERIMENTAL RESULTS 79 
10~--------------------------------------------~--~ 
o Tree([1]) 
-
8 -------------------J III MDD(proposed) 
I-
6 
0. 
:::s 
""C 
Q) 
Q) 
0. 4 C/) 
2 
o 
. Figure 5.14: Speedup against without SIMD instructions. 
The size of data element is 8 bits, 8 parallel SIMD instructions are used. 
to about 5 times faster than without SIMD and permutation instructions. In the case of8 bits 
version(Fig. 5.14), speedups were achieved from about 1.8 up to about 8.5 times faster than 
without SIMD and permutation instructions. The target processor behaves as a RISe processor 
when SIMD instructions are not used, but behaves as a 3-way VLIW processor when SIMD in-
structions are used. Therefore, the instruction level parallelism was also contributed to speedup. 
High speedup ratio exceeding SIMD parallelization factor was obtained for matrix transpose 
of 8 and 16 bits versions, as a result of both data level and instruction level parallelism. In 
the data reordering programs of 8 bits version, matrix transpose 'Yas about 2.0 times faster, and 
bitrerverse, reverse, shuffle were about 1.5 times faster than 16 bits version. In the case of 8 bits 
version of programs other than data reordering, 6 of 8 programs were about 1.25 times faster 
80 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
than 16 bits version. However, in the pase of fft and rgbcmyk, 8 bits version was slower than 16 
bits version. This is because a large number of permutation instructions was generated as the 
number of data elements increase in the case of fft and rgbcmyk. The run-time overhead caused 
by permutation instructions greater than the execution cycles saved by the SIMD instructions. 
Table 5.1 shows the breakdown of generated instructions focused on the nqmber of permutation 
instructions. "4 SIMD" and "8 SIMD" mean the size of a data element~ is 16 bits and 8 bits 
respectively. The number of permutation instructions, total number of generated instructions 
and the percentage of permutations instructions for each program are shown in Table 5.1. In 
4 SIMD, the ratio of permutation instructions was ranged from 22 up to 85 %. The number 
of generated permutaion instructions was smalUn the cases of shuffle, dot product and matrix. 
About a half of instructions was permutation instruction in the cases of matrix transpose, bitre-
verse, complex multiply and complex update. Highly ratio of permutation instructions could be 
found in the cases of reverse, fft, rgbgray and rgbcmyk. The reason why a large number of per-
mutation instructions were generated was that the complex data permutations were required to 
perform those programs using SIMD instructions. For example, rgbgray takes an input vector 
composed of three colors, red, green and blue, and computes an output vector whose elements 
are the results of output[i] = input[3 * i] +(input[3 * i + 1] » 1) + (input [3 * i + 2] » 2). 
In this case, 6 load instructions, 2 store instructions, 8 SIMD arithmetic instructions and 30 
permutation instructions were generated. Fig. 5.15 shows the data flow of the generated code 
for this program. As shown in Fig. 5.15, permutation was very complex. Moreover, it took 5 
permutation instructions for one permutation. Though the number of permutation instructions 
was large, total number of instructions was reduced, and writing such complex permutation 
instruction sequence by hand was too difficult for application programmers. In 8 SIMD, the 
ratio of permutation instructions was higher than 4 SIMD. The ratio of permutation instructions 
in some programs such as convolution, dot product, matrix and fft was much higher than the 
case of 4 SIMD. The reasons include the complex permutation,as well as unoptimized data 
permutation of the intermediate results. The proposed method determines the order of data in 
registers based on inputs and outputs of DFGs. This scheme often generates data permutation 
.. which require many permutation instructions. Better schemes or optimization techniques to 
determine the order of data in registers such as [57] would reduce the number of permutation 
5.3. EXPERIMENTAL RESULTS 81 
Table 5.1: Breakdown of generated instructions 
4SIMD 8SIMD 
# ofinsn. Ratio of # ofinsn. Ratio df 
# of perm. Total perm. [%] # of perm. Total perm. [%] 
-
matrix transpose 32 68 47 24 44 55 
bitreverse 32 66 48 32 65 49 
reverse 12 16 75 12 14 86 
shuffle 4 14 29 2 6 33 
complex multiply 52 96 54 48 78 62 
complex update 64 129 50 58 95 71 
convolution 9 23 39 35 40 88 
dot product 4 18 22 23 32 72-
--
matrix 4 17 24 23 31 74 
fft 109 173 63 177 219 81 
I 
rgbgray 30 46 65 24 33 73 
rgbcmyk 118 139 85 117 141 83 
instructions. 
In these experiments, the quality of the generated code cannot be clearly explained because 
the optimum solution cannot be obtained due to the high computational complexity of the 
optimum code generation for media instructions. However, there is potential for improvement 
of the performance of the target processor. The method to determine the order of data in 
registers have room for improvement as mentioned above. Much permutation instructions were 
generated by the current method as shown in Table 5.1. Permutation instructions would be 
reduced by improving this method: In addition, the quality of code -would be improved by 
considering some architecture features such as multiple instruction issue and data path pipeline 
in code generation. Because the target processor in these experiments has one RISe core 
and 2SIMD core, and supports 4 and 8 way SIMD instructions. Ideally, speedups could be 
achieved up to 9/17 times when u$ing 4/8 way SIMD instructions. However, the speedups 
82 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
Load Load Load Loa,d Load Load 
~ ~ ~ 
Store Store 
Figure 5.15: Permutation in rgbgray 
shown in these experiments were averagely about 2 time. This was caused by not only much 
permutation instructions, but also unconsidered architecture features. Code generation with 
instruction scheduling considering VLIW and IJir>elined architecture would improve the quality 
of code. Because the generation of application specific instructions was the main topic in 
this study, other code generation and optimization techniques such as instruction scheduling 
and register allocation were out of the scope. Code generation with other code optimization 
techniques is future work. 
Table 5.2 shows the compilation time of [1] and the proposed method for each program. "4 
SIMD" and "8 SIMD" mean the size of a data elements is 16 bits and 8 bits respectively. In 
the 4 SIMD, the compilation time of [1] was shorter than the proposed method. Both methods 
compiled most of the programs within 10 seconds. ffi and rgbcmyk took compilation time 
5.3. EXPERIMENTAL RESULTS 83 
Table 5.2: Comparison of compilation time between [1] and proposed method 
4SIMD 8SIMD 
Tree([7]) MDD. Tree([7]) MDD 
[sec.] [sec.] [sec.] [sec.] 
matrix transpose 7.47 8.51 2800 202 
bitreverse 7.36 8.44 2700 17.7 
reverse 0.65 0.89 409 37.3 
shuffle 1.23 1.2 732 0.75 
complex mUltiply 7.89 9.73 3380 271.3 
complex update 11.9 14 4040 294 
. convolution 0.99 1.45 2550 40.0 
dot product 0.74 1.04 2220 2.87 
matrix 0.72 1.03 2220 2.77 
fft 14.7 18 7410 1100 
rgbgray 2.5 3.58 1060 290 
rgbcmyk 24.3 34.3 8060 2320 
more than 10 seconds, this is because the necessary data reordering was complex and appeared 
several time. On the contrary, in the 8 SIMD, the compilation time of the proposed method was 
shorter than that of [1]. [1] took more than 1000 seconds to compile each program for the most 
of the programs. On the other hand, it took less than 100 seconds to compile 6 programs for 
each program by the proposed method. In the case offftand rgbcmyk, it took mo:re than 1000 
seconds. However, the compilation time was about a quarter of that of [1]. 
Table 5.3 shows the number of permutations(# perm) in 14,jand 14 generated in CanGener-
atePermutation() of the proposed method, and the number of nodes of MDDs(# node) repre-
senting 14,j and R i . The input sets of permutations was Ro! = {abcdefgh}. CanGeneratePer-
mutation() was performed to generate as many permutations as possible. All possible permu-
tations were generated after the main while-loop of CanGeneratePermutation() Was performed 
5 times. The numbers of permutations in 14,j and Hi become large as the value of i increases. 
84 CHAPTER 5. EFFICIENT CQDE GENERATION FOR MEDIA INSTRUCTIONS 
However; the most large size of MDD was R3 , and the size of MDD become small as the value 
of i increases. For all Ri,j and ~ except for R3 , the number of MDDnodes was less than 1000 
even if the number of permutations was more than 8000000. The sets of permutations were 
represented efficiently by MDDs. 
5.4 Summary 
In this· chapter, a code generation technique for SIMD and permutation- instructions are pre-
sented. Utilization of permutation instructions is essential for exploitation of SIMD instruc-
tions. In the presented algorithm, the packed data in registers are represented and manipulated 
by MDDs. Utilizing MDDs, permutation instructions can be generated efficiently. The exper-
imental results show the permutation instruction generation algorithm can generate SIMD and 
permutation instructions, and reduce the number of instructions and speedup the execution of 
programs. 
"" 
5.4. SUMMARY 
Table 5.3: Permutation count and MDD node count of sets of permutations 
generated by CangeneratePermutaiion() . 
Ro = { abcdefgh }, the number of permutation instructions, IPI, is 6. 
I I # perm. I # nodes II I # perm. I # nodes I 
RI,o 1 10 R 4,o 8667135 86 
RI,1 1 10 R 4,1 8667135 79 
R I,2 1 10 R4,2 - 8667135 24 
R 1,3 1 10 R 4,3 8667135 86 
Ri,4 1 10 R4,4 8667135 79 
R 1,5 1 10 R 4,5 8667135 24 
. RI 7 33 Rl 14407168 454 
R2,o 36 69 R 5,o. 16777216 1 
R2,1 36 57 R 5,1 16777216 1 
R2,2 36 24. R 5,2 16777216 . 1 
R23 36 69 R5;3 16777216 1 
.' 
R2,4 36 57 R 5,4 16777216 1 
R2,5 36 24 R 5,5 16777216 1 
R2 188 205 R5 i6777216 1 
R,3,O 7744 569 
R3,1 7744 553 
R3,2 7744 60 
R33 7744 616 , 
R3,4 7744 602 
! 
R3,5 7744 62 
R3 34000 1795 
85 
86 CHAPTER 5. EFFICIENT CODE GENERATION FOR MEDIA INSTRUCTIONS 
Chapter 6 
Conclusion and Future Work· 
This chapter concludes this thesis. Then, the future direction of compilers' for application 
specific instruction-set processors is discussed. 
6.1 Conclusion 
Compilation methods for any kinds of processors are cnicial.to ease'software development 
effort. Compilers have to provide suitable prognimming models to describe applications and 
translate high-level programs into low-level assembly code. Emerging processors with new 
architecture features demand for new compilation methodology and opti'mization techniques. 
Block-floating-point processors is specialized to perform arithmetic operations by block-
floating-point manner. In spite of the advantage of block-floating-point arithmetic in perfor-
mance and hardware area, block-floating-point arithmetic has been rarely employed in embed-
ded systems because of the difficulty in programming. A challenge in code generation for 
block-floatirig-point processors is to bridge the gap between the programming model of block-
floating-point and the model of general programming language. 
A programming scheme and code generation method for block-floating-point processors is 
presented in this thesis. Compiler intrinsic functions are introduced to describe block-floating-
point operations. Floating-point programs are easily translated into block-floating-point pro-
grams by using compiler intrinsic functions. Experimental results showed that the proposed 
compilation method successfully generates assembly code for block-floating-point processors. 
87 
88 CHAPTER 6. CONCLUSION AND FUTURE WORK 
The generated assembly code fulfilled the performance requirements for block-floating-point 
processors. The proposed method provides an easy way to use block-floating-point arithmetic 
to application developers. 
On the other hand, the importance of media processors and its code optimization techniques 
have been increasing by the spread of media applications. A challenge in compilation for media 
processors is to exploit data level parallelism in application programs. Traditional parallelizing 
techniques for super computers based on vectorization of loops are not suitable to fully exploit 
the advantage of media instruction~set. Utilization of SIMD instructions together with data 
permutation instruction maximizes performance of media processors: In this thesis, the code 
optimization problem for SIMD instructions considering permutation instructions is mathe-
matically formulated into Integer Linear Programming problem. It is showed that the optimal 
assembly code is obtained by solving the formulated problem. Heuristic code generation for 
SIMD instructions is also proposed in this thesis. This method enables to generate assembly 
code with high degree of SIMD parallelism up to 8. The proposed method achieved speedup 
ratio up to about 8.5.Significant performance improvement in is shown by this method. 
6.2 Future Work 
The future work includes following items. 
6.2.1 Automatic ASIP Design Space Exploration 
Current ASIP design tools provide the interface to customize processors more easily than RTL 
design tools. However, the progress of semiconductor process technology is involving demands 
on higher productivity in electronic system design~ ASIPs should not be manually customized 
but automatically customized in the next generation of ASIP design technology. Compiler 
technologies such as processor dependent and independent program transformation and opti-
mization, retargettable code generation and optimization are indispensable to automate ASIP 
design. 
6.2. FUTURE WORK 89 
6.2.2' Compilation Techniques for Low Power 
As a, concern to environmental issues including the warming of the earth grows, electronic 
systems are required to be ecological. To meet demands on low power for embedded systems, 
both hardware and software techniques to reduce power consumption are crucial. Compilation 
technologies for low power such as data localizatjon in memory hierarchy, software controlled 
hardware activation, low energy usage of registers has been studied. There exists a lot of 
low power techniques by compilers, however, the relationships among different low power 
. techniques or integration of several techniques have not been studied well. The mutual relation 
among some low power techniques is considered to be future work. 
6.2.3' Compilation Techniques for Multi Processor SoC 
Performance requirement for embedded systems is expected to be higher than current embed-
ded systems. Current embedded systems have up to about 10 processor cores. However, more 
and more processor cores up to 100 or 1000 cores will be integrated in the next generation of 
embedded systems. Known compiler technologies for super computing which target a com-
puting system consisting of up to 1000 or 10000 processing elements may be applicable to 
the next generation of multi processor embedded systems. However, the characteristics of the 
multi processor embedded systems are not ~ell-known. Compiler technology for multi proces-
sor embedded systems will be a hot topic in the future compiler research. 
90 CHAPTER 6. CONCLUSION AND FUTURE WORK 
) 
Bibliography 
[1] A. Kudriavtsev and P. Kogge, "Generation of Permutations for SIMD Proc~ssors," Proc.· 
of the 2005 ACM SIGPLAN/SIGBED Conference on Languages, Compilers, and Tools 
for Embedded Systems, pp.147 -156, Jun; 2005. 
[2] Texas Instruments, TMS320C1x User's Guide, 1991. 
[3] Texas Instruments, TMS320C20x User's Guide, 1999. 
[4] Texas Instruments, TMS320C64x1C64x + DSP CPU and Instruction Set Reference Guide, 
2007. 
[5] . Philips Semiconductors, PNX1300 Series Media Processors - nata Book, 2002. 
[6] IV. Praet, D. Lanneer, W. Geurts, and G. Goossens, "Processor Modeling and Code Se-. 
lection for Retargetable Compilation," ACM Trans. on Design Automation of Electronic 
Systems, vo1.6, no.3, pp.277-307, 2001. 
[7] G. Coossens, D .. Lanneer,W. Geurts, and J.V. Parat, "Design of AS IPs in Multi-Processor 
SoCs using the Chess/Checkers Retargetable Tool Suite," Proc. 2006 International Sym-
posium on System-on-Chip, pp.l-4, 2006.· 
[8] CoWare, "Processor Designer." http://www.coware.com/products/processordesigner.php • 
. 2007. 
[9] A. Hoffmann, H. Meyr, and R. Leupers, Architecture Exploration for Embedded Proces-
sorswith LISA, Kluwer Academic Publishers, 2003. 
91 
92 BIBLIOGRAPHY 
[10] RE. Gonzalez, "Xtensa: a configurable and extensible processor," IEEE Micro, vol.20, 
no.2, pp.60-70, 2000. 
[11] N. Cheung, J. Henkel, and S. Parameswaran, "Rapid Configuration and Instruction Se-
lection for an ASIP: A Case Study," DATE '03:. Proc. of the Conference on Design, 
Automation and Test in Europe, pp.802-807, 2003. 
[12] P. Marwedel and G. Goossens, Code Generation for Embedded Processors, Kluwer Aca-
.\ 
den?c Publishers, 1995. 
[13] C. Liem, Retargetable Compilers for Embedded Core Processors, Kluwer Academic Pub-
lishers; 1997. 
[14] R Leupers, Retargetable Code Generation for Digital Signal Processors, Kluwer Aca-
demic Publishers; 1997. 
[15] R Leupers, Code Optimization Techniques for Embedded Processors, Kluwet Academic 
Publishers, 2000. 
[16] T. Aamodtand P. Chow, "Embedded ISA Support for Enhanced Floating-Point to Fixed-
Point ANSI C Compilation," 3rd International Conference on Compilers, Architecture, 
and Synthesis for Embedded Systems (CASES), pp.128-137, November 2000. 
[17] K.I. Kum, J. Kang, and W. Sung, "A floating-point to integer C converter with shift re-
duction for fixed-point digital signal processors," ICASSP '99: Proc. of the IEEE Inter-
national Conference on Acoustics, Speech, and Signal Processing, 1999., pp.2163-2166, 
1999. 
[18] D. Menard, D. Chillet, and o. Sentieys, "Floating-to-fixed-point conversion for digital 
signalprocessors," EURASIP J. Appl. Signal Process., vo1.2006, no.l, pp.77-77, uary. 
[19] RJ. Fisher, General-Purpose SIMD within a Register: Parallel Processing on Consumer 
Microprocessors, Ph.D. thesis, Purdue University, 2003. 
BIBLIOGRAPHY 93 
[20] S. Kyo, S. Okazaki, and I. Kuroda, "An Extended C Language and a SIMD Compiler for 
Efficient Implementation of Image Filters on Media Extended Micro-Processor," Proc. of 
Acivs 2003(Adcanced Concepts for Intelligent Vision Systems)), pp.234-241, 2003. 
[21] "Embedded C." http://www.embedded-c.org, 2007. 
[22] J. Xiong, J. Johnson, RW. Johnson, and D. Padua, "SPL: A Language and Compiler for 
DSP Algorithms," Programming Languages Design and Implementation (PLDI), pp.298-
308,2001. 
[23] Texas Instruments, TMS320C64x DSP Library Programmer's Reference, 2003. 
[24] "Intel®Math Kernel Library." http://www.intel.comlcd/software/products/asmo-
na/eng/307757.htm, 2007. 
[25] M.S. Lam, R Sethi, J.D. Ullman, and AV Aho, Compilers: Principles, Techniques, and 
Tools, Addison-Wesley, 2006. 
[26] K. Kemledy and J.R Allen, Optimizing Compilers for Modem Architectures: a 
dependence-based approach, Morgan Kaufmann Publishers Inc., 2002. 
[27] S. Liao, S. Devadas, K. Keutzer, and S. Tjiang, "Instruction Selection using Binate Cover-
ing for Code Size Optimization," ICCAD '95: Proc. of the 1995 IEEE/ ACM International 
Conference on Computer-Aided Design, pp.393-399, 1995. 
[28] G. Araujo and S~ Malik, "Code Generation for Fixed-point DSPs," ACM Trans, on Design 
Automation of Electronic Systtems, vo1.3, no.2, pp.136-161, 1998. 
[29] C.H. Gebotys, "An Efficient Model for DSP Code Generation: Performance, Code Size, . 
Estimated Energy," ISSS '97: Proc. of the 10th International Symposium on System Syn-
thesis, pp.41-47, 1997. 
[30] R. Leupersand P. Marwedel, "Time-Constrained Code Compaction for DSPs," ISSS '95: 
Proc. of the 8th International Symposium on System Synthesis, pp.54-59~ 1995. 
94 BIBLIOGRAPHY 
[31] K. Raley and P. Bauer, "Implementation Options for Block Floating Point Digital Fil-
ters," ICASSP '97: Proc. of the IEEE International Conference on Acoustics, Speech, 
and Signal Processing, 1997., pp.2197~2200, 1997. 
[32] A. Mitra and M. Chakraborty, "The NLMS Algorithm in Block FLoating Point Format," 
IEEE Signal Processing Letters, pp.301-304, 2004. 
[33] S. Kobayashi and G. Fettweis, "A Hierarchical Block-Floating-Point Arithmetic," Journal 
of VLSI Signal Processing, vo1.24, no.1, pp; 19-30, 2000. 
[34] S. Kobayashi and G. Fettweis, "A New Approach for Block-floating-point Arithmetic," 
ICASSP '99: Proc. of the IEEE International Conference on Acoustics, Speech, and Sig-
nal Processing, 1999.,pp.2009-2012, 1999. 
[35] S. Kobayashi, I. Kozuka, and T. Kino, "Rapid Application Software Developement on 
A Block-Floating-Point DSP," Proc. 2003 International Signal Processing Conference, 
2003. 
[36] D. Elam and C. Lovescu, A Block Floating Point Implementationfor an N-Point FFT on 
the TMS320C55x DSP. Texas Instruments, 2003. 
[37] A.J.C. Bik, M. Girkar, P.M. Grey, and X. Tian, "Automatic Intra-Register Vectorization 
for the Intel®Architecture," International Journal of Parallel Programming, vo1.30, no.2, 
pp.65 --C 98, Apr. 2002. 
[38] S. Larsen and S. Amarasinghe, "Exploiting Superword Level Parallelism with Multime-
dia Instruction Sets," Proc. of the Conference on Programming Language Design and 
Implementation, pp.145-156, Jun. 2000. 
[39] A.E. Eichenberger, P. Wu; and K. O'Brien, "Vectorization for SIMD Architectures with 
Alignment Constraints," Proc. of the ACM SIGPLAN 2004 Conference on Programming 
Language Design and Implementation, pp.82-93, Jun. 2004. 
[40] P.Wu, A.B. Eichenberger, and A. Wang, "Efficient SIMD Code Generation for Runtime 
Alignment and Length Conversion," CGO '05: Proc. of the International Symposium on 
Code Generation and Optimization, pp.153-164, 2005. 
BIBLIOGRAPHY 95 
. [41] P. Wu, AE. Eichenberger, A Wang, and P.·Zhao, "An Integrated Simdization Framework 
Using Virtual Vectors," ICS .'05: Proc. of the 19th Annual International Conference on 
Supercomputing, pp.169-178, 2005. 
. . 
[42] D. Nuzman, I. Rosen, and A Zaks, "Auto-vectorization of Interleaved Datafor SIMD," 
PLDI '06: Proc. of the 2006 ACM SIGPLAN Conference on Programming Language 
Design and Implementation, pp.132-143, 2006. 
[43] S. Larsen, R. Rabbah, and S. Amarasinghe, "Exploiting Vector Parallelism in Software 
Pipelined Loops," MICRO 38: Proc. of the 38th annual IEEE/ACM International Sympo-
silim on Microarchitecture, pp.119-129, 2005. 
[44] M. Imai, "ASIP Meister: A Configurable Processor Core Development System," Proc. ITI 
3rd International Conference on Information & Communications Technology (ICleT), 
2005. 
[45] P.M. Sailer and D.R. Kaeli, TheDLX Instruction 'Set Architecture Handbook; Morgan 
Kaufmann Publishers, Inc., 1996. 
[46] ANSIlIEEE Standard 754, IEEE Standard for Binay Floating Point Arithmetic, 1985.· 
[47] V. Zivojnovic, J. Martinez, C. Schlger, and H. Meyr, "DSPstone: A DSP-Oriented Bench-
marking Methodology," Internatinal Conference. on Signal Processin~ Applications and 
Technology, Oct. 1994. 
[48] Tensilica, "Xtensa Processor Floating Point Unit." http://www.tensilica.com/products/ 
x7 _fioa-ting_point.htm, 2007. 
[49] GB3 Digital Systems, "ARM7 Floating-Point Co-Processor," 2005. 
[50] AV. Aho, M. Ganapathi, and S.W.K. Tjiang, "Code Generation Using Tree Matching and 
Dynamic Programming," ACM Trans. on Programming Languages and Systems, vol.11,. 
noA, ppA91- 516, Oct. 1989. 
[51] ACE, "CoSy compiler development system." http://www.ace.nl/compiler/cosy.html~ 
2007. 
96 BIBLIOGRAPHY 
[52] S. Kobayashi, K. Mita, Y. Takeuchi, and M. Imai, "A Compiler Generation Method for 
HW/SW Codesign Based on Configurable Processors," IEICE Trans. on Fundamentals of 
Electronics, Communication and Computer Sciences, vol.E85-A, no.l2, pp.2586-2595, 
2002. 
[53] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guid to the Theory of 
NP-Completness, W H Freeman & Co (Sd), 1979. 
[54] S.M. Arvind Srinivasan Timothy Kam and R.K. Brayton, -''Algorithms for Discrete Func-
tion Manipulation," Proceedings of the IEEE International Conference on Computer-
Aided Design, pp.92-95, Nov. 1990 .. 
[55] T. Miyamori, J. Tanabe, Y. Taniguchi, K. Furukawa, T. Kozakaya, H. Nakai, Y. Miyamoto, 
K. Maeda, and M. Matsui, "Development of Image Recognition Processor Based on Con-
figurable Processor," Journal of Robotics and Mechatronics, voLl7, no.4, pp.437-446, 
2005. 
[56] D.M. Miller and R. Drechsler, "Implementing a Multiple-Valued Decision Diagram Pack-
age," Internatinal Symposium on Multi-Valued Lo.gic,pp.52-57, May. 1998. 
[57] G. Ren, P. Wu, and D. Padua, "Optimizing data permutations for simd devices," PLDI 
'06: Proc. of the 2006 ACM SIGPLAN conference on Programming language design and 
implementation, New York, NY, USA, pp.118-131, ACM Press, 2006. 

