Efficient software development for microprocessor based embedded system. by Tang, Tze Yeung Eric. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Efficient Software Development for Microprocessor 
based Embedded System 
Tang Tze Yeung Eric 
A Thesis Submitted in Partial Fulfillment 
of the Requirements for the Degree of 
Master of Philosophy 
in 
Department of Computer Science & Engineering 
© The Chinese University of Hong Kong 
‘ July 2003 
The Chinese University of Hong Kong holds the copyright of this thesis. Any 
person(s) intending to use a part or the whole of the materials in this thesis in a 
proposed publication must seek copyright release from the Dean of the Graduate 
School. 
1 ( 7 8 m MS JLJ 



















The generation of embedded microprocessors provides a flexible, high performance 
and low power consumption platform for the development of embedded system 
based mobile devices. This platform can reduce the development cost and the time to 
market. The deployment of the embedded system causes the development process to 
become more software-centric. In this thesis, we propose a methodology for 
developing more efficient software for the embedded system with respect to the 
merits and faults of the microprocessors used in the embedded systems. The 
methodology covers three different strategies to optimize the software development 
process. 
In source code optimization, we have developed rules for creating the source 
codes in order to enhance the overall performance. Some of the rules are applicable 
before compilation. On the other hand, we also studied ways to further optimize the 
^ performance of the codes generated using a sophisticated optimizing compiler. 
The float-to-fixed optimization is crucial for a microprocessor without floating 
point unit. We can improve the speed significantly by replacing all the floating point 
operations to fixed-point operations. 
The third and last strategy is the domain specific optimization. In most cases, 
we can replace the original implementations with more efficient implementations by 
taking advantage of the domain specific requirements of the applications. 
In order to show the effectiveness of this methodology, we have conducted 
several case studies for each of the optimization strategies. We show impressive 
improvement in speed can be gained using the methodology. In addition, we have 
also discussed ways to evaluate possible inaccuracies introduced when the float-to-
fixed and domain specific optimizations are not carefully used. In short, we have 
contributed by generalizing the methodology for creating efficient software for 
mobile devices in a systematic fashion 
i 
ACKNOWLEDGMENT 
I want to express my gratitude to all the people who have helped and supported me 
over the years. In particular, this thesis owes it existence to my supervisor, Dr. Y. S. 
Moon. He has enriched my knowledge in computer science and engineering. Under 
his patient guidance, I am able to overcome many difficulties in this research, as 
well as improve my writing and presentation skills. Through his close connection 
with the industry, I can also gain experience with the industry's most current 
technology and it tremendously benefits in both my research and future career 
development. I express again my deep appreciation for his continuous support. 
I would like to thanks Prof. David Y. L. Wu and Prof. K.H. Lee for their patience 
and intelligent comment. Nevertheless, the discussions among my colleagues also 
inspired and helped me a lot. Especially, Mr. Chan Ka Cheong, Mr. Cheng Po Sum, 
_ and Mr. Fong Kut Fai in the embedded system and smartcard laboratories. Without 
their enthusiastic help, I cannot finish all the experiments and demonstrations on 
time. I am also indebted to my girl friend, Tany Kwee, for her love and 
consideration. 
Finally yet importantly, thanks as always to my family for their unconditional 
support and specially dedicate this thesis to my father in the heaven. 
V 
ii 
一 TABLE OF CONTENTS 
ABSTRACT I 
ACKNOWLEDGMENT II 
1 INTRODUCTION 1 
1.1 EMBEDDED SYSTEM 1 
1.2 EMBEDDED PROCESSOR 1 
1.3 EMBEDDED SYSTEM DESIGN 3 
1.3.1 Current Embedded System Design Challenges 3 
1.3.2 Embedded System Design Trend 4 
1.4 EFFICIENT SOFTWARE DEVELOPMENT FOR MICROPROCESSOR 8 
1.4.1 Efficient Software Development Methodology 8 
1.5 THESIS ORGANIZATION 10 
2 SOURCE CODE OPTIMIZATION 11 
2 . 1 SOURCE CODE OPTIMIZATION STRATEGY 11 
2 . 2 SOURCE CODE TRANSFORMATIONS 12 
2.2.1 Strength Reduction 12 
2.2.2 Function Mining 13 
2.2.3 Table Lookup 13 
2.2.4 Loop Transformations 13 
2.2.5 Software Pipelining 15 
2.2.6 Register Allocation 17 
2 . 3 CASE STUDY： SOURCE CODE OPTIMIZATION ON THE S T R O N G A R M ( S A 1 1 1 0 ) 
PLATFORM 18 
2.3.1 StrongARM architecture 18 
2.3.2 StrongARM pipeline hazard illustration 20 
2.3.3 Source Code Optimization on StrongARM 21 
2.3.4 Instruction Set Optimization of StrongARM 27 
2 . 4 CONCLUSION 3 2 
iii 
3 FLOAT-TO-FIXED OPTIMIZATION 33 
3 .1 INTRODUCTION TO FIXED-POINT 3 4 
3.1.1 Fixed-point representation 34 
3.1.2 Fixed-point implementation 35 
3.1.3 Mathematical functions implementation 38 
3.2 CASE STUDY： FINGERPRINT MINUTIAE EXTRACTION ALGORITHMS ON THE 
STRONG A R M PLATFORM 4 1 
3.2.1 Fingerprint Verification Overview 42 
3.2.2 Fixed-point Implementation of Fingerprint Minutiae Extraction 
Algorithm 49 
3.2.3 Experimental Results 51 
3 . 3 CONCLUSION 5 6 
4 DOMAIN SPECIFIC OPTIMIZATION 57 
4 . 1 CASE STUDY: FONT RASTERIZATION ON THE STRONG A R M PLATFORM . .…57 
4.1.1 Outline Font 57 
4.1.2 Font Rasterization 59 
4.1.3 Experiments 63 
4 . 2 CONCLUSION 6 6 




LIST OF TABLES 
Table 2.1 Common strength reduction examples 12 
Table 2.1 Results of the optimized codes 25 
Table 2.2 SHA profiling result 26 
Table 2.3 SHA Source codes optimization results 26 
Table 2.4 Different constant values comparison 30 
Table 2.5 Different immediate values comparison 31 
Table 3.1 Floating point and Fixed-point speed comparison 33 
Table 3.2 Fixed-point arithmetic 36 
Table 3.3 Speed Comparison of MACRO and Function Call 37 
Table 3.4 Input Signs and Ranges relationship of atan2 40 
Table 3.5 Speed comparison of fixed-point and floating point implementations5 
Table 4.1 Speed comparison of Method 1 (Math.) and Method 2 (Approx.).".64 
V 
一 LIST OF FIGURES 
Figure 2.1 StrongARM: Five stages pipelining demonstration 19 
Figure 2.2 Data forwarding path 20 
Figure 2.3 Data hazard illustration 21 
Figure 2.4 Control hazard illustration 21 
Figure 2.5 Original source code and assembly code generated 22 
Figure 2.6 Introduction of the temporary variable 23 
Figure 2.7. Software pipeling example 24 
Figure 2.7 Code generated for expression: 113*a + 40 29 
Figure 3.1 Arc Tangent Function Plot 39 
Figure 3.2 Bifurcation and Termination 42 
Figure 3.3 Circular mask with region Ri and Rn 45 
Figure 3.4 Ridge Line Tracing Illustration 46 
Figure 3.5 FAR comparison between float and fixed-point version 52 
Figure 3.6 FRR comparison between float and fixed-point version 52 
Figure 3.7 Different fixed-point notations comparison 54 
Figure 4.1 A TrueType font character outline and its control points 58 
Figure 4.2 Quadratic Bezier curve with three control points (A, B and C) 59 
Figure 4.3 Converting outline font character to bitmap data 60 
Figure 4.4 Approximation of quadratic Bezier curve 62 
Figure 4.5 Performance Test on PDA (SAll 10 206Mhz) 64 
Performance Boost: 61.4% 64 
Figure 4.6 Intersection points calculated from Method 1 (Maths.)& Method 2 
(Approx.) 65 
vi 
Chapter 1 Introduction 1 
CHAPTER 1 
1 Introduction 
The popularity of embedded devices grows every year because of the diversity of 
embedded applications, ranging from a simple calculator to an advanced Personal 
Digital Assistant (PDA). Embedded system design is very complicated because of 
the need to consider both hardware and software. Moreover, factors like production 
cost, performance and time-to-market constraints are crucial too. To begin our 
discussion, we will start with an overview of embedded system architecture and 
embedded system design trend. 
1.1 Embedded System 
Embedded system refers to any electronic system with a built-in processor dedicated 
" a n application or multi applications. Typical examples in our daily life include 
mobile phones, calculators, watches, refrigerators and military robots, etc. The input 
and output units itself are usually built-in too. Mostly, software in the embedded 
systems is not supposed to be modified by users because of its specific purpose. 
Therefore, an embedded system is a combination of a processor, I/O devices, 
memories and preinstalled software. 
1.2 Embedded Processor 
There are various kinds of embedded processors designed for the different 
applications. 
We will give a brief survey here [Leupers 2000] [Restle 2000]: 
• Microcontroller 
It is mostly used in control-oriented applications. Since data processing is 
usually not an essential part of its functions, a microcontroller is not fast. 
CISC architecture is widely adopted in microcontrollers because of its 
Chapter 1 Introduction 11 
compact code density that fits into small memory footprint commonly found 
in this kind of embedded system. 
• Microprocessor (pP) 
Microprocessor is widely used in an embedded system that is required to 
handle multi-tasks such as interfacing I/O devices, running different kinds of 
processing software. The instruction set architecture (ISA) of these 
processors is, therefore, designed to fit different kinds of applications. In 
most cases, RISC architecture is adopted in this kind of embedded processors 
because of RISC's high performance and low power consumption. Typical 
examples are the ARM [ARM 2003] and MIPS [MIPS 2003] processors. 
• Digital Signal Processor 
A Digital Signal Processor (DSP) is a special kind of embedded processor 
designed for digital signal processing applications such as Finite Impulse 
.. Response (FIR) filters and Fast Fourier Transform (FFT), etc. Because of 
their specific tasks, the instruction set and internal architecture of a DSP are 
tuned to support high performance, repetitive, numerically intensive tasks. 
[Berkeley 2000]. Depending on its costs and functions, a DSP supports either 
floating point and fixed-point arithmetic. 
• Application Specific Integrated Circuit (ASIC) 
It is a chip specially designed for a particular task. An ASIC is produced in 
large quantities and is not programmable. Examples include the MPEG-3 
Decoder ASIC and Video Compression Chip ASIC. 
• Field Programmable Gate Array (FPGA) 
Field Programmable Gate Array is a programmable chip. Using FPGA, an 
embedded system designer can use Hardware Description Language (HDL) 
for rapid prototyping an integrated circuit for an application. 
Chapter 1 Introduction 3 
1.3 Embedded System Design 
Embedded system design needs to satisfy many different constraints. It makes the 
design process become more complicated and difficult. We first discuss the 
embedded design challenges and then talk about the current trend of the embedded 
system design. 
1.3.1 Current Embedded System Design Challenges 
[Wolf 2001] [Ferrari 1999] [Koopman 1996] 
Small in size, low weight 
More and more embedded systems, such as mobile phones and Personal 
Digital Assistants (PDA), are designed to be portable. 
Low power consumption 
Most portable, battery-operated embedded systems are expected to operate 
for long hours. It is critical that such systems such as watches and mobile 
t* 
phones are power efficient. 
High performance 
Today, multimedia applications are common in such embedded systems like 
the mobile phones, digital cameras and PDAs. Such applications, like real 
time encoding and decoding of MPEG files, require extensive CPU power. 
Network 
In today's networked world, most consumer-oriented embedded systems 
need to be network-connected. They may connect directly to the Internet or 
to other local networks like Bluetooth, etc. Therefore, many high-end 
embedded processors have built-in networking capabilities. 
Chapter 1 Introduction 4 
Development Cost & Time 
The development cost and time depend on the complexity of the embedded 
applications and choice of hardware / software platforms. The complexity is 
proportional to the functionalities and the hardware design issues. The 
development cost includes the costs of development tools and the manpower 
involved. By selecting proper platforms and tools, development time can be 
minimized to meet the tight time-to-market criteria. 
Production Cost 
Production cost depends on the cost of the hardware components and the 
nonrecurring engineering (NRE). NRE cost refers to the development and 
design costs that are not re-useable. 
1.3.2 Embedded System Design Trend 
The production cost, the development cost and time-to-market are the main concerns 
^of designing an embedded system nowadays. It is because other constraints 
described above can be solved by the advance technology of embedded processor 
design. In order to reduce the production cost, we have to minimize the nonrecurring 
engineering (NRE), which is increasing the reusability of the final hardware 
developed. Similarly, for the development cost and time-to-market, if we can re-use 
the previous hardware and software designed, the cost and time can be reduced too. 
Hardware/Software Co-Design 
Hardware-Software Co-Design is used very often for designing embedded system. It 
uses various Computer Aided Design (CAD) tools (e.g. commercial simulators) for 
cooperative design of the hardware and software in an embedded system [De 
Michell 1997]. In such a design, programmers can test their software with system 
• simulation tools prepared by the hardware designers to speed up the whole 
embedded system design process. As both the hardware and software design are 
done concurrently, the development time and cost can be reduced. This 
Chapter 1 Introduction 5 
methodology helps us to find the optimal hardware and software combinations to 
achieve the desire performance. 
Platform-based Design 
The main idea of platform-based design methodology is to design a common 
architecture that can support a variety of applications rather than designing the 
optimal hardware and software configuration for a particular embedded application. 
If the chosen hardware platform is well designed for multiple applications, more 
embedded systems can share the same hardware platform, then the hardware 
production volume increases. In this way, production cost can be lowered [Ferrari 
1999]. Moreover, the design re-use can reduce the time-to-market and the 
development effort. Besides the re-use of hardware platform, the re-use of software 
is also an essential part of platform-based design [Keutzer 2000]. Hence, we also 
need an abstract layer to make the software portable and enhance the reusability 
regardless the underlying hardware components. The abstract layer is called the 
^ software platform [Ferrari 1999]. 
The abstract layer can be achieved by using the following: 
• Embedded Operating System [Martin 2000] 
The operating system provides system calls, device drivers for the software 
to access the hardware resources such as I/O devices. It also provides the 
network communication sub-system to encapsulate the low-layer network 
connectivity. Consequently, it eases the software development and 
enhances the software portability and reusability. 
• High Level Language 
The software should be written with high level language so that it can be 
‘ easily ported to run on different hardware components. For instance, C 
language is one of the commonly-used high level languages used in 
embedded systems. 
Chapter 1 Introduction 6 
By combining both the hardware and software platforms, most designs can be re-
used so as to greatly reduce the cost of the embedded system design. We believe that 
the platform-based design strategy will dominate the embedded system design in the 
near future. 
Microprocessor based Embedded System Design 
In platform-based design, the hardware platform should consist of programmable 
cores, input-output (I/O) sub system and memories in order to support a variety of 
applications. [Martin 2000] The programmable cores can be DSP, FPGA and 
microprocessor. We propose that the hardware platform must include the 
microprocessor, rather than using DSP and FPGA intensively with numerous of 
reasons described below. 
• High Performance, Low Power Consumption 
Many microprocessors can achieve high performance, low power consumption 
„ nowadays so that there is a push to use high performance microprocessors in 
embedded systems today [Harvey 1995]. According to Moore's law which 
predicts that the number of transistors per chip doubles every year [Wolf 1994], 
an even higher performance, lower power consumption microprocessor will be 
produced in the near future. These microprocessors can support many 
computation intensive tasks like MPEG4 video encoding and decoding [Prasad 
2002]. In order to compare them with DSP, some of the microprocessors like 
the ARM9, have specially designed hardware to speed up digital signal 
processing applications [Francis 2001]. The Harvard architecture, which is used 
very often in DSP to separate the data and program memories, is used in this 
kind of microprocessor so that data can always be read from data cache to 
enhance the overall performance. Moreover, faster multiplier and barrel shifter 
for faster executions are also available. An example is the multiply and 
accumulate (MLA) instruction that is used very often in digital filtering. 
Microprocessors are also compatible for implementing peripherals that are 
always implemented with hardware logic or DSP because of its real time 
Chapter 1 Introduction 7 
constraint. For example, it can be used to implement the complete embedded 
system with UART, keypad controller, and LCD controller together efficiently 
without the help of hardwired logic. [Lioupis 2001]. 
參 Embedded Operating System Support 
Embedded operating system is a very important layer to encapsulate the 
underlying hardware components to ease the software development and enhance 
the reusability. There are wide ranges of embedded operating support for 
microprocessors. For instances, VxWorks RTOS, QNX, eCos, embedded Linux 
[Lennon 2001], Microsoft Windows CE. [Santo 2001] Many microprocessors 
also provide specialized instructions to facilitate some operating system 
mechanisms, such as memory management, synchronization protection. [Furber 
2000] On the other hand, there is only a little amount of operating system 
support for DSP due to its slow context switching and the absence of 
sophisticated operating system mechanisms at the instruction level [Restle 
^ 2000]. 
參 High Level Language Support and Efficient Code Generation 
There are various kinds of high level language support for microprocessors. 
Translators for languages like ANSI C, C++, Java are widely available. With 
such support, efficient code generation for satisfying memory, power and real-
time constraints is becoming an important research topic in embedded system 
research. Current compilers for DSP are producing less efficient code in terms 
of code size, performance when compared to hand optimized code. 
Unfortunately, due to the special architecture of DSP, high level language 
compiling is not well implemented in DSP [Leupers 2000]. Because of this 
limitation, many developers still need to employ assembly language for 
developing software for DSP for efficiency. Such an approach decreases the 
software portability and consumes more development time. 
Chapter 1 Introduction 8 
The above advantages of microprocessors show that they are well suitable for 
the complex embedded applications. Many existing platform-based embedded 
systems are equipped with microprocessors nowadays. Two examples are the digital 
video platform equipped with a 32-bit MIPS processor from Philips semiconductor 
[Philips 2003] and the personal Internet client platform equipped with a 32-bit ARM 
processor from Intel Corporation. [Intel 2003] 
1.4 Efficient Software Development for Microprocessor 
With the help of software platform provided by the abstraction of operating system 
and high level languages, we can always develop and test the embedded software on 
the desktop computer instead of performing tests on the hardware platform. We can 
test the embedded software with large set of data samples that is not fit into the 
memories provided by the hardware platform. Moreover, there are wide ranges of 
open source software available on the Internet, such that we can modify the exiting 
source codes instead of writing the software from the bottom. Hence, developing 
,embedded software in desktop PC can save lots of time and ease of the debugging 
problems. After a series of testing, we can port the software to the embedded system 
for testing. The porting of software needs to be fine-tuned for the microprocessor 
used in the hardware platform because the performance of the microprocessor is 
limited when compared to the desktop PC. We propose the efficient software 
development methodology here to show how we could maximize the performance of 
the software on the microprocessor in embedded system. 
1.4.1 Efficient Software Development Methodology 
In order to maximize the performance of a microprocessor based embedded system, 
we have to - understand the architecture of the microprocessor and optimize the 
software with respect to its merits and faults. In particular cases, we can also modify 
“the algorithms of the embedded software to make it run faster. The following are 
some common strategies for efficient software development in such an environment: 
• Source Code Optimization 
Chapter 1 Introduction 9 
As we use high level languages to develop the application software, the quality 
of the codes generated by the compiler is critical. The compiler should have 
deep understanding of the target microprocessor architecture in order to 
generate efficient codes. Although the compiler for microprocessors is so 
sophisticated today, there is still room for improvement. We can rewrite the 
source codes to further optimize the codes for faster performance. One 
optimization strategy is to enhance the instruction parallelism of the source 
codes. Sometimes, the compiler may not utilize the instructions available. In this 
situation, we may need to rewrite the code segments with assembly languages 
too. 
• Float-to-Fixed Point Optimization 
Low power consumption is one of the main concerns in embedded systems. In 
order to achieve this objective, many microprocessors used in embedded 
systems are not equipped with the floating point unit (FPU) to save power and 
reduce the chip size. Because of the absence of FPU, we have to convert all the 
floating point arithmetic to fixed-point arithmetic due to the slow emulation of 
floating point operations in order to achieve faster program execution time. The 
problem is the dynamic range of the fixed-point is much more limited when 
compared to floating point. Therefore, we have to determine the best fixed-point 
representation that does not cause overflow and underflow problems during 
program execution. In this thesis, we will employ a fingerprint minutiae 
extraction algorithm as a case study. 
• Domain Specific Optimization 
This strategy refers to optimizing the software performance by modifying the 
algorithm. Such modification can involve changing some parameters, 
simplifying some calculations and it can also degrade the ultimate results 
occasionally. In most cases, knowledge specific to the application domain often 
suggests tips for creating optimizations. In this thesis, we will use the font 
Chapter 1 Introduction 10 
rasterization process as a case study to show the domain specific optimization 
process. 
The efficient software development strategies described above can be used 
together or separately to enhance the performance. The optimizations are done in 
software only independent of the hardware platforms. In view of the growing 
demand of compute bound multimedia applications, this methodology provides a 
crucial solution for enabling the new applications in the microprocessor based 
embedded systems. 
1.5 Thesis Organization 
In this thesis, we concentrate on the efficient software development methodology for 
microprocessor based embedded system. The embedded system discussed in this 
thesis is referring to the portable embedded system, thus they are battery-operated. 
.The processor examined is the embedded microprocessor designed for a portable 
embedded system. We have already discussed the importance of the microprocessor 
based embedded system in Chapter 1. For later chapters, we describe the details of 
our efficient software development methodology. Efficient software development 
means developing optimized software for speed. 
In Chapter 2, we review some of the source code optimization techniques. 
Consequently, we study the practice of source code optimization referring to the 
target processor architecture and compiler behavior through case studies. In Chapter 
3，we show the entire. floating point to fixed-point optimization process and the 
underlying theories. We also show the improvement in speed of the fixed-point 
implementation of the fingerprint minutiae extraction algorithm. In Chapter 4, we 
present the domain specific optimization and optimize the font rasterization engine 
“with two different implementations to show the improvements. Lastly, we give a 
conclusion of this thesis. 
Chapter 2 Source Code Optimization 11 
CHAPTER 2 
2 Source Code Optimization 
Software developers need to keep the source codes clean and easy to understand to 
ensure its readability and alterability. Consequently, we always leave the 
optimization tasks to an optimizing compiler that optimizes the source codes with 
respect to the architecture of the target processor using the 'best' combinations of 
resources including the instructions and registers. Although the compiler is so 
sophisticated, we can still further optimize the codes by rewriting some of the source 
codes and even writing assembly codes instead. In this chapter, we will review some 
of the source code optimization techniques and show how the optimizations related 
to the target architecture through a case study of the StrongARM platform. 
f 
2.1 Source Code Optimization Strategy 
A commonly used strategy to optimize the source codes is described below: 
1. Source code profiling 
There are many profiling tools available to help us analyzing our programs. 
They help us to figure out which functions or codes that are executed most often 
so that we can concentrate on those critical codes to optimize the programs. For 
example, the GNU Profiler is an example [GPROF 2003]. 
2. Optimize Source Codes 
We try to optimize the critical codes by rewriting the codes or replacing them 
� with assembly codes. 
3. Optimize codes with Compiler 
We use the optimizing compiler to optimize the modified source codes. 
Chapter 2 Source Code Optimization 12 
4. Assembly codes checking 
The assembly codes generated from the compiler are checked to see whether the 
desired level of optimization is reached. Otherwise, repeat step 2. 
2.2 Source Code Transformations 
Source code transformations refer to the source codes rewriting to improve the 
performance of the executable codes. Different compilers can generate different 
optimized codes. Review covering source code transformations to boost 
performance can be found in [Bacon 1994] [ARM 1998]. Some transformations may 
be implemented in compilers. Therefore, we can study their results through 
examining the assembly codes generated by the compiler. We give an overview of 
some useful transformations here. 
2.2.1 Strength Reduction 
Strength reduction replaces expensive operators with less expensive ones. Less 
expensive operators refer to the operations that require less number of clock cycles. 
One of the common reduction methods is replacing multiplication with addition. 
Some other common reductions are shown below: 
Expensive Replacements Explanations 
operators  
x x T ^ JC « c Left shifting instead of multiplication 
JC » C Right shifting instead of division 
x! y 1 Multiplication instead of division 
XX — y . 
f o r X > 0 JC & (2�一 1) Bitwise operator (AND) instead of modulus 
Table 2.1 Common strength reduction examples. 
In strength reduction, we should find out the target processor's cost effective 
• operators. For instance, since there is no division hardware in the StrongARM 
processor, the processor uses a function call to emulate division and thus consumes 
more clock cycles. Hence, we should prevent the use of division or we can write 
Chapter 2 Source Code Optimization 13 
specialized functions to handle division to enhance the performance of the 
StrongARM processor. 
2.2.2 Function Inlining 
Function inlining is very useful to eliminate the overhead of those frequently called 
functions. After inlining, compiler can also have a better analysis of the source codes 
to produce more optimized codes. The disadvantage of function inlining is that the 
code size will increase. Commonly, most of the small sized functions will be inlined 
because of the expensive overheads needed to handle them. In the C programming 
language, inlining can be achieved by using MACRO or the keyword "inline". 
2.2.3 Table Lookup 
Table lookup can be used for replacing the expensive mathematical functions, like 
the sine, cosine, etc. We often construct one or more tables with the desired accuracy 
only to minimize the memory usage to represent the values of a mathematical 
.function. For examples, we have used table lookup intensively in our fixed-point 
library to implement some mathematical functions in our case study, which will be 
discussed in Chapter 3. 
2.2.4 Loop Transformations 
Loops are always where most of a program's execution time is spent. Therefore, 
many loop transformations have been developed to improve the execution 
performance. 
• Loop Unrolling 
Loop-unrolling can reduce the overheads of the small loops. The overheads 
of the loops refer to the refreshment of loop counter and the execution of 
‘ branching instructions. Loop unrolling is easy to achieve; we simply 
duplicate the body of the loop and update the iteration of the loop counter. 
Consider the following code segments: 
Original Codes: 
Chapter 2 Source Code Optimization 14 
for ( i=0； i < 100； i++){ 
total[i] = total[幻 +a[i]; 
} 
Optimized Codes: 
for ( 土=0; i < 100; i+=2){ 
total[i] = total[i] +a [i]; 
total[i+l] = total[i+l] +a[i+l]； 
} 
• Loop Fusion 
We can combine different loops with same number of iterations to reduce the 
loop overheads. 
Original Codes: 
for { i=0; i < 100； i++){ 
total[i] = total[i] + c； 
,, } 
for ( j=0; j < 100； j++){ 
sum[j] = sum[j] * 4； 
} 
Optimized Codes: 
for ( i=0； i < 100； i++){ 
total[i] = total[i] + c； 
sum[i] = sum[i] * 4; 
} 
• Loop Invariant Code Motion 
Loop invariant code refers to the code that is independent of the loop counter. 
We should remove it from the loop or cache the necessary value used in the 
� loop in register first. Consider the following codes, the strlen() is a function 
to determine the length of the string "article" that is called every iteration in 
the loop. The length of the string "article" is independent of the loop, 
Chapter 2 Source Code Optimization 15 
therefore we should use a temporary variable to cache it and thus the strlen() 
function would not be called each time. 
Original Codes: 
for ( i=0； i < strlen(article)； i++){ 
sum = sum + article[i]; 
} 
Optimized Codes: 
length = strlen(article)； /* String Length */ 
for ( i=0; i < length； i++){ 
sum = sum + article[i]; 
} 
/* the length of the article is invariant */ 
2.2.5 Software Pipelining 
f 
Pipelining in a processor divides the execution of instruction into numerous stages, 
like fetch, decode and execute. Similarly, software pipelining breaks the loop body 
into several stages to remove data dependencies in the codes to enhance parallelism. 
It is proven to be an efficient technique for generating independent instructions for 
Very Large Instruction Word (VLIW) and Superscalar processor to execute several 
instructions in parallel [Lam 1988][Hennessy 2003]. We can also use this technique 
to remove data hazards of simple pipelined processor to obtain its optimal 
performance. There are many algorithms designed to achieve software pipelining. 
Comparison of these methods is discussed in [Allen 1995]. Basically, we can 
construct a software pipelined loop following steps below [Su 1999]: 
• Unroll the loop body 
• Pipeline the unrolled loop body 
參 Choose the instructions from each iteration 
• Construct the new loop with prelude and postlude 
Chapter 2 Source Code Optimization 16 
Consider the following loop: 
Original Codes: ‘ 
for ( i=0; i < N; i++){ 
total[i] = total[i] + c； /* A */ 
sum = sum + total[i]; /* B */ 
b = sum + d; /* C */ 
} 
Software Pipelining: 
1) Unroll the loop body： 
total[0] = total[0] + c； 
sum = sum + total[0]; 
b = sum + d； 
total[1] = total[1] + c； 
sum = sum + total[1]; 
b = sum + d; 
total[2] = total[2] + c； 
sum : sum + total[2]； 
b 二 sum + d； 
2) Pipeline the unrolled loop： 
total[0] = total[0] + c； 
sum = sum + total[0]； total[1] = total[1] + c; 
b = sum + d; sum = sum + total [1] ； total [2] = total [2] +c； | 
b = sum + d; sum = sum + total[2]; 
b = sum + d; 
3) Select the instructions： 
b=sum + d; sum=sum + to ta l[1] ; total[2]=total[2] + c； 
4) The pipelined loop： 
‘ total[0] = total[0] + c； 
sum = sum + total[0]； Prelude 
total[1] = total[1] + c； 
for (i=l;i<N-l;i++){ 
Chapter 2 Source Code Optimization 17 
b = sum + d; 
sum = sum + total[i]; 
total[i+1] = total[i+l] + c； 
} 
b = sum + d； 
sum = sum + total [i] ; Postlude 
b = sum + d; 
2.2.6 Register Allocation 
Some variables are used more often than others; therefore using registers to 
represent these variables can enhance the performance. Nevertheless, we should be 
aware that not all kinds of variables can be represented by registers. Registers should 
not be allocated to global variables and reference variables passed to functions 
because they can be changed anywhere outside the current program scope. To solve 
^ it, we can introduce local temporary variables to direct the compiler to allocate 
registers to them explicitly. Consider the following codes, the variable "sum" is an 
external reference and thus it is loaded from memory every time in the loop 
unnecessarily. Therefore, we introduce a temporary variable "tmp" to store the value 
of "sum" and use it in the loop. Thus, we can reduce the memory access and number 
of instructions to be executed. When the loop is finished, we store the result back to 
the external reference "sum". 
Original Codes: 
void func(unsigned char* array, int *sum, int* ptr){ 
int i ; 
^ for (i=0;i<100;i++){ 
*sum += array[i]； 




Chapter 2 Source Code Optimization 18 




tmp += array[i]； 
} 
•sum = tmp； 
} 
2.3 Case Study: Source Code Optimization on the Strong ARM 
(SAlllO) Platform 
2.3.1 StrongARM architecture 
StrongARM (SAlllO) is a 32-bit system-on-chip RISC processor based on ARM 
architecture. [Furber 2000] It supports ARM v4 instruction set and comes with a 5 
stage-pipeline architecture to enhance the throughput of the processor [Intel 2000]. 
A RISC embedded microprocessor commonly employs pipelining to boost the 
performance. StrongARM separates each instruction into several steps such that 
several instructions can be executed in parallel. We have to understand the processor 
pipeline behavior in order to optimize the source codes to enhance instruction level 
parallelism. 
The five pipeline stages of the StrongARM are described below [Intel 1998] [Furber 
2000]: 
1. FETCH 
Fetch instruction from instruction cache. 
2. DECODE 
Decode instruction, read values from register. 
3. EXECUTE 
Perform shifting and arithmetic operations. 
4. BUFFER 
Data cache and memory access. 
5. WRITEBACK 
Chapter 2 Source Code Optimization 19 
Write results to registers. 
Cycle I FETCH [ DECODE | EXECUTE | BUFFER | WRITEBACK 
1 Fetch from 
PC 
" " “ 2 F e t c h from PC+4 Decode 
Instruction at PC 
3 ~ Fetch from PC+8 Decode Execute 
Instruction at PC+4 Instruction at PC 
4 Fetch from PC+12 Decode Execute Memory access for 
Instruction at PC+8 Instruction at PC+4 Instruction at PC etc. 
5 Fetch from PC+16 Decode Execute Merrory access for Write results to register 
Instruction at PC+12 Instruction at PC+8 Instruction at PC+4 etc. of instruction at PC 
• 
Figure 2.1 StrongARM: Five stages pipelining demonstration. 
When all the stages shown in Figure 2.1 are active in the same machine cycle 
simultaneously, the pipelining is said to operate in its optimal situation. To achieve 
this optimal pipelining so as to make all the processing units active, we have to 
prevent pipeline hazards. 
Pipeline Hazard 
There are several kinds of pipeline hazards [Hamacher 2002]: 
1. Structural Hazard 
It happens when two different instructions competing for the same resource, like 
memory and arithmetic unit. 
2. Data Hazard 
It happens when the operand of one instruction rely on the result of a prior 
instruction which has yet to complete. 
3. Control Hazard 
It happens when the next instruction is not available. This phenomenon often 
occurs in branching. 
When the above hazards arise, the pipeline would be stalled until the hazard is 
resolved. Stalls refer to the idle period when other instructions are waiting in the 
pipeline. It may slow down the processor but sometimes it is not preventable. 
Chapter 2 Source Code Optimization 20 
2.3.2 StrongARM pipeline hazard illustration 
The StrongARM processor core employs several techniques to resolve the pipeline 
hazard problems. Structural hazards, commonly found during simultaneous memory 
accesses, are resolved by employing the Harvard architecture in its design so that 
data and program are stored in separate memories. Data hazards are resolved by 
forwarding result data immediately after execution to the following instruction. In 
Figure 2.2, there are dedicated paths added between different processing units to 
forward the result data to the inputs of ALU immediately. Therefore the instructions 
can be executed instantly in the next cycle without fetching values from registers. 
Even with the dedicated data forwarding path, data dependency associated with a 
load instruction can still cause data hazards. The scenario is shown in Figure 2.3 
[Intel 1998] Control hazards that occur in the executions of branching instructions, 
are unpreventable. But, their impact can be minimized by adding special hardware to 
predict the new instruction location (PC) during decode stage. The scenario is shown 
in Figure 2.4 [Intel 1998]. Beside the hazards mentioned above, some instructions 
, t h a t need to occupy more than one cycle for their executions may lead to 
introduction of stalls too. 
Register Read 
. V . . V . I f 
\ ALU / 
[ > o 
Buffer/ 
D-Cache 
• J] ‘ >L 
‘ Register Write 
Data forwarding path 
Figure 2.2 Data forwarding path 
Chapter 2 Source Code Optimization 21 
Cycle FETCH DECODE EXECUTE BUFFER WRITEBACK 
1 Fetch from 
PC 
“ 2 “ Fetch from PC+4 Decode 
LDRrl,[rO,+4]! 
3 F e t c h from PC+8 Decode Calculate 
MOV r2,rl rO+4 
4 Stall S t ^ Stall Read data from [rO+4] 
5 Fetch from PC+12 Decode Execute Stall Write data to rl 
instruction at PC+8 MOV r2,rl 
0 LDR rl, [rO, +4] ！ //Load data from address [r2, +4] to rl (Assume Data Cache hit) 
4 MOV r2,rl "Move data from rl to r2 -
Figure 2.3 Data hazard illustration 
Cycle FETCH DECODE EXECUTE BUFFER WRITEBACK 
1 Fetch from 
PC 
2 Fetch from PC+4 Decode 
' CMPr2’r3 
3 Fetch from PC+8 Decode Calculate 
BEQ .FUNC and r2-r3 and set Control 
Calculate new PC Flags 
(.FUNC) 
4 Fetch from .FUNC StaU StaU S ^ Stall 
5 Fetch from Decode Stafi StaU S t ^ 
.FUNC+4 instruction at .FUNC 
0 CMP r2,r3 "Compare the value of r2 and r3 (Assume r2 == r3) 
4 BEQ .FUNC //Branch if r2 == r3 
Figure 2.4 Control hazard illustration. 
2.3.3 Source Code Optimization on the StrongARM 
Source code optimization is highly related to the optimizing compiler used. If the 
compiler is sophisticated enough to handle the optimization very well, it is better to 
keep original source code unmodified. In our experiments, we used the standard 
ARM GNU tool chain including GCC (version 2.95.1) [GNU 2003] on the Linux 
operating system. We examined the assembly codes generated by the compiler to see 
Chapter 2 Source Code Optimization 22 
whether further source code optimization is necessary. Moreover, we checked the 
performance of the 'source codes using the Simlt-ARM, the ARM instruction-set-
simulator and a cycle-accurate simulator for the StrongARM architecture [Qin 2003]. 
The simulator can provide the total number of cycles used and the overall instruction 
per cycle ratio. 
To illustrate the process of the source code optimization, we use a simple quick 
sort program as a demonstration. In our quick sort implementation, we have two 
functions: they are "SWAP" and "QSORT". The SWAP function swaps two 
elements while sorting. QSORT is the main sort function for dividing a sort list and 
comparing the list elements. There is only one loop in the function QSORT. We, 
therefore, concentrate on this loop for optimization. The loop is shown in Figure 2.5. 
I . L24 : 
p=l ； 1 Idr r2, [r4, rO, asl #2] 
i Idr r3, [r4, Ir, asl #2] 
for (i=l+l; i<=r; i++){ | cmp r2, r3 
1 bge .L23 
if {array[i] < array [1] ) { j add rS, rS, #l  
pd? r3, [r4, r5, asl #2] | . ‘ I str r2, [r4, r5, asl #2] f Z 
： ^ - . . - inlined 
swap(array,p,i); 1 门？ ~ ~ r 3 , [r4, rO, asl #2]丨 
： .uZJt 
} i add rO, rO, #1 
i cmp rO, r6 
} I b l e .L24 
Figure 2.5 Original source code and assembly code generated. 
The GNU C compiler provides many levels of optimization; we had chosen the 
provided maximum optimization flag (-03) to enable all the useful optimizations. 
This optimization level supports function inlining so that the swap function was 
inlined automatically by the compiler because of its simplicity. After looking at the 
assembly codes, we noticed that array [1] (Idr r3, [r4,lr,asl #2] ) was 
‘always loaded from memory while it was loop invariant whereas it should be stored 
in a register instead. The compiler cannot handle this because array was an external 
reference which can be modified by other functions simultaneously so that the 
compiler decided to load the content from memory each time. We can improve it 
Chapter 2 Source Code Optimization 23 
by introducing a temporary variable to let the compiler allocate it to register. 
.L24：  
p=l ； Idr rl, [r4, r2, asl #2] cmp rl, rO  
val = array [1] ； bge .L23 
add r5, r5, #1 
for (i = l + l,- i< = r； i++) { Idr r3, [r4, r5, asl #2] 5 
str rl, [r4, r5, asl #2] • ,• if (array [i] < val) { str r3, [r4, r2, asl #2]她ned 
.L23: 
P++'' add r2, r2, #1 
swap (array, p, i) ； 
} 
} 
Figure 2.6 Introduction of the temporary variable. 
In Figure 2.6，we found that the compiler uses register rO to hold the value of 
array [1] after introducing the temporary variable val. 
Since data dependency between instructions may cause data hazard, the compiler 
‘ t r i e s to reschedule the instruction to eliminate its effects. But, there are still some 
cases when such problems cannot be avoided. Consider the following code segment 
from the QSORT, 
Idr rl, [r4, r2, asl #2] 
cmp rl, rO 
bge .L23 
There was data dependency of register rl in both of the Idr and cmp 
instructions. The data forwarding path in the StrongARM cannot solve this data 
dependency situation. We tried to remove the data dependency by using software 
pipelining described before. The result is shown in Figure 2.7. 
Chapter 2 Source Code Optimization 24 
, #Only loop body is shown. 
p=l; , 
val = array[1]; 
sp = array [1 + 1] ； .L24: 
for (i=l+l;i<r;i++;){ r3'ri 
if (sp < = rr[r4, r。，asl #2] P++； add r5, r5, #1 
swap (array, P, i) ； lldr r3, [r4, r5, asl #2] 
J str r2, [r4, r5, asl #2] 
r .1 str r3, [r4, rO, asl #2] sp = array[i]; 儿25: ^ 
} add rO, rO, #1 • 
cmp rO, r6 
if (sp < val) { Idr r3, [r4, rO, asl #2] 
� bit .L24 P++； 
swap(array,p,i)； 1 
} I I 
Figure 2.7. Software pipeling example. 
For simplicity, we treated the body of the if-statement as one task in the loop. 
Hence, we used the previously described method to pipeline the two tasks in the 
loop in order to remove data dependency. Moreover, we observed that the compiler 
cannot recognize that one of the elements in "array" was preloaded in register r3 for 
- sof tware pipelining. For this reason, one extra load instruction was added in 
swapping. We can remove it by rewriting the swap function code ourselves. Finally, 
we compared the performance of all the optimization versions with the simulator and 
the StrongARM PDA to check the improvement. We tested the quick sort program 
with 50000 random integer data. The results are shown in Table 2.1. 
The instruction per cycle (IPC) information gives a brief picture about the 
pipeline behavior of the program. After the use of software pipelining, we found that 
the execution of the loop can achieve the maximum instruction per cycle (IPC) 
among others. It was not just because of the removal of data dependency, but the 
branching overhead of the loop was also reduced. There were two branches in the 
loop: the loop body branching and the if-statement branching. Branching causes 
stalls in the pipeline and thus affects the instruction level parallelism. In our case, if 
more instructions can be executed before branching, the branching overhead can be 
reduced. That is the reason why the instruction per cycle (IPC) drops after removal 
of the extra instructions in both versions 2 and 4. 
Chapter 2 Source Code Optimization 25 
Version Optimization IPC Total Estimated Speed on 
added (Instruction number of Speed StrongARM 
(Accumulate) Per Cycle) instructions (Simulation) PDA 
executed 
1 Compiler Only 0.7206 9 9 0 4 3 9 3 1 ^ 0.6658s 0.6244s 
2 Temporary 0.7185 9 8 1 3 8 4 7 3 0 . 6 6 1 7 s 0.6185s 
variable 
introduced 
3 Software 0.7248 9 8 5 2 4 0 0 9 0 . 6 5 8 4 s 0.6159s 
Pipelined 
4 Rewrite Swap 0.7237 9 8 3 8 9 6 6 7 0 . 6 5 6 5 s 0.6139s 
Table 2.1 Results of the optimized codes. 
The execution time of the program not only depends on the instruction per cycle, 
but it also depends on the clock rate of the processor and the total number of 
.instruction executed. The execution time can be obtained from the following 
equation. [Hamacher 2002] 
Texec ~ Tclock 乂 似 
r隱-execution time 
Tclock - clock rate 
n - total number of instructions executed 
IPC - instructions per cycle 
Therefore, we can reduce the execution time by reducing the total number of 
‘ instructions executed too. In particular, the improvements of execution time in both 
versions 2 and 3 were due to the decreases in total number of instructions. Although 
the improvement in speed was not very significant in the quick sort experiment, we 
had shown the entire optimization process. The improvement of the source code 
Chapter 2 Source Code Optimization 26 
optimization is strongly dependent on the how often the optimized code segments to 
be executed. If it is executed for millions of times, then undoubtedly more 
significant result can be shown. 
Lastly, in order to show the generality, we try to optimize the NIST Secure Hash 
Algorithm (SHA) source codes extracted from the MiBench, a free, commercially 
representative embedded benchmark suite [Guthaus 2001]. The secure hash 
algorithm can produce a 160-bit message digest with respect to the input stream. It is 
commonly used in digital signature and cryptographic applications. We first used the 
GNU profiler to find out the critical functions in the source codes. 
Time(%) Calls Function Name 
87.93 50744 sha.transform 
12.07 1 sha一 stream 
OOO i sha—final 
0.00 1 sha_print 
. Table 2.2 SHA profiling result. 
From the Table 2.2，we found that function sha_transform used most of the time. 
It was called 50744 times. Hence, it is important to optimize this function by using 
the source code optimizing techniques discussed. The results are shown in Table 2.3. 
Instructions Per Estimated Execution time 
Cvcle (IPO execution on t^ycie (叫 time StrongARM 
(Simulation) PDA 
Original 0.8833 0.0743s 0.0842s 
Version 
Optimized Version 0.9094 0.0623s 0.0727s 
• Speed Up (%) iWo 
Table 2.3 SHA Source codes optimization results. 
We found that we can obtain about a 14% execution time reduction on the 
StrongARM PDA after optimization. This shows that the source code optimization 
Chapter 2 Source Code Optimization 27 
does achieve significant improvement in speed. Moreover, we also did the same 
experiments with the Embedded Visual C++ 3.0，which is the standard development 
tool for Pocket PC PDA; we found that we can achieve the same improvement in 
performance. Therefore, the two most common compilers used in the StrongARM 
PDA are not yet optimized the source codes provided fully. And thus, source code 
optimizations are necessary. 
2.3.4 Instruction Set Optimization of StrongARM 
The optimal use of instructions is vital for microprocessor based embedded system. 
Since high level languages are often used for developing complicated software, it is 
therefore crucial to ensure the compiler chooses the most efficient and suitable 
instructions to translate the associated programs. In the embedded microprocessors, 
there are often some special instructions available to improve the performance of 
digital signal applications. A multiply-and-accumulate (MLA) instruction, which is 
commonly found in most of the DSP processor, is available in the StrongARM 
,processor. This instruction can be used efficiently in digital filtering and image 
processing. In this work, we studied whether the compiler can make use of this 
instruction to gain speed for the StrongARM. The compiler used is the Linux ARM 
GCC used in previous experiments. 
MLA, which stands for Multiply-And-Accumulate, is a single instruction that 
can perform multiplication and addition together. We evaluated the efficiency of the 
GCC compiler in using the MLA instruction through the following test codes: 
for (j = 0; j < 10000; j++) 
for (i = 0; i < 3000; i++) { 
tmp[i]. = a*b + c; //MLA expression 
} 
The optimizing compiler would optimize the code with respect to the nature of a, 
b and c in the MLA expression above. Consider the following cases of the 
V 
expression a*b + c in the loop: 
1. a, b, c are variables and variant in each iteration. 
For example, they are loaded each time from different memory locations in 
the loop. The compiler generated the MLA instruction for the expression to 
Chapter 2 Source Code Optimization 28 
gain maximum speed. 
2. a is variable and variant in each iteration, whereas b and c are constant. 
For example, the expression: a * 113 + 40. 
The compiler did not generate the MLA instruction for the expression. 
Instead, it generated a series of shifting, addition and subtraction instructions 
which intended to remove the use of the hardware multiplier. [Lefevre 1992] 
Typically, the multiplier takes 1-3 cycles to complete multiplication 
depending on the value of the multiplicand. Therefore, it would introduce 
pipeline stalls and thus affect the overall performance. Say, if we want to 
multiply the constant 113 by a variable a, we can convert the multiplication 
to series of shifting, addition and subtraction by the following method 
[Bernstein 1986]: 
1 1 3 ( 1 1 1 0 0 0 1 2 ) 二 2 6 + 2 5 + 2 4 + 1 
By using the identify, 
=> 113 = 2 ' + 2 ' + 2 ' + 1 = - + 1 
�1 1 3 a = 2\2^a-a) + a 
In computer system, multiplication of any power of two values is equal to 
left shifting (“《，，). Therefore, 113a can be evaluated as follows: 
la<r- a«?>-a 
\Ua<r-la«A + a 
Thus, only two instructions are needed to perform multiplication. This is 
more efficient than the multiplication/MLA instruction. The assembly codes 
generated for expression 113*a + 40 by the compiler is shown in Figure 2.7. 
Chapter 2 Source Code Optimization 29 
.LIO: 
bl _ rand 
rsb r3, rO, rO, asl #3 add rO, rO, r3, asl #4 >- MLA： 113*a + 40 
add rO, rO, #40 _ 
str rO, [r5, r4, asl #2] 
add r4, r4, #1 
cmp r4, r7 
ble .LIO 
Figure 2.7 Code generated for expression: 113*a + 4 0 . . 
However, there are some cases which require more instructions to compute 
the multiplications. It may be slower than a single multiplication or MLA 
instruction. Consider the following example: 
172(10101 IOO2) = 2 ' + 2 ' + 2 ' + 2 ' 
=>172 = 2 ' + 2 ' - 2 ' - 2 ' + 2 ' 
=^172 = 2'(2(2^ (2 + l ) - ( 2 + l)) + l) 
. ^\12a = 2\2{2\2a + a ) - ( 2 a + fl)) + l) 
Therefore, 172a can be evaluated as the following: 
3a<r- a « \ + a 
2\a<^3a «3-3a 
43a <^2la « \ + a 
Ilia <- 43fl « 2 
It requires four instructions to perform the multiplication and it is slower than 
the MLA instruction. As the binary representation of the constant affects the 
number of instructions generated for multiplication, we performed an 
experiment of different constant values and checked the results against the 
‘ speed of hand-optimized code with one single MLA instruction. 
Chapter 2 Source Code Optimization 30 
b c . - No. of No. 0二 Time(compiler) Time(MLA) instructions(compiler) instmctions(MLA)  
3 ~Q~ 1 1 — 16.28s “ 16.43s 
9 ~Q~" 1 1 16.28s ~T6.43S 
~T9""”0~ 2 — 1 16.42s 16.43厂 
22 0 3 1 16.57s ~T6.43s 
52 0 3 1 16.57s "~T6.43s 
172 4 1 16.71s ‘16.43s 
13741 0 I 5 I 1 I 16.86s 16.43厂 
Table 2.4 Different constant values comparison. 
In Table 2.4, we only consider the multiplication in the MLA expression 
such that we set the value of c to zero. The compiler generated various 
numbers of instructions to replace the multiplication of different values of b. 
If there were more than two instructions, it was slower than a single MLA or 
multiplication instruction. For example, it generated five instructions for the 
, multiplication of the value 1374. That is much slower than the MLA 
instruction. Through this experiment, we discovered that the compiler did not 
concern the speed of the internal multiplier and sometimes replaced with less 
efficient instructions. 
Furthermore, the addition of the value c in the MLA expression also affects 
the number of instructions generated. It is because the StrongARM cannot 
always pack the immediate values into one single instruction. It used 12-bit 
for holding the immediate value in the instruction, in which 8-bit for the 
value and 4-bit for the shift operand. Therefore, only the immediate values 
satisfy the following formula can be stored in one single instruction: 
[8 - bit value] x where wis 4- bit shift operand 
Consider the immediate value 1000. 
1000 =11111010002 
(111111010 2)X22 
The calculation above shows that it can be stored in one single instruction 
Chapter 2 Source Code Optimization 31 
and thus only one addition is required. 
In contrast, consider the immediate value 5000. 
5000 = 10011100010002 
The binary representation of 5000 cannot satisfy into the formula discussed 
above and hence it cannot be stored in one single instruction. 
Instead, the compiler would consider the case 4992+8， . 
5000 = 4992+8 
4992 = 100111000_2 = (010011 lO� )x 2'
8 = 10002 =(000010002)X20 
Therefore, two additions are required for the addition of value 5000. 
We examined the speed of different combinations of constant b and c and 
checked how many instructions were generated for each case. The results are 
shown in Table 2.5. 
(b，c) No. of No. of Time Time 
instructions instructions (Compiler) (MLA) 
“ (Compiler) (MLA) 
(3,1000) 2 i 16.43s 16.43s 
(3,5000) 3 — i 16.57s 1 6 . 4 3 s ^ 
(19,1000) 3 1 16.57s 1 6 . 4 3 s ^ 
(19,5000) 4 i 16.72s 16.43s 
(52,1000) 4 - i 16.72s 16.43s 
(52,5000) 5 1 16.87s 16.43s 
(1374,1000) 6 1 — 1 7 . 0 1 s 1 6 . 4 3 s ^ 
(1374,5000) 7 1 17.16s 1 6 . 4 3 s ^ 
Table 2.5 Different immediate values comparison. 
In Table 2.5, we found that the MLA always performed better than the codes 
generated by compilers. This is because at least one addition instruction was 
generated by the compiler. Moreover, the addition of 5000 was separated 
into two additions of 4992 and 8，thus it was slower than the addition of 1000 
even with the same value of b. 
3. a is variable and it is also loop variant. Although b and c are also variable, 
Chapter 2 Source Code Optimization 32 
they are loop invariant. For example, we assigned a value to variable b and c 
before entering the loop. Thus, they would not be changed in the loop. The 
compiler generated a multiplication instruction followed by an addition 
instruction. Given the same time span, MLA can perform two operations 
whereas only one single multiplication can be performed. Therefore it is 
slower than one single MLA instruction 
One single MLA instruction can perform better when it is compared to the 
optimization of the compiler for the cases described above. Therefore, we have to 
use assembly language to further optimize the codes. Through experiments, we 
exposed the weakness of the compiler. It does not have enough knowledge of the 
target processor in order to fully optimize the source codes. 
2.4 Conclusion 
We have reviewed some common techniques in source code optimization and also 
‘ examined the possibility of optimizing source codes with respect to the architecture 
of the embedded processor. We showed that the source code optimization techniques 
could give satisfactory improvement. It might be possible for the compiler using 
these optimization techniques to generate better executable codes in the future. 
Moreover, we exposed the compiler do not always use the optimal instructions 
available for the target processor. Therefore, we have to use assembly language to 
further optimize our codes that would disgrace the portability of the software. 
Chapter 3 Float-to-Fixed Optimization 33 
CHAPTER 3 
3 Float-to-Fixed Optimization 
One of the major concerns in the portable embedded system is the power 
consumption. In order to reduce the power consumption, floating point unit is not 
included in such system. Instead, the software emulation is used to perform floating 
point operations. To appreciate the speeds of such emulation, a typical performance 
comparison (in second) of two processors is shown below [Moon arid Luk 2002]: 
StrongARM Pentium II 
(206Mhz) (266Mhz) 
Integer Arithmetic  
Addition/ o l 005 
Subtraction  
Multiplication — 0.09 0.06 
Division 0.12 0.22 “ 
Floating Point (Software Emulation) 
Arithmetic ；  
Addition/ TT^I 005 
Subtraction  
Multiplication ^ 0.77 
Division 10.09 — 0.06 一 
Mathematical functions (Software Emulation)  
• e — 152.18 0.55 
Cosine 151.89 “ 0.55 一 
^ngen t — 270.73 0.71 
Arc Tangent 191.52 1.26 
Square Root 28.25 0.33 
#Each test runs 1 million times. 
Table 3.1 Floating point and Fixed-point speed comparison. 
From the data in Table 3.1，it is obvious that the StrongARM, an embedded 
microprocessor commonly used in Personal Digital Assistants (PDAs), performs 
floating point operations much slower than Pentium II even'the clock speeds of both 
processors are similar. Therefore, it is not uncommon that software developed in a 
Chapter 3 Float-to-Fixed Optimization 34 
desktop PC cannot be directly ported to an embedded system because of the absence 
of a floating point unit. Here, we propose the replacement of floating point 
operations with fixed-point operations to boost performance of the programs used in 
embedded system. Fixed-point is merely an integer with a decimal point defined 
explicitly by the programmer. Therefore, its performance is as fast as integer 
arithmetic. In this chapter, we would first review fixed-point arithmetic and then 
show the complete float-to-fixed process using a case study of fingerprint minutiae 
extraction performed on the StrongARM platform. 
3.1 Introduction to Fixed-point 
Besides floating point representation, fixed-point is another representation of real 
numbers in computer systems. Fixed-point is used very often in DSP to perform 
variety of applications such as speech recognition [Hui 1998], MPEG-4 AAC audio 
decoding [Mesarovic 2000], JPEG2000 encoding [Tai 2001], etc. Fixed-point is 
used in DSP because of the implementation of floating point DSP requires large 
chip area and its operation also requires higher power consumption. [Jersak 1998] 
3.1.1 Fixed-point representation 
Fixed-point uses integer to represent real number. Here is the representation of 
fixed-point in integer [Kang 1997][Willems 1997]: 
WL - World Length, total number of bits used to store the fixed-point numbers. 
Mostly choose the WL of the platform. E.g. 32bits 
IWL — Integral Word Length, total number of bits assigned to the integral part. 
FWL 一 Fractional World Length, total number of bits assigned to the fractional part. 
S - Unsigned or two complement (signed) 
�S w m m m ^ ^ m m . FWL 
Decimal point 
Chapter 3 Float-to-Fixed Optimization 35 
The range of the fixed-point numbers supported is: 
-1-鳳 < N < 2 ^^^ { Quantization Step : 
For a 32 bit word, a common choice is: IWL = 15,1 sign bit, FWL = 16，WL = 32. 
This is referred as the 15.16 signed notation. 
The minimum number that can be represented is - = 32768.0 (0x8000000) and 
the maximum number is - T^^ « -32767.99998 (OxTFFFFFFF). 
The range of fixed-point is quite limited when compared to the floating point 
representation that uses exponent and mantissa in a 32-bit word to represent its 
numbers. Therefore, overflow and underflow problem are quite serious in fixed-
point representation. We should take special care when choosing suitable fixed-point 
notation for our applications. 
3.1.2 Fixed-point implementation 
Fixed-point type and operations are absent in most, if not all, definitions of the 
modern programming languages. Therefore, the implementation of fixed-point 
library becomes a necessity when we want to use fixed-point in our programming 
There are two different implementations for fixed-point type: 
• Multiple Precision 
As fixed-point representations have narrower number ranges, sometimes it is 
not enough for us to use only one single fixed-point notation throughout the 
entire application program. For example, some of the variables in a program 
need to use 3.28 signed notation while others need to use 15.16 signed 
notation to prevent overflow. Because of using different notations in the 
same programs, the arithmetic between two different fixed-point notations 
need to pre-aligned with the same decimal point before actual operations are 
performed. Mostly, the smaller operand (i.e. smaller IWL) will be right-
shifted to align with the other operand. In order to determine the IWL for 
Chapter 3 Float-to-Fixed Optimization 36 
each of the variables used in the program, we need to find the absolute 
maximum for each variable by running numerous test cases for the program. 
The conversion of floating point to fixed-point is so labor intensive that 
automatic simulation and conversion tools become necessary. A DSP case 
study can be found in [Kum 2000] 
• Single Precision 
Single precision is used because of its simplicity. It uses the same notation 
throughout the whole program. Since we do not have to align the decimal 
point before performing arithmetic, therefore single precision 
implementation is faster and simpler than the multiple precision 
implementation. This technique is commonly found in game programming 
and MPEG-3 Decoding [MAD 2003] 
As we want the fixed-point library to run fast and we need to deal with only 
- s m a l l integral and fractional part in our applications, we, therefore, implemented 
our fixed-point library with the single precision of 15.16 signed notation. We used 
the ANSI C language to implement our applications and fixed-point library. The 
programming language C is chosen because of efficiency and compatibility 
considerations. The simplified arithmetic rules are listed in the following table: 
Addition I X +Y 
Subtraction X - Y 
Multiplication (X * Y) » FWL 
Division (X «FWL)/Y 
#X，Y is a Fixed-point type (same notation), FWL is 16. 
# >> - signed right shift, « - signed left shift 
Table 3.2 Fixed-point arithmetic 
Chapter 3 Float-to-Fixed Optimization 37 
Addition and subtraction can be done in ways similar to integer addition and 
subtraction. However, in order to preserve the accuracy of 15.16 signed notation, 
multiplication and division are implemented using a temporary accumulator with 
2*WL, i.e. 64 bits. Since most 32-bit based modern compilers provide us a 
convenient 64-bit "long long" data type as accumulator, they facilitate our work by 
making it platform independent. Though assembly language can be used to perform 
the multiplication and division, it leads to inconvenience. Moreover, the resultant 
codes become platform-dependent too. We have used integer type declared in C to 
represent signed 15.16 fixed-point type. For the accumulator in MUL/SUB, we have 
employed the long long data type, i.e. 64bits. Most importantly, G macros are used 
to implement the fixed-point library operations, like ADD, SUB, MUL and DIV 
since implementations using function calls may require many stack operations for 
storing temporary variables. Moreover, these operations are very common and are 
frequently used so that the speed of performance becomes a major concern to us. To 
evaluate our implementations, a series of speed tests were conducted to find out the 
. efficiency of these two implementation methods. Each of the following tests ran 10 
million times. The results were recorded as follows: 
StrongARM Intel Pentium 4 
SAlllO (1.4 GHz) 
(206MHz) 
Fixed ADD (MACRO) 0.58s a04s 
Fixed ADD 2.86s 0.22s 
(Function Call) 
Fixed DIV ( M A C R O ) o W s 
Fixed DIV 15.952s 0 9 ^ 
(Function Call) 
Table 3.3 Speed Comparison of MACRO and Function Call 
Chapter 3 Float-to-Fixed Optimization 38 
3.1.3 Mathematical functions implementation 
Mathematical functions frequently used in calculations are implemented as function 
calls to mathematical libraries in high level languages. Since the mathematical 
libraries use floating point arithmetic in their implementations, we have to 
implement our own fixed-point mathematical functions instead. The implementation 
of mathematical functions can be achieved either by using table lookup methods or 
other mathematical approximations, like the Taylor series and the Fade Rational 
Approximation [Abramowitz 1972]. In our fixed-point library, we have employed 
the table lookup method for speed. We implemented only the necessary 
mathematical functions in our fingerprint minutiae extraction case study; they are 
the arc tangent, sine and cosine functions. 
Sine and Cosine implementation 
To facilitate the bit masking and wrap around process, it is better to choose the 
number of entries as a power of two. Hence, we arbitrarily divide a full circle into 
256 divisions so that each division approximately equals to 1.4 degrees (360/256) 
for our application. The maximum absolute error for using 256 divisions is about 
0.02455. With the aid of a PC equipped with a floating point unit, we generate the 
lookup table for the trigonometric functions by converting all the values to fixed 
point representations. Hence, the storage of 256 entries requires 256* 4bytes (15.16 
signed) = 1024 bytes. Moreover, we have only used one cosine lookup table for the 
sine and cosine functions to prevent duplication by using the fact that sin( 6 )=-
cos(9o + e). 
Arc tangent implementation 
Comparing to the cosine lookup table, the arc tangent table is more difficult to 
implement. It is because tangent values approach infinity when they approach 
90/270 degrees so that the distribution of the function values are very uneven. 
Instead, we implemented the arc tangent function with the help of a tangent lookup 
table. There are two arc tangent function prototypes commonly used in programming, 
atanO) { output range: [-90°,90°] } and atan2(y,jc) { output range: [-180°，180°] }. 
Chapter 3 Float-to-Fixed Optimization 39 
We implemented the atan(<T) function by searching for the value in the tangent 
lookup table entries which is closest to a. To save resources, we use only 64 entries 
lying between 0 to 90 degree as 256/4 equals to 64; the table size is 64*4 = 256 
bytes. To handle very large values of a, we studied the arc tangent function and 
found out that 
cr> 57.29 => 
89�Stan—i(cr)S90� 
arc tangent function plot 
21 1 1 1 1 1 1 1 
I 
• 瞧 丨 瞬 丨 画 丨 瞧 _ 誦 _ 麵 丨 毒 - 誦 丨 — 丨 隱 丨 誦 丨 瞧 丨 痛 • • 
1.5- 产 — ； = 
1 - / I 
^ 0.5 - ！ (Q I 
••D _ 
S I 
互 • 0) 0 - I 
(0 , i ： 
3 -0.5 - I -
•1: J I _ 
-1.5 ： ； -
-21 “ “ 1 1 1 1 1_|  
-80 -60 -40 -20 0 20 40 60 80 input ratio 57.29 
Figure 3.1 Arc Tangent Function Plot 
As each entry in our lookup table is about 1.4 degrees, therefore values 
between 89 to 90 degrees belong to the same entry in our table. The implementation 
of the arc tangent functions can be summarized in the following pseudo code: 
atan (o ： fixed-point type){ 
s ：= sign(a ) ; /* sign of a */ 
Chapter 3 Float-to-Fixed Optimization 40 
a ： =abs ( (7 ) ; /* absolute value of a */ 
angle ：- index of closest value of a in tangent 
lookup table. 





atan2(y： integer, x: integer){ 
fixed-point type temp ：= (y<<16)/x； 
/* calculate the ratio */ 
angle：= atan(temp)； 
if ( X is negative){ 
if ( angle < 0 ){ 
angle := angle + 12 8； 
/* index 128 is equal to 180 degree*./ 
}else 




atan2(y,x) uses atan( a ) internally. We compute the value of a as y/x. Then, 
we convert the value return from atan( o ) based on the sign of argument y and x 
according to the following table: 
sign(y)/sign(x) + / + TT- 1 
Range(degree) [0,90] [90,180] [-90,-180] [ 0 ^ ] 
Table 3.4 Input Signs and Ranges relationship of atan2. 
Chapter 3 Float-to-Fixed Optimization 41 
3.2 Case Study: Fingerprint Minutiae Extraction Algorithms on 
the StrongARM platform 
With the tremendous development of mobile embedded system, like PDA and 
mobile phones, the fingerprint verification technology migrates from desktop 
computer environment to the mobile e-commerce platforms. A major obstacle for 
porting this technology to the mobile devices is the limited processing power of the 
embedded microprocessors inside devices. In particular, the microprocessors do not 
support floating point hardware. In this work, we used fingerprint minutiae 
extraction as a case study to show how the fixed-point can enhance the performance 
on the microprocessor based embedded system. 
The StrongARM platform, a common the standard platform of PDA nowadays. 
There are several ways to achieve the use of floating point computation on 
StrongARM/Linux: 
• Floating Point Coprocessor 
There is floating point unit available for StrongARM processor core, but 
it is not employed in the StrongARM platform because of extra cost and 
power consumption. 
• Floating Point Software Library 
The floating point software library is a software implementation of the 
IEEE 754 floating point standard, it can be linked with your program 
statically. It is not used in the standard Linux development toolchain of 
StrongARM/Linux because it would increase the executable size. 
• - Floating Point Unit Emulation 
The StrongARM/Linux [[Bambrough 1999] adds a floating point unit 
� emulator in its kernel and enables it to run the Floating point coprocessor 
machine codes. It is the standard way of supporting floating point on the 
StrongARM/Linux platform. We have used this floating point unit 
Chapter 3 Float-to-Fixed Optimization 42 
emulation as a reference for speed comparison with our fixed-point 
implementation of our fingerprint minutiae extraction algorithms. 
We will first give an overview of the fingerprint verification concept and 
discussion of corresponding algorithms with focus on the implementation of fixed-
point fingerprint minutiae extraction algorithm afterward. In particular, we will 
demonstrate that the fixed-point implementation could achieve both speed and 
reliability through experiments. 
3.2.1 Fingerprint Verification Overview 
Fingerprint images, captured by different kinds of sensors, are made up of gray-scale 
ridgelines. A fingerprint is characterized by its minutiae points obtained from the 
ridges. The most common minutiae are (i) bifurcation where a ridge branches into 
two different directions and (ii) termination where a ridges ends, as shown in Figure 
3.2. 
‘ Many researches employed binarization and thinning algorithms [Jain 1997][Wahab 
1998] as the basic step for fingerprint minutiae extraction. Nonetheless, Maro and 
Maltoni presented the direct gray-scale minutiae detection algorithm [Maio 1997] 
for fingerprint and showed that it is a more efficient and accurate way. In this case 
study, we focus on how to optimize this direct gray-scale algorithm on our 
embedded system platform without employing support of a floating point unit. 
SSN^J^^^ O bifurcation 1 
Figure 3.2 Bifurcation and Termination 
Chapter 3 Float-to-Fixed Optimization 43 
A fingerprint verification system can be divided into 4 stages: 
1. Image segmentation 
2. Core point detection 
3. Minutia extraction 
4. Minutia matching 
Image Segmentation 
Before we extract minutiae points, we first segment a fingerprint image to find the 
regions of interest (ridges) by considering the orientation of certain areas each of 
WxW. We can find the orientation of the ridges in each grid by the equation 3.1 
[Hong 1998]. Before the computation of the orientation 0，we have to apply the 
Sobel filter for each of the pixels(u,v) in the fingerprint images in order to extract the 
gradients of both X and Y direction,. The gradients are stored as gx(u’v) and gy(u’v) 
respectively. 
. W . w U=l V= I  
2 】 2 
yy{ij)= t ^ t ^ s i M i M (3.1) .w . w U=l——V=J—— 2 2 
Then we compute the certainty level based on the orientation field of the area 
with the equation 3.2. If the certainty level of the orientation field is below a certain 
threshold T, then the pixel will be marked as background; otherwise, it will be 
included into our region of interest. [Maio 1997] 
Chapter 3 Float-to-Fixed Optimization 44 
\ ) vXij) 
where (3.2) 
t t i s l M ^ s l M ) .w . w 
U=l V= 1 2 2 
Core point detection 
The core point is defined as the point where the maximum curvature of the ridges is 
found [Jain 2000] or the top most point on the innermost ridge [Zhang 2002] in a 
fingerprint. By combining the two definitions together, a fingerprint image can only 
contain at most one such point. The core point is extremely important; it can be used 
to classify the fingerprint image and acts as a reference point for minutiae matching. 
We first find the curvature of the ridges by using the sine component of the 
orientation 9 of each pixel (i,j). Hence, we compute the sine component of each pixel 
with the equation 3.3 and the orientation 0 at each pixel(i,j) 
“/， j ) = s i n { p ( / , ； ) ) ( 3 . 3 ) 
The top most point in the innermost ridge had minimal sine component such 
that the ridge lies horizontally while the sine component of its both side must be 
maximal such that the ridge becomes almost vertical. Therefore, we locate the core 
point by searching the sine component map to look for a point satisfies the above 
criteria. If-the sine component of a point is less than a threshold, 0 < threshold^^ 
and both sides are almost vertical, 6{i-I, j)> tt/2 and <9(i+ ! , ; ) < ; r / 2 . Then, 
we compute the difference D of the two regions Ri and Rn defined in a circular mask 
shown in Figure 3.3 of radius r and threshold angle OC from the equation 3.4. 
Chapter 3 Float-to-Fixed Optimization 45 
^ = Z j ) - Z j ) (3.4) R • R u 
V ) Y r 
Figure 3.3 Circular mask with region R| and Rn 
The difference D calculated with the help of the circular mask represents the 
changes of the curvature in the concave ridges. We calculate the difference D for 
each of the sine component of each pixel and try to locate the pixel with maximum 
value of D. The maximum of D means that there is the sharpest change in curvature 
r and that is the possible location of the core point. 
The orientation of the core point (coreX, coreY) is obtained by analyzing the 
orientation of its surrounding points. The procedures are described below: 
1. For each row of the smoothed orientation map, find the X and Y coordinates 
where the orientation value is minimum and nearest to the core point 
2. Compute the average of X and Y coordinates of all the points found in step 1. 
3. Calculate the orientation of the core point with the AvgX and AvgY obtained 
in step 2 where AvgX is the average coordinate of the X coordinates and 
AvgY is the average coordinates of the Y coordinates. The orientation is 
obtained as: 
-AVQY core _ Orient = arctan( ) AvgX 
Chapter 3 Float-to-Fixed Optimization 46 
Minutia extraction 
The fingerprint image is a gray level array and the ridges are the pixels with local 
maximum of the cross section of the ridgeline orthogonal to its direction [Maio 
1997 ]. Based on this concept, we can follow the ridge by tracing the local maximum 
as illustrated in Figure 3.4 [Maio 1997]: 
A  
Cross S e d i o f i 0 
j P ^ l u l l I I I 
、 M l m x (丨“)制 
• ^ ^ 
Figure 3.4 Ridge Line Tracing Illustration 
Chapter 3 Float-to-Fixed Optimization 47 
The algorithm in [Maio 1997] can be summarized in the following pseudo code: 
ridge line tracing(is，js，(??o) { 
end := false; 
(ic，jc):=(is，jJ; 
(Pc •=<Po ； 
while (-iend){ 
(ij, jj):= (i^ , j j + | x pixel 1 along direction ； 
n := section set centred in (i^ j j 
71 with direction + — and length ； 
(ij, ’ jn) := local maximum over Q ； 
store(i„,jJ ； 
end := check stop criteria on(i^, j j , j j ; 
(ic，jc):=(in，jn)； 
(p^  := tangent direction in (i�’ j�）； 
‘ } } 
Each time when the ridge line tracing routine executes, we start a new tracing 
point ( i s , j s ) with a tracing direction which is the orientation of the grid that 
contains the tracing point. Then we move forward |i pixels along the direction. 
Afterward, we take a cross section O of length a along the direction normal to the 
current orientation. The starting point U 細 , j start) and ending point {i end J end) ^^ 
the cross section can be calculated using the following equations: 
({start J start) = {round{it - (T cos{0 + - ) , roundijt -asmiO + - ) ) 
n n (3.5) (lend，J end) = (joimdijt + crcos((9 + —), roundijt + (Jsin(6> + —)) 
Chapter 3 Float-to-Fixed Optimization 48 
From the cross section, we can find the local maximum and use the point at the 
local maximum as the next tracing point. To stop the ridge tracing procedure, we 
should check the stopping criteria at each iteration: 
• If there is no local maximum, it is likely to be a termination minutia. 
• If the next tracing point is out of interest, we can omit it. 
• If the next tracing point is previously traced, it is likely to be a bifurcation 
minutia. 
• If the ridge direction changes dramatically, it is likely to be a tracing error 
and the procedure should be stopped. 
During the ridge line tracing procedure, all minutia information (x, y coordinates and 
the minutia orientation) will be recorded. 
Minutia matching 
By using the location and orientation of the core point obtained previously, we 
convert all the minutiae points into polar coordinates {r.,0.) form respect to the core 
point acting as an origin: 
fi = Vi^i — core又 f + (兄-core^ f 
=tan-'(}；,. -core”x�-騰J-隱。 
where (Xj，兄.)is the cartesian coordinates of minutia i 
‘ (r.，6-) is the polar coordinates of minutia i 
{core^, core^) is the cartesian coordinates of the reference point 
‘ cor^ori邮 is the reference point orientation 
The polar coordinates are ready for matching with another pair of polar coordinates 
extracted from the live fingerprint. The matching score of two fingerprints can be 
obtained by equation 3.6. The value of a matching is between 0 and 100. We can 
Chapter 3 Float-to-Fixed Optimization 49 
analyze our fingerprint software with large amount of fingerprints to determine the 
optimal matching score threshold later. 
number of matches , ^^ 
xlOO (3.6) Maximum number of mintiaes amoung two fingerprints 
3.2.2 Fixed-point Implementation of Fingerprint Minutiae Extraction 
Algorithm 
After investigating the fingerprint minutiae extraction algorithm carefully, we 
conclude that there are only three major parts that need to be implemented with 
fixed-point, while integer computation can be used in the other parts. The sine, 
cosine and arc tangent are implemented with lookup table as described before. In the 
ridge line tracing algorithm, we have used fixed point arithmetic in the steps shown 
below: 
1. Calculating the orientation of the ridge 
2 [Vy{ij)) 
where Vx and Vy are the sum of gx(u,v) and gy (u’v). Their values can become 
extremely large. Thus, we used integer (32bits) to store Vx and Vy and 
applying the fixed atan2(Vx,Vy) so that the corresponding angles can be 
obtained efficiently. 
2. Detect the core point 
As we have mentioned before, the sine function is used in the core point 
. detection process. The 0 calculated in Step 1 becomes the argument of the 
sine functions such that the sine component of each grid can be obtained 
3. Finding new trace point and cross section coordinates: 
if = round {i^ + /n cos( 6)) 
it = round (人 一//sin((9)) 
Chapter 3 Float-to-Fixed Optimization 50 
where (it,jf) - new tracing point, fi - step length, ( i c j c ) - previous trace point 
Rounding is necessary because the coordinates (itj t) must be integral 
values. The dimension of our fingerprint images is 256x256 (i.e. it and jt < 
256). As we have to round back to integer, we can analyze the FWL as 
following: 
> 0.05 
FWL > - log 
M 
for fi = 3, FWL > 4.90 
Similarly, we can calculate the starting and ending coordinates of cross 
section with similar equations: 
(,、制 ’ htan) = {round (it - G cos(6> + y ) , round (jt-a sin(6> + y ) ) 
Jl 
{lend，Jend) = {rouud (it + CF cos(0 + —), round (jt + a sin(0 + —)) 
where (it，jt) - new trace point 
( i s t a r t O s t a r t ) — Starting coordinates 
( i e n d j e n d ) 一 ending coordinates 
p - cross section length 
Hence, 
c j 2 - ( ™ + ” 2 0 . 0 5 
FWL > - log 
(7 
for a = A, FWL > 5.32 
Chapter 3 Float-to-Fixed Optimization 51 
From the above analysis, FWL should be at least larger than 5.32, i.e. FWL = 6. 
Moreover, as the range of angle 0 is [-255,255] and the dimension of the fingerprint 
images used is 256x256, i.e. IWL = 8. Therefore, we proposed that the minimum 
fixed-point notation is 8.6 signed. We attempt to find out the minimum fixed-point 
notation because we want to confirm that 15.16 signed notation is more than enough 
for fingerprint minutiae extraction algorithm. More related experiments are shown in 
later section. 
3.2.3 Experimental Results 
1. Accuracy 
To measure the accuracy of our embedded fingerprint verification system, 
we have employed a SecuGen optical fingerprint sensor to build up a 
fingerprint database. The source of fingerprint database origins from 383 
different fingers. Each individual enrolled his/her same finger 3 times to 
build up a database with 1149 fingerprint images. 
False Acceptance Rate (FAR) and False Rejection Rate (FRR) are 
commonly used to determine the accuracy. FAR is the probability that a false 
match occurs, i.e. an unauthorized user is accepted by the system. FRR is the 
probability that an authorized user is rejected by the system. To evaluate our 
algorithm accuracy, we have matched each individual fingerprint with that of 
the other individuals (imposters) and form a FAR curve from the result. In 
addition, we calculate the FRR curve by matching each fingerprint with the 
other two fingerprint of the same individual (genuine). The FAR and FRR 
curve are the normalized accumulative distribution of the matching score [0， 
100]. Here we use FAR and FRR together to determine whether our fixed-
point implementation can achieve the same accuracy as the floating point. 
The vertical axis is the frequency normalized by the total number of 
matching. The horizontal axis is the matching score. 



















"-11 11111111 11111111 1 1111111 11 I III 1 1 1 1 11 11 11111111 1111 11 111111 












E 0.3 g 0.2 
0.1 
o 
\--FAR(float) .. .. .. . FAR(Fixed) I 
Figure 3.5 FAR comparison between 
















1--FRR(float) • •• - - FRR(Fixed) I 
Figure 3.6 FRR comparison between 
float and fixed-point version 
52 
Chapter 3 Float-to-Fixed Optimization 53 
The FARs of the floating point and fixed-point results are very similar. 
For the FRR comparison, the curve of the fixed-point version is slightly 
higher which means that the ratio increases a bit faster than the floating point 
version. Nevertheless, the overall accuracy of the fingerprint verification 
process was not affected. To measure the overall accuracy of the system, we 
introduced Equal Error Rate (EER) by overlapping the FAR curve with the 
FRR one and find the interception point. It is the position where both FAR 
and FRR are minimized. In our experiment, EER of both versions were quite 
similar - EER of floating point version is 0.0594 whereas fixed-point version 
is 0.0601. The performance is comparable with other fingerprint verification 
systems, which generally has 5 - 7% EER. [Maio 2002] 
In addition, we did several FAR/FRR tests by using different fixed-
point parameters to investigate their impacts on the EER. The results are 
shown in Figure 3.7. We found that the resultant FARs were basically 
similar in all cases so that we show only the floating point FAR curve. From 
. the figure, we found that the EER start to rise when we use FWL less than 6. 
Based on this EER analysis, we concluded that 8.6 signed notation is 
accurate enough for our fingerprint minutiae extraction process. This 
observation is similar to what we deduced from the mathematical analysis 
shown previously. For flexibility and convenience in the 32-bit word 
environment, we have used the 15.16 signed notation throughout our 
implementation. 
Chapter 3 Float-to-Fixed Optimization 54 
Different Fixed-point notations comparison 
——^ r - \ -
^ � 5 V ^  
O L P - a s a s s a f i ^ ? ^  
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 97 101 
matching score 
,. I FAR(tloal) FLOAT m 8.4 • 8.5 - - - 8.6 | 
“ H x e d - p o i n t n o t a t i o n ~ ~ E E R 




Figure 3.7 Different fixed-point notations comparison 
Chapter 3 Float-to-Fixed Optimization 55 
2. Speed Test 
We used 20 different fingerprint images. Each fingerprint image is a 
256x256 pixel gray-scale image. Each sample was tested for 10 times and the 
average computational time is shown as follows. 
Average Computational Time: 
StrongARM ZFx86 
SAll 10 with FPU 
(206Mhz) (133Mhz) 
Fixed-point 0.90s 2.84s 
Floating point 21.0s 2.85s 
Table 3.5 Speed comparison of fixed-point and floating point 
implementation 
‘ The above two tests aimed at illustrating the differences between our 
implementation in a system with and without Floating Point Unit (FPU). 
In StrongARM SAll 10, a system without FPU, we find that the use of 
fixed-point arithmetic is a great success. The improvement is imperative. It 
also shows us that the slow and unacceptable speed of the floating point unit 
emulation in the Linux Kernel and thus fixed-point arithmetic is essential. 
The ZFx86 is an embedded platform equipped with an x86 compatible 
microprocessor and FPU for floating point calculations. The result of our test 
using the ZFx86 shows that the performance was nearly the same regardless 
the presence of the FPU. 
Chapter 3 Float-to-Fixed Optimization 56 
3.3 Conclusion 
In this chapter, we have studied how fixed-point can be used effectively in 
embedded system without a floating point unit. We have discussed the 
implementation issue of the fixed-point library. It includes the employment of a 
single precision fixed-point data type and the associated mathematical functions. 
One of the disadvantages of using fixed-point is the constraint of its narrow dynamic 
range. As a result, we have to verify the fixed-point implemented software so as to 
avoid overflow and underflow problems through experiments or mathematical 
analysis. In order to give a whole picture of the use of fixed-point, we conducted a 
case study on a StrongARM platform to show the entire float-to-fixed process. 
We have chosen the fingerprint verification algorithms as the case study 
because of its complexity and the growing popularity of biometrics in e-commerce. 
We compared the performance and the accuracy of the floating point and fixed-point 
• implementation of the fingerprint minutiae extraction algorithms. Our fixed-point 
implementation can achieve both speed and reliability on the embedded system 
‘ while the floating point implementation is unacceptably slow. 
Chapter 4 Domain Specific Optimization 57 
CHAPTER 4 
4 Domain Specific Optimization 
Different implementations of the same task can result in different performance. The 
rationale behind domain specific optimization is to optimize the software with 
specific domain knowledge. In many cases, we can boost the speed by using 
implementations that are very efficient at the sacrifice of certain degradation in 
accuracy. 
We used a font rasterization engine which rasterizes the outline font to bitmap 
data as the case study to illustrate the optimization process. In this case study, we 
implemented two different font rasterization engines to show how the performance 
and accuracy are affected. We first give an overview of the outline font and font 
rasterization process. Then, we compare the performance of the two different 
• implementations of the font rasterization engine on the StrongARM platform. 
4.1 Case Study: Font Rasterization on the StrongARM platform 
4.1.1 Outline Font 
A character is described by its outline. The outline is made up of straight lines and 
curves. In particular, TrueType font developed by Apple Computer Inc. [Apple 2003] 
and Microsoft Corporation [Microsoft 2003] uses quadratic Bezier curve in the 
outline description. A TrueType font character and its control points are shown in 
Figure 4.1. 
Chapter 4 Domain Specific Optimization 58 
艮 33 34^35 
29 ^  
28 10 18 19 
Y . 
24 
Figure 4.1 A TrueType font character outline and its control points. 
[Apple 2003] 
The straight lines of the outline fonts are described by two consecutive control 
points (Ax, Ay) and (Bx, By) [Poem 93]. The point (X，Y) in between (Ax, Ay) and (Bx, 
By) is defined by 
< wtere 0 < r < l 
The quadratic Bezier curve is defined by three control points (Ax, Ay), (Bx, By) 
and (Cx,Cy) [Poon 93]. The point (X’Y) in between (Bx, By) and (Cx,Cy) is defined 
by 
- X = A a - t f + 2Bt(l-t) + c f 
y - A / 1 - 0 ' + 2 5 / ( 1 - 0 + C / 
The illustration of the quadratic Bezier curve is shown in Figure 4.2. 
Chapter 4 Domain Specific Optimization 59 
BR 
/ \ / \ / \ 
A國 C° 
Figure 4.2 Quadratic Bezier curve with three control points (A, B and C). 
4.1.2 Font Rasterization 
Font rasterization refers to the process of converting an outline font character to 
bitmap data that is suitable for output by a raster devices such as monitor and printer. 
' The rasterization process is very time consuming especially those calculations 
involve with rasterization of the quadratic Bezier curves. The situation becomes 
more severe even when the outline of a character is described by many curves 
[Moon and Cheang 91]. The font rasterization algorithm, which is also called a scan 
conversion, is described in the following: 
Scan Conversion 
Scan conversion is the algorithm for converting outline character to bitmap. The 
algorithm is summarized below [Poon 93]: 
Y scan-line first. 
_ For scan-line = 0 to bitmap size -1 do 
Calculate all the intersection points(X) of this scan-line 
to the font outline. 
Sort all the intersections Point(X) with respect to value 
of X. 
Foreach pair of Xi, Xi+i do 
Determine which pixels should be turned ON. 
End 
Chapter 4 Domain Specific Optimization 60 
End 
Repeat., with X scan-line. 
Pixel i T ^ 
y-scan-line \ . . 
mm mmm mmm mmm mmm w| _丨:幽丨丨礴：祷卜 mmw wmm mmmmm 瞧 m m mmm m m • mmmm wmmm m m m mmm mmm -mmm l^jjWMM. mhm • 
… �� 
Figure 4.3 Converting outline font character to bitmap data. 
During the scan conversion, we have to find the intersection points for all the 
curves and lines with the scan line which is parallel to the x- or y- axis. It is a time 
consuming task because it involves many calculations such as division, 
multiplications and square roots. A scan conversion is illustrated in Figure 4.3 where 
XI, X2, X3 and X4 are the intersection points. The calculations of the intersection 
points are discussed below: Intersection Point Calculations [Poon 93] 
The outline of a font character is described by straight lines and curves. The 
intersection point of a straight line and a scan-line can be obtained as 
- Control points of the straight line 
Y = Sy-Y scan - line 
(X,F)-Intersection point 
J ( 召 厂 、 ） 
Similarly, by using X = Sx，we can calculate the intersection point for a X scan-line. 
Chapter 4 Domain Specific Optimization 61 
Finding the intersection point between quadratic Bezier curve and the scan-line 
is more complex. More calculations are necessary when compared to straight lines. 
The intersection points can be obtained as 
(A^ )，(B^，By )，(C^  ,Cy) - Control points of the quadratic Bezier curve 
Y = Sy-Y scan - line 
(X,y)-Intersection point 
By Solving, 
一 b ± ^ J b � 4 a c where b = 2{B -A ) => t = 2a c = Ay-Sy 
V 
f 
if 0 < r < 1, X = A, ( 1 - 0 ' +25 . (1- t ) t + Cj 2 
Similarly, by using X = Sx，we can calculate the intersection point for a X scan-line. 
Getting the intersection point of a quadratic Bezier curve with a scan line 
involves solving a quadratic equation in which multiplication, division and square 
root are required. Thus the presence of many curves in an outline font character 
would lead to slow .rasterization. Thus, reducing the complexity of the calculation 
for rasterizing a quadratic Bezier curve is significant to improve the overall 
performance of the rasterization speed. One commonly used method is to replace a 
quadratic Bezier curve by a set of straight lines in the definition of the outline font 
character. 
Chapter 4 Domain Specific Optimization 62 
Approximation of quadratic Bezier curve [Angel 2000] 
The approximation is made by joining the first and third control points of a quadratic 
Bezier curve to make a straight line. However, we need to check if the curve can be 
approximated by one straight line with acceptable error. If not, we should subdivide 
the curve recursively and approximate the curve by more straight lines until the error 
becomes acceptable. The algorithm is shown in Figure 4.1.4. 
1 / \ "2 -成 + 成 ca - P ^ i l f l z 
N| / \ 伙 - 2 ’ 代 - 2 
/ \ 
/ ^ where 
‘ p 1 p 3 pl,p2, p3 are control points of Bezier curve. 
Figure 4.4 Approximation of quadratic Bezier curve. 
In each step, we first check the error N of the quadratic Bezier curve and the 
straight line (pl-p3). The error of the approximation is equal to the distance between 
q3 and p2. In order to save computational time, we only consider the absolute 
difference of q3 and p2. Let Nx be the difference of x coordinates, while Ny be the 
difference of y coordinates. By choosing suitable threshold value T, 
if iV >randA^ > r 
^ y 
‘ S u b d i v i d e the curves and repeat with two new curves: 
Left side (pl ,ql ,q3) 
Right side (q3,q2,p3) 
else 
Store the straight line (p 1，p3) 
Chapter 4 Domain Specific Optimization 63 
The computation of the new points (ql, q2, q3) is very easy, involving only two 
additions and two divisions. Smaller value of T leads to more subdivisions and hence 
more straight lines are used. As the result, the approximated curve becomes 
smoother. We have chosen 1 as the value of T in our system. 
4.1.3 Experiments 
We implemented two font rasterization engines: 
1. Direct mathematical derivations 
Compute the intersection points of both straight line and quadratic Bezier 
curve. . 
2. Approximation by straight lines 
Approximate the quadratic Bezier curves by straight lines and solve for 
intersection points for straight lines only. 
We used "Times New Roman" TrueType font in all our tests because of its 
generality and popularity. We first extracted all the outline parameters of alphabets 
(A-Z,a-z) from the TrueType font file, including all the control points for quadratic 
Bezier curve and straight lines, into a data file. Our test programs read the 
parameters from the data file and rasterize each alphabet to a 50x50 pixels bitmap. 
Rasterization Performance Testing 
We compare the rasterization performance of the two different implementations on 
the PDA with the StrongARM processor (SAll 10 206Mhz) [Intel 2000]. The test 
program rasterizes each character for 200 times and records the time consumed. The 
experimental results are shown in Figure 4.5. 
The speed of rasterization varies from character to character because of the 
variant number and size of curves and lines to construct the characters. Figure 4.5 
shows that Method 2 (Approx.) can speed up the rasterization process. The average 
performance boost is about 50%. Besides, the shapes of curves affect the speed of 
rasterization. For instance, "Q","0","S", which are mainly curve based characters, 
require many intersection point calculations to rasterize their quadratic Bezier curves. 
As a result, straight line approximations of their curves can result in significant 
Chapter 4 Domain Specific Optimization 64 
performance of their rasterization. When compared with other characters, the speed 
up in "Q" - 67.8%: "0"- 67.5%,"S" - 67.8% is very significant. 
f \ 
Performance test on PDA (SAll lO 206Mhz) 
l i j ^ M 
； M l I M M I M f M M 
- A B C D E F G H I J K L M N O P Q R S T U V W X Y Z 
Character(A-Z) 
• Method KMath.) • Method 2(Approx.) V J 
Figure 4.5 Performance Test on PDA (SA1110 206Mhz) 
In addition, we also compare the rasterization speed of an English article in 
order to simulate a real life situation. The article is copied from the Hong Kong 
Government website [HKSAR 2003]. It contains 3749 alphabets. The results are 
shown below: 
PDA (SA-1110 206Mhz) 
Method 1 (Math.) 54.32s 
Method 2 (Approx.) 20.93s 
• Performance Boost: 61.4% 
Table 4.1 Speed comparison of Method 1 (Math.) and Method 2 (Approx.) 
Through these experiments, we show that using straight lines to approximate the 
curves could speed up the rasterization process. However, the enhancement of 
performance sacrifices the accuracy of the rasterization process. Hence, we should 
check the output bitmaps to ensure the degradation in accuracy is acceptable. 
Chapter 4 Domain Specific Optimization 65 
Accuracy Evaluation 
A B C D E F G H 
I J K L M N O P . 
Q R S T U V W X 
Method 1 
J I i L a — 脑 < 1 2 
Figure 4.6 Intersection points calculated from Method 1 (Maths.)& Method 2 
(Approx.) 
We evaluate the accuracy by comparing the intersection points calculated from 
‘ two different methods. Firstly, we draw out the intersection points calculated from 
method 1 in black color as the reference outline. Then, we draw out the outline 
obtained from method 2 in red color to examine the differences. Originally, each 
bitmap is made up of 50x50 pixel, we enlarged it to 200% to show the differences 
clearly in Figure 4.6. We can compare the position of the red pixels and black pixels 
to see the errors. There are slight distortions in the characters with large curvature, 
like "Q" and ”0”. However, the shapes of the curved outlines are still well 
maintained. Moreover, the font's outline will be further enhanced by anti-aliasing 
when the outlines are filled. Therefore, they will appear even smoother after post 
processing. 
Consider a PDA with 320x240 pixels Liquid Crystal Display (LCD) screen, 
only 4-5 (50x50 pixels) bitmap characters can be displayed in a row. Hence, 50x50 
pixels are suitable to display the large characters in the PDA environment. To 
conclude, we have showed that the approximation of curves could speed up the font 
rasterization process with a reasonable accuracy. 
Chapter 4 Domain Specific Optimization 66 
4.2 Conclusion 
In this chapter, we have discussed the domain specific optimization. Some 
implementations may be more efficient while they may lead to poor accuracy. 
Therefore, we are required to have domain specific knowledge of the particular tasks 
and select only those implementations with acceptable accuracy. We have conducted 
a case study using the font character rasterization engine on the StrongARM 
platform as an example. We used two different implementations to show the 
difference in performance and accuracy. We evaluated the accuracy of the font 
rasterization engine based on user's perspective on the resultant bitmap characters. 
Chapter 5 Conclusion 67 
CHAPTER 5 
5 Conclusion 
In this thesis, we have presented an efficient software development methodology for 
microprocessor based embedded systems. Here, we will review the three studied 
optimization strategies for the embedded systems and point out their merits and 
deficiencies. 
1. Source Code Optimization 
We can apply many source code transformation techniques to improve the 
performance. This can be achieved by maximizing the overall instruction per 
cycle (IPC) ratio of the programs. Moreover, we have exposed that it is not 
uncommon that a compiler does not utilize the 'best' instructions available in 
^ a microprocessor while generating the machine codes. In serious cases, we 
have to write assembly languages to replace parts of the compiler output. In 
particular, we have suggested possible places to look for such replacements. 
2. Float-to-Fixed Optimization 
The absence of floating point unit in a microprocessor makes floating point 
operations in many applications to run extremely slow. We have showed that 
the use of fixed-point can be the solution. In a case study, the fixed-point 
implementation of a fingerprint minutiae extraction algorithm can run about 
twenty times faster than the floating point version on the StrongARM 
platform. 
‘ 3. Domain Specific Optimization 
Sometimes, we have to replace the existing implementation with a more 
efficient implementation using properties associated with the problem 
domain. Sometimes, the overall accuracy could be sacrificed. Therefore, 
Chapter 5 Conclusion 68 
domain specific knowledge should be studied carefully in a case-by-case 
fashion. In the font rasteriziation engine case study, we can achieve certain 
boost in performance and retain most of the overall functionality. 
Each of the above strategies has been studied carefully with support of 
experiments. In reality, all of the strategies can be applied simultaneously to an 
application that needs to accelerate its execution in microprocessor based devices. 
Unfortunately, no universal tools are available to implement these strategies 
automatically. In fact, such strategies stem from different areas of computer science, 
namely: software engineering, numerical analysis as well as knowledge like 
computer graphics (in the font character rasterization case), multimedia processing 
(if we were to construct software for MPEG decoding and encoding), etc., making it 
difficult or impossible to automate the complete optimization procedure. To a certain 
sense, the optimization is similar to designing an Application Specific Integrated 
Circuit (ASIC) for an application except all implementations are now done using 
， software. 
Because of the complexity of the optimization steps, formal evaluation of the 
strategies must be done in two aspects: speed and quality. Speed is an obvious and 
easy to measure parameter. But, quality is not. In the case of fingerprint verification, 
the EER rates are good measures. However, it will be difficult to evaluate the results 
of a font rasterizer or a MPEG decoder in objective fashions. 
As mobile devices become more and more popular in our daily life, we can 
foresee the important contributions of the results of this work to the relevant 
software industry. In the past, different optimization strategies have been proposed. 
However, these strategies were never studied systematically. The main contribution 
of this thesis is to unify them together and demonstrate how each can be used. In this 
regard, we have paved the way for software ASIC for the design of modem 




Abramowitz. M., Stegun. I.A., "Handbook of Mathematical Functions", 
Dover Publications Inc, New York, Ninth printing, December 1972 
[Allan 1995] 
Allan. V. H.，Jones. R.B., Lee R. M.，Allan. S. J., "Software Pipelining", 





ARM Inc., "Writing Efficient C for ARM", Application Note 34’ January 
1998 
. [Angel 2000] 
Angel. E.，"Interactive computer graphics: a top-down approach with 
OpenGL", Addison Wesley Longman, Inc., Jan 2000, pp. 448-450 
[Apple 2003] 
Apple Computer, Inc, "DIGITIZING LETTERFORM DESIGNS", 
http://developer.apple.eom/fonts/TTRefMan/RM01/ChapLhtml 
[Bacon 1994] 
Bacon. D.’ Graham. S., Sharp. O., "Compiler transformations for high-
performance computing", ACM Computing Surveys, Dec. 1994，Vol. 26, pp. 
345-420 
[Bambrough 1999] 
Bambrough. S., "NetWinder Floating Point Notes", 
http://www.netwinder.org/~scottb/notes/FP-Notes.html 
[Berkeley 2000] 
‘ Berkeley Design Technology, Inc., "Choosing a DSP Processor", 2000 
[Bernstein 1986] 
Bernstein. R.，"Multiplication by integer contsntas", Software: Practice and 
Experience Volume 16, Issue 7，July 1986, pp. 641-652. 
70 
[De Michell 1997] 
De. Michell. G., Gupta. R. K., "Hardware/Software Co-Design", Proceedings 
of the IEEE, Volume: 85 Issue : 3，March 1997，pp. 349-365 
[Francis 2001] 
Francis. H.，"ARM DSP-Enhanced Extensions", ARM White Paper, ARM 
Ltd., May 2001. 
[Ferrari 1999] 
Ferrari. A., Sangiovanni-Vincentelli A., "System design: traditional concepts 
and new paradigms", Proceedings of 1999 IEEE International Conference on 
Computer Design: VLSI in Computers and Processors, Oct. 1999, pp. 2-12 
[Furber 2000] 
Furber. S., "ARM system-on-chip architecture second edition", Addison 
Wesley., 2000. 
[Guthaus 2001] 
Guthaus. R. M.’ Ringenberg. S. J., Emst. D., Austin T. M.，Mudge. T.， 
Brown. R.B., "MiBench: A free, commercially representative embedded 
benchmark suite", IEEE 4th Annual Workshop on Workload 
Characterization, December 2001. 
’ [GCC 2003] 
GNU Compiler Collection, 
http://gcc.gnu.org/ 
[GPROF 2003] 
The GNU Profiler 
http://www.gnu.Org/manual/gprof-2.9.l/ 
[Hamacher 2002] 
Hamacher. C.’ Vranesic. Z.’ Zaky. S.，"Computer Organization: fifth edition", 
McGraw-Hill, 2002. 
[Harvey 1995] . 
Harvey. M.，"32-bit embedded processors: the push for higher performance", 
Proceedings of the IEEE 1995 National Aerospace and Electronics 
Conference, Volume: 1，1995, pp. 371-375 
[Hennessy 2003] 
Hennessy. L. J., Patterson. A. D.，"Computer Architecture: A Quantitative 
Approach", third edition, Morgan Kaufmann, 2003. 
71 
[HKSAR 2003] 





Hong. L.，Yifei. W., Jain. A., "Fingerprint image enhancement: algorithm 
and performance evaluation", IEEE Transactions in Pattern Analysis and 
Machine Intelligence, Volume: 20，Issue: 8，Aug. 1998，pp. 777-789. 
[Hui 1998] 
Hui. Ganghui., Ho. K.C., “A robust speaker-independent speech recognizer 
on ADSP2181 fixed-point DSP", Proceedings of the 1998 Fourth 
International Conference on Signal Processing, 12-16 Oct 1998, pp. 694-697 
[Intel 1998] 
Intel Corporation, "StrongARM** SA-110 Microprocessor Instruction 
Timing", September 1998. 
[Intel 2000] 
Intel Corporation, "Intel StrongARM SA-1110 Microprocessor developer's 
Manual", June 2000. 
. [Intel 2003] 
Intel Corporation, "Intel Personal Internet Client Architecture white paper", 
2003 
[Jain 1997] 
Jain. A.K., Hong. L.，Bolle. R.，"On-line Fingerprint Verification", IEEE 
Transaction on Machine Intelligence, Volume: 19’ Issue: 4，April 1997, pp. 302-314 
[Jain 2000] 
Jain. A., Prabhaker. S., Hong. L.’ Pankanti. S.，"Filterbank-based Fingerprint 
Matching", IEEE Transaction on Image Processing, Volume: 9，Issue: 5, May 2000，pp. 846-859 
[Jersak 1998] 
Jersak. M.’ Willems. M., "Fixed-point extended C compiler to allow more 
efficient high-level programming of fixed-point DSPs", Proceedings of the 
9th International Conference on Signal Processing Applications & 
Technology (ICSPAT), Toronto, Oct 1998, pp. 531-535 
72 
[Kang 1997] 
Kang. J., Sung. W., "Fixed-point C Compiler for TM320C50 Digital Signal 
Processor", Proceedings of the International Conference on Acoustics, 
Speech, and Signal Processing，97，Apr 1997，pp. 707-710 
[Keutzer 2000] 
Keutzer. K., Net won. A. R.’ Rabaey. J.M. Sangiovanni-Vincentelli. A., 
"System-Level Design: Orthogonalization of Concerns and Platform-Based 
Design", Proceedings of IEEE Transactions on Computer-Aided Design of 
Integrated Circuits and Systems, Volume: 19 Issue: 12, Dec 2000, pp. 1523-
1543 
[Koopman 1996] 
Koopman. P., "Embedded system design issues (the rest of the story)", 
Proceedings of IEEE International Conference on Computer Design: VLSI in 
Computers and Processors, ICCD ’96, Oct. 1996, pp. 310-317 
[Kum 2000] 
Kum K.，Kang. J.’ “AUTOSCALER for C: An Optimizing Floating-Point to 
Integer C Program Converter For Fixed-Point Digital Signal Processor，，， 
IEEE Transactions on Circuits and System-II: Analog and Digital Processing, 
Vol. 47，No. 9’ Sept 2000. 
‘ [Lam 1988] 
Lam. M., "Software Pipelining: An Effective Scheduling Technique for 
VLIW Machines", ACM SIGPLAN'88 Symposium on Compiler 
Construction, pp. 318-328, June 1988. 
[Lefevre 1992] 
Lefevre. V.’ "Multiplication by an Integer Constant", Technical Report, The 
French National Institute For Research in Computer Science and Control, 
1992. 
[Lennon 2001] 
Lennon. A., "Embedded Linux", lEE Review, Volume: 47, Issue: 3，May 
2001，pp. 33-37 
[Leupers 2000] 
Leupers. R., "Code generation for embedded processors", Proceedings of 
The 13th^ International Symposium on System Synthesis, 20-22 Sept 2000， pp.173-178 
73 
[Lioupis 2001] 
Lioupis D., Papagiannis. A., Psihogiou. D., "A Systematic Approach to 
Software Peripherals for Embedded Systems", Ninth International 
Symposium on Hardware/Software Codesign, CODES 2001,Copenhagen, 
Denmark, April 25-27, 2001. 
[MAD 2003] 
Underbit Technologies, Inc., 
http://www.underbit.com/products/mad/ • 
[Maio 1997] 
Maio. D., Maltoni. D.’ "Direct gray-scale minutiae detection in fingerprints", 
IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume: 
19, Issue: 1’ Jan. 1997, pp. 27-40 
[Maio 2002] 
Maio. D., Maltoni. D.，Cappelli. R.’ Wayman. J.L., Jain. A. K.，"FVC2000: 
fingerprint verification competition", IEEE Transaction on Pattern Analysis 
and Machine Intelligence, Volume: 24, Issue: 3，March 2002, pp. 402-412 
[Martin 2000] 
Martin. G., Lennard. C.，"Improving embedded software design and 
• integration in SOCs", Proceedings of IEEE 2000 Custom Integrated Circuits 
Conference, 21-24 May 2000, pp. 101-108 
[Mesarovic 2000] 
Mesarovic. V.，Hemkumer. N. D.’ Dokic. M., "MPEG-4 ACC audio 
decoding on a 24-bit fixed-point dual-DSP architecture", Proceedings of the 
2000 IEEE International Symposium on Circuits and Systems, 28-31 May 
2000, pp. 706-709 
[Moon and Cheang 91] 
Moon. Y.S. and Cheang. S.M., "Screen Fonts for Chinese Windowing 
Systems", Proceedings of the International Conference of Computer 
Processing of Chinese and Oriental Languages, August 1991，pp. 63-68 
[Moon and Luk 2002] 
Moon. Y.S., Luk. F.T., Ho. H.C., Tang. T.Y., Chan. K. C., Leung C.W., 
"Fixed-point Arithmetic for Mobile Devices - A Fingerprint Verification 
Case Study", Proceedings of SPIE - Advanced Signal Processing Algorithms, 
Architectures and Implementation XH", July 2002, pp. 144-149 
[Microsoft 2003] 




MIPS technologies Inc., 2003 
http://www.mips.com/content/Products/ProductInfo 
[Philips 2003] 
Philips Electronics, 2003 
http://www.semiconductors.philips.com/products/nexperia/ 
[Poon 93] 
Poon. C.C., "A New Approach to the Generation of Gray Scale Chinese 
Fonts", Master Dissertation, June 1993, pp. 32-41 
[Prasad 2002] 
Prasad. R. S. V.’ Ramkishor. K.’ "Efficient Implementation of MPEG-4 
Video Encoder on RISC Core," International Conference on Consumer 
Electronics (ICCE), Los Angeles, June 2002. 
[Qin 2003] 
Qin. W.，Malik S.，"Flexible and Formal Modeling of Microprocessors with 
Application to Retargetable Simulation", Proceedings of 2003 Design 
Automation and Test in Europe Conference (DATE 03)，Mar 2003, pp. 556-561 
- [Restle 2000] 
Restle. R. C , "Choosing between DSPs, FPGAs, pPs and ASICs to 
implement digital signal processing", Conference Proceedings of ICSPAT: 
DSP World, Autumn 2000 
[Santo 2001] 
Sanot. B., "Embedded battle royal [embedded operating systems], IEEE 
Spectrum, Volume: 38 Issue: 12, Dec. 2001, pp. 36-41 
[Su 1999] 
Su. B., Wang J., "A Source-level loop optimization for DSP code generation", 
Proceedings of 1999 IEEE International Conference on Acoustics Speech, 
and Signal Processing, vol. 4’ 15-19 Mar 1999, pp. 2155-2158 
[Tai 2001] 
Tai. H.M., Long. Men., Yang. S.，Zhou. D.’ "Implementation of JPEG2000 
codec on a fixed-point DSP", Proceedings of International Conference on 
Consumer Electronics, 2001, ICCE, 19-21 June, pp. 128-129. 
[Wahab 1998] 
Wahab. A., Chin. S.H., Tan. E.G., "Novel approach to automated fingerprint 
recognition", Proceedings of the IEEE in Vision Image and Signal 
Processing, Volume: 146, Issue: 3，June 1998，pp. 160-166 
75 
[Willems 1997] 
Willems. M.’ Buersgens. V.，Meyr K.，Meyr. H.，"FRIDGE: An Interactive 
Fixed-point Code Generation Environment for HW/SW CoDesign", 
Proceedings of the International Conference on Acoustics, Speech and Signal 
Processing '97, Apr. 1997，pp. 687-690 
[Wolf 1994] 
Wolf. W. H.，"Hardware-Software Codesign of Embedded Systems", 
Proceedings of the IEEE, Volume: 82 issue: 7，1994. 
[Wolf 2001] 
Wolf. W., "Computers as components: principles of embedded computing 
system design", Morgan Kaufmann, 2001 
[Yang 2001] 
Yang. H.，Gao. G. R.’ Marquez. A., Cai. G.，Hu Z.，"Power and Energy 
Impact by Loop Transformations", Workshop on Compilers and Operating 
System for Low Power (COLP'Ol), September 2001 . 
[Zhang 2002] 
Zhang. Q.，Huang. K.’ Yan. H., "Fingerprint classification based on 
" extraction and analysis of singularities and pseudoridges", Selected papers 
from 2001 Pan-Sydney Workshop on Visual Information Processing, 
Conferences in Research and Practice in Information Technology, 2002 



























 • ；1 
. 



















 , . - • 」 ： . . . 
















 v r v

















































v . . .



























































. . . i i .








































































 ‘ . ， ：
 / 二 .
 .
 .








































































 . . . 
。
V • . ， . , r - . .
 





















































•M . , 
• ••‘ . . . • 
• -、 .. 一 • 
‘“ “ ： ， ’ 
. • . . . . 
.‘ • • . 、 . • - . • -
. • � • 
_ . • ... • • •‘ ‘ 
- • • • . , C ‘ -
V. ,. 
• ... 
• • • . • • . . 
. • , . . • •‘. 
. ： • •• • • ••-
. . . , , . • ’ 
.• . ‘、. . . • ‘ .. . .. , ： . ^ ‘ 
‘ . • . . 、 � -
- • . . • ‘ 
• . , . - -v 
•1. 丨 • 
. -7 . . � � . 
广 - 、 , 
•‘ � • • 
CUHK L i b r a r i e s I … 
,肩__1[:氣：： 
• Q0HmM77D � ) „ ( 
• • • . . . , 
‘ . - . . 一 . •‘ • 々 - • . ： . 严 ‘” . • ’ - . ' . - 、 ， . . ： -
