Survey of the Itanium architecture from a programmer's perspective by Parker, Steven G. & Gribble, Christiaan Paul
TECHNICAL REPORT
A Survey of th e  I ta n iu m  A rc h ite c tu re  from  a 
P ro g ra m m e r’s P ersp ec tiv e
Christiaan Paul Gribble and Steven G. Parker
UUSCI-2003-003
Scientific Computing and Imaging Institute 
University of Utah 
Salt Lake City, UT 84112 USA
August 30, 2003
A b s tra c t:
The Itanium family of processors represents Intel’s foray into the world of Explicitly Parallel In­
struction Computing and 64-bit system design. This survey contains an introduction to the Itanium 
architecture and instruction set, as well as some of the available implementations. Taking a pro­
grammer’s perspective, we have attem pted to distill the relevant information from a variety of 
sources, including the Intel Itanium architecture documentation.
In varying levels of detail, we cover the im portant characteristics of the Itanium architecture, a 
large portion of the Itanium instruction set, program performance factors and optimizations, and 
several of the available Itanium implementations. While this survey does not provide exhaustive 
discussions of these topics, we hope tha t it will serve as a practical introduction to creating new 
applications for the Itanium architecture.
THE
U N I V E R S I T Y
U T A H
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e  
f r o m  a  P r o g r a m m e r ’s P e r s p e c t i v e
Christiaan Paul Gribble 
Steven G. Parker
Scientific Computing and Imaging Institute 
University of Utah 
August 2003
Copyright c 2003 
Christiaan Paul Gribble and Steven G. Parker
All Rights Reserved
Contents
1 The Itanium Architecture 1
2 Instruction Set Architecture 3
2.1 Information Units and Data T y p e s ...............................................................................................  3
2.1.1 In tegers...............................................................................................................................  3
2.1.2 Floating-Point N u m b e rs ..................................................................................................  3
2.1.3 Alphanumeric Characters ..................................................................................................  5
2.2 Instruction Formats ......................................................................................................................... 5
2.3 Instruction Classes ......................................................................................................................... 6
2.4 Addressing Modes ......................................................................................................................... 7
2.4.1 Immediate Addressing ...................................................................................................... 7
2.4.2 Register Direct Addressing...............................................................................................  7
2.4.3 Register Indirect Addressing............................................................................................  7
2.4.4 Autoincrement Addressing...............................................................................................  7
3 Architectural Registers 9
3.1 Instruction Pointer ......................................................................................................................... 9
3.2 General-Purpose Registers ............................................................................................................  9
3.3 Floating-Point Registers ...............................................................................................................  10
3.4 Branch Registers ............................................................................................................................ 11
3.5 Predicate Registers ......................................................................................................................... 12
3.6 Application Registers ...................................................................................................................... 12
3.7 System Information Registers......................................................................................................... 12
3.8 Other Processor Registers...............................................................................................................  13
4 Itanium Assembler Statements 14
5 Integer Instructions 16
5.1 Arithmetic Instructions..................................................................................................................  16
5.1.1 Addition ............................................................................................................................ 16
5.1.2 Subtraction ......................................................................................................................... 16
5.1.3 Shift Left and A d d ............................................................................................................  16
5.1.4 Multiplication and Division of 64-bit In tegers................................................................ 17
5.1.5 Multiplication of 16-bit In teg e rs ...................................................................................... 17
5.1.6 Special-Case Arithmetic Operations...............................................................................  18
5.2 Data Access Instructions...............................................................................................................  19
5.2.1 Load Instructions...............................................................................................................  19
5.2.2 Store Instructions ...............................................................................................................  20
5.2.3 Move Long Immediate Instruction ................................................................................... 20
5.2.4 Accessing Specialized Registers ...................................................................................... 20
5.3 Miscellaneous Integer Instructions...............................................................................................  21
5.3.1 Zero-Extend In s tru c tio n ..................................................................................................  21
5.3.2 Sign-Extend Instruction ..................................................................................................  21
5.3.3 Instructions for Narrow Data T y p e s ...............................................................................  21
6 Comparison and Branching Instructions 22
6.1 Comparison Instructions ...............................................................................................................  22
6.1.1 Signed Comparison............................................................................................................  22
6.1.2 Unsigned C om parison ...................................................................................................... 22
6.1.3 Unconditional Comparison...............................................................................................  23
6.2 Branch Instructions ......................................................................................................................... 23
iv
7 Logical and Bit-Level Instructions 25
7.1 Logical Instructions..............................................................................................................................25
7.2 Bit-Level Instructions..................................................................................................................... .....25
7.2.1 Bit-Shift Instructions..............................................................................................................25
7.2.2 Shift Right Pair Instruction............................................................................................... .....26
7.2.3 Extract and Deposit Instructions...........................................................................................26
7.2.4 Single-Bit Test Instruction............................................................................................... .....27
8 Floating-Point Instructions 28
8.1 Arithmetic Instructions.................................................................................................................. .....29
8.1.1 Addition, Subtraction, and M ultiplication...........................................................................29
8.1.2 Fused Multiply-Add and M ultiply-Subtract........................................................................29
8.1.3 Reciprocal and Square Root A pproxim ations.....................................................................29
8.1.4 Maximum and Minimum Instructions............................................................................ .....31
8.1.5 Normalization .................................................................................................................. .....31
8.2 Data Access Instructions............................................................................................................... .....31
8.2.1 Load Instructions............................................................................................................... .....31
8.2.2 Store Instructions............................................................................................................... .....33
8.3 Miscellaneous Floating-Point Instructions........................................................................................33
8.3.1 Floating-Point Compare Instruc tion ............................................................................... .....33
8.3.2 Logical Instructions ..............................................................................................................34
8.3.3 Assembler Pseudo-Ops...........................................................................................................34
8.3.4 Floating-Point Merge Instruction...........................................................................................34
8.3.5 Floating-Point Value Classification ............................................................................... .....35
8.4 Floating-Point Operations on Integer V alues............................................................................... .....36
8.4.1 Data conversion ............................................................................................................... .....36
8.4.2 Integer Multiplication ...........................................................................................................37
9 Parallel Instructions 38
9.1 Integer Instructions..............................................................................................................................38
9.2 Floating-Point Instructions............................................................................................................ .....38
10 Structured Programming Constructs 39
10.1 If...Then...Else Structures............................................................................................................... .....39
10.1.1 Standard Implementation.................................................................................................. .....39
10.1.2 Predicated Implementation............................................................................................... .....39
10.1.3 Nested If...Then...Else Structures Using Predication...................................................... .....40
10.2 Case Selection Structures............................................................................................................... .....41
10.3 Loop S tructures............................................................................................................................... .....41
10.3.1 Counter-controlled Loops ............................................................................................... .....41
10.3.2 Loops Controlled by an Address L im it............................................................................ .....42
10.3.3 Loops with a Conditional Entrance........................................................................................42
10.3.4 Using the Loop Count Register......................................................................................... .....42
11 Using Procedures and Functions 44
11.1 Itanium Stack S tru c tu res ............................................................................................................... .....44
11.1.1 Itanium Memory S ta c k s .................................................................................................. .....44
11.1.2 Itanium Register Stacks .................................................................................................. .....44
11.2 Calling Procedures and Functions ............................................................................................... .....45
11.2.1 Register Conventions..............................................................................................................45
11.2.2 Call and Return Branch Instructions............................................................................... .....45
11.2.3 Argument Passing ............................................................................................................ .....46
11.2.4 A Practical E x am p le ..............................................................................................................46
v
12 Program Performance 50
12.1 Processor-Level Parallelism............................................................................................................ .....50
12.2 Instruction-Level Parallelism ..............................................................................................................51
12.3 Explicit Parallelism ..............................................................................................................................51
12.3.1 InstructionTemplates ..............................................................................................................51
12.3.2 Data Dependencies and Speculation ............................................................................... .....52
12.3.3 ControlDependencies and Speculation ......................................................................... .....54
12.4 ProgramOptimizations .................................................................................................................. .....55
12.4.1 PerformanceConsiderations ............................................................................................ .....55
12.4.2 Low-levelOptimizationHints ......................................................................................... .....57
12.4.3 PerformanceMonitoring .................................................................................................. .....58
12.5 Loop Optimization as a Practical Example ........................................................................................59
12.5.1 Loop Unrolling .................................................................................................................. .....59
12.5.2 Software-Pipelined Loops ............................................................................................... .....59
12.5.3 Writing a Software Pipelined Loop ............................................................................... ..... 61
13 Itanium Implementations 65
13.1 The Itanium-Family Processors ........................................................................................................... 65
13.1.1 Cache Hierarchy ............................................................................................................... ..... 65
13.1.2 Execution Units and Issue Ports ........................................................................................... 66
13.1.3 Pipelines ................................................................................................................................. 68
13.2 The Ski Simulator ................................................................................................................................. 68
vi
1 A slightly modified version of the SQUARES program from Evans and T rim p e r................... .....15
2 The Itanium extract and deposit instructions ............................................................................... .....27
3 A slightly modified version of the DOTCLOOP program from Evans and T r im p e r .....................43
4 Passing arguments via registers and the memory stack .....................................................................46
5 A slightly modified version of the BOOTH function from Evans and T rim per......................... .....48
6 A slightly modified version of the DECNUM3 program from Evans and T r im p e r................... .....49
7 A slightly modified version of the DOTCTOP2 program from Evans and T r im p e r .....................64




1 Size and numeric range (in decimal notation) of Itanium integer data ty p es ............................. 3
2 IEEE floating-point representation...............................................................................................  4
3 Itanium instruction types and the corresponding execution units................................................ 7
4 Itanium addressing modes and effective addresses......................................................................  8
5 The names and uses of the Itanium general-purpose reg is te rs ................................................... 10
6 The names and uses of the Itanium floating-point registers ......................................................  11
7 The names and uses of the Itanium branch registers ................................................................... 11
8 The names and uses of the Itanium predicate registers ................................................................ 12
9 Comparison of the Itanium integer and floating-point instructions............................................  28
10 Meanings of the special IEEE floating-point representations ......................................................  28
11 Assembler mnemonics for the f c l a s s  in stru c tio n ................................................................... 35
12 The registers, and their uses, of the DECNUM3 stack frame ................................................... 47
13 Itanium instruction templates ......................................................................................................... 51
14 Characteristics of the current Itanium processors .........................................................................  66
15 Characteristics of the Itanium and Itanium 2 cache structures ................................................... 67
16 Possible dual-issue instruction bundles for the Itanium and Itanium 2 processors...................  67
17 The Itanium and Itanium 2 pipelines ............................................................................................  68
v iii
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e
Preface
The Itanium family of processors represents Intel’s foray into the world of Explicitly Parallel Instruction 
Computing and 64-bit system design. Within this survey is contained an introduction to the Itanium architec­
ture and instruction set, as well as some of the available implementations. We have attempted to distill the 
relevant information from the thousands of pages of Itanium documentation and reference materials cited at 
the end of this work by taking a programmer’s perspective.
This survey largely follows the structure, form, and content of an excellent book by James Evans and 
Gregory Trimper, entitled Itanium Architecture fo r  Programmers. We have, of course, taken the liberty to 
rearrange the topics, omit the less important details, and expand the most relevant discussions with appropriate 
information from other sources; in other words, we do more than simply summarize the book. Nevertheless, 
we gratefully acknowledge the significant impact that their work has had on this survey.
We cover the following topics in varying levels of detail:
•  the important characteristics of the Itanium architecture (Sections 1-3),
•  programming with the Itanium instruction set (Sections 4-11),
•  program performance factors and optimization techniques (Section 12), and
•  several implementations of the Itanium architecture (Section 13).
It is not our intention to provide exhaustive discussions of the Itanium architecture, its instruction set, or 
any of the available implementations. We have made an effort to include those topics and details that we 
found most useful during our initial experimentation with the Itanium architecture. Likewise, where useful 
or important details have been omitted intentionally, due either to space and formatting constraints or to the 
intended scope of this work, we have made an effort to cite specific sections and pages within the reference 
materials that will enhance the included discussion.
o u r hope is that this survey will serve as a practical introduction to creating new applications for the 
Itanium architecture.
Salt Lake City, Utah 
August 2003
ix
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e T h e  I t a n i u m  A r c h i t e c t u r e
1 The Itanium Architecture
A computer system may be categorized in terms of two basic characteristics: its organization and its archi­
tecture. The organization of a computer system describes its overall structure and the elements of which the 
system is composed. In this survey, we are concerned with the details of Intel’s Itanium architecture and 
not the more general topic of computer organization. Many instructive resources cover both of these topics 
simultaneously; in particular, we recommend the well-known texts by John Hennessy and David Patterson 
(see the section entitled “Bibliography and Additional Resources” on page 70).
The architecture of a computer system, on the other hand, describes the structure and operation of the 
system as visible to an assembly language programmer. A computer architecture is therefore an abstraction, 
consisting of the programming interface for controlling the operation of the system.
A related, but wholly distinct, concept is that of an implementation. An implementation is the realization 
of the structure and operation prescribed by a computer architecture using various hardware and software 
components. For example, the Itanium 2 processor is just one of the several available implementations of the 
Itanium architecture.
An Analogy from Evans and Trimper. In their book, Itanium Architecture fo r  Programmers, James Evans 
and Gregory Trimper offer a useful analogy, based on pianos, to help clarify the distinction between an 
architecture and an implementation. We paraphrase their example and include it here:
A piano architecture is defined by the specification of the keyboard. The keyboard is the player’s interface 
to the instrument, and it consists of 88 keys: 36 black keys and 52 white keys. Notes of various specified 
frequencies are sounded by striking a particular key. The size and arrangement of the keys are identical for 
all modern piano keyboards, so a person who can play the piano can play any piano.
Many implementations of the piano architecture are possible. Implementations may be distinguished by 
the types of materials used to construct the piano, by the size and shape of the instrument, or any number 
of other decisions made by a particular manufacturer concerning the details of the piano. Nevertheless, any 
piano player will be able to play the final product.
Likewise, a computer architecture specifies the programmer’s interface for controlling the operation of 
the system. Many implementations of the computer architecture are possible, and they are distinguished by 
size, cost, and performance characteristics. However, any computer program that runs on one machine should 
run on any machine conforming to the same architecture.
Explicitly Parallel Instruction Computing. Modern computer architectures are generally classified as one 
of three types. Complex Instruction Set Computers (CISC) usually provide a large number of machine in­
structions, each of which may exhibit many different styles. CISC machines are often difficult to implement 
because each type of instruction may require a large portion of the available die area. In contrast, Reduced In­
struction Set Computers (RISC) provide far fewer machine instructions and far fewer instruction styles. As a 
result, faster circuitry may be possible, and RISC programs may execute faster than their CISC counterparts, 
even though they are typically composed of a larger number of machine instructions.
A few, mostly experimental, RISC architectures employ very long instruction words (VLIW) to guide 
the simultaneous execution of several RISC-like instructions. In the past, the advantages of the VLIW ap­
proach were overshadowed by its disadvantages. Analyzing and implementing instruction-level parallelism 
required very sophisticated compilers, and accommodating the architectural latency among the instructions 
required that software programs be recompiled (and thus redistributed) for each new hardware implementa­
tion. However, after a thorough analysis by B. Rau, minimal modifications to the VLIW approach enabled the 
architecture-implementation difficulties to be overcome. These results lead directly to the third, and newest, 
class of modern computer architecture: Explicitly Parallel Instruction Computer (EPIC). Intel’s Itanium ar­
chitecture is the first EPIC design.
1
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e T h e  I t a n i u m  A r c h i t e c t u r e
64-Bit Systems. A computer system may also be classified according to the width of its datapath. This width 
describes the number of bits that can “flow” through the computer’s internal conduits in parallel.
Although EPIC architectures are a relatively recent development, processors built around a 64-bit data­
path are not. The Alpha processor, marketed by Digital Equipment Corporation, was the first 64-bit RISC 
computer to find commercial success. Other manufacturers, such as Hewlett-Packard, have also marketed 
64-bit systems with varying degrees of success.
The Itanium architecture is Intel’s first 64-bit design. While it is too soon to declare the Itanium architec­
ture an overwhelming success, we are hopeful that the implications of EPIC principles, when combined with 
a 64-bit design, will lead to a viable and affordable platform for building and running large-scale scientific 
and high-performance computing applications.
In the following pages, we discuss those aspects of the Itanium architecture that are most relevant to both 
high-level and assembly language programmers. If our hope is realized, you will be well-prepared to face the 
new programming challenges.
2
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n s t r u c t i o n  S e t  A r c h i t e c t u r e
2 Instruction Set Architecture
Just as a computer architecture abstracts the structure and operation of a computer system, an instruction set 
architecture (ISA) abstracts the interface between a computer’s hardware and lowest-level software. With an 
understanding of the ISA, a programmer knows, in principle, what the computer system can and cannot do 
and how to accomplish a given task efficiently.
Instruction sets may be classified according to the number of addresses contained within the typical 
instruction. Like most RISC designs, the Itanium architecture describes a two-address machine with respect 
to its load and store operations: one operand is a memory address and the other is a processor register. Most 
other Itanium instructions involve at least three addresses (for example, the ad d  r 1 = r 2 , r3  instruction) 
and therefore specify two source operands (r2 , r3 )  and one destination operand (r1).
2.1 Information Units and Data Types
The basic unit of information in the Itanium architecture is the 8-bit byte. Unlike previous Intel architectures, 
the Itanium architecture assigns each byte a 64-bit address. The architecture also describes several multi-byte 
units that are composed of groups of adjacent bytes: the 16-bit word (2 bytes), the 32-bit double word (4 
bytes), and the 64-bit quad word (8 bytes). Each of these multi-byte units is also addressable.
2.1.1 Integers
Although the Itanium load and store instructions are able to manipulate information units smaller than 64 
bits in width, integer arithmetic instructions only operate on quad word data. Likewise, Itanium logical 
instructions only work with quad word data; these instructions, however, provide some access to the data at 
the bit or group-of-bits level. Table 1 expresses the size and numeric range of the available Itanium integer 
data types.
Unit Bits Bytes Signed Integer Unsigned Integer
Byte 8 1 -128 to +127 0 to 255
Word 16 2 -32,768 to +32,767 0 to 65,535




Quad word 64 8 -9,223,372,036,854,775,808 to 0 to 
+9,223,372,036,854,775,807 18,445,744,073,709,551,615
Table 1: Size and numeric range (in decimal notation) of Itanium integer data types
2.1.2 Floating-Point Numbers
In addition to integers, the Itanium architecture provides instructions for manipulating floating-point numbers. 
These numbers are typically represented by a significand that is multiplied by some power of two. An 
exponent and sign are also packed with the significand.
Historically, computer manufacturers defined their own formats for floating-point numbers. Only after 
the set of standards documented in ANSI/IEEE 754, IEEE Standard fo r  Binary Floating-Point Arithmetic, 
emerged was there any agreement between manufactures concerning floating-point representation.
The standard defines four floating-point formats: single, double, extended single, and extended double. 
The former two formats have been supported by nearly every new architecture that has been developed in 
the time since the standard was defined, while the latter two offer flexibility for supporting older, proprietary 
floating-point representations. For the purposes of this survey, we only consider the widely supported IEEE
3
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n f o r m a t i o n  U n i t s  a n d  D a t a  T y p e s
single- and double-precision floating-point numbers. Table 2 lists some important characteristics of each 
floating-point format that we consider here. Note that in IEEE representation, the significand consists of an 
implicit “hidden bit” followed by the fraction; this bit is suppressed to reduce storage requirements.
Characteristic Single Double
Size in memory
Sign 1 bit 1 bit
Exponent 8 bits 11 bits
Fraction 23 bits 52 bits
Total 32 bits 64 bits
Bias for exponent 127 1023
Minimum magnitude 1.175 * KTS8 2.225 * 10^SO8
Maximum magnitude 3.403 * 10+S8 1.798 * 10+SO8
Precision
Binary 24 bits 53 bits
Decimal 6 digits 16 digits
Table 2: IEEE floating-point representation
Single precision. An IEEE single-precision floating-point number consists of four adjacent bytes in memory. 
In a little-endian representation the bits are labeled from right to left, as follows:
31 30 23 22 00
:S
Generally, the value of the number is given by
where S' is the sign of the number (0 for positive, 1 for negative), F  is the binary fraction, 1.F is the sig­
nificand, F  is the true exponent, and B  is the bias. There are special cases when these bits are interpreted 
differently, for example, when manipulating an integer value stored in a floating-point register.
When an IEEE single-precision floating-point number is stored in an Itanium processor register, the re­
gions are arranged as follows:
81 80 64 S3 G2 43 42 00
:Fx
Note that the “hidden bit” in the IEEE representation is made explicit in an Itanium processor register.
S Exponent 1 Fraction Zeros
S Exponent Fraction
4
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n s t r u c t i o n  f o r m a t s
Double precision. An IEEE double-precision floating-point number consists of eight adjacent bytes in mem­
ory. In a little-endian representation the bits are labeled from right to left, as follows:
63 62 52 51 00
:D
Again, the value of the number is generally given by
where , , 1. , , and are as before. There are again special cases when these bits are interpreted 
differently.
When an IEEE double-precision floating-point number is stored in an Itanium processor register, the 
regions are arranged as follows:
81 80 B4 63 62 11 10 00
:Fx
As with single-precision floating-point numbers, the hidden bit is made explicit in an Itanium processor 
register.
2.1.3 Alphanumeric Characters
Binary numbers can encode any information, including alphanumeric characters (letters, numerals, punctua­
tion marks, etc.). Numerous encoding schemes exist, and the supported schemes are largely dependent upon 
which operating system and programming environment are used. For this reason, we omit any further discus­
sion of this data type, except to say that the Itanium architecture features instructions for manipulating narrow 
information units (that is, bytes, words, and double words), many of which are discussed in later sections. 
Managing strings of alphanumeric data is relegated to the programmer or compiler.
We recommend that consult your system’s documentation to see which encoding schemes and character 
sets are supported.
2.2 Instruction Formats
The Itanium architecture specifies a seemingly awkward 41-bit instruction width. While the rationale behind 
this choice is largely unimportant to the discussion at hand, it suffices to say that, in the final design, Itanium 
instructions are always fetched in groups of three and are packaged with a 5-bit instruction template, as 
follows:
127 87 86 46 45 05 04 00
:0x.,.0
This layout yields a 128-bit instruction bundle, where the template supplies additional information in­
structing the CPU how to decode and execute the three instructions. Instruction bundles are always treated 
as little-endian structures, as shown in the figure above, and are always 16-byte aligned; that is, the four 
lowest-numbered bits of a bundle’s address are always zeros.
Instruction templates are one of 32 predefined bit patterns that describe the three instructions contained 
within the bundle. We defer any further discussion of these structures until Section 12.
Instruction slot 2 Instruction slot 1 Instruction slot 0 Template
S Exponent 1 Fraction Zeros
S Exponent Fraction
5
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n s t r u c t i o n  C l a s s e s
While we are more concerned with the actual function of the Itanium instructions, it is useful to consider 
briefly the bit-field layout of a single Itanium instruction:
Itanium load and store operations require two operand specifiers (a source and a destination), and the 
arithmetic and logic operations require three operand specifiers (two sources and a destination). Some Ita­
nium instructions have two destination operands, and so require four operand specifiers. Thus, the Itanium 
instruction layout may have as many as six main bit-fields.
The field labeled qp provides for a qualifying predicate, the fields labeled f i e l d s  to f i e l d 4 provide 
for up to four operands, and the highest four bits specify the major opcode. Bits may be reinterpreted when an 
instruction requires less than four operands, or when numeric constants are packaged within the instruction 
itself as immediate data.
2.3 Instruction Classes
Itanium instructions can be divided into six basic classes:
•  Type A instructions include standard arithmetic and logic operations on integers (add, multiply, Boolean 
AND, etc.), as well as comparison operations on data values.
•  Type I instructions include other operations on integer data types, for example, bit-shifting, moving 
data to and from special purpose registers, and multimedia instructions.
•  Type M instructions include the load and store operations for both integer and floating-point data, 
the operations for moving data between general-purpose integer registers and floating-point registers, 
and the instructions that give the programmer a limited degree of control over the system’s memory 
hierarchy.
•  Type B instructions include the branching and jumping operations, as well as those for calling and 
returning from functions or procedures.
•  Type F instructions include those operations on floating-point data that are not Type M instructions.
•  Type X instructions include a few special Itanium instructions that encode more information than would 
normally fit into the 41-bit instruction width. These instructions consume two slots in an instruction 
bundle.
Table 3 (next page) shows the Itanium instruction types and the execution units that actually perform the 
operations. Not surprisingly, there is a correspondence between the I, M, B, and F instructions and the I, M, 
B, and F execution units that decode and execute them. Type A instructions, which are the most common, can 
be executed by both I- and M-units, providing the potential for a high degree of instruction-level parallelism.
An implementation of the Itanium architecture may include more than one of each type of execution unit. 
As an example, the Itanium 2 processor includes four M-units (two load, two store) and two I-units, but only 
one F-unit.
6
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A d d r e s s i n g  m o d e s
Operation Instruction Type Execution Unit
Arithmetic, logic, comparison A any available I- or M-unit
Other integer operations I I-unit
Memory access and data movement M M-unit
Branches and calls B B-unit
Floating-point operations F F-unit
Special two-slot instructions X I- or B-unit, depending on operation
Table 3: Itanium instruction types and the corresponding execution units
2.4 Addressing Modes
The Itanium ISA supports four addressing modes: immediate, register direct, register indirect, and autoincre­
ment. Table 4 (next page) captures the important characteristics of the available Itanium addressing modes, 
each of which we describe fully below.
2.4.1 Immediate Addressing
When immediate addressing is used, the instruction itself contains the operand data. Because the data is 
already in the CPU, no additional address calculations or memory fetches are required.
We have already encountered immediate addressing briefly: numeric constants whose values are known 
at the time of program assembly or compilation can packaged within the bit-field of a given instruction. Im­
mediate addressing is almost always used for these sorts of operands. Also, you will recognize that immediate 
addressing is useful only for source, and not destination, operands.
2.4.2 Register Direct Addressing
An instruction may contain an address that points to the operand data; this addressing mode is called direct 
addressing. The Itanium ISA is a register-to-register architecture and allows only the load and store instruc­
tions to operate on data in memory. Thus, only register direct addressing, where the bits within the instruction 
specify the “address” (name or number) of a processor register that contains the operand data, is permitted 
by the Itanium architecture.
2.4.3 Register Indirect Addressing
An instruction may also contain an address pointer to the operand data; this addressing mode is called indirect 
addressing. The bits within the instruction contain the register address, say rX. When the instruction executes, 
the contents of this register rX interpreted as the effective address of the information unit containing the actual 
data. For the Itanium architecture, this two-phase addressing mode is more strictly called register indirect 
addressing.
2.4.4 Autoincrement Addressing
Often it is useful to refer to operand data using register indirect addressing and then adjust the address con­
tained within that register to point to the next identically sized information unit. Stepping through an array of 
data is an example of a common task where this addressing mode proves useful.
The Itanium ISA supports this capability with its autoincrement addressing mode. The postincrement 
value is not limited to the size of particular data types. For store operations, the value is expressed as a 9-bit
7
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A d d r e s s i n g  M o d e s
Addressing Mode Assembler Syntax Effective Address
Immediate imm Bits packaged within the instruction are 
interpreted as an integer value, typically 
signed, or as an instruction ‘subcode” 
that is used to select specifi c cases of 
the instruction
Register Direct rX The named register
Register Indirect [rX] Contents of the named register
Autoincrement [rX ], imm 
or
[rX ], rY
Contents of the named register; the 
register value is then postincremented 
by the signed quantity given statically 
as imm (load and store operations) 
or dynamically in register rY (load 
operations only)
Table 4: Itanium addressing modes and effective addresses
signed immediate constant within the instruction. For load operations, it can also be specified dynamically 
using a value in an Itanium general-purpose register.
Specifying the postincrement value as a signed constant allows the programmer to step through an array 
in either direction, depending on whether the register points to the first or last data element.
8
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A r c h i t e c t u r a l  r e g i s t e r s
3 Architectural Registers
The Itanium architecture includes an unprecedented number of registers, including an instruction pointer, 128 
general-purpose registers, 128 floating-point registers, 8 branch registers, 64 predicate registers, as many as
128 special-purpose (application) registers, various system information registers, and many others. Such a 
large number of registers enables numerous computations to be performed without the need to repeatedly 
spill and fill intermediate results to memory.
The Itanium registers vary greatly in their size, features, and uses. Following the nomenclature used by 
Evans and Trimper, we characterize the Itanium registers with the following terms:
•  A register is constant if its value has been permanently defined at the hardware level.
•  A register is special if it has some purpose assigned to it, either at the hardware level or by software 
convention.
•  A register is scratch if it may be used freely by a procedure or function at any calling level; the caller 
must save any important contents of these registers.
•  A register is preserved if a calling routine depends on its contents; any called procedure must save and 
restore the contents of these registers for its caller.
•  A register is automatic if its name only has a dynamic correspondence to a physical register; these 
registers are automatically spilled to and filled from memory during allocation by the hardware, as 
necessary.
•  A register is read-only if its value is dynamically maintained at the hardware level or by the operating 
system; read-only registers cannot be modified by an application program.
Please refer to these descriptions as we detail the Itanium architectural registers.
3.1 Instruction Pointer
The Itanium instruction pointer (IP) supports the instruction fetch cycle; it points to the currently executing 
instruction bundle. In most other architectures, this register is called the program counter. The Itanium IP is
64 bits wide and can accommodate full address pointers:
non-zero
Itanium instructions are always fetched three at a time, as 128-bit instruction bundles. The lowest four 
bits of the IP are, therefore, always zero.
3.2 General-Purpose Registers
The Itanium architecture specifies 128 general-purpose registers, named G ro-G rm . Each of these registers 
is 64 bits wide and can accommodate both full address pointers and either signed or unsigned integers:
Each general register has an associated 65th bit, called the Not a Thing (NaT) bit. This bit is used to 
indicate whether the value stored in a register is valid. When the contents of a marked register are used by
9
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e F l o a t i n g - P o i n t  r e g i s t e r s
an operation, the NaT bit of the destination register will automatically be set. The invalid condition may 
be carried through a sequence of instructions, to be dealt with when convenient. NaT bits are important for 
software that utilizes speculative loads.
Table 5 lists the names and standardized uses of the Itanium general-purpose registers.
Register Assembler Name Other Name Class Notes
Gr0 r0 Constant Always contains 0; writes are illegal
Gn r1 gp Special Global data pointer
Gr2, Gr3 3r,2r Scratch Often useful with ad d l instruction
Gr4-Gr7 7r14r Preserved
Grg-Grn r8 -r1 1 r e t0 - r e t3 Scratch Integer values returned by a function
Gr12 r12 sp Special Stack pointer (always modulo 16)
Gri3 r13 tp Special Thread pointer (requires operating 
system support)
Gri4-Gr3i r14 -r3 1 Scratch
Gr32-Gr39 r3 2 -r3 9 in 0 -in 7 Automatic up to 8 input arguments to a function
Gr32-Gri27 r32-r127 Automatic Stacked input registers; safe
Gr32-Gri27 r32-r127 lo c0 -lo c 9 5 Automatic Stacked local registers; safe
Gr32-Gri27 r32-r127 o u t0 -o u t9 5 Automatic Stacked output registers
Gr32-Gri27 r32-r127 Automatic Rotating registers (groups of 8); they 
overwrite the stacked registers of the 
current procedure
Table 5: The names and uses of the Itanium general-purpose registers
3.3 Floating-Point Registers
The Itanium architecture specifies 128 floating-point registers, named Fr0-F ri27 . Each of these registers is 82 
bits wide and can accommodate expanded forms of IEEE single- or double-precision floating-point values, 
as well as signed or unsigned 64-bit integers:
81 80 B4 63 00
:Frs biased exp significand
The floating-point registers do not have an associated invalidity bit; rather, a special value called Not a 
Thing Value (NaTVal) indicates whether the contents of a floating-point register are valid.
Table 6 (next page) lists the names and standardized uses of the Itanium floating-point registers.
10
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e B r a n c h  r e g i s t e r s
Register Assembler Name Other Name Class Notes
Fr0 f0 Constant Always +0.0; writes are illegal
Frx f1 Constant Always +1.0; writes are illegal
Fr2-Fr5 f2 - f5 Preserved
Fre-Fr? f6 - f7 Scratch
Fr8-Fri5 f 8 - f 15 Scratch Floating-point arguments to a function 
and values return by a function
Fri6-Fr3i f1 6 -f3 1 Preserved
Fr32-Fri27 f 3 2 - f 127 Scratch Rotating registers
Table 6: The names and uses of the Itanium floating-point registers 
3.4 Branch Registers
The Itanium architecture specifies eight branch registers, named Br0-B r7 . Each of these registers is 64 bits 
wide and can therefore accommodate full address pointers:
non-zero
Itanium instructions are always fetched three at a time, as 128-bit instruction bundles. The lowest four 
bits of a branch register are, therefore, always zero.
Branch registers specify the target address of indirect branches. Table 7 lists the names and standardized 
uses of the Itanium branch registers.
Register Assembler Name Other Name Class Notes
Br0 b0 r p Scratch Return link
Bri-Brs 5b-1b Preserved
Bre-Br? b6-b7 Scratch
Table 7: The names and uses of the Itanium branch registers
11
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r e d i c a t e  r e g i s t e r s
3.5 Predicate Registers
The Itanium architecture specifies 64 predicate registers, named Pro-Pr63 . Each of these registers is only one 
bit wide and can therefore accommodate either a Boolean true (1) or false (0) value:
These registers control the conditional execution of instructions and conditional branches. Table 8 lists 
the names and standardized uses of the Itanium predicate registers. Each bit in the 64-bit predicate vector is 
individually addressable. No predicate registers are automatically stacked at the time of a procedure call; if 
necessary, the entire predicate vector may be saved to a general-purpose register.
Register Assembler Name Other Name Class Notes
Pr p0 Constant Always true (1); writes are discarded
Pri-Pr5 p1-p5 Preserved Fixed;safe
Pr6-Pri5 p6-p15 Scratch Fixed; unsafe
Pri6-Pr63 p16-p63 p r . r o t Preserved Rotating registers
Table 8: The names and uses of the Itanium predicate registers
3.6 Application Registers
The Itanium architecture allows for as many as 128 application registers (named Ar0-A ri27) to be defined. 
These 64-bit registers can accommodate full address pointers and either signed or unsigned integers:
Application registers perform specific tasks associated with various instructions in an application-level 
program. We omit any further details of the Itanium application registers here. Consult the Intel Itanium 
architecture documentation for more information.
3.7 System Information Registers
The Itanium architecture also includes registers that provide information concerning the hardware implemen­
tation to application programs.
For example, an application can determine implementation-dependent features, such as the processor’s 
manufacturer or its family and model numbers, by reading the c p u id  (processor identification) registers.
The Itanium architecture also allows for as many as 256 64-bit pmd (performance monitor data) registers. 
These registers record certain aspects of the system’s performance that can be used for tuning applications. 
An operating system may allow read-only access to these registers by application-level code. The Itanium 
architecture requires that, at a minimum, eight pmd registers be implemented.
12
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e O t h e r  P r o c e s s o r  r e g i s t e r s
3.8 Other Processor Registers
In addition to those we have discussed, the Itanium architecture includes a number of other registers:
•  Various state management registers, like the a r . p f s  (previous function state) register, where prior 
state information can be preserved in hardware rather than a slower memory stack.
•  Various system control registers, like the p s r  (processor status) register, where the operating system 
and hardware can track critical aspects of machine state.
•  Several more, highly specialized registers that generally require privileged instructions to access them.
For more details concerning these other Itanium registers, we recommend the Intel Itanium architecture doc­
umentation.
13
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e Itanium Assembler Statements
4 Itanium Assembler Statements
Before diving into a detailed discussion of the available instructions, we introduce the basic syntax of Itanium 
assembler statements:
[ l a b e l : ]  [ ( q p ) ]  m n em o n ic [ . comp] d s t = s r c [ ; ; ]  [ / /  comment] 
where [ ] denotes an optional syntactical element and:
l a b e l  is a symbolic address in the form of a character string terminated by a colon (: ),
•  (qp) specifies a qualifying predicate register,
•  m nem onic specifies a name that uniquely identifies an Itanium instruction,
•  comp specifies one or more instruction completers to indicate variations on a base instruction mnemonic,
•  d s t  specifies the destination operand(s),
•  s r c  specifies the source operand(s),
•  ; ;  is an explicit stop used to identify Itanium instruction bundles or data dependencies, and
•  / /  comment is a human-language description of the assembler statement.
While not altogether void of traits in common with assembly statements for other architectures, many of these 
elements are unique to the Itanium architecture.
Typically, each line in an assembly language program is one statement that may be imperative, declarative, 
or controlling:
Imperative statements represent machine instructions in symbolic form. These statements are the most 
common type.
•  Declarative statements control the allocation of memory or perform naming functions. These state­
ments do not generate machine instructions to be executed at runtime; rather, they set aside space, 
define symbols, or initialize the contents of particular memory locations.
Controlling statements give the programmer a limited degree of control over certain aspects of the 
assembly process.
To illustrate the Itanium assembly language, we include a (slightly modified) example program from 
Evans and Trimper called SQUARES. The program populates a table in memory with the squares of the first 
three integers using tabular differences. The SQUARES code is given in Figure 1 (next page).
Even without any knowledge of particular Itanium instructions, you should be able to recognize many 
elements of the Itanium architecture that we have already discussed: various processor registers, like the 
general-purpose registers Gr2o-Gr22 and the branch register Br0, or perhaps the use of register indirect ad­
dressing, as in s t 8  [r1 4 ] = r2  0 ; ; .  You will see, too, that the SQUARES program includes imperative, 
declarative, and controlling assembler statements.
While SQUARES is a trivial example and is not particularly useful except for illustrative purposes, the 
program sets the stage nicely for our discussion of the Itanium instruction set.
14
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  I t a n i u m  A s s e m b l e r  S t a t e m e n t s
/ /  SQUARES: P o p u la te  a t a b le  o f sq u ares
.d a ta / / D eclare  d a ta  s e c tio n
. a l ig n 8 / / S p ec ify  d e s ir e d  a lignm en t
sq1 : .s k ip 8 / / To s to r e  1 squared
sq2 : . sk ip 8 / / To s to r e  2 squared
sq3 : . sk ip 8 / / To s to r e  3 squared
. t e x t / / D eclare  code s e c tio n
. a l ig n 32 / / S p ec ify  d e s ir e d  a lignm en t
.g lo b a l main / / Mark m andatory program  e n try
. p roc main
main:
. body / / Begin p ro ced u re  'm a i n '
f i r s t : mov r2 1 = 1 ;; / / Gr21 = 1 s t  t a b u la r  d if f e re n c e
mov r22=2; ; / / Gr22 = 2nd ta b u la r  d if f e re n c e
mov r2 0 = 1 ;; / / Gr20 = 1 s t  square
add l r14=@ gprel(sq1) , g p ; ; / / P o in t to  s to ra g e  fo r  1 s t  square
s t8 [ r14 ] = r 2 0 ; ; / / S to re  1 s t  square
add r21=r2 2 , r 2 1 ; ; / / Adjus t  1 s t  t a b u l a r  d i f f e r e n c e
add r2 0 = r 2 1 , r 2 0 ; ; / / Gr20 = 2nd square
addl r14=@gprel(sq2) , g p ; ; / / P o in t  t o  s t o r a g e  f o r  2nd square
s t8 [ r14 ] = r 2 0 ; ; / / S to re  2nd square
add r21=r2 2 , r 2 1 ; ; / / Adjus t  1 s t  t a b u l a r  d i f f e r e n c e
add r2 0 = r 2 1 , r 2 0 ; ; / / Gr20 = 3rd  square
addl r14=@gprel(sq3) , g p ; ; / / P o in t  t o  s t o r a g e  f o r  3rd square
s t8 [ r14 ] = r 2 0 ; ; / / S to re  3rd  square
done: mov r CO II ; / / S igna l  complet ion
b r . r e t . sptk.many b0 ; ; / / Return t o  command l i n e
. endp main / / End p rocedure  'm a in '
Figure 1: A slightly modified version of the SQUARES program from Evans and Trimper
15
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n t e g e r  I n s t r u c t i o n s
5 Integer Instructions
A large number of computer programs run simply by manipulating integer data types. Many integer opera­
tions are provided by the Itanium ISA, so we being our discussion of the architecture’s instruction set with 
these instructions.
5.1 Arithmetic Instructions
The Itanium integer arithmetic instructions include addition, subtraction, bit-shift left with addition, and 
parallel multiplication of 16-bit values. These Type A instructions are among the most commonly used.
5.1.1 Addition
Several forms of the Itanium addition instruction are available:
add r1 = r 2 , r3 / / r1 <- 3r+2r
add r1 13r2r= / / r1 <- r2  + r 3  + 1
ad d s r1 =imm14, r3 / / r1 <- s e x t ( i m m 1 4 ) 3r+
a d d l r1 =imm22, r3 / / r1 <- s e x t ( i m m 2 2 ) + r 3
add r1 =imm,r3 / / r1 <- s e x t ( i m m )  + r 3
where s e x t  denotes that the immediate constant is sign-extended to 64 bits before being used in the oper­
ation. The register designations r1 , r2 , and r3  refer to the particular encoding found in fields in the bit 
layout of an instruction. Any one of the Itanium general-purpose registers Gr0-G ri27 may be specified with 
this instruction; however, only Gr0-G r3 may be used with the a d d l  form.
The last form of the Itanium ad d  instruction is an example of an assembler pseudo-op. A pseudo-op is 
a convenient form of an instruction that the assembler will “transform” according to context. For example, if 
the constant imm can be represented as a two’s complement integer using 14 or fewer bits, the assembler will 
generate the appropriate a d d s  instruction. On the other hand, if the representation of that constant requires 
more than 14 bits, the appropriate a d d l  instruction will be generated. Note that in this case, the choice of 
registers for the second source operand is constrained to one of Gr0-G r3, as mentioned above.
5.1.2 Subtraction
Fewer forms of the Itanium subtraction instruction are available:
su b 3r2r=1r / / r1 < - r2  - r3
su b 13r2r=1r / / r1 < - r2  - 1-3r
s u b r1=imm8, r3 / / r1 <- s e x t ( i m m 8 ) -  r 3
where the notational conventions are the same as for addition. Note that only one, relatively narrow, repre­
sentation can be used for the immediate constant.
5.1.3 Shift Left and Add
A third integer arithmetic operation is available, and it combines a bit-shift left with addition, as follows:
s h l a d d  r 1 = r 2 , c o u n t , r 3  / /  r1  < -  2 c o u n t * r2  + r3
where c o u n t  specifies the number of bit positions, ranging from a minimum of one to a maximum of four, 
that the value in first source register will be shifted to the left.
16
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A r i t h m e t i c  I n s t r u c t i o n s
Special cases of integer multiplication. When Gr0 is specified as the second source register, the s h l a d d  
instruction can compute 2, 4, 8, or 16 times the value contained in the first source register, depending on 
the value of c o u n t .  When the two source registers are the same general-purpose register, say Gr„, this 
instruction can compute 3, 5, 9, or 17 times the value contained in Gr„, again, depending on the value of
co u n t .
Array indexing. computing the address of a particular element in an array can be completed in one step 
with the s h l a d d  instruction:
address = (element index) * (size o f  data type) + (starting address o f  array) 
becomes
s h l a d d  a d d r = i n d e x ,  c o u n t ,  a r r a y _ a d d r
where a d d r  is the general-purpose register used to hold the element’s computed address, i n d e x  is a general- 
purpose register holding the element number (zero-based indexing), c o u n t  corresponds to the size of the 
individual array elements, and a r r a y _ _ a d d r  is a general-purpose register holding the starting address of 
the array. With the s h l a d d  instruction, it is possible to work with the whole array using only two general- 
purpose registers: one containing the array’s starting address and one containing the current element number.
5.1.4 Multiplication and Division of 64-bit Integers
A multiply or divide instruction requires more stages to implement at the digital logic level than a simple 
addition or subtraction instruction. This requirement implies that multiplication and division instructions 
will take longer to execute than other instructions. RISC and EPIC architectures strive to make instruction 
execution times as consistent as possible across the entire instruction set. As a result, the Itanium architecture 
does not provide instructions for full width integer multiplication or division.
The Itanium ISA does include a special instruction that will perform integer multiplication using the 
floating-point registers. We discuss this topic further in Section 8. No such instruction is available for integer 
division, but virtually all programming environments provide some form of a software substitute. consult 
your system’s documentation to see which (possibly unpublished) internal routine or inline instruction se­
quence is used.
5.1.5 Multiplication of 16-bit Integers
The Itanium ISA includes two forms of a parallel instruction that multiplies two 16-bit signed integer pairs 
and produces two independent 32-bit signed integer products:
p m p y 2 . l  r 1 = r 2 , r 3  / /  p a r a l l e l  m u l t i p l y ,  l e f t  fo rm
p m p y 2 . r  r 1 = r 2 , r 3  / /  p a r a l l e l  m u l t i p l y ,  r i g h t  fo rm
With the left form, the result of multiplying bits 63:48 of each source register is placed in bits 
<63:32> of the destination register, while the result of multiplying bits <31:16> of both sources is placed 
in <31:0> of the destination.
In contrast, with the right form, the result of multiplying bits <47:32> of each source register is placed 
in bits <63:32> of the destination register, while the result of multiplying bits <  15:0> of both sources is 
placed in <31:0> of the destination.
The Itanium ISA provides a number of other parallel operations; these instructions are introduced in 
Section 9.
17
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A r i t h m e t i c  I n s t r u c t i o n s
5.1.6 Special-Case Arithmetic Operations
We saw earlier that the Itanium assembler provides a pseudo-op for the ad d  instruction with an immedi­
ate constant. Assembler pseudo-ops may be provided for very common or useful operations, largely as a 
convenience to the programmer.
An ISA will typically provide instructions for general operations that include the more common opera­
tions as special-cases. Some architectures may provide machine instructions for the special cases, others may 
provide pseudo-ops, and others still may not provide either. Here, we discuss some common operations for 
which the Itanium ISA does not provide actual machine instructions but that can be written as special cases 
of more general instructions or as assembler pseudo-ops.
Negation. Many architectures include an operation for arithmetic negation of integers in two’s complement 
notation. The Itanium ISA, however, includes neither a machine instruction nor an assembler pseudo-op 
to accomplish this task. Arithmetic negation can be accomplished using either one of two special cases of 
subtraction:
su b  r 1 = 0 , r 3  / /  r1  < -  0 -  r 3  = - r 3
sub  r 1 = r 0 , r 3  / /  r1  < -  r 0  -  r 3  = 0 -  r 3  = - r 3
With either form, the value contained in r3  is subtracted from zero, and r1  will contain the negated value.
Complementation. Likewise, the Itanium ISA does not include an instruction for computing the one’s 
complement (bitwise complement) of a value. Again, either one of two special cases of subtraction can be 
used to accomplish this common task:
su b  r 1 = - 1 , r 3  / /  r1  < -  -1  -  r3
s u b  r 1 = r 0 , r 3 , 1  / /  r1  < -  r 0  -  r 3  -  1 = 0 -  r 3  -  1 = -1  -  r3
Having examined both negation and complementation, it should now be clear why the su b  instruction syntax 
specifies that the value in register r3  be subtracted from the immediate constant.
Copying. The mov instruction used in the SQUARES program is actually an assembler pseudo-op. The 
Itanium ISA lacks a machine instruction that moves data between general-purpose registers or that moves a 
constant value into a register. However, the assembler recognizes the mov pseudo-op and implements two 
forms:
mov r 1 =imm22 becomes a d d l  r1 = im m 2 2 , r 0  
mov r 1 = r 3  becomes a d d s  r1  = 0 , r 3
The second form results in having copies of the same data value in both registers.
Clearing. The Itanium ISA also lacks an instruction to clear the contents of a general-purpose register. 
However, any one of several other instructions will suffice:
mov r1= 0  / /  b ec om es  a d d s  r 1 = 0 , r 0
sub  r 1 = r n , r n  / /  r1  < -  r n  -  r n  = 0
s h l a d d  r 1 = r 0 , c o u n t , r 0  / /  t h e  v a l u e  o f  c o u n t  i s  i r r e l e v a n t
In all cases, register r1  will contain zero.
18
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e D a t a  A c c e s s  I n s t r u c t i o n s
5.2 Data Access Instructions
Most modern computer designs include cache structures that attempt to reduce the time required to access 
data stored in memory. In some designs, the presence and type of cache are only matters of concern for the 
implementation, and the ISA will not include instructions for interacting with the cache structures. In other 
designs, the presence and type of cache are matters of concern for both the architecture and the implementa­
tion. In this case, the ISA may include instructions for influencing the behavior of the cache structures.
The Itanium architectures specifies that the cache structures be explicitly visible to the assembly language 
programmer. The Itanium ISA includes instructions for prefetching a line of data that will soon be needed by 
a program into the cache and for flushing a line of data that is no longer needed back to memory. We defer 
any further discussion of the Itanium cache structures until Section 13.
At the moment, our concern with the cache structures involves the instruction completers that can be used 
with the integer load and store operations to provide hints to these structures. We discuss the load and store 
instructions below.
5.2.1 Load Instructions
The Itanium ISA includes two forms of the integer load instruction:
l d s z . I d t y p e . l d h i n t r1 = [r3] / / r1 <- mem[r3]
l d s z . l d t y p e . l d h i n t r1 = [r3] , r2 / / r1 <- mem[r3]
/ / r 3 <- r 3  + r2
l d s z . l d t y p e . l d h i n t r1 = [r3] , imm9 / / r1 <- mem[r3]
/ / r 3 <- r 3  + s e x t ( im m 9 )
l d 8 . f i l l . l d h i n t  
l d 8 . f i l l . l d h i n t
l d 8 . f i l l . l d h i n t
r 1 = [ r 3 ]  
r 1 = [ r 3 ] , r 2
r 1 = [ r 3 ] , i m m 9
/ /  f i l l  d a t a  a n d  NaT b i t
/ /  f i l l  d a t a  a n d  NaT b i t
/ /  r 3  < -  r 3  + r2
/ /  f i l l  d a t a  a n d  NaT b i t
/ /  r 3  < -  r 3  + s e x t ( im m 9 )
where s z  is the size of the information unit in memory at the location specified by register r3  from which
1, 2, 4, or 8 bytes are to be copied into the lowest-order 1, 2, 4, or 8 bytes of register r1 . The loaded data 
is zero-extended to the full width of the register, as necessary. Note that the load instructions use register 
indirect addressing for the source operand and register direct addressing for the destination operand.
Several valid values for I d t y p e ,  the load type completer, are available. If this instruction completer is 
omitted, then an ordinary load is executed. Some of the other values can be used to indicated ordered, biased, 
speculative, and/or advanced loads and are discussed in Section 12.
There are three valid values for l d h i n t ,  the load hint completer: none, n t 1 ,  and n t a .  None, which 
corresponds to omitting the load hint completer, indicates an ordinary load operation; the processor hardware 
then assumes that the program associates temporal locality in the L1 cache with the loaded value. n t1  
provides the hint that the program considers the loaded value to have nontemporal locality in just the L1 
cache, while n t a  hints that the program considers the loaded value to have nontemporal locality in all levels 
of the memory hierarchy. Using the load hint completers may avoid knocking out of the cache structures data 
that might be reused.
The load instructions provide for postmodification of the pointer value in register r3  by a full 64-bit 
signed value stored in register r2  or by a 9-bit signed constant, with values ranging from -256 to +255.
Finally, the f i l l  form of this instruction loads 8 bytes and the NaT bit associated with register r1 . This 
form is used to restore register contents when an operating system switches process contexts or when an 
application uses a preserved register.
19
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e D a t a  A c c e s s  I n s t r u c t i o n s
5.2.2 Store Instructions
The Itanium ISA includes two forms of the integer store instruction:
r2  
r2
s e x t ( im m 9 )
s t s z . s t t y p e . s t h i n t  
s t s z . s t t y p e . s t h i n t
[ r 3 ] = r 2  
[ r 3 ] = r 2 , i m m 9
/ /  mem[r3] <-  
/ /  mem[r3] <-  
/ /  r 3  < -  r 3  +
s t 8 . s p i l l . s t h i n t  
s t 8 . s p i l l . s t h i n t
[ r 3 ] = r 2  
[ r 3 ] = r 2 , i m m 9
/ /  s p i l l  d a t a  an d  NaT b i t  
/ /  s p i l l  d a t a  an d  NaT b i t  
/ /  r 3  < -  r 3  + s e x t ( im m 9 )
where s z  is the size of information unit in memory into which the lowest-order 1, 2, 4, or 8 bytes of the 
quantity in register r2  are to be copied to the memory address specified in register r3 . Note that the store 
instruction uses register direct addressing for the source operand and register indirect addressing for the 
destination operand.
There are two valid values for s t t y p e ,  the store type completer: none and r e l .  None, which corre­
sponds to omitting the store type completer, indicates an ordinary store operation. We omit any discussion of 
the r e l  store type completer but recommend the Itanium architecture documentation for further details.
There are two valid values for s t h i n t ,  the store hint completer: none and n t a .  None, which corre­
sponds to omitting the store hint completer, indicates an ordinary store operation; the processor hardware 
then assumes that the program associates temporal locality in the L1 cache with the stored value. n t a  pro­
vides the hint that the program considers the stored value to have nontemporal locality at all levels of the 
memory hierarchy. The use of n t a  may avoid knocking out of the cache structures data that might be reused.
The store instructions provide for postmodification of the pointer value in register r3  by a 9-bit signed 
constant, with values ranging from -256 to +255.
Finally, the s p i l l  form of this instruction stores 8 bytes and the validity bit associated with register r2 . 
This form is used to save register contents when an operating system switches process contexts or when an 
application uses a preserved register.
5.2.3 Move Long Immediate Instruction
The Itanium ISA provides a special instruction, called movl,  that can accommodate a 64-bit immediate 
value:
movl r1=imm64 / /  r1  < -  imm64
movl r 1 = l a b e l  / /  r 1  < -  6 4 - b i t  a d d r e s s  f o r  l a b e l
The 64-bit immediate value, or the full 64-bit address of l a b e l  (determined by the linker), is copied into 
the general-purpose register r1 . The movl instruction can use any general-purpose register in the range 
Gr!-Grm , unlike the a d d l  instruction we discussed previously. It is often used to establish pointers for 
subsequent load and store operations. Note that movl occupies two slots in an Itanium instruction bundle as 
a result of the 64-bit immediate value and is therefore one of the few Type X instructions.
5.2.4 Accessing Specialized Registers
Many of the Itanium’s specialized registers contain information that is useful to an application-level program. 
Three mov assembler pseudo-ops provide the ability to copy values between specialized registers and general- 
purpose registers:
mov r 1 = r e g  / /  r1  < -  c o n t e n t s  o f  r e g
mov r e g = r 2  / /  r e g  < -  c o n t e n t s  o f  r2
mov reg=imm8 / /  r e g  < -  s e x t ( im m 8 )
2 0
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e m i s c e l l a n e o u s  I n t e g e r  I n s t r u c t i o n s
Note that the Itanium IP can be read, but not modified, using a mov instruction. Also, the last form, which 
uses an 8-bit immediate value as the source operand, can only be employed when the destination is one of the 
Itanium application registers (Ar0-A ri27).
5.3 Miscellaneous Integer Instructions
While most of the Itanium integer instructions operate on full 64-bit data, we have seen that the load and store 
operations can also manipulate narrower information units. The load instruction automatically zero-extends 
data that is less than 64 bits wide, but the Itanium ISA also includes a separate instruction to zero-extended 
data in a register to the full width. A similar instruction is provided for sign-extension.
5.3.1 Zero-Extend Instruction
Bit-masking is often used to force some bits of a register value to zero. This task can be accomplished with a 
Boolean AND operation. The Itanium architecture also includes an instruction to zero-extend a value to the 
full width of a register:
z x t x s z  r 1 = r 3  / /  r1  < -  z e x t ( r 3 )
where x s z  is 1,2, or 4 to select the range of bit positions (<63:8>, <63:16>, or <63:32>) in the destination 
register r1  that will be set to zero. The lowest-order 1, 2, or 4 bytes are copied from the source register r3 .
A full width load followed by zero-extension is preferable to using the narrow load instructions when 
accessing individual bytes in memory is slower than this instruction pair. Which method is faster will typically 
depend on the implementation; nevertheless, either instruction sequence will produce the desired result on 
any Itanium implementation.
5.3.2 Sign-Extend Instruction
Similarly, the Itanium ISA includes an instruction that sign-extends a value to the full width of a register:
s x t x s z  r 1 = r 3  / /  r1  < -  s e x t ( r 3 )
where x s z  is 1,2, or 4 to select the bit position (7,15, or 31) in the source register r3  that will be propagated 
as the sign bit in the destination register r1 . Sign-extension is useful for constructing signed quantities from 
small information units that have been loaded by a narrow load instruction.
5.3.3 Instructions for Narrow Data Types
If the full numeric precision or large address space of the Itanium architecture are not necessary, the effective­
ness of the Itanium’s cache structures can actually improve if 32-bit address pointers and narrow information 
units are used. The Itanium provides arithmetic instructions for smaller data widths:
•  Several parallel instructions, like the pmpy2 instruction that we have already encountered, that operate 
on multiple narrow integer values simultaneously using the Itanium’s 64-bit datapath; and
•  addp4 and s h l a d d p 4 ,  which produce 64-bit address pointers from 32-bit addresses; these instruc­
tions are useful for migrating 32-bit code.
We discuss the parallel operations in Section 9 but omit any further details of the other narrow integer 
instructions. Consult the Itanium architecture documentation for more information.
21
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C o m p a r i s o n  a n d  B r a n c h i n g  I n s t r u c t i o n s
6 Comparison and Branching Instructions
The power and flexibility of computer programming lies in the ability to control the logical flow of execution 
based upon currently calculated conditions. We now consider the features of the Itanium architecture that 
enable programmers to control the program’s flow of execution.
6.1 Comparison Instructions
The Boolean true or false outcome of a comparison operation is typically used to choose between two alter­
native code sequences. Comparison operations are thus intimately tied to the concept of predication, where 
one set of actions is executed if a given premise is true and a different set is completed if that premise is false. 
The Itanium architecture supports predication more fully than any previous architecture. The Itanium predi­
cate registers, which we introduced in Section 3, can capture the Boolean true or false result of a comparison 
operation and thus “control” which statements execute.
The Itanium ISA includes a number of 32- and 64-bit integer, double-precision floating-point, and parallel 
comparison instructions. We discuss the more common integer versions here; Section 8 and Section 9 cover 
the remaining versions.
6.1.1 Signed Comparison
There are six useful cases for comparing two values: equal (=), not equal (!=), less than ( ), less than or 
equal ( =), greater than or equal ( =), and greater than ( ).
Two versions of the signed comparison instruction are supported: cmp, for 64-bit quad word data values, 
and cmp4, for 32-bit double word data values. We describe only the syntax for the 64-bit instruction; the 
32-bit cmp4 operates in exactly the same manner.
Several forms of the instruction are available:
c m p . c r e l . c t y p e  p 1 , p 2 = r 2 , r 3  / /  two r e g i s t e r s
c m p . c r e l . c t y p e  p 1 , p 2 = i m m 8 , r 3  / /  i m m e d i a t e  an d  one r e g i s t e r
c m p . c r e l . c t y p e  p 1 , p 2 = r 0 , r 3  / /  com pare  0 t o  one  r e g i s t e r
c m p . c r e l . c t y p e  p 1 , p 2 = r 3 , r 0  / /  com pare  one r e g i s t e r  t o  0
where two predicate registers (Pr0-P r63) must always be specified for p1 and p2. Pr0, which is always true, 
may be used in either position.
Typically, the comparison statements are read from left to right: In the two-register form, for example, 
p1 is set to true and p2 to false if r2  c r e l  r 3  is true, and vice versa if the comparison is false.
There are six valid values for c r e l ,  the comparison relationship completer, each of which corresponds 
to one of the six comparison cases described above: eq, ne, l t ,  l e ,  ge, and g t .
Several valid values for c t y p e ,  the comparison type completer, exist: none, unc,  or ,  and,  o r . andcm, 
orcm, andcm, and a n d .  orcm. None, which corresponds to omitting the comparison type completer, in­
dicates an ordinary comparison operation, like the one described above. We discuss the unconditional com­
parison type completer unc momentarily. The remaining completers are used with the parallel comparison 
operations; parallel operations are introduced in Section 9.
6.1.2 Unsigned Comparison
When comparing two unsigned quantities for equality, the same instructions used for signed values will 
suffice: If the two bit patterns match at all bit positions, then the quantities are equal; otherwise, the quantities 
are not equal.
However, because of the two’s complement representation used by most binary computers, the signed 
versions of the comparison instructions will not work for the remaining cases. The Itanium ISA provides four 
additional c r e l  completers for dealing with unsigned values: l t u ,  l e u ,  geu,  and g t u .
2 2
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e B r a n c h  I n s t r u c t i o n s
6.1.3 Unconditional Comparison
We introduced the unconditional comparison type completer unc  above. Valid forms of the compare instruc­
tion using this completer include:
cmp. c r e l . unc  
cmp. c r e l . unc  
cmp. c r e l . unc  
cmp. c r e l . unc
p 1 , p 2 = r 2 , r3  
p 1 , p2=imm8, r3  
p 1 , p 2 = r 0 , r3  
p 1 , p 2 = r 3 , r 0
/ /  two r e g i s t e r s
/ /  i m m e d i a t e  an d  one r e g i s t e r
/ /  com pare  0 t o  one  r e g i s t e r
/ /  com pare  one r e g i s t e r  t o  0
where the valid c r e l  values are as before.
An ordinary comparison instruction executes and sets both predicate registers according to the compar­
ative test when predicated true, but does nothing at all when predicated false. In the latter case, the values 
contained in the predicate registers will not change.
The unconditional comparison instruction behaves in the exact same manner when predicated true. How­
ever, if predicated false, this form sets the values in both predicate registers to false without actually per­
forming a comparison. This form of the instruction is useful for constructing nested i f . . .  t h e n  . . . e l s e  
structures, as we show in Section 10.
6.2 Branch Instructions
The Itanium ISA provides numerous branching abilities, among them the simple conditional and uncondi­
tional branch types. Five forms involving several instruction completers are available:
(qp) b r . b r t y p e . b w h . p h . dh t a r g e t 2 5 / / r e l a t i v e t o  IP
(qp) b r . b r t y p e . b w h . p h . dh b2 / / i n d i r e c t a d d r e s s i n g
b r . p h . dh t a r g e t 2 5 / / u n c o n d i t i o n a l (p se u d o
b r . p h . dh b2 / / u n c o n d i t i o n a l (p se u d o
(qp) b r l . b r t y p e . b w h . p h . dh t a r g e t 6 4 / / r e l a t i v e t o  IP
where qp specifies the qualifying predicate register.
The branch target address can be specified using IP-relative addressing or indirect addressing with the 
branch register b2. If IP-relative addressing is used, the programmer can specify the address of the target 
instruction bundle using a symbolic label. The compiler or assembler will compute the appropriate 25-bit 
signed offset as t a r g e t 2  5 — IP, resulting in a branch range of 224 bytes (220 instruction bundles).
To execute longer-range jumps, it is possible to load a full 64-bit address into one of the eight branch 
registers (Bro-Br7) and then specify that register as the branch target. The Itanium 2 processor includes 
hardware support for the b r l  instruction, which encodes a 64-bit offset using two slots of an instruction 
bundle.
There are a total of ten valid values for b r t y p e ,  the branch type completer: none, cond,  c a l l ,  r e t ,  
i a ,  c l o o p ,  c t o p ,  c e x i t ,  wtop,  and w e x i t .  Note that none is synonymous with cond.  We defer 
discussions of some of the remaining completers until we describe the programming constructs that utilize 
them, while discussions of other branch type completers are omitted altogether; for these, we recommend the 
Itanium architecture documentation.
There are four valid values for bwh, the branch whether completer: s p n t ,  s p t k ,  d p n t ,  and d p t k .  The 
programmer or compiler can statically (s) predict (p) whether a branch will be taken ( tk )  or not taken (nt);  
prediction can also be completed dynamically (d) by the hardware. Branch prediction is an implementation- 
dependent feature, so the s p t k  hint is always available.
There are three valid values for ph ,  the prefetch hint completer: none, f  ew, and many. Note that none is 
synonymous with f  ew. This hint indicates how many lines should be prefetched into the Itanium instruction 
cache, beginning with the target instruction bundle.
There are two choices for dh, the branch cache deallocation hint completer: none and c l r .  This hints 
indicates whether the small cache dedicated to branch target addresses should be left alone (none) or flushed 
( c l r ) .  Note that the existence of such a cache is an implementation-dependentfeature.
23
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e B r a n c h  I n s t r u c t i o n s
All branches incur time penalties. The Itanium architecture takes great pains to reduce the impact of 
branch instructions and includes many features that improve branch performance. For instance, the execution 
of a predicated branch instruction in a B-unit can be overlapped with the immediately prior comparison 
instruction that is computing the branch’s predicate result in an I- or M-unit. The impact of the time required 
to compute the branch target’s address is thus reduced by executing these instructions in parallel. This ability 
allows zero latency between the compare and branch instructions.
When combined with predication, the branching instructions give the programmer a powerful tool for 
controlling the logical flow of their programs.
2 4
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e L o g i c a l  a n d  B i t - L e v e l  I n s t r u c t i o n s
7 Logical and Bit-Level Instructions
As we have seen, most Itanium instructions operate on the full 64-bit data value contained in a register. 
However, several instructions that can be used to manipulate individual bits within a register are also available; 
we discuss these instructions here.
7.1 Logical Instructions
Several Itanium logical instructions are available:
and
00r2rII 
\—ir / / r1 < - r2  & r3
and r1=imm8, r3 / / r1 <- s e x t ( i m m 8 ) & r 3
andcm 3r2r=1r / / r1 <- r2  & r3
andcm r1=imm8, r3 / / r1 <- s e x t ( i m m 8 ) & r 3
o r 3r2r=1r / / r1 <- 3r2r
o r r1=imm8, r3 / / r1 <- s e x t ( i m m 8 ) | r 3
x o r 3r2r=1r / / r1 <- r2  ~ r 3
x o r r1=imm8, r3 / / r1 <- s e x t ( i m m 8 ) r 3
where &, | , and ~ denote the Boolean AND, OR, and XOR operations.
The logical functions supported by the Itanium ISA can be used to set, clear, toggle, and test individual 
bits within a value using bit-masking techniques. Evans and Trimper discuss the details of using the Itanium 
logical instructions to accomplish these common programming tasks (Section 6.1.3, page 158).
7.2 Bit-Level Instructions
We have already seen the s h l a d d  arithmetic instruction, which combines bit-shifting with integer addition. 
The Itanium ISA includes several other useful bit manipulation instructions.
7.2.1 Bit-Shift Instructions
Several forms of the Itanium bit-shift instructions are available:
s h l 2r3r=1r / / r1 3r-< s h i f t e d l e f t  r 2  b i t s
s h l r 1 = r 3 , c o u n t 6 / / r1 < -  r3 s h i f t e d l e f t  c o u n t 6  b i t s
s h r r 1 = r 3 , r 2 / / r1 < -  r3 s h i f t e d r i g h t  r 2  b i t s
s h r r 1 = r 3 , c o u n t 6 / / r1 < -  r3 s h i f t e d r i g h t  c o u n t 6  b i t s
s h r . u r 1 = r 3 , r 2 / / r1 < -  r3 s h i f t e d r i g h t  r 2  b i t s
/ / u n s i g n e d
s h r . u r 1 = r 3 , c o u n t 6 / / r1 < -  r3 s h i f t e d r i g h t  c o u n t 6  b i t s
/ / u n s i g n e d
where c o u n t  6 is a 6-bit unsigned immediate that specifies the shift count. Note that left shifts with s h l  are 
always unsigned, while s h r  produces an arithmetic shift, unless the . u instruction completer is specified.
If an immediate value is used with these instructions, then s h l  is actually an assembler pseudo-op for a 
special case of the Itanium deposit instruction, while s h r  and s h r . u  are pseudo-ops for special cases of the 
Itanium extract instruction. These instructions are introduced momentarily.
25
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e B i t - L e v e l  I n s t r u c t i o n s
7.2.2 Shift Right Pair Instruction
The Itanium ISA also includes a “long shift” instruction, which shifts a bit pattern that is twice the width of 
a single register:
s h r p  r 1 = r 2 , r 3 , c o u n t 6  / /  r1  = [ r 2 : r 3 ]  s h i f t e d  r i g h t  c o u n t 6  b i t s
This instruction treats the two 64-bit source registers r2  and r3  as a “single” 128-bit value and shifts the bits 
c o u n t 6  positions to the right. The rightmost 64 bits of the result are placed in the destination register r  1.
The s h r p  instruction can be used to rotate the bit pattern if the same register is specified for both source 
operands.
7.2.3 Extract and Deposit Instructions
Two instructions that enable the programmer to read or write any number of bits within a register are available.
The Itanium extract instruction isolates some contiguous block of bits from the source register and places 
those bits, right-justified, into the destination register:
e x t r  r 1 = r 3 , p o s 6 , l e n 6  / /  s i g n e d  fo rm
e x t r . u  r 1 = r 3 , p o s 6 , l e n 6  / /  u n s i g n e d  fo rm
where p o s6  is a bit position (in the range <63:0>) and le n 6  specifies the number of bits to extract. Bits 
< p o s 6  + l e n 6 - 1 : p o s 6 >  from the source register r3  are copied into the destination register r1  as bits 
< l e n 6 - 1 : 0 > .  If the . u instruction completer is used, the remaining bits in the destination are set to zero; 
otherwise, b i t <  p o s 6  + l e n 6 - 1 >  from the source is propagated as the sign bit.
The Itanium deposit instruction isolates a contiguous span of bits from the right-hand side of the source 
register and repositions that span anywhere within the destination register:
d e p . z  r 1 = r 2 , p o s 6 , l e n 6  
d e p . z  r 1 = i m m 8 , p o s 6 , l e n 6
dep  r 1 = r 2 , r 3 , p o s 6 , l e n 4
dep  r 1 = i m m 1 , r 3 , p o s 6 , l e n 6
/ /  z e r o  fo rm
/ /  z e r o  fo rm ,  w i t h  i m m e d i a t e
/ /  m erge  fo rm
/ /  m erge  fo r m ,  w i t h  i m m e d i a t e
where p o s6  is a bit position (in the range <63:0>), l e n 4  and l e n 6  specify the number of bits to deposit, 
and imm1 and imm8 are 1- and 8-bit immediate values.
The zero form copies bits < l e n 6 - 1 : 0 >  from the source register r2  into the destination register r1 , 
setting all other bits of the destination to zero. If an immediate value is used, it is first sign-extended before 
bits < l e n 6 -1 :0 >  are deposited into the destination.
The merge form copies bits < l e n 6 - 1 : 0 >  from the source register r2  into the destination register r1  
as bits < p o s t + l e n 6 - 1 : p o s 6 > , and all other bits of the destination are copied from the corresponding 
positions within the source register r3 . Note that at most 16 bits can be copied from register r2 , a result of 
the 4-bit width of l e n 4 .  If an immediate value is used, it is sign-extended and bits < l e n 6 - 1 : 0 >  serve as 
the first source segment.
Figure 2 (next page) demonstrates how these instructions operate, using quad word operands. In this 
figure, pos6=32 and len6=16.
2 6
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e B i t - L e v e l  I n s t r u c t i o n s
Figure 2: The Itanium extract and deposit instructions
7.2.4 Single-Bit Test Instruction
The Itanium ISA also provides the t b i t  instruction to set qualifying predicates based on a test of any single 
bit within a 64-bit register:
t b i t . t r e l . c t y p e  p t , p f = r 3 , p o s 6
where p t  and p f  are predicate registers and p o s6  is an unsigned value that encodes the particular bit position 
(in the range <63:0>) within the register r3  that will be tested.
There are two valid values for t r e l ,  the test relationship completer: nz (nonzero) and z (zero). The 
predicate register p t  is set to true and p f  to false depending on the specified t r e l  completer.
The valid values for the c t y p e  completer are the same as those for the other comparison operations, 
which were discussed in Section 6.
27
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e F l o a t i n g - P o i n t  I n s t r u c t i o n s
8 Floating-Point Instructions
Floating-point data types and the now standard IEEE representations were introduced in Section 2, while the 
Itanium floating-point registers were introduced in Section 3. We turn our attention to the Itanium instructions 
that operate on the floating-point data types using the associated registers.
Many of the Itanium integer instructions have direct floating-point counterparts, while others do not. 
Those that do are listed in Table 9 with their floating-point analogues. Those integer instructions that do 
not have similar floating-point versions include the bit-shift instructions and data-independent branching 
instructions. Similarly, many floating-point instructions that we will discuss have no corresponding integer 
version.
Type of Instruction Integer Floating-Point
Arithmetic add fadd (pseudo-op)
sub fsub (pseudo-op)
xmpy (pseudo-op) fmpy (pseudo-op)
Load & Store ld 1 , ld 2 , l d 4 , ld8 l d f s , l d f d ,  l d f e ,  ld f8




or f o r
xor fxo r
Table 9: Comparison of the Itanium integer and floating-point instructions
You will recall from our earlier discussions that the Itanium’s 128 floating-point registers are 82 bits wide. 
This seemingly strange 82-bit width was chosen to accommodate Intel’s 80-bit extended double-precision 
format, which has been carried over from the IA-32 architecture. As a consequence, these registers enable 
greater accuracy when using intermediate, “register-only” results. Floating-point values are converted to the 
appropriate IEEE format only when storing the final results of a calculation in memory.
In addition to the single- and double-precision floating-point formats, the IEEE standard defines en­
codings for various “special” values, like positive and negative infinity, for example. Table 10 lists these 
encodings and their meanings. Note that denormal numbers are those whose fractions have not been shifted 
far enough to the left to give the significand the “hidden bit” mentioned in Section 2. These numbers have a 
value between zero and the smallest of the normalized numbers given in Table 2 (on page 4).
Biased Exponent Fraction Meaning
All ones Nonzero NaN
All ones Zero ±  Infi nity
Zero Nonzero ±  Denormal
Zero Zero Zero
other Any Nonzero, normalized
Table 10: Meanings of the special IEEE floating-point representations
28
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A r i t h m e t i c  I n s t r u c t i o n s
8.1 Arithmetic Instructions
The Itanium architecture, like most modem designs, includes native support for some common floating-point 
arithmetic operations.
8.1.1 Addition, Subtraction, and Multiplication
The Itanium ISA provides for addition, subtraction, and multiplication for floating-point data types:
f a d d .  p c .  s f f1 2f3fII / / f1 <- 2f+3f
f s u b . p c . s f f1 2f3fII / / f1 <- 2f-3f
f m p y . p c . s f f1 4f3fII / / f1 <- f 3 * f 4
f n m p y . p c . s f f1 4f3fII / / f1 <- 4f*3f-(
where f1 , f2 , f3 , and f4  can be any of the Itanium floating-point registers (Fr0-F ri27). However, unlike 
their integer counterparts, these floating-point instructions do not support immediate constants of any sort.
There are three valid values for p c , the precision completer: none, s, and d. None, which corresponds 
to omitting the precision completer, is used to handle special circumstances like the IA-32 double extended 
format. Our focus will be on the IEEE single- (s) and double- (d) precision.
Five valid values for s f , the status field completer, are available: none, s0 , s1 , s2, and s3 . None 
is synonymous with s 0. These values refer to four settings in a floating-point status register that we do 
not describe further. The default value (none) is used throughout the remainder of this section. Evans and 
Trimper describe the status field completer briefly (Section 8.4.5 on page 242). For the details concerning the 
floating-point status registers, consult the Itanium architecture documentation.
Each of the basic arithmetic operations described here is actually an assembler pseudo-op for special cases 
of the more general “fused” floating-point arithmetic instructions. We introduce these powerful instructions 
next.
8.1.2 Fused Multiply-Add and Multiply-Subtract
Three instructions that multiply two source operands and then add or subtract a third operand are also avail­
able:
fm a . p c .  s f f1 2f4f3fII / / f1 < - +4f*3f f2
fm s . p c .  s f f1 2f4f3fII / / f1 < - -4f*3f f2
f n m a . p c .  s f f1 2f4f3fII / / f1 <- 4f*3f-( 2f+
where f1 , f2 , f3 , and f4  can again be any of the Itanium floating-point registers (Fr0-F ri27). Note that the 
intermediate product of f3  and f4  is not rounded in any way before adding or subtracting f2 , producing a 
final value that is to the optimal precision.
The valid values for p c  and s f  are the same as those for the non-fused assembler pseudo-ops described 
above.
8.1.3 Reciprocal and Square Root Approximations
The IEEE standard also includes requirements for division, remainder, and unary square root operations with 
floating-point data types. Some RISC architectures provide full hardware support for these operations, but 
because they take longer to execute than other instructions, these operations tend to cause pipeline stalls.
In contrast, the Itanium architecture employs lookup tables that store approximations to reciprocals and 
reciprocal square roots with a known accuracy. The execution time for a table lookup is the same as for the 
fma instruction. The approximation obtained by these instructions can be refined to the desired accuracy 
using series expansion. As a result of instruction-level parallelism, this approach can be as fast as, if not 
faster than, the hardware-only implementations.
29
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e A r i t h m e t i c  I n s t r u c t i o n s
Floating-Point Reciprocal Approximation. using two source operands and two destination operands, the 
reciprocal approximation instruction can compute either an approximate reciprocal or an IEEE-mandated 
quotient:
f r c p a . s f  f 1 , p 2 = f 2 , f 3  / /  p2 = 1 an d  f1  = 1 / f 3  o r
/ /  p2 = 0 an d  f1  < -  f 2 / f 3
where s f  can take on any of the valid values from the previously discussed floating-point instructions. If the 
instruction is used with a qualifying predicate and that predicate is zero, then the predicate register p2 is set 
to zero and the contents of f1  remain unmodified. If the qualifying predicate is one, then either:
•  p2 is set to one and f1  is set to the reciprocal of f3 , or
•  p2 is set to zero and f1  is set to the IEEE-mandated quotient, f 2 / f 3 .
Evans and Trimper include a brief discussion demonstrating how this instruction can be used to compute 
a refined result (Section 8.8.1, page 254). Their example follows that given by Peter Markstein in his book 
IA-64 and Elementary Functions: Speed and Precision. We recommend both of these resources for more 
details concerning the use of the f r c p a  instruction.
Floating-Point Reciprocal Square Root Approximation. In the same manner, the f r s q r t a  instruction 
computes either an approximate reciprocal square root or an IEEE-mandated square root:
f r s q r t a .  s f  f 1 , p 2 = f 2 , f 3  / /  p2 = 1 an d  f1  = 1 / s q r t ( f 3 )  o r
/ /  p2 = 0 an d  f1  < -  s q r t ( f 3 )
where the valid values for s f  are as before. If the instruction is used with a qualifying predicate and that 
predicate is zero, then the predicate register p2 is set to zero and the contents of f1  remain unmodified. If 
the qualifying predicate is one, then either:
•  p2 is set to one and f1  is set to the reciprocal of s q r t  ( f 3 ) , or
•  p2 is set to zero and f1  is set to the IEEE-mandated square root, s q r t  ( f 3 ) .
Here, too, Evans and Trimper include a brief discussion demonstrating how this instruction can be used to 
compute a refined result (Section 8.8.2, page 255), again following Markstein. For the details, please consult 
either of these excellent resources.
Floating-Point Division. As noted, the Itanium architecture does not include an instruction for floating-point 
division, but virtually all programming environments provide some form of a software substitute. consult 
your system’s documentation to see which (possibly unpublished) internal routine or inline instruction se­
quence is used.
Markstein discusses numerous algorithms that compute these and various other mathematical functions 
using the Itanium instruction set while conforming to the IEEE conventions for rounding and exception 
reporting. Refer to his book for the details of using the Itanium floating-point instructions to compute refined 
arithmetic results.
30
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e D a t a  A c c e s s  I n s t r u c t i o n s
8.1.4 Maximum and Minimum Instructions
The Itanium ISA includes instructions for determining the maximum or minimum of two floating-point val­
ues:
f a m a x . s f  f 1 = f 2 , f 3  / /  f1  < -  l a r g e r  o f  f2  an d  f3  ( a b s o l u t e  v a l u e )
f a m i n . s f  f 1 = f 2 , f 3  / /  f1  < -  s m a l l e r  o f  f2  an d  f3  ( a b s o l u t e  v a l u e )
fmax.  s f  f 1 = f 2 , f 3  / /  f1  < -  l a r g e r  o f  f2  an d  f3
f m i n . s f  f 1 = f 2 , f 3  / /  f1  < -  s m a l l e r  o f  f2  an d  f3
where the valid values for s f  are as before. If the values in registers f2  and f3  are equal in value (fmax, 
fmin)  or magnitude (famax,  famin),  then register f1  is set to the value of f3 .
8.1.5 Normalization
The Itanium ISA includes an assembler pseudo-op for “normalizing” and rounding a floating-point value 
after a series of calculations:
fn o r m .  p c . s f  f 1 = f 3  becomes fma.  p c . s f  f 1 = f 3 , f 1 , f 0
where the valid values for p c  and s f  are as before.
8.2 Data Access Instructions
Like their integer analogues, the floating-point load and store operations can use instruction completers to 
provide hints to the Itanium cache structures regarding how the program expects the values to be used. We 
discuss the available floating-point load and store instructions below.
8.2.1 Load Instructions
The Itanium ISA includes instructions for loading floating-point values in several different forms.
Standard floating-point load. There are three forms of the standard floating-point load instructions:
l d f  f s z . f l d t y p e . l d h i n t f1 = [r3] / / f1  <- mem[r3]
l d f  f s z . f l d t y p e . l d h i n t f1= [r3] , r2 / / f1  <- mem[r3]
/ / r 3  <- r 3  + r2
l d f  f s z . f l d t y p e . l d h i n t f1= [r3] , imm9 / / f1  <- mem[r3]
/ / r 3  <- r 3  + s e x t ( im m 9 )
l d f 8 . f l d t y p e . l d h i n t f1= [r3] / / f 1 < 6 3 : 0 >  < -  mem [r3]
l d f 8 . f l d t y p e . l d h i n t f1= [r3] , r2 / / f 1 < 6 3 : 0> < -  mem[r3]
/ / r 3  <- r 3  + r2
l d f 8 . f l d t y p e . l d h i n t f1= [r3] , imm9 / / f1  63 : 0> < -  mem[r3]
/ / r 3  <- r 3  + s e x t ( im m 9 )
l d f . f i l l . l d h i n t f1= [r3] / / -<1f mem[r3]
l d f . f i l l . l d h i n t f1= [r3] , r2 / / f1  <- mem[r3]
/ / r 3  <- 2r+3r
l d f . f i l l . l d h i n t f1= [r3] , imm9 / / f1  <- mem[r3]
/ / r 3  <- r 3  + s e x t ( im m 9 )
31
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e D a t a  A c c e s s  I n s t r u c t i o n s
where f s z  is the size of the information unit at the address specified in register r3  from which a converted 
value is placed into register f1 . The valid values for f s z  are s for a single-precision value, d for a double­
precision value, and e for an IA-32 80-bit extended double-precision value. Note that the load instruction 
uses register indirect addressing for the source operand and register direct addressing for the destination.
There are several valid values for f l d t y p e ,  the load type completer. None, which corresponds to omit­
ting the load type completer, indicates an ordinary load operation. The remaining types correspond to a check 
load, speculative load, or advanced load, and are considered further in Section 12.
Three valid values for l d h i n t ,  the load hint completer, exist: none, n t 1 ,  and n t a .  These completers 
provide the same hints to the Itanium cache structures as their integer counterparts, which were discussed in 
Section 5.
Also like the integer versions, the floating-point load instructions provide for postmodification of the 
pointer value in register r3  by a full 64-bit signed value stored in register r2  or by a 9-bit signed constant, 
with values ranging from -256 to +255.
The l d f 8  form of this instruction loads 8 bytes from the quad word memory location specified in register 
r3  into the significand bits (<63:0>) of register f1 . This form is typically used to load integer data types 
into a floating-point register. Accordingly, the sign bit (<81>) is set to zero and the biased exponent field 
(bits <80:64>) is set to 0x1003E (263) to indicate that the value in the significand bits should be interpreted 
as a 64-bit integer.
Finally, the f i l l  form of this instruction loads 16 bytes and the appropriate fields are placed into register 
f1  without conversion. This form is used to restore register contents when an operating system switches 
process contexts or when an application uses a preserved register.
Floating-point load pair. The Itanium ISA includes an instruction that will load a pair of floating-point 
values. Several forms are available:
l d f p s .. f l d t y p e . l d h i n t II2f1f [ r3] / / f1 <- mem[r3]
/ / f2 <- mem [r3+ 4]
l d f p s .. f l d t y p e . l d h i n t f 1 , f 2= [ r 3 ] ,8 / / f1 <- mem[r3]
/ / f2 <- mem [r3+ 4]
/ / r 3 <- r 3  + 8
l d f p d .. f l d t y p e . l d h i n t f 1 , f 2= [r3] / / f1 <- mem[r3]
/ / f2 <- mem [r3+ 8]
l d f p d .. f l d t y p e . l d h i n t f 1 , f 2= [r3] , 1 6 / / f1 <- mem[r3]
/ / f2 <- mem [r3+ 8]
/ / r 3 <- r 3  + 16
l d f p 8 .. f l d t y p e . l d h i n t f 1 , f 2= [r3] / / f1  63 : 0> < -  mem[r3]
/ / f2  63 : 0 >  < -  m em [r3+ 8]
l d f p 8 .. f l d t y p e . l d h i n t f 1 , f 2= [r3] , 1 6 / / f1  63 : 0> < -  mem[r3]
/ / f2  63 : 0 >  < -  mem[r3+8]
/ / r 3 <- r 3  + 16
Data from two successive information units (starting at the address specified in register r3 )  are converted 
and placed into the destination registers f1  and f2 . Note that one of the destination registers must be an 
odd-numbered register and the other even-numbered, but they do not have to be consecutively numbered. 
(For example, specifying f  9 and f1 2  as the destination registers is valid.) Other restrictions apply; consult 
the the Itanium architecture documentation for the details.
Like the standard floating-point load instruction, this operation provides for postmodification of the 
pointer value in register r3  by an amount equal to the aggregate size of the two values, and is useful for 
stepping through an array, for example.
The valid values for f l d t y p e  and l d h i n t  are the same as those for the standard floating-point load 
instructions.
32
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  m i s c e l l a n e o u s  F l o a t i n g - P o i n t  I n s t r u c t i o n s
8.2.2 Store Instructions
The Itanium ISA provides three forms of the floating-point store instruction:
s t f  f s z . s t h i n t [ r3] =f2 / / mem[r3] <- f2
s t f  f s z . s t h i n t [ r3] = f 2 , imm9 / / mem[r3] <- f2
/ / r 3  < -  r 3  + s e x t ( im m 9 )
s t f 8 . s t h i n t [ r3] =f2 / / mem[r3] <- f2  6 3 :0
s t f 8 . s t h i n t [ r3] = f 2 , imm9 / / mem[r3] <- f2  6 3 :0
/ / r 3  < -  r 3  + s e x t ( im m 9 )
s t f . s p i l l . s t h i n t [ r3] =f2 / / mem[r3] <- f2
s t f . s p i l l . s t h i n t [ r3] = f 2 , imm9 / / mem[r3] <- f2
/ / r 3  < -  r 3  + s e x t ( im m 9 )
where f s z  is the size of the information unit at the address specified by register r3  into which the value in 
register f2  is converted and stored. The valid values for f s z  are s for a single-precision value, d for a double­
precision value, and e for an IA-32 80-bit extended double-precision value. Note that the store instruction 
uses register direct addressing for the source operand and register indirect addressing for the destination.
There are several valid values for f l d t y p e ,  the load type completer. None, which corresponds to omit­
ting the load type completer, indicates an ordinary load operation. The remaining types correspond to a check 
load, speculative load, or advanced load, and are considered further in Section 12.
Two valid values for s t h i n t ,  the store hint completer, exist: none and n t a .  These completers provide 
the same hints to the Itanium cache structures as their integer counterparts, which were discussed in Section 5.
Also like the integer versions, the floating-point store instructions provide for postmodification of the 
pointer value in register r3  by a 9-bit signed constant, with values ranging from -256 to +255.
The s t f 8  form of this instruction stores the significand bits (<63:0>) of register f2  in the quad word 
memory location specified by register r3 .
Finally, the s p i l l  form of this instruction stores the contents of register f2  into the 16-byte memory 
location specified by r3 . This form is used to save register contents when an operating system switches 
process contexts or when an application uses a preserved register.
8.3 Miscellaneous Floating-Point Instructions
The Itanium ISA includes a few other instructions that operate on floating-point data types. We discuss these 
here.
8.3.1 Floating-Point Compare Instruction
In Section 6, the notion of predication was introduced. The Itanium ISA includes a floating-point compare 
instruction that can be used to set qualifying predicate registers based on the value in a floating-point register. 
The behavior and syntax of this instruction is similar to that of its integer analogue:
fcmp.  f c r e l . f c t y p e  p 1 , p 2 = f 2 , f 3  / /  a l w a y s  u s e s  two r e g i s t e r s
where two predicate registers p1 and p2 must always be specified and can be any of Pr0-P r63 .
Typically, a comparison statements is read from left to right: p1 is set to true and p2 to false if r 2 c r e l r 3  
is true, and vice versa if the comparison is false.
There are several valid values for f c r e l ,  the conditional relationship completer, including: eq, ne, l t ,  
l e ,  ge, g t .  Each of these has the same meaning as the corresponding symbol in the integer versions of the 
compare instruction. A number of other values for the conditional relationship completer are also available: 
n l t ,  n l e ,  nge,  and n g t .  Here, n stands for “not”, so these completers provide mnemonics for the logically 
opposite relationships.
33
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  m i s c e l l a n e o u s  F l o a t i n g - P o i n t  I n s t r u c t i o n s
The IEEE standard also defines a special “unordered” relation that is true if one or both operand values 
are NaN (not a number). The fcmp instruction can use the u n o r d  completer to test for this relationship and 
the o rd  completer for the Boolean opposite.
Two valid values for f c t y p e ,  the comparison type, exist: none and unc.  None, which corresponds to 
omitting the comparison type completer, indicates an ordinary comparison, like that just described. The unc 
completer indicates and unconditional comparison and behaves just like the unconditional integer comparison 
operation, which was described in Section 6.
8.3.2 Logical Instructions
The Itanium ISA includes logical instructions that operate on the significand of a floating-point value:
f a n d 3f2f=1f / / dnacifingis o f f1 <- f2 & f3
fandcm f 1 = f 2 , f 3 / / s i g n i f i c a n d o f f1 <- f2 & f3
f o r f 1 = f 2 , f 3 / / s i g n i f i c a n d o f f1 <- f2 | f3
f x o r f 1 = f 2 , f 3 / / s i g n i f i c a n d o f f1 <- f2 f3
f s e l e c t f 1 = f 3 , f 4 , f 2 / / s i g n i f i c a n d o f f1 <- (f3&f2) (f4&f2)
where &, | , and ~ denote the Boolean AND, OR, and XOR operations. Each instruction sets the sign of f1  
to positive and the biased exponent field to 0x1003E.
The f s e l e c t  instruction copies significand bits of f3  from the positions where the bits of f2  are one 
and it copies the bits of f4  from the positions where the bits of f2  are zero.
8.3.3 Assembler Pseudo-Ops
The Itanium ISA also provides a number of assembler pseudo-ops for copying floating-point values between 
registers. These include:
mov 3f=1f / / f1 < - f3
f a b s f 1 = f 3 / / f1 <- a b s ( f 3 )
f n e g f 1 = f 3 / / f1 <- - f 3
f n e g a b s f 1 = f 3 / / f1 <- - a b s ( f 3 )
where f1  and f3  may be any of the Itanium floating-point registers (F ro -F r^ ). Like all assembler pseudo­
ops, these represent common special cases of the more general floating-point instructions discussed above.
8.3.4 Floating-Point Merge Instruction
We know that floating-point numbers are stored and manipulated as sign and magnitude quantities. There 
are several forms of a merge instruction to manipulate the sign bit of a floating-point number, both with and 
without the biased exponent field:
f m e r g e . s  f 1 = f 2 , f 3  / /  f1  < -  s i g n ( f 2 )  w i t h  r e s t ( f 3 )
f m e r g e . n s  f 1 = f 2 , f 3  / /  f1  < -  - s i g n ( f 2 )  w i t h  r e s t ( f 3 )
f m e r g e . s e  f 1 = f 2 , f 3  / /  f1  < -  s i g n ( f 2 )  an d  e x p ( f 2 )  w i t h  r e s t ( f 3 )
where f1 , f2 , and f3  may any of the Itanium floating-point registers (F ro -F r^ ). These instructions are 
useful for composing a new floating-point value using a combination of the various elements from the source 
operands.
3 4
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  m i s c e l l a n e o u s  F l o a t i n g - P o i n t  I n s t r u c t i o n s
8.3.5 Floating-Point Value Classification
The f c l a s s  instruction allows the programmer to determine the (nature) of the current value in a floating­
point register:
f c l a s s .  f c r e l . f c t y p e  p 1 , p 2 = f 2 , f c l a s s 9  / /  i s  f2  a s  e x p e c t e d ?
where two predicate registers must always be specified and f c l a s s 9  is a bit pattern encoding the character­
istics sought about the contents of register f  2.
Predicate register p1 is set to true and p2 to false if f2  f c r e l  f c l a s s 9  is true, and vice versa if the 
relationship is false.
There are two valid values for f c r e l ,  the conditional relationship completer: m (is a member) and nm 
(is not a member).
There are two valid values for f c t y p e ,  the comparison type completer: none and unc.  None, which 
corresponds to omitting the completer, indicates an ordinary comparison. The unc  completer indicates an 
unconditional comparison, and behaves as previously described.
Itanium assemblers will recognize the mnemonics in Table 11 for the f c l a s s 9  bit pattern. These 
mnemonics can be OR’d together using the | operator.
Floating-Point Class Mnemonic Bit Value in fclass9
NaTVal @nat 0x100
Quiet NaN @qnan 0x080
Signaling NaN @ snan 0x040
Positive @pos 0x001
Negative @neg 0x002
Zero @ zero 0x004
Un-normalized @unorm 0x008
Normalized @norm 0x010
Infi nity @inf 0x020
Table 11: Assembler mnemonics for the f c l a s s  instruction
The floating-point number will agree with the f c l a s s 9  pattern if one of the following three conditions 
is true:
•  The value is NaTVal and @nat was specified.
•  The value is NaN and either @qnan or @snan was specified.
•  The value’s sign agrees with @pos or @neg, if specified, and the value’s type agrees with the remainder 
of the specified characteristics.
Note that a value of 0x1ff for f c l a s s 9  will test whether the value in register f2  is any supported floating­
point type.
35
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  F l o a t i n g - P o i n t  O p e r a t i o n s  o n  I n t e g e r  v a l u e s
8.4 Floating-Point Operations on Integer Values
We noted in Section 5 that the Itanium floating-point registers could be used to do full width multiplication of 
64-bit integers. Before investigating the several forms of the floating-point instruction that makes this opera­
tion possible, we introduce the instructions for converting between integer and floating-point representations.
8.4.1 Data Conversion
We have already seen that the floating-point load and store operations will convert data to and from the IEEE 
single- and double-precision formats when working with data in the floating-point registers. The Itanium ISA 
also includes instructions for converting quad word signed integer values as well.
Rounding and truncation. Several instructions that modify the format of a floating-point value as it moves 
between Itanium registers are available:
f c v t . f x . s f  f 1 = f 2
f c v t . f x . t r u n c . s f  f 1 = f 2
f c v t . f x u . s f  f 1 = f 2
f c v t . f x u . t r u n c .  s f  f 1 = f 2
/ /  r o u n d  t o  i n t e g e r
/ /  t r u n c a t e  t o  i n t e g e r
/ /  r o u n d  t o  u n s i g n e d  i n t e g e r
/ /  t r u n c a t e  t o  u n s i g n e d  i n t e g e r
where the result of each operation is placed into the significand of register f1 . The biased exponent of f1  is 
set to 0x1003E and the sign bit is set to zero. If the floating-point value in register f2  is negative, then the 
sign of the result in f1  is given by bit <63 > of the significand. This qualification applies only to the signed 
forms of these instructions.
The valid values for s f  are the same as those previously described for the floating-point arithmetic in­
structions.
Integer to floating-point conversion. The Itanium ISA also provides an instruction for converting a 64-bit 
integer (stored in the significand of a floating-point register) into a normalized floating-point value:
f c v t . x f  f 1 = f 2  / /  c o n v e r t  t o  n o r m a l i z e d  f l o a t i n g - p o i n t
This operation is always exact, a result of the extended exponent range of the Itanium floating-point registers. 
Note that no instruction completers are necessary. An assembler pseudo-op that converts a 64-bit integer into 
a floating-point value using the fma instruction is also available:
f c v t . x u f .  p c .  s f  f 1 = f 2  becomes fma f 1 = f 3 , f 1 , f 0
Rounding may be necessary if the integer value is too large; a truncation operation is not available. 
The valid values for p c  and s f  are the same as those for the floating-point arithmetic operations.
Data movement. Several forms of the g e t f  instruction, which moves values from the floating-point regis­
ters to the general-purpose registers, are available:
g e t f . s  r 1 = f 2  / /  r1  < -  s i n g l e - p r e c i s i o n  r e p r e s e n t a t i o n  o f  f2
g e t f . d  r 1 = f 2  / /  r1  < -  d o u b l e - p r e c i s i o n  r e p r e s e n t a t i o n  o f  f2
g e t f . e x p  r 1 = f 2  / /  r 1 < 1 7 : 0 >  < -  s i g n  an d  e x p o n e n t  o f  f2
g e t f . s i g  r 1 = f 2  / /  r1  < -  s i g n i f i c a n d  o f  f2
36
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  F l o a t i n g - P o i n t  O p e r a t i o n s  o n  I n t e g e r  v a l u e s
Bits <63:32> of r1  are set to zero with g e t f .  s. Likewise, bits <63:18> of r1  are set to zero with 
g e t f .  exp. If f2  contains NaTVal, then the NaT bit of r1  is marked for all forms of this instruction.
Similar instructions are available for moving values from a general-purpose register to a floating-point 
register:
s e t f . s  f 1 = r 2  / /  f1  < -  s i n g l e - p r e c i s i o n  v a l u e  i n  r2
s e t f . d  f 1 = r 2  / /  f1  < -  d o u b l e - p r e c i s i o n  v a l u e  i n  r2
s e t f . e x p  f 1 = r 2  / /  s i g n  a n d  e x p o n e n t  o f  f1  < -  r 2 < 1 7 : 0 >
s e t f . s i g  f 1 = r 2  / /  s i g n i f i c a n d  o f  f2  < -  r2
Here, bits <17:0> of r2  are set in the sign and exponent fields of f1 , and its significand is set to the 
hexadecimal value 1 followed by 15 zeros (0x1000000000000000) with the s e t f .  exp  instruction. With 
s e t f .  s i g ,  the value in r2  is copied into the significand of f1 , its sign filed is set to zero, and the biased 
exponent is set to 0x1003E. If the NaT bit of register r2  is set, then the conversion is skipped and register f1  
is set to NaTVal.
8.4.2 Integer Multiplication
The Itanium ISA provides a fused multiply-add instruction for multiply 64-bit integer data types stored in 
floating-point registers. Several forms of the xma instruction are available (some of which are assembler 
pseudo-ops):
xma. l 2f4f3f=1f / /
xma. l u 2f4f3f=1f / /
xma. h 2f4f3f=1f / /
xma. hu 2f4f3f=1f / /
xmpy. l 4f3f=1f / /
xmpy. l u 4f3f=1f / /
xmpy. h 4f3f=1f / /
xmpy. hu 4f3f=1f / /
low fo rm
low fo rm  (p s e u d o - o p )
where a 128-bit intermediate result is formed by either a signed or unsigned multiplication of the significands 
of registers f  3 and f  4 and (possibly) adding the significand of register f  2. Note that the significand of f2  is 
zero-extended as necessary. Either the lower or the upper 64 bits of this results are then stored in significand 
of register f1 .
There is no fused multiply-subtract instruction for 64-bit integer data types, a result of zero-extending, 
and not sign-extending, the significand of f 2 .
The sign bit of f1  is set to zero, and the biased exponent to 0x1003E. If any source operand’s value is 
NaTVal, then the conversion is skipped and f1  is set to NaTVal.
37
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e Pa r a l l e l  I n s t r u c t i o n s
9 Parallel Instructions
As we know, the Itanium architecture operates on 64-bit integer and 82-bit floating-point data types. Some­
times, however, the precision or range of values enabled by the full width of the appropriate registers are not 
necessary. To optimize the circumstances where this situation is true, many modern architectures offer par­
allel (sometimes called “multimedia”) instructions. These instructions operate on several narrow data types, 
packed into a full-width architectural register, in parallel.
The Itanium architecture provides several integer and floating-point parallel instructions. The use of 
parallel instructions is extremely complex and often requires special algorithms and data layouts to achieve 
optimal execution. As a result, we give only a cursory overview of these instructions here.
For further details concerning the Itanium parallel instructions, consult the appropriate entries in Section 3 
of the Itanium Instruction Set Reference.
9.1 Integer Instructions
A large number of parallel integer instructions, which perform their specified operations on multiple bytes, 
words, and double words packed into a 64-bit general-purpose register, are available, but we do not list the 
instruction mnemonics here.
The parallel integer instructions include: typical arithmetic operations, like add, subtract, multiply; many 
useful others, such as maximum, minimum, average, bit shifts, comparisons, etc.; and the necessary instruc­
tions for packing and unpacking general-purpose registers with multiple narrow data types.
Note that these instructions have latencies greater than one and may require execution unit 10 in par­
ticular. There use is further complicated by interdependencies with other Itanium instructions. It is possi­
ble that the analogous nonparallel instructions will exhibit better performance, but this result is, of course, 
implementation-dependent.
9.2 Floating-Point Instructions
Similarly, many parallel floating-point instructions, which perform their specified operations on two single­
precision floating-point values packed into an 82-bit Itanium floating-point register, are available, but we do 
not list the instruction mnemonics here.
The parallel floating-point instructions include: typical arithmetic operations, like the fused multiply-add 
and multiply-subtract; many useful others, such as maximum, minimum, negation, comparison, the reciprocal 
approximations, etc.; and the necessary instructions for loading and storing the packed data values.
These instructions have a four-cycle latency. In principle, then, the Itanium architecture can sustain twice 
as many parallel single-precision operations as nonparallel double-precision operations. o f  course, data 
dependencies and other factors, like the number of F-units, will limit the achievable speed-up provided by 
the parallel instructions.
38
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e S t r u c t u r e d  P r o g r a m m i n g  C o n s t r u c t s
10 Structured Programming Constructs
We saw in Section 6 that the Itanium ISA includes several comparison and branching instructions to control 
the flow of a program’s execution. Many of the common control structures provided by high-level languages, 
including logical or data-dependent constructs ( i f . . . t h e n . . . e l s e  and case selection structures) and 
loops ( d o . . . u n t i l  or w h i l e . . . do), can benefit from implementation using the Itanium’s advanced 
features. We turn our attention to the low-level bases for building these high-level constructs using the 
Itanium ISA.
10.1 If...Then...Else Structures
The i f .  . . t h e n .  . . e l s e  block is one of the simplest structured programming constructs. It is also one of 
the most powerful.
10.1.1 Standard Implementation
An assembly level i f . . .  t h e n  . . . e l s e  construct will typically look like the following:
p r i o r  co d e  
i f :  c m p . c r e l  p t , p f = r a , r b
(p f)  b r . c o n d  e l s e ; ;  
t h e n :  < d o  THEN b l o c k > ; ;
b r  e n d ; ;  
e l s e :  do ELSE b l o c k  ; ;  
en d :  s u b s e q u e n t  c o d e
where p t  and p f  are the predicate registers set for the true and false outcomes of the compare instruction. 
When the outcome is true, p f  will contain a zero, the first conditional branch will fall through, the THEN 
block will execute, and the unconditional branch will skip the ELSE block. When the outcome is false, p f  
will contain a one, the first conditional branch skips the THEN block, and the ELSE block will execute. Note 
that in either case one branch must execute, which is very time-consuming.
10.1.2 Predicated Implementation
Most Itanium instructions can be predicated, so both branch instructions encountered above can be elimi­
nated:
p r i o r  co d e
i f :  c m p . c r e l  p t , p f = r a , r b ; ;  / /  p r e d i c a t e s  f o r  a r e l  b
t h e n :  ( p t )  do THEN b l o c k  
e l s e :  (p f)  < d o  ELSE b l o c k >  
end :  s u b s e q u e n t  co d e
Predication of the THEN and ELSE blocks can impact performance significantly. The CPU executes both 
streams of instructions, but the values in the predicate registers determine which stream actually has an effect 
when results written into destination registers or memory.
Instructions that are to execute regardless of the construct’s comparison outcome can be interleaved at the 
desired position within the instruction sequence. Most other architectures would require that these instruc­
tions be duplicated in each code block.
You will recall that the Itanium architecture enables zero latency between compare and branch instructions 
by performing the operations in separate I- or M-units (the compare) and B-units (the branch). Without 
an explicit stop ( ; ; )  after the compare instruction, the CPU is permitted to execute the compare and the 
predicated instructions of the i f . . .  t h e n . . . e l s e  construct in parallel, possibly using stale values in
/ /  p r e d i c a t e s  f o r  a r e l  b 
/ /  s k i p  THEN b l o c k
/ /  s k i p  ELSE b l o c k
39
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e If ...Th en ...Else Structures
the predicate registers. The explicit stop ensures that the comparison has completed before the qualifying 
predicates are used.
When there are only a few instruction in the THEN and ELSE blocks, the programmer or compiler should 
remove as many stops as possible. Also, interleaving the small number of instructions from the THEN block 
with those from the ELSE block can ensure that instruction bundles are filled with useful work rather than 
no-ops.
Because the CPU actually executes the instructions from both code blocks and uses predication to deter­
mine which instructions have an effect, lengthy THEN and ELSE blocks may negatively impact the CPU’s 
overall throughput. Although branch instructions are expensive, there is a performance trade-off between 
the predicated and standard implementations of the i f .  . . t h e n .  . . e l s e  construct, and at some point the 
standard version will execute more quickly. Where this crossover point lies is, of course, an implementation- 
dependent result.
Similarly, when the number of instructions in each block is severely imbalanced, using the standard 
i f . . .  t h e n . . . e l s e  implementation may be more effective. This situation is particularly true if the 
shorter block belongs to the more probable outcome, because the CPU is executing a large number of in­
structions that have no effect (those that are predicated false) only to ensure the effect of a small number of 
instructions (those that are predicated true). With the standard implementation, the execution time will reflect 
only those few instructions and the one branch taken.
10.1.3 Nested If...Then...Else Structures Using Predication
You learned earlier that the unconditional comparison instruction, when predicated false, sets the values in 
both predicate registers to false without actually performing a comparison. This form of the comparison 
instruction is useful for implementing nested i f . . .  t h e n  . . . e l s e  structures, as follows:
i f :
t h e n :
e l s e :
end:
p r i o r  code  
c m p . c r e l  p t , p f = r u , r v  
( p t )  c m p . c r e l . u n c  p a , p b = r w , r x  
(pa) < d o  i n n e r  THEN b l o c k  A> 
(pb) < d o  i n n e r  ELSE b l o c k  B> 
(p f)  c m p . c r e l  p c , p d = r y , r z  
(pc) < d o  i n n e r  THEN b l o c k  C> 
(pd) < d o  i n n e r  ELSE b l o c k  D> 
s u b s e q u e n t  code
/ /  o u t e r  c o n d i t i o n a l  t e s t  
/ /  THEN-block c o n d i t i o n a l  t e s t
/ /  E L S E -b lo ck  c o n d i t i o n a l  t e s t
This branch-free implementation can be very useful for short code blocks. As before, all code blocks are 
loaded and executed, but only the sequence that has been predicated true will have an effect. If one or more 
of the code blocks is very long, consider using a branching implementation.
4 0
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C a s e  S e l e c t i o n  S t r u c t u r e s
10.2 Case Selection Structures
Case selection structures can be implemented very compactly using the Itanium compare instructions and 
predication. consider the following simple example, expressed in the c  programming language:
s w i t c h  ( W )
{
c a s e  1:
R = P -  Q; 
b r e a k ;  
c a s e  2:
R = P + Q; 
b r e a k ;  
c a s e  4:
R = P; 
b r e a k ;
}
Using a sequence of compare instructions with predication, this code becomes:
c a s e 1 : 
c a s e 2 : 
c a s e 4 :
c m p .e q  p 1 , p 0 = 1 , r W  
c m p .e q  p 2 , p 0 = 2 , r W  
c m p .e q  p 4 , p 0 = 3 , r W ;  
(p1) s u b  r R = r P , r Q  
(p2) ad d  r R = r P , r Q  




i f  rW == 1, 
i f  rW == 2, 
i f  rW == 4,
Q/ /  R = P 
/ /  R = P + Q 
/ /  R = P
p1 < -  1 
p2 < -  1 
p4 < -  1
where the notation rR  means the general-purpose register containing the value of R, and so forth. The Itanium 
architecture allows more than one instruction to target the same register (in this example, the register rR) if 
those instructions are mutually exclusive; that is, if only one of the instructions will be predicated true. If this 
were not the case, then more stops would be required.
10.3 Loop Structures
Like the i f .  . . t h e n  . . . e l s e  construct and case selection structures, various types of loop structures can 
be implemented using predicated instruction execution.
10.3.1 Counter-controlled Loops
A counter-controlled loop can be expressed as follows:
< e n t e r  l o o p  w i t h  r c  = number  o f  t r a v e r s a l s >  
l o o p :  c o n s t r u c t i o n s  o f  l o o p  b o d y >
a d d  r c = - 1 , r c ; ;  / /  d e c r e m e n t  l o o p  c o u n t e r
c m p .e q  p 0 , p f = r c , 0  / /  i s  l o o p  c o u n t e r  == 0?
(p f)  b r . c o n d . s p t k  l o o p ; ;  / /  no ,  e x e c u t e  l o o p  a g a i n
s u b s e q u e n t  co d e  / /  y e s ,  c o n t i n u e
This loop uses only one predicate register. Furthermore, the static prediction hint ( s p t k )  is given, so this 
loop assumes many traversals.
41
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e L o o p  S t r u c t u r e s
10.3.2 Loops Controlled by an Address Limit
An address can also be used to control the execution of a loop. For example, consider the following code that 
processes the quad word elements of an array:
< e n t e r  l o o p  w i t h  r c  = a d d r e s s  o f  f i r s t  e l e m e n t >
l o o p :  l d 8 r t = [ r c ] , 8 ; ;
p r o c e s s  c u r r e n t  e l e m e n t  
c m p . g t u  p 0 , p f = r c , r l  
(p f )  b r . c o n d . s p t k  l o o p ; ;  
s u b s e q u e n t  co d e
/ /  l o a d  c u r r e n t  e l e m e n t  and
/ /  i n c r e m e n t  p o i n t e r
/ /  p a s t  t h e  l a s t  e l e m e n t ?
/ /  no ,  e x e c u t e  l o o p  a g a i n
/ /  y e s ,  c o n t i n u e
where r l  contains the address of the last element. Note that an unsigned comparison is used for the addresses. 
Also, a loop of this sort implicitly assumes that the array is non-null.
10.3.3 Loops with a Conditional Entrance
Suppose that it is permissible for the array in the previous example to be null. Clearly this situation calls for 
a different sort of loop, one that would not attempt to operate on an empty array. We can remedy the situation 
by positioning the conditional test at top of the loop:
e n t e r l o o p  w i t h  r c  = a d d r e s s o f f i r s t e l e m e n t
cmp . g t u p t , p 0 = r c , r l / / p a s t t h e  l a s t  e l e m e n t ?
(p t )  b r . c o n d . s p n t  l e a v e ; ; / / y e s , e x i t  t h e  l o o p
l d 8 r t = [ r c ] , 8 ; ; / / l o a d c u r r e n t  e l e m e n t  and
/ / i n c r e m e n t  p o i n t e r
p r o c e s s  c u r r e n t  e l e m e n t
b r l o o p ; ; / / l o o k f o r  n e x t  e l e m e n t
l e a v e :  < s u b s e q u e n t  c o d e >
where the notational conventions are as before. Here, we provide the static prediction hint s p n t ,  indicating 
that the branch will not be taken.
10.3.4 Using the Loop Count Register
Counter-controlled loops are very common structures in application programming. Often, these structures 
are nested, so the need to handle the innermost loops efficiently is very important.
The Itanium architecture provides two mechanisms to implement loops efficiently: the a r . l c  (loop 
count) application register and the b r . c l o o p  branch instruction.
The a r . l c  register must be initialized, prior to entering the loop body, to one less than the total number 
of desired traversals using a mov pseudo-op. The b r . c l o o p  instruction, placed at the bottom of the loop, 
tests the value of a r . l c  against zero after each traversal. If a r . l c  is not zero, it is decremented and the 
branch is taken. If a r . l c  is zero, the branch is not taken and execution falls through to the next instruction.
Here, the body of the loop will be executed at least once. If this is not the desired behavior, programmers 
or compilers must add the appropriate tests prior to the beginning of the loop body.
We have included another (slightly modified) example from Evans and Trimper to illustrate how a r . l c  
and b r . c l o o p  are used. The program, called DOTCLOOP, computes the scalar product of two vectors. The 
DOTCLOOP code is given in Figure 3 (next page).
The Itanium architecture provides only one a r . l c  register, so any routine that uses the register must 
save and restore its contents for the calling routine. In this example, the general-purpose register r  9 can 
be used because the program’s main  procedure is a “leaf” procedure; that is, main  does not call any other 
routines. Generally, however, a stack-based mechanism should be used to save and restore register contents 
for previous calling levels.
4 2
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e L o o p  S t r u c t u r e s
/ /  DOTCLOOP: Compute th e  s c a l a r  produc t  of  two v e c t o r s
N = 3 / / Declare  a c o n s t a n t
. d a t a / / Declare  d a t a  s e c t i o n
. a l i g n 8 / / Spec i fy  d e s i r e d  a l ignmen t
P: . sk ip 8 / / To s t o r e  t h e  p roduc t
X: data2 -1 ,+ 3,+ 5 / / F i r s t  v e c t o r  of  1 6 - b i t  v a lu e s
Y: data2 - 2 , - 4 , + 6 / / Second v e c t o r  of  1 6 - b i t  v a lu e s
. t e x t / / Declare  code s e c t i o n
. a l i g n 32 / / Spec i fy  d e s i r e d  a l ignmen t
.g lo b a l main / / Mark mandatory 'm ai n '  program e n t r y
. proc main
main :
. prologue / / Begin pr ologue s e c t i o n
.save a r . l c , r 9
mov r 9 = a r . l c ; ; / / Save c a l l e r ' s  a r . l c
. body / / Begin procedure  'm ai n '
f i r s t : movl r14=X;; / / Gr14 = p o i n t e r  t o  X
movl r15=Y;; / / Gr15 = p o i n t e r  t o  Y
movl r1 6= P ; ; / / Gr16 = p o i n t e r  t o  P
mov r20=0 / / r20 = s c a l a r  p roduc t
mov r17=N-1; ; / / One l e s s  t h an  t h e  number of t r a v e r s a l s
mov a r . lc=r17 / / I n i t i a l i z e  a r . l c
top : ld2 r 2 1 = [ r 1 4 ] ,2 / / Load e lement  from X and incremen t  p o i n t e r
ld2 r 2 2 = [ r 1 5 ] , 2 ; ; / / Load e lement  from Y and incremen t  p o i n t e r
pmpy2. r r2 1 = r2 2 ,r2 1 ;; / / M u l t ip ly  e lement  from X by e lement from Y
sxt4 r 2 1 = r 2 1 ; ; / / S ig n -ex ten d  r e s u l t  t o  64 b i t s
add r2 0= r20, r21 / / Update s c a l a r  p ro du c t
b r . c l o o p . s p tk . f ew  top / / More e lements  t o  p ro cess?
s t8 [ r 16 ]= r2 0 ; ; / / No, s t o r e  t h e  s c a l a r  p rodu c t
done: mov r CO II ; / / S ig na l  complet ion
mov a r . lc= r9 / / R es to re  c a l l e r ' s  a r . l c
b r . r e t . s p t k . m a n y  b0; ; / / Return  t o  command l i n e
. endp main / / End proced ure  ' m a i n '
Figure 3: A slightly modified version of the DOTCLOOP program from Evans and Trimper
In addition, the DOTCLOOP program introduces the prologue section of a program. The prologue, marked 
by . p r o l o g u e ,  occurs at the beginning of the text segment and extends until the . bod y  directive. You can 
see that the prologue includes both assembler directives (. s av e ,  for example) and actual Itanium instruc­
tions. Programs may require an epilogue section as well. Not surprisingly, the epilogue occurs at the bottom 
of the text segment, but there are no special directives to explicitly mark the epilogue section.
43
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e u s i n g  P r o c e d u r e s  a n d  f u n c t i o n s
11 Using Procedures and Functions
We now turn out attention to the Itanium mechanisms, instructions, and calling conventions that support 
procedures and functions.
11.1 Itanium Stack Structures
The Itanium architecture includes support for both memory-based stacks and register stacks. We describe 
both types here, highlighting only those details necessary to provide a basic understanding of the low-level 
mechanisms prescribed by the Itanium architecture that support the use of procedures and functions.
11.1.1 Itanium Memory Stacks
By convention, the general-purpose register Gri2 serves as the Itanium stack pointer, and it is initialized to 
point to the memory stack when a program is loaded. The stack pointer requires 16-byte alignment, also by 
convention. Note that Itanium assemblers will recognize sp  as a synonym for G ri2 .
Calling procedures will automatically provide a 16-byte “scratch” area for the callee. If more than
16 bytes are required, a procedure frame must be established: the called procedure must decrement sp  
by the frame size in its prologue section.
The procedure frame contains five areas: the local storage region, a dynamic allocation region, the frame 
marker region, and an outgoing parameters region. We omit any further details of the procedure frame, but 
Evans and Trimper cover the topic thoroughly (Section 7.1.3, page 191). We note that:
•  The frame size must always be a multiple of 16 bytes.
•  It is the responsibility of the programmer (or compiler) to save the previous stack pointer in the prologue 
and restore it in the epilogue.
•  The Itanium architecture allows the programmer or compiler to define any number or type of stack 
structures in a program’s data segment and to use the general-purpose registers as user-maintained 
stack pointers.
Evans and Trimper also cover more details of user-defined stacks (Section 7.1.4, page 192).
11.1.2 Itanium Register Stacks
In addition to memory stacks, the Itanium architecture supports a hardware-based register stack. The archi­
tecture prescribes that 32 static registers and a register stack of at least 96 registers (managed by the register 
stack engine, which we discuss momentarily) be provided by any implementation. We discuss this topic 
briefly, leaving the details to Evans and Trimper (Section 7.3, page 196) and other resources.
The alloc instruction. A new stack frame on the Itanium register stack is allocated using the following 
instruction:
a l l o c  r 1 = a r . p f s , i n s , l o c s , o u t s ,  r o t s
where i n s ,  l o c s ,  o u t s ,  and r o t s  specify the sizes of the input, local, output, and rotating regions of the 
stack frame. The size o f  the frame (sof) is given by i n s  + l o c s  + o u t s .  The size o f  the local region 
(sol) is given by i n s  + l o c s ;  there is no distinction between the inputs and the locals. The size o f  the 
rotating region (sor) is given by r o t s ,  which must be a multiple of eight and cannot exceed sof.
The a l l o c  instruction has introduced the idea of rotating registers. This powerful concept allows data in 
registers to remain accessible by incrementing the logical names of the registers within the set using special 
instructions. We defer any further discussion of rotating register sets until Section 12.
4 4
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C a l l i n g  P r o c e d u r e s  a n d  F u n c t i o n s
The Register Stack Engine. The Itanium Register Stack Engine (RSE) provides transparent access to a 
large virtual register stack in memory. In this sense, it functions similarly to a typical cache structure. Using 
cues from the operating system, the RSE manages the limited number of physical registers in the stack by 
spilling and filling register contents to a backing store (typically a dedicated region of memory) when a new 
allocation or procedure return requires that action.
The RSE asynchronously moves data to and from memory without direct intervention by the CPU using 
direct memory access. In this sense, the RSE functions as an I/O device that is active only periodically and 
whose operation is largely decoupled from instruction execution. Of course, if the RSE is waiting for data 
from the CPU, or vice versa, execution may stall.
11.2 Calling Procedures and Functions
In order to reduce possible contention over the vast register resources provided by the Itanium architecture, 
programmers and compilers must follow the conventions for using these resources. Section 3 touched on the 
standardized uses of each type of Itanium register. Here, we describe the conventions for calling procedures 
and functions more fully.
11.2.1 Register Conventions
Many conventions reflect differences at the hardware level. For example, some registers are global in scope 
(Gro-Gr32), others have constant values (Gr0, Fr0-Fri), and yet others are managed by the RSE. Follow­
ing Evans and Trimper, we characterized the Itanium registers according to their size, features, and uses in 
Section 3. We reiterate the characterizations pertaining to the use of procedures and functions:
•  A register is scratch if it may be used freely by a procedure or function at any calling level; the caller 
must save any important contents of these registers.
•  A register is preserved if a calling routine depends on its contents; any called procedure must save and 
restore the contents of these registers for its caller.
•  A register is automatic if its name only has a dynamic correspondence to a physical register; these 
registers are automatically spilled to and filled from memory during allocation by the hardware, as 
necessary.
Note that the location within the program where the contents of important registers are saved depends upon 
the programming language and environment, as well as the operating system.
11.2.2 Call and Return Branch Instructions
The Itanium architecture provides instruction completers for the more general branch instructions to indicate 
a call or return from a function:
b r . c a l l . b w h . p h . dh  b 1 = t a r g e t 2 5  / /  I P - r e l a t i v e
b r . c a l l . b w h . p h . dh  b1=b2 / /  i n d i r e c t  a d d r e s s i n g
b r . r e t . b w h . p h . dh  b2 / /  i n d i r e c t  a d d r e s s i n g  o n l y
where the valid values for bwh, ph ,  and dh  are the same as those for the branch instructions described in 
Section 6. Note that the target address must be aligned with an instruction bundle; that is, the four lowest- 
order bits of the address must be zero. Also, note that calls and returns may have a qualifying predicate.
As a result of the b r . c a l l  instruction, the return address becomes IP+16 and is stored in register b1. 
Then, several values are saved into the a r . p f s  application register; we do not provide the details of this 
step, but they can be found in Appendix D.7 of Evans and Trimper. The register stack is adjusted and, finally, 
the IP is set to the target address either by adding the sign-extended offset, t a r g e t 2 5 ,  or by copying the 
address in register b2.
The b r . r e t  instruction copies the value of b2 into the IP and restores from a r . p f s  those values saved 
by b r . c a l l .  The calling procedure’s stack frame is also restored.
45
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C a l l i n g  P r o c e d u r e s  a n d  f u n c t i o n s
Figure 4: Passing arguments via registers and the memory stack
11.2.3 Argument Passing
The power of procedures and functions comes from their ability to take input arguments, operate on those 
arguments, and possibly return a result to the calling procedure.
calling routines can pass as many as eight input arguments using registers; beyond eight, they must be 
passed using the memory stack. Up to eight general-purpose registers are therefore allocated as o u t s  and are 
used to pass 64-bit arguments. Floating-point registers Fr8-F ri5 are used to pass single- or double-precision 
floating-point arguments. Stack space for passing more than eight arguments can be claimed by decrementing 
the stack pointer by the appropriate amount (including the required 16-byte scratch space). Figure 4 shows 
schematically how argument passing works.
The rules for passing integer and floating-point data types differ. For instance, suppose we make the 
following c  function call:
r e s u l t  = m y_ _f un ct ion  ( a ,  b ,  r ,  s ,  t ,  c ,  x ,  d ,  y ) ;
where a, b, c, and d are integer values and r ,  s, t ,  x, and y are floating-point values. These arguments 
must correspond to the sequentially numbered argument slots, as depicted in Figure 4: a with a r g 0 ,  b with 
a rg 1 , and so forth.
Integer arguments in a r g 0 - a r g 7  are placed in the corresponding output registers o u t 0 - o u t 7 .  Other 
output registers are not used. In contrast, floating-point arguments are placed in sequentially numbered 
floating-point registers f 8 - f 1 5 ;  registers are not skipped. Any remaining arguments ( a rg 8  and above) are 
stored in quad word information units beginning at sp+16,  where sp  is the decremented stack pointer that 
will be used by the callee.
So, for my__funct ion,  a is passed in outO, b in o u t2 ,  r  in f  8, s in f  9, t  in f lO,  c in o u t5 ,  x 
in f11,  d in o u t7 ,  and y, the ninth argument, is stored at sp+16.  Note that o u t3 ,  o u t 4  and o u t 6  are 
unused.
11.2.4 A Practical Example
To illustrate the preceding topics, we again include some (slightly modified) example code from Evans and 
Trimper. Two listings, BOOTH and DECNUM3, combine to form a program for converting a positive integer 
value into a string of ASCII encoded decimal digits. Figure 5 (page 48) shows the code for a function that 
computes a 128-bit signed product from two 64-bit inputs using Booth’s algorithm, and Figure 6 (page 49) 
shows the code for the test program.
Rather than concentrate on the algorithms, we choose to highlight only those features of the code that 
are pertinent to the discussion at hand. The details of Booth’s algorithm, as well as the process of converting 
integers in hexadecimal representation into ASCII encoded decimal digits, are covered by Evans and Trimper 
(Section 6.5.1, pages 170-173; Section 6.6, pages 175-178; Section 7.2, pages 194-196; Section 7.6, pages 
214-218).
4 6
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C a l l i n g  P r o c e d u r e s  a n d  f u n c t i o n s
BOOTH. The b o o t h  function takes two 64-bit integers, multiplies them together, and produces a 128-bit 
signed product. The function expects the multiplicand to be in the first argument slot and the multiplier to 
be in the the second. To satisfy these requirements, the calling procedure must place the multiplicand on the 
register stack in o u t0  and the multiplier in o u t1 ; these registers are accessible within the b o o t h  function 
as in 0  and in 1 . The function “claims” these two input registers using the . r e g s t k  directive:
. r e g s t k  i n s ,  l o c s ,  o u t s ,  r o t s
where i n s ,  l o c s ,  and o u t s  determine the sof, and r o t s  specifies the number of rotating registers.
Typically, integer functions will return a value to their caller in register r8  ( r e t 0 ) .  The Itanium archi­
tecture permits up to four integer return values to be placed in general-purpose registers. Booth’s algorithm 
computes a full, 128-bit product, which is clearly larger than a single general-purpose register. Thus, the 
function makes the low-order bits of the resulting product available to the caller by placing it in r e t 0 ;  the 
high-order bits are placed in register r9  ( r e t 1 ) .
Note that the b o o t h  function largely uses the scratch registers and does not allocate a new stack frame. 
The function does save the caller’s a r . l c  register (a preserved register) in r 3 1 .
DECNUM3. The DECNUM3 test program also makes use of the scratch registers. However, because the 
contents of these registers is undefined upon the return from a procedure call (in this example, the call to 
bo o th ) ,  the calling routine is responsible for saving and restoring any important values of these registers. 
As a consequence, Evans and Trimper point out that, when designing and using procedure calls, it is important 
to enumerate the registers that must be preserved and then devise an efficient way to save and restore their 
contents.
The DECNUM3 program allocates a new stack frame, using the a l l o c  instruction, that will hold the six 
local and two output values. Table 12 lists the registers in the DECNUM3 stack frame and their uses.
Register Purpose
loc0  (r 3 2 ) Preserve rp
loc1  (r33) Preserve a r . p f s
loc2 (r34) Pointer to user-defi ned stack
loc3 (r35) Approximation to 0.8
loc4 (r36) Previous quotient for the remainder calculation
loc5 (r37) Preserve gp (r1)
out0 (r38) Pass multiplicand to booth
out1 (r39) Pass multiplier to booth
Table 12: The registers, and their uses, of the DEcNuM 3 stack frame
We close this discussion by noting that the expected result is 4,888,718,345, the decimal representation 
of 0x123456789. The program can be tested by using the debugger included with your programming envi­
ronment to inspect the contents of the 80 bytes starting at address A3.
4 7
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e C a l l i n g  P r o c e d u r e s  a n d  f u n c t i o n s
/ /  BOOTH: F u l l - w i d t h  i n t e g e r  m u l t i p l i c a t i o n  u s in g  B o o th ' s  a lg o r i t h m
W = 64 / / Declare  a c o n s t a n t
. t e x t / / Declare  code s e c t i o n
. a l i g n 32 / / Spec i fy  d e s i r e d  a l ignment
.g lo b a l booth / / Mark e n t r y  p o i n t  f o r  'boo'
. proc booth
b o o t h :
. prologue / / Begin procedure  'main '
. r e g s tk 2 , 0 , 0 , 0 / / Declare  2 i n s
. save a r . l c , r 3 1
mov r 3 1 = a r . l c / / Save c a l l e r ' s  a r . l c
. body / / Begin procedure  ' b o o t h '
f i r s t : add 0r1-WII2r / / Number of  t r a v e r s a l s
mov a r . lc=r2 / / I n i t i a l i z e  a r . l c
mov r1 9=0 / / Set  b i t  n-1 t o  zero
mov re t0 = in 1 / / Set  R t o  m u l t i p l i e r
mov re1=0; ; / / S to re  f i r s t  square
c y c l e : and r22=0x1 , re t0 / / I s o l a t e  lowest  b i t  of  R
xor r2 3=r19 , r2  2 ; ; / / r23 <- whether t o  a c t
cmp. ne p6,p0 = 0 , r2  3 / / p6 <- whether t o  a c t
mov r19=r22 / / B i t  n -  1 f o r  ne x t  i t e r a t
(p6) cmp.eq.unc p7 , p8=0,r2  2 / / Add, s u b t r a c t ,  or  no-op?
(p7) add r e t 1 = r e t 1 , i n 0 / / Add X t o  L
(p8) sub r e t 1 = r e t 1 , i n 0 ; ; / / S u b t r a c t  X from L
shrp r e t 0 = r e t 1 , r e t 0 , 1 / / New R of  s h i f t e d  LR
shr r e t 1 = r e t 1 , 1 / / New L of  s h i f t e d  LR
d one : mov a r . lc=31 / / R es to re  c a l l e r ' s  a r . l c
b r . r e t . s p t k . 0bynam / / Return  t o  c a l l e r
. endp booth / / End proced ure  ' b o o t h '
Figure 5: A slightly modified version of the BOOTH function from Evans and Trimper
48
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  C a l l i n g  P r o c e d u r e s  a n d  f u n c t i o n s
/ /  DECNUM3: Convert  a p o s i t i v e  hexadecimal  i n t e g e r  i n t o  a 
s t r i n g  of  ASCII encoded decimal  d i g i t s
LEN = 20 / / Declare  a co n s t a n t
DOT8 = Oxcccccccccccccccd / / ap p rox im a te ly  0.8
. g lo b a l booth / / E x t e r n a l  r e f e r e n c e  t o  ' b o o t h '
. d a t a / / Declare  d a t a  s e c t i o n
. a l i g n 8 / / S pec i fy  d e s i r e d  a l ignmen t
X3: data8 0x123456789 / / Number t o  con ve r t
A3: . sk ip 80 / / S to rage  f o r  ASCII s t r i n g
STACK: . sk ip LEN / / U s e r -d e f in e d  s t ack
. t e x t / / Declare  code s e c t i o n
. a l i g n 32 / / S pec i fy  d e s i r e d  a l ignmen t
. g lo b a l main / / Mark t h e  mandatory program e n t r y
.proc main
main:
.p ro logu 23r21e / / Begin procedure  'm ai n '
a l l o c l o c 1 = a r . p f s , 0 , 6 , 2 , 0 / / A l l o c a t e  a new s t a c k  frame
.save r p , l o c 0
mov loc0=b0 / / Save r e t u r n  ad d res s
. body / / Begin procedure  ' b o o t h '
f i r s t : add loc2=@gprel (STACK), gp / / loc2 p o i n t s  t o  STACK
movl loc3=DOT8 / / loc3 p o i n t s  t o  DOT8
mov l o c 5 , gp / / Save g l o b a l  p o i n t e r
new: add r15=@gprel (X 3 ) ,g p ; ; / / r15 p o i n t s  t o  i n p u t  number
ld8 r9=[r15] / / Load i n p u t  number
s t1 [ l o c 2 ] = r 0 , 1 ; ; / / Push zero  as a f l a g
aga in : mov loc4=r9 / / Save p r e v io u s  q u o t i e n t
mov out0=r9 / / arg0 o f  ' b o o t h '  i s  m u l t i p l i c a n d
mov out1=loc3 / / arg1 o f  ' b o o t h '  i s  m u l t i p l i e r
b r . c a l l . sptk.many b0=booth / / C a l l  ' b o o t h ' ,  r 8 : r 9  <- out0*out1
mov gp=loc5 / / R es to re  g lo b a l  p o i n t e r
no sign : add r 9 = r 9 , l o c 4 ; ; / / Add X t o  L
s h r . u 39,rII9r / / r9 <- q u o t i e n t  = loc4 /10
shl add 9r29,rII3r / / r3 <- 5 * q u o t i en t
add 3r3rII3r / / r3 <- 10*qu ot ien t
sub 3r3rII3r / / r3 <- remainder
or r3=0x3 0 , r 3 ; ; / / Convert  t o  ASCII
s t1 [ l o c 2 ] = r 3 , 1 / / S to re  t h e  c h a r a c t e r
cmp. ne p 6 ,p 0 = r9 , r 0 / / I s  q u o t i e n t  non-zero?
p b r . cond. s p tk . f ew  aga in / / Yes, r e p e a t  t h e  cyc le
s t1 [r16]=r0 / / No, NULL-terminate A3 ( s t r i n g z )
add loc2=1 , loc2 / / Ad jus t  s t a c k  p o i n t e r
done: mov r CO II o / / S ig na l  complet ion
mov b0=loc0 / / R es to re  r e t u r n  addres s
mov a r . p f s = l o c 1 / / R es to re  c a l l e r ' s  a r . p f s
b r . r e t . s p t k . m a n y  b0; ; / / Return  t o  command l i n e
. endp main / / End proc ed ure  'm a i n '
Figure 6: A slightly modified version of the DECNUM3 program from Evans and Trimper
4 9
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r o g r a m  P e r f o r m a n c e
12 Program Performance
Performance is typically the driving factor behind the development of new hardware and software designs. 
Science and engineering applications, where large-scale simulations and time consuming computations dom­
inate, often demand optimal performance from both the low-level hardware and the user-level code.
We now examine the mechanisms provided by the Itanium architecture for software optimization, as well 
as more general optimization guidelines that can be applied to a variety of applications. We leave hardware 
optimization considerations to expert sources such as Hennessy and Patterson.
Where possible, we discuss these topics in an implementation-independent manner; however, several of 
the more advanced topics require at least minor consideration of the architecture’s implementation. When 
this is true, we use the Itanium 2 processor as our guide. For these sections, it may be useful to refer to the 
details of that implementation, which can be found in Section 13.
12.1 Processor-Level Parallelism
At various points throughout this survey, we have alluded to a feature of modern processor designs called 
instruction pipelining. Pipelining can be likened to an assembly line in a manufacturing process: At each 
stage of the pipeline, a highly specialized component receives an input from the preceding state, performs 
one highly specialized function on that input, and produces an output for the next stage. To achieve maximum 
throughput, each component in the pipeline must perform some sort of useful work at each time-step; this is 
often referred to as “keeping the pipe full”.
Instruction pipelining, then, involves specialized hardware components executing one stage of the instruc­
tion cycle. Pipelines have an associated depth, which describes the number of stages that perform distinct 
operations. For example, the instruction pipeline of the Itanium 2 processor involves eight distinct stages; the 
pipeline’s depth is thus eight.
Ideally, each stage performs its operation in the same amount of time as all of the other stages in the 
pipeline. However, if any one stage takes longer to execute than any other, the steady flow of input-operate- 
output is temporarily interrupted because at least one stage is left waiting for input; such a situation is known 
as a pipeline stall or bubble. We describe several factors that contribute to these stalls shortly.
Three additional terms are often used to characterize instructions in the context instruction pipelining: 
An instruction is issued when it permitted to pass from one stage to the next. The latency of a pipeline stage 
describes the number of time-steps actually required by that stage to perform its operation on an instruction. 
Finally, an instruction is retired when it has passed through the final stage of the pipeline.
We have already mentioned that pipeline stalls can negatively impact the performance of modern CPUs. 
These stalls typically result from resource conflicts, procedural dependencies, or data dependencies.
Resource conflicts result when instructions in different stages of the pipeline require access to the same 
area in memory or the same functional unit, for example. Procedural dependencies generally result from 
branching instructions because partially completed instructions must be halted without altering the state of 
the machine. Finally, data dependencies typically arise when an instruction requires data that has not yet been 
computed, loaded, or otherwise made accessible.
The steady flow of input-operate-output in pipelined processors enables maximum performance but can 
be interrupted by any of the following:
multiple-issue conflicts, 
branch-induced pipeline flushing,
•  producer-consumer dependencies, and
data stalls from the cache and memory hierarchy.
In EPIC designs, the programmer or compiler must ensure that these situations are avoided whenever possible.
50
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I n s t r u c t i o n - L e v e l  P a r a l l e l i s m
Code Slot Unit Code Slot Unit Code Slot Unit Code Slot Unit
0x0 0 M 0x8 0 M 0x10 0 M 0x18 0 M
1 I 1 M 1 I 1 M
2 I 2 I 2 B 2 B
0x1 0 M 0x9 0 M 0x11 0 M 0x19 0 M
1 I 1 M 1 I 1 M
2 I;; 2 I;; 2 B;; 2 B;;
0x2 0 M Oxa 0 M;; 0x12 0 M 0x1a 0 *
1 I;; 1 M 1 B 1
2 I 2 I 2 B 2
0x3 0 M Oxb 0 M;; 0x13 0 M Oxb 0 *
1 I;; 1 M 1 B 1
2 i;; 2 I;; 1 B;; 2
0x4 0 M Oxc 0 M 0x14 0 * Oxlc 0 M
1 L 1 F 1 1 F
2 X 2 I 2 2 B
0x5 0 M Oxd 0 M 0x15 0 * Oxld 0 M
1 L 1 F 1 1 F
2 X;; 2 I;; 2 2 B;;
0x6 0 * Oxe 0 M 0x16 0 B Oxle 0 *
1 1 M 1 B 1
2 2 F 2 B 2
0x7 0 * Oxf 0 M 0x17 0 B Oxlf 0 *
1 1 M 1 B 1
2 2 F;; 2 B;; 2
Table 13: Itanium instruction templates
12.2 Instruction-Level Parallelism
Pipelining makes efficient use of the available hardware resources. This technique does not, however, reduce 
the execution time of any single instruction. Instruction-level parallelism, in which the hardware executes 
several instructions in parallel, can be used to reduce execution time and achieve greater throughput.
A processor supports superscalar execution when it has two or more instruction pipelines that can operate 
on independent data items in parallel. A superscalar processor fetches, decodes, and (possibly) executes two 
or more instructions at the same time. o f  course, these processors are also susceptible to pipeline stalls, but 
the additional instruction pipelines typically lead to a performance improvement over a processor with only 
a single pipeline.
Note that the multiple pipelines in a superscalar processor need not be identical; in fact, multiple spe­
cialized pipelines, each handling some subset of an architecture’s instruction set, will generally lead to better 
performance. For example, different integer and floating-point pipelines are common in many modern archi­
tectures.
12.3 Explicit Parallelism
Some architectures prescribe that the hardware will direct the execution of program instructions to maximize 
the available execution resources. Instruction-level parallelism is thus transparent to the programmer. In 
contrast, EPIC architectures rely on the programmer or compiler to schedule instructions in an intelligent and 
productive way; this is the “explicitly parallel” part of EPIC designs.
12.3.1 Instruction Templates
In Section 2, we introduced the notion of an instruction bundle: three 41-bit Itanium instructions packaged 
with a 5-bit instruction template. We noted that the templates provide extra information to the CPU regarding 
how the instructions contained within the bundle should be executed.
51
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e e x p l i c i t  Pa r a l l e l i s m
In particular, the template dictates which execution units are required by the instructions in the bundle. 
The 5-bit field indicates that 32 instruction templates are possible; however, only 24 of these have been 
defined. Table 13 (previous page) lists the available templates. In this table, “*” indicates that the template 
code has been reserved for future extensions to the architecture.
Itanium assemblers recognize special directives for manually assigning templates to instruction bundles:
{ .mmi / /  u s e  an MMI t e m p l a t e  f o r  t h i s  b u n d l e
Type M o r  Type A i n s t r u c t i o n
p o s s i b l e  s t o p  / /  i f  M;;MI o r  M ; ;M I ; ;
Type M o r  Type A i n s t r u c t i o n  
Type I  i n s t r u c t i o n
p o s s i b l e  s t o p  / /  i f  M ; ;M I ; ;
}
Here, we have illustrated the .mmi directive; similar directives are available for the other instruction tem­
plates.
You will recall that Type A instructions, the most common type, can be executed by either I- or M-units; 
this gives programmers and compilers a high degree of latitude when grouping program instructions into 
bundles.
Typically, instruction templates are not assigned by hand; the compiler or assembler assumes this re­
sponsibility. However, for a more thorough discussion, including exercises comparing results from the most 
popular Itanium compilers, we recommend Evans and Trimper, Section 10.3.1 (pages 302-307).
12.3.2 Data Dependencies and Speculation
Four general cases of data dependencies within an Itanium instruction bundle can be identified:
•  Read-after-write (RAW) dependencies are not permitted for Itanium registers (there are a few excep­
tions) but are permitted for memory: A load from a location in memory to which data has recently been 
written will retrieve the stored valued.
•  Write-after-write (WAW) dependencies are also not generally permitted for registers, but several com­
pare instructions are permitted within the same instruction group. Besides these exceptions, Itanium 
registers are permitted to occur as a destination operand only once per instruction bundle. Although 
WAW dependencies are not explicitly forbidden by the architecture, these should be avoided as multiple 
writes may cause resource conflicts.
Write-after-read (WAR) dependencies are allowed for both registers and memory.
Read-after-read (RAR) dependencies are allowed for both registers and memory.
Templates that include the explicit stop must be used for RAW and WAW dependencies involving data from 
a source register.
The Itanium architecture includes support for data speculation. This technique requires special hardware 
and attempts to minimize the risk or impact of data stalls. These stalls occur when the data needed by an 
instruction is not yet available in a register because, as we know, a load from memory may take many, many 
cycles. To mitigate the impact of data stalls, most compilers handle this situation by rearranging the code so 
that it is logically equivalent to the programmer’s intent but with the load appearing earlier in the instruction 
sequence.
Itanium load instructions that have been moved must fetch the data speculatively. Itanium processors 
have a special internal structure called the advanced load address table (ALAT), which stores a register name 
and associated memory address. Itanium store instructions query the ALAT, invalidating all entries whose 
memory address overlaps with any portion of the data to be stored. Data speculations with an invalidated 
ALAT entry will then fail. Recovery code must also be inserted to handle the situations when a speculative 
load has failed.
52
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e e x p l i c i t  Pa r a l l e l i s m
Consider the following example: 
No Data Speculation
code b lock A 
s t8  [ r14 ] =r24 
ld8 r 2 0 = [ r 1 5 ] ; ;  
add r2 0 = 1, r 2  0 
code b lock B 
s t8  [r16]=r2 0
With Data Speculation
ld 8 . a r 2 0 = [ r 1 5 ] ; ;
code block  A 
s t8  [r14]=r24 
l d 8 . c . c l r  r 2 0 = [ r 1 5 ] ; ;  
add r20=1,r20
code block  B 
s t8  [r16]=r20
/ /  Advanced load
/ /  Check load
If the compiler does not know whether registers r1 4  and r1 5  will point to overlapping regions of memory, 
then moving the load instruction is a speculative decision and not guaranteed to produce the correct results. 
This situation situation requires an advanced load ( ld 8 .a )  with a check load ( l d 8 . c . c l r )  functioning as 
the recovery routine.
As noted, the load type completer . a indicates an advanced load. This instruction inserts an entry for 
the address contained in register r1 5  into the ALAT (possibly displacing some other entry). Later, the check 
load completer . c . c l r  invokes an ALAT query, searching for register r20.  If found, the ALAT is cleared 
and the recovery load is not necessary. However, if the appropriate entry is not found, the load is executed and 
the value in r2  0 is refreshed. The Itanium ISA also provides the . c . n c  load type completer if the ALAT 
entry should not be flushed after an ALAT hit.
If the ALAT still holds the necessary entry and no store conflicts have arisen, then the speculation suc­
ceeds, minimizing the load latency by executing < co d e  b l o c k  A> while the memory hierarchy satisfies 
the requested load.
The effectiveness of this speculation depends in part upon the length of < co d e  b l o c k  A>. If the 
instructions composing this block execute quickly, then a data stall might still occur. The compiler may thus 
choose a more aggressive rearrangement of the code, as follows:
No Data Speculation With Aggressive Data Speculation
code block  A
s t8  [ r14 ]=r24 
ld8 r 2 0 = [ r 1 5 ] ; ;  
add r2 0 = 1, r 2  0 
code block  B 
s t8  [r16]=r2 0 
subsequent  code
ld 8 . a r 2 0 = [ r 1 5 ] ; ;
code b lock  A 
add r20=1,r20
code b lock B 
s t8  [ r14]=r24 
c h k . a . c l r  r 2 0 , r e c o v e r
back:
s t8  [r16]=r2 0
subsequent  code
/ /  Advanced load
/ /  Check lo ad
re co v er :
ld8 r 2 0 = [ r 1 5 ] ; ;
add r2 0 = 1, r 2  0
code b lock B 
b r  back
/ /  Some o t h e r  add res s  
/ /  Reload
In this example, the recovery code (beginning at r e c o v e r )  must reload register r2  0 and execute the en­
tire instruction sequence (add  and < co d e  b l o c k  B>) again, if the speculation fails. Such aggressive 
speculation will lead to large code segments, but can nevertheless prove effective in certain circumstances.
The Itanium ISA provides the c h k . a instruction to query the ALAT and conditionally branch to the 
recovery routine. Either . c l r  or . n c  must be used to complete the instruction; these completers affect the 
ALAT as described before. Note that while c h k . a has the same branch range as other IP-relative branch 
instructions, this instructions executes in an M- and not a B-unit.
53
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e e x p l i c i t  Pa r a l l e l i s m
Finally, we note that the Itanium ISA provides several forms of an instruction to manually invalidate 
entries in the ALAT:
i n v a l a  / /  I n v a l i d a t e  a l l  ALAT e n t r i e s
i n v a l a . e  r1  / /  I n v a l i d a t e  e n t r y  f o r  g e n e r a l - p u r p o s e  r e g i s t e r  r1
i n v a l a . e  f1  / /  I n v a l i d a t e  e n t r y  f o r  f l o a t i n g - p o i n t  r e g i s t e r  f1
This instruction will do nothing if no matching entry is found when a register is specified.
12.3.3 Control Dependencies and Speculation
control dependency describes situations in which speculative execution is contingent on the logical flow of 
the program. In these cases, any exceptions that might occur (for example, a floating-point exception) should 
not be raised if the speculatively executed instructions would not have otherwise been encountered. The 
NaT bit (for general-purpose registers) and the special NaTVal (for floating-point registers) can suppress this 
behavior.
In previous sections, we have noted that operations will propagate either the NaT bit or NaTVal if any of 
its operands are so marked, to be dealt with when convenient. The Itanium ISA includes the t n a t  instruction 
to test a register’s NaT bit:
t n a t . t r e l . c t y p e  p t , p f = r 3
where the test relationship completer t r e l  can be z (zero)or nz (non-zero) and c t y p e  can be any of those 
for the compare instructions.
A similar test can be performed for NaTVal in floating-point registers using the f c l a s s  instruction. 
consider the following example:
No Data Speculation With Control Speculation
<code b lock A> 
(px) b r . c o n d  notdo
ld8 r 2 0 = [ r 1 5 ] ; ;
add r2 0 = 1, r 2  0 
code b lock B
n o t d o :
<s ubseq ue n t  code>
l d 8 . s r 2 0 = [ r 1 5 ] ; ;  
code b lock A 
(px) b r . c o n d  notdo
chk . s  r 2 0 , r e c o v e r
back:
add r20=1,r20
code b lock B
n o t d o :
subsequen t  code






/ /  Some o t h e r  addr es s  
/ /  Reload
Here, the load with a long latency depends upon falling through the predicated branch, and the compiler 
generates a speculative code sequence to overlap execution of other useful work and the load instruction.
The Itanium ISA includes the c h k . s assembler pseudo-op to branch conditionally if the NaT bit of a 
register is set. It has the same branch range as other IP-relative branching instructions and can execute in 
either an I- or M-unit.
It is also possible to combine advanced and speculative loading with the l d . s a  instruction, causing the 
ALAT to track success or failure of the load while deferring any exceptions.
54
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r o g r a m  O p t i m i z a t i o n s
12.4 Program Optimizations
Software optimization is considered by many to be a “black art”. A good resource outlining general software 
optimization guidelines is Richard Gerber’s book, The Software Optimization Cookbook. We also recommend 
the Itanium 2 Processor Reference Manual fo r  Software Development and Optimization, which is part of the 
Itanium architecture documentation.
12.4.1 Performance Considerations
Several architectural factors may contribute to program’s performance. We outline many of these that are of 
concern for the Itanium architecture. The l f e t c h  instruction for software prefetching is also described.
Addressing modes. You will recall from our discussion in Section 2 that the Itanium architecture supports the 
immediate, register direct, and register indirect addressing modes. For best performance, programs should 
retain the most frequently used data in processor registers and minimize interaction with system memory. 
While the cache structures help to alleviate the impact of time-consuming memory operations, it cannot be 
totally eliminated, and performance degradation can result.
Code size. Code size is much less of a concern in modern architectures than those of even the not-so-distant 
past. Loop unrolling, which is a common optimization technique employed by modern compilers, may 
unnecessarily contribute to code bloat. Performance enhancements cannot be guaranteed by the technique 
but it will always increase the total code size. Loop unrolling can be beneficial, however, due to the relatively 
high cost of branch instructions.
A typical compiler for the Itanium architecture will offer unrolling as one of many possible methods for 
loop optimization, but software pipelining using the architecture’s rotating register sets will often yield better 
results. We discuss loop optimization more thoroughly at the end of this section.
Instruction reordering. We encountered instruction reordering previously, when we introduced the Itanium 
advanced and speculative load instructions. Here, instructions were rearranged so that better performance 
might result without impacting the logical intent of the program.
At a lower level, instructions within an instruction bundle are mutually independent and can be reordered 
to enhance the efficiency with which the instructions are executed. This low-level reordering may result 
in fewer no-op instructions or it may make better use of the available execution units. Typically, Itanium 
compilers can perform these optimizations but Itanium assemblers cannot.
Inline functions and recursion. You saw in Section 11 that using procedure calls involves a significant 
amount of overhead. Function inlining, where the body of a procedure or function is inserted directly into the 
code of the calling routine, can eliminate these overheads at the expense of greater code size.
55
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r o g r a m  O p t i m i z a t i o n s
Software prefetching. In addition to the advanced and speculative loads that we have already discussed, the 
Itanium ISA includes instructions for prefetching lines into the cache structures:
l f e t c h .. I f t y p e . I f h i n t [ r3] / / mem[r3]
l f e t c h .. I f t y p e . I f h i n t [ r 3 ] , r 2 / / mem[r3]
/ / r 3  < -  r 3 + r2
l f e t c h .. I f t y p e . I f h i n t [ r 3 ] , i m m 9 / / mem[r3]
/ / r 3  < -  r 3 + s e x t ( im m 9 )
l f e t c h .. I f t y p e . e x c l . I f h i n t [ r3] / / mem[r3]
l f e t c h .. I f t y p e . e x c l . I f h i n t [ r 3 ] , r 2 / / mem[r3]
/ / r 3  < -  r 3 + r2
l f e t c h .. I f t y p e . e x c l . I f h i n t [ r 3 ] , i m m 9 / / mem[r3]
/ / r 3  < -  r 3 + s e x t ( im m 9 )
where the line containing the address in register r3  is brought into the cache, followed (optionally) by a 
postincrement of that address, either by the value in register r2  or by the 9-bit sign-extended immediate 
value imm9.
A line will be marked exclusive when the prefetch instruction includes the e x c l  completer; this form is 
useful when programs will quickly write to an address within the prefetched line.
There are two valid values for I f t y p e ,  the line prefetch type completer: none and f a u l t .  None, 
which corresponds to omitting this instruction completer, ignores all faults associated with an ordinary load 
operation, while the f a u l t  completer will raise these faults as necessary.
There are four valid values for I f h i n t ,  the line prefetch hint completer: none, n t l ,  n t 2 ,  and n t a .  
None, which corresponds to omitting this instruction completer, indicates that the program associates tem­
poral locality in the L1 cache with the prefetched line. The remaining completers, n t l ,  n t 2 ,  and n t a ,  
indicate that the program associates nontemporal locality in the L1, L2, or all levels of the hierarchy with the 
prefetched line.
Other factors. clearly the discussion of performance factors that we have provide here is not exhaustive. 
Several other factors, including instruction size, instruction power, and function recursion, are considered 
more fully by Evans and Trimper (Section 10.6, pages 325-334). We also recommend the sources mentioned 
previously (Gerber’s book and the Itanium documentation) for a wealth of useful information.
56
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r o g r a m  O p t i m i z a t i o n s
12.4.2 Low-level Optimization Hints
The following optimization hints have been adapted from the Intel Itanium 2 Processor Reference Manual 
fo r  Software Development and Optimization and so do not apply to the Itanium architecture in general. The 
details of the Itanium 2 processor are discussed in Section 13.
Instruction scheduling. Intel offers a number of guidelines that can be used to minimize the chances of 
implicit stops and other pipeline stalls. We summarize them here:
•  Schedule the most restrictive instructions early in a bundle. Doing so will help minimize conflicts with 
generic instruction subtypes that might otherwise consume the specific port required by a restricted 
instruction.
•  When placing Type A instructions in an instruction bundle’s I-slot, try to schedule actual Type I in­
structions first. Doing so will enable the Type A instructions, which do not require the I-units, to be 
completed by an M-unit if necessary. Note that it is preferable to place Type A instructions in an M-slot 
whenever possible.
•  Control-speculation (advanced and check) and pair ( ld f p s )  floating-point loads require the first two 
M-units, while other floating-point loads can be performed by any of the Itanium 2 processor’s four 
M-units. When mixing the restrictive and regular floating-point load instructions, schedule the regular 
loads late in an instruction bundle to ensure that they do not unnecessarily consume the first two M- 
units and delay the restricted load types.
•  Avoid nop . f , the floating-point no-op, as unintended floating-point stalls may result from long-latency 
floating-point instructions.
•  Several more dual-issue templates have been added to the Itanium 2 processor. (We discuss this concept 
in Section 13.) As a result, the . mf i  instruction template should be avoided.
Branch prediction. We know that branch instructions incur a significant performance impact because they 
interrupt the sequential flow of program execution. The Itanium architecture offers branch prediction as a 
means to mitigate the negative impacts.
The branch whether instruction completers were discussed in Section 6. The performance impact of these 
completers depends on other branch instructions contained within a two-bundle window, as well as other 
branch information that the processor maintains.
Dynamic branch prediction (. dpxx) is the recommended default value. However, if the c to p  or c lo o p  
branch type completers are used, static branch prediction (. spxx) is recommended.
Static prediction is also recommended for very short (1 or 2 cycle) loops, because loops will not have to 
wait for the processor to regenerate a new dynamic prediction. If dynamic prediction is used, the processor 
may stall the loop while it updates the prediction.
Branch prediction hints are not recommended for . bbb  instruction bundles, as unexpected behavior may 
result. However, the . c l r  completer can be specified for the slot 0 branch when the use a . bbb  bundle 
cannot (or should not) be avoided.
Correctly predicted indirect branches always incur a two-cycle bubble, while the penalty associated with 
an incorrect result is at least six cycles. To minimize the impact of mispredictions with indirect branch targets, 
Intel recommends the following:
•  Separate the branch register write and indirect branch by at least six access to the L1 instruction cache. 
Add an additional write to the branch register above the actual write as a hint to the target.
•  Use different branch registers for each indirect branch instance to minimize conflicts with other indirect 
branches.
57
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e P r o g r a m  O p t i m i z a t i o n s
Instruction prefetching. Instruction prefetching, in which instruction cache lines are moved into the L1 
instruction cache, is supported by the Itanium 2 processor and is available in two forms: streaming and hint.
Streaming prefetching corresponds to the use of the . many completer for branch instructions and causes 
the prefetch engine to continuously issue prefetch requests at a rate of one request per cycle. The lines are 
prefetched from 64 or 128 bytes (depending on the alignment of the branch target) beyond the target address. 
Streaming prefetching terminates when one of the following conditions is met:
•  A predicated-taken branch is encountered.
•  A branch misprediction occurs.
•  The b rp  instruction is encountered.
The b rp  instruction is used to inform the hardware of an upcoming branch instruction. When used without 
the .im p  completer, it is assumed that prefetch engine will have already prefetched beyond the upcoming 
branch and further prefetches would be useless.
Hint prefetching begins with the b rp  and mov b r  instructions. Several completers for b rp  are available; 
we recommend Section 8.2 of the Intel Itanium 2 Processor Reference Manual fo r  Software Development and 
Optimization (herein referred to as “the optimization manual”) and Section 3:28 of the Itanium Instruction 
Set Reference for further details of hint prefetching.
We have omitted any discussions of prefetch flushing hints or the b r l  instruction; Section 8.2 of the 
optimization manual covers the details of these topics.
These low-level hints are of more concern to assembly language programmers or compiler writers than 
those using a high-level language. However, as we have maintained throughout this survey, understanding 
the low-level mechanisms employed by an architecture will enable the high-level programmer to write better 
code. For more details of the topics we have just considered, consult the optimization manual.
12.4.3 Performance Monitoring
The Itanium architecture provides a number of features for advanced performance monitoring, among them 
the pmd registers that were introduced in Section 3. You will recall that the architecture requires that at least 
eight pmd register be implemented. The Itanium 2 processor provides four 48-bit performance counters. In 
addition, there are over 100 monitorable events and a rich set of advanced monitoring features.
We would certainly stray beyond the intended scope of this work if we were to described all of the 
performance monitoring features and abilities of the Itanium 2 processor. Instead, we provide an extremely 
concise overview and refer you to specific pages within Sections 10 and 11 of the the optimization manual 
for more thorough discussions.
The Itanium 2 processor provides two programming models for performance monitoring: workload char­
acterization and profiling. Both of these models are discussed in Section 10.2 (pages 10-1 through 10-12) of 
the optimization manual.
The Itanium 2 processor performance monitoring events are broken into several categories, including ba­
sic events (clock cycles, for example), branch events, and system events, among others. Using the associated 
event counters, common performance metrics can be derived, giving the programmer insight into the behavior 
of the application. These and related topics are taken up in Section 11 of the optimization manual.
58
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  L o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
12.5 Loop Optimization as a Practical Example
Having at least introduced many of the factors contributing to program performance, we now turn to a prac­
tical example: loop optimization. Loops are a common structure provided by high-level languages, and they 
make many repetitive tasks easy to accomplish with concise and readable code. Because of these features, 
loop optimization is a important and useful topic to consider.
12.5.1 Loop Unrolling
We mentioned previously that loop unrolling was a optimization technique employed by many compilers. 
While we noted that this technique may lead to unnecessary code bloat, there are also advantages that can 
lead to better performance.
Consider an example where the number of registers required by the loop body is significantly less the 
total number of available processor registers. In this instance, an optimizing compiler will “unroll” the loop; 
that is, it will duplicate the instructions composing the loop’s body some number of times. How many times 
depends upon a number of factors, including the number of free registers. By doing so, it can reduce the total 
number of loop traversals and eliminate some amount of overhead. In addition, it may be able to rearrange 
the instruction sequence more significantly and gain further speed-ups.
There are trade-offs besides code size that must be considered. For example, if the number of loop 
traversals is not a convenient factor of the number of free processor registers, then special code to handle the 
remaining traversals must be inserted. Likewise, if that number is determined dynamically and not known at 
compile-time, then unrolling the loop is significantly more difficult, if not impossible. Finally, as the size of 
the loop’s body increases, there is a greater likelihood that all of its instructions will not fit into the instruction 
cache. This situation can lead to thrashing, which will negatively impact on the loop’s performance.
12.5.2 Software-Pipelined Loops
often, a better way to handle loop optimization is with software pipelines. You will recall that many modern 
processors support instruction pipelining at the hardware level. A similar principle can be implemented in 
software, using the Itanium architecture’s support for qualifying predicates, special branch instructions, and 
rotating register sets. We diverge for a moment to introduce the Itanium rotating registers.
Rotating registers. Floating-point registers Fr32-F ri27 and predicate registers Pri6-Prg3 are designated 
rotating register sets. General-purpose registers can also operate as rotating registers; they must be a l l o c ’d 
in groups of eight, beginning with Gr .
Each set of rotating registers will be renamed by incrementing the register number when a special branch 
instruction is encountered. For example, Fr32 becomes Fr33, Fr33 becomes Fr34, and so on. Rotating registers 
allow data contained in a register from a previous iteration to remain accessible in subsequent iterations, but 
in a differently named register; each new iteration is given a new group of registers with which to operate. 
Note that the hardware automatically handles rotation and renaming of all three rotating register sets.
Modulo-scheduling. Now, using a rotating register set, programmers are able to create a software pipeline 
and modulo-schedule the instructions in a loop, enabling instructions from different iterations of the loop to 
execute in parallel.
Modulo-scheduled loops typically consist of three phases: the prolog phase, the kernel phase, and the 
epilog phase. During the prolog phase, the software pipeline fills. Some instructions of the loop will be 
predicated false.
During the kernel phase, a new iteration of the loop will begin with each cycle, while some previous 
iteration completes. Here, the software pipeline is full, and generally all of the instructions are predicated 
true.
Finally, in the epilog phase, the software pipeline drains; no new iterations are started and the uncompleted 
iterations are finished. With each cycle, one predicate register will be set to false.
59
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  L o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
Branch instructions for software pipelining. The Itanium ISA provides special branch instructions for loop 
control that are used with rotating registers:
b r . c t o p . b w h . p h . dh t a r g e t
b r . c e x i t . b w h . p h . dh  t a r g e t
/ /  r o t a t e  an d  r e t u r n  t o  t o p
/ /  e x i t  when a r . l c  = 0 and
/ /  a r . e c  = 1
/ /  r o t a t e  an d  f a l l  t h r o u g h
/ /  e x i t  when a r . l c  = 0 and
/ /  a r . e c  = 1
(qp) b r . w t o p . b w h . p h . dh t a r g e t
(qp) b r . w e x i t . b w h . p h . dh  t a r g e t
/ /  r o t a t e  an d  r e t u r n  t o  t o p
/ /  e x i t  when qp = 0 and
/ /  a r . e c  = 1
/ /  r o t a t e  an d  f a l l  t h r o u g h
/ /  e x i t  when qp = 0 and
/ /  a r . e c  = 1
where the valid values for bwh, p h , and dh  are the same as those for the regular branch instructions.
The first two forms ( b r . c t o p  and b r . c e x i t )  are used for counted loops. We have already encountered 
the Itanium loop count register a r . l c .  The values in this register, and in a r . e c ,  the epilog count register, 
are used to control register rotation and the program’s flow of execution.
During the prolog and kernel phases, the value of a r . l c  will be greater than zero. In this case, the 
b r . c t o p  and b r . c e x i t  instructions will decrement a r . l c ,  set predicate register p63 to one, and rotate 
all three register sets. So, for the next iteration, p 1 6 will contain a one because p63 was rotated “into” p16.
Then, during the epilog phase, the value of a r . l c  will be zero, and a r . e c  will be greater than one. 
Here, the b r . c t o p  and b r . c e x i t  instructions will decrement a r . ec , set predicate register p63 to zero, 
and rotate all three register sets. So, for the next iteration, p 1 6 will contain a zero because p63 was rotated 
“into” p16.
Finally, at the end of the epilog, b r . c t o p  will fall through instead of branching and b r . c e x i t  will 
branch to t a r g e t  rather than fall through.
The following schematic demonstrates register rotations for counted loops:
Pipeline Phase a r . l c  a r . e c  p63
Prolog decremented unchanged 1
Kernel decremented unchanged 1
Epilog 0 decremented 0
Upon exit, a r . l c ,  a r . ec , and p63 will typically be zero.
The last two forms ( b r . w top  and b r . w e x i t )  are used for while loops. These instructions operate 
similarly to those for counted loops; however, the qualifying predicate register qp and the values in a r . e c  
control register rotation and the program’s flow of execution.
During the prolog and kernel phases, the value of qp will be one. In this case, the b r . w t o p  and 
b r . w e x i t  instructions will set predicate register p63 to zero and rotate all three register sets. So, for 
the next iteration, p16 will contain a zero because p63 was rotated “into” p16.
After the value of qp becomes zero, but while a r . e c  is greater than one (the epilog phase), the b r . wtop 
and b r . w e x i t  instructions will decrement a r . ec , set predicate register p63 to zero, and rotate all three 
register sets. So, for the next iteration, p 1 6 will contain a zero because p63 was rotated “into” p16.
Finally, at the end of the epilog, b r . c t o p  will fall through instead of branching and b r . c e x i t  will 
branch to t a r g e t  rather than fall through.
6 0
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  l o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
The following schematic demonstrates register rotations for counted loops:
Pipeline Phase qp a r . ec p63
Prolog 1 unchanged 0
Kernel 1 unchanged 0
Epilog 0 decremented 0
Upon exit, qp, a r . e c  and p63 will typically be zero.
12.5.3 Writing a Software Pipelined Loop
To illustrate software pipelined loops, we again use (slightly modified) example code from Evans and Trimper. 
Like the DOTCLOOP program in Section 10, the code in Figure 7 (page 64) computes the scalar product of 
two vectors. However, unlike the previous version, this example uses the special branch instructions, rotating 
register sets, and qualifying predicates to modulo-schedule the program’s main loop.
This code is written for the Itanium 2 processor. While it will run on any Itanium implementation, it 
achieves optimal performance on this particular implementation because it accounts for instruction latencies 
and other factors that are specific to the Itanium 2 processor.
Evans and Trimper also include an implementation-independent version of this program (DOTCTOP) in 
Section 10.5.1 (pages 316-321); see their book for details.
Determining the rotating register sets. First, we begin by recognizing that an integer load (from the L1 
cache) has a minimum latency of two cycles when the data are to be operated on by the pmpy2 . r  instruction, 
which itself has a latency of 3 cycles. Now, we can outline the software pipeline for the program’s main loop, 
using a three-element vector as an example, as follows:
Cycle Stage 0 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 6
l d 2 (1) 
l d 2 (2) 
l d 2 (3) pmpy(1) 
pmpy(2) 
pmpy(3)
s x t (1)
s x t (2) a d d (1)
s x t (3) a d d (2)
a d d (3)
where stage 1 is a no-op to account for the extra load latencies, and stages 3 and 4 are also no-ops, accounting 
for the latency of the pmpy2 . r  instruction used in the program.
Using this pipeline, the elements of vector X are valid in stages 0-2; thus, three rotating registers are 
required and we use r32,  r33,  and r34 .  The elements of vectors X and Y are valid in stages 2-5. By 
reallocating r3 4  as the destination in stage 2, only three more rotating registers are required; we use r35,  
r36,  and r 3 7 . Results from the sign-extend instruction are valid in stages 5 and 6, but r3 7  can be reallocated 
as the destination in stage 5, so only one new register is required (r38).  Finally, the elements of vector Y will 
occupy three additional registers; we use r39,  r 4 0 , a n d  r41 .  A total of ten rotating registers, r 3 2 - r 4 1 ,  
are required.
Now we consider the rotating predicate registers that will be used to control the flow of execution within 
the software pipeline. Note that only two stages are “active” during any given execution cycle. Four predicate
61
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  l o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
registers (p16, p18, p21, and p22)  will be assigned to the operational pipeline stages, while p17, p19, and 
p20 correspond to the no-op stages.
The loop count register ( a r . l c )  should be zero at the cycle after which p63 will become zero, making 
p 1 6 zero on the following cycle. This requirement typically implies that a r . l c  should be set to one less than 
the number of traversals required by the loop. The epilog counter ( a r . e c )  should be set to seven because 
there are seven pipeline stages to be drained. The following schematic demonstrates this behavior:
Cycle a r .  l c a r . e c p16 p17 p18 p19 p20 p21 p22
0 2 7 1 0 0 0 0 0 0
1 1 7 1 1 0 0 0 0 0
2 0 7 1 1 1 0 0 0 0
3 0 6 0 1 1 1 0 0 0
4 0 5 0 0 1 1 1 0 0
5 0 4 0 0 0 1 1 1 0
6 0 3 0 0 0 0 1 1 1
7 0 2 0 0 0 0 0 1 1
8 0 1 0 0 0 0 0 0 1
Note that the loop counter should stop on the value zero, while the epilog counter should stop on the value 
one. In addition, the number of software pipeline cycles is a r . l c  + a r .  ec;  here, 2 + 7 = 9.
Optimizing the instruction schedule. Next, consider the order of instructions that compose the main loop 
in the original DOTCLOOP:
t o p :
l d 2  r 2 1 = [ r 1 4 ] ,2
l d 2  r 2 2 = [ r 1 5 ] , 2 ; ;
p m p y 2 . r  r 2 1 = r 2 1 , r 2 2
s x t 4  r 2 1 = r 2 1 ; ;
a d d  r 2 0 = r 2 0 , r 2 1  
b r . c l o o p . s p t k . f e w  t o p ; ;
using this same order in the software pipelined version of the program, three instruction bundles are 
necessary:
t o p :
(p 1 6) 
( p 1 6) 
( p 1 7 )
l d 2
l d 2
pm py 2 . r
r 3 2 = [ r 1 4 ] ,2  
r 3 9 =  [ r 1 5 ] ,2  
r 3 4 = r 3 4 , r41
/ /  M 
/ /  M 
/ /  I
(p 1 8) 
(p19)
n o p . m
s x t 4
add
r 3 7 = r 3 7  
r 2 0 = r 2 0 , r 3 8
/ /  M 
/ /  I 
/ /  I
nop .m  / /  M
n o p . f  / /  F
b r . c t o p . s p t k . f e w  t o p ; ;  / /  B; ;
You will note that some explicit stops have been eliminated, which is possible because the instructions in 
the software pipelined loop are predicated. Nevertheless, even if we ignore the delays of loading data from
6 2
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  L o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
memory and the actual latencies of instructions on a specific implementation, then this schedule requires at 
least two cycles per iteration.
By taking advantage the characteristics of Type A instructions, we can switch the order of the s x t 4  and 
ad d  instructions to condense the sequence into two instruction bundles:
t o p :
(p 1 6) l d 2 r3 2 = [ r 14 ]  , 2 / / M
(p 1 6) l d 2 r3 9 = [ r 15 ]  , 2 / / M
( p 1 7 ) pm py 2 . r r34 = r 3 4 , r 4 1 / / I
(p 1 9) add r2  0= r 2 0 , r 3 8 / / M
(p 1 8) s x t 4 r3 7 =r37 / / I
b r . c t o p . s p t k . few t o p ;  ; / / B; ;
resulting in a smaller code size and better instruction cache usage. In addition, execution time on an Itanium 2 
processor will be reduced because it has four M-units. Note that the explicit stop after the b r . c t o p  will 
ensure the proper logical ordering from cycle to cycle.
An analysis of this optimized instruction schedule by Evans and Trimper shows that the minimum number 
of cycles is 13, while the original version requires 21 cycles, indicating a significant savings for the optimized, 
modulo-scheduled loop. (The details of this analysis can be found in Section 10.5.1, page 319 of Evans and 
Trimper.)
Modulo-scheduling and software pipelining are powerful optimization techniques, but they can be con­
fusing. We recommend a careful review of these topics, particularly the DOTCTOP2 code in Figure 7.
63
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  L o o p  O p t i m i z a t i o n  a s  a  P r a c t i c a l  e x a m p l e
/ /  DOTCTOP2: Compute th e  s c a l a r  p ro duc t  of  two v e c t o r s
N = 3 / / Declare  a co n s t a n t
. d a ta / / Declare  d a t a  s e c t i o n
. a l i g n 8 / / S pec i fy  d e s i r e d  a l ignmen t
P: . sk ip 8 / / To s t o r e  t h e  p ro du c t
X: data2 -1 ,+ 3,+ 5 / / F i r s t  v e c t o r  of  1 6 - b i t  v a lu es
Y: data2 6+4-2- / / Second v e c t o r  o f  1 6 - b i t  v a lu e s
. t e x t / / Declare  code s e c t i o n
. a l i g n 32 / / S pec i fy  d e s i r e d  a l ignmen t
. g lo b a l main / / Mark mandatory 'm ai n '  program e n t r y
. proc main
main :
. prologue / / Begin prologue s e c t i o n
.save a r . l c , r 9
mov r 9 = a r . l c ; ; / / Save c a l l e r ' s  a r . l c
. body / / Begin procedure  'm ai n '
f i r s t : a l l o c r 1 0 = a r . p f s , 0 ,1 6 ,0 ,1 6 / / A l l o c a t e  a new s t a c k  frame
movl r14=X / / Gr14 = p o i n t e r  t o  X
movl r15=Y / / Gr15 = p o i n t e r  t o  Y
movl r16=P / / Gr16 = p o i n t e r  t o  P
mov r20=0 / / r20 = s c a l a r  p ro du c t
mov a r . lc=N-1 / / I n i t i a l i z e  a r . l c
mov a r . ec=7 / / I n i t i a l i z e  a r . e c
mov p r . ro t= 0 x 1 0  00 0; ; / / I n i t i a l i z e  p r e d i c a t e s
top :
(p16) ld2 r 3 2 = [ r 1 4 ] ,2 / / Load e lemen t ,  incremen t  p o i n t e r
(p17) ld2 r 3 9 = [ r 1 5 ] ,2 / / Load e lemen t ,  incremen t  p o i n t e r
(p18) pmpy2. r r3 4= r34, r41 / / M u l t ip ly  e lements
(p22) add r20=r21 , r38 / / Update s c a l a r  p roduc t
(p21) sxt4 r37=r37 / / S ig n -ex ten d  r e s u l t  t o  64 b i t s
b r . c l o o p . s p tk . f ew  top / / More e l emen ts  t o  p ro ces s?
s t8 [r16]=r20 / / No, s t o r e  t h e  s c a l a r  p ro du c t
done: mov ret0=0 / / S ig na l  complet ion
mov a r . lc= r9 / / R es to re  c a l l e r ' s  a r . l c
mov a r . pfs=r10 / / R es to re  c a l l e r ' s  a r . p f s
b r . r e t . s p t k . m a n y  b0; ; / / Return  t o  command l i n e
. endp main / / End proc ed ure  'm a i n '
Figure 7: A slightly modified version of the DOTCTOP2 program from Evans and Trimper
6 4
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e I t a n i u m  I m p l e m e n t a t i o n s
13 Itanium Implementations
As of this writing, Intel has marketed two processor implementations of the Itanium architecture: the (orig­
inal) Itanium processor and the Itanium 2 processor. In addition, the Hewlett-Packard Company (HP) has 
developed a software-based Itanium ISA simulator, called Ski. In the final section of this survey, we discuss 
the more relevant details of each implementation.
13.1 The Itanium-Family Processors
The first “market-ready” implementation of the Itanium architecture was, not surprisingly, Intel’s Itanium 
processor. The life-cycle of the original Itanium processor was relatively short, for a number of reasons that 
we do not discuss. Only one year later, Intel released the Itanium 2 processor, which includes a number of 
enhancements over the original implementation.
Table 14 (next page) compares many characteristics of the two Itanium processors. We elucidate only the 
important details, largely for comparative purposes.
13.1.1 Cache Hierarchy
The Itanium processors include three distinct levels in the cache hierarchy, differing in terms of size, speed, 
and physical location. The cache structure “closest” to the processor is the level 1 (L1) cache. The Itanium 
processors divides this cache into two regions, one for instructions (L1-I) and one for data (L1-D), following 
the Harvard memory architecture. The L1-I cache is read-only, while the L1-D cache is read-write. Each 
cache structure has a separate connection to the CPU.
In contrast, the level 2 (L2) and level 3 (L3) cache structures use a von Neumann memory architecture, 
where no distinction is made between instructions and data. The L3 cache, which is not a common feature of 
most contemporary architectures, functions as a backside cache, because it connects to the processor using a 
separate bus. The L3 cache monitors L2 activity and mimics the data access patterns while retaining a longer 
history of that activity due to its larger size.
Figure 8 shows the structure of the cache hierarchy for the Itanium processors.
In general, the cache structures of the Itanium processor are smaller and slower than those of the later 
Itanium 2 processor. Although the L3 cache of an Itanium processor can be larger than that of an Itanium 2, 
access times are slower because the cache is merely in-package, rather than on-chip as it is in the Itanium 2 
processor. The original Itanium processor exhibits fixed load latencies (ranging from 2-24 cycles, depending 




Figure 8: Structure of the cache hierarchy for the Itanium processors
65
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e t h e  I t a n i u m - F a m i l y  P r o c e s s o r s
Characteristic Itanium Itanium 2
Development code name Merced McKinley
Year of market release 2001 2002
Chip technology
CPU speeds 733 MHz, 800 MHz 900 MHz, 1.0 GHz, 1.3 GHz, 1.4 GHz, 1.5 GHz
Process feature size 180 nm 180 nm
Number of transistors 25 * 106 222 * 106
Number of layers 6 6
operating voltage 1.5 V 1.5 V
Power consumption 116-130W 130 W
Processor features
Physical stacked registers 96 96
RSE modes Enforced lazy Enforced lazy
Integer units 2 M-units, 2 I-units 4 M-units, 2 I-units
Memory units 2 load, 2 store 2 load, 2 store
Parallel fbating-point units 2 1





Physical address bits 44 50
virtual address bits 54 64
Data bus width 64 bits 128 bits
Maximum page size 256 MB 4 GB
System bus
Speed 266 MHz 400 MHz
Width 64 bits 128 bits
Bandwidth 2.1 GB/s 6.4 GB/s
L3 Cache
Size 2 MB, 4 MB 3 MB, 4 MB, 6 MB
Location off-chip, in-package on-chip
Table 14: Characteristics of the current Itanium processors
Benchmarks have shown that the Itanium 2, which exhibits several optimizations including those in the 
cache hierarchy, achieves approximately twice the instruction throughput of the original Itanium processor, 
even when running with clock speeds that are less than a factor of two faster.
Table 15 (next page) summarizes the relevant differences of the Itanium and Itanium 2 cache structures.
13.1.2 Execution Units and Issue Ports
Table 14 shows that the processors also differ in the number and kinds of execution units. The Itanium 2 adds 
two more M-units and two additional instruction issue ports. 
You will recall that Type A instructions can execute in either M- or I-units; the additional M-units raise 
the superscalar degree for these instructions to six. Furthermore, many more pairings of instruction bundle 
templates can be issued in a given clock cycle. Table 16, in which the row represents the first bundle of the
66
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e t h e  I t a n i u m - F a m i l y  P r o c e s s o r s










L2 96 KB 64 B 6 cycles (integer)
9 cycles (fbating-point)
L3 2 MB, 4 MB 64 B 21 cycles (integer)
24 cycles (fbating-point)








1 cycle (instruction) 
1 cycle (integer)
L2 256 KB 128 B 5-9 cycles (integer) 
6-10 cycles (floating-point) 
7-11 cycles (instruction)
L3 3 MB, 4 MB, 6 MB 128 B 12-16 cycles (integer) 
13-17 cycles (floating-point) 
14-18 cycles (instruction)
Physical memory < 1 pb >100 cycles
Table 15: Characteristics of the Itanium and Itanium 2 cache structures 
pair and the column, the second, captures the possible dual-issue pairs for each processor.
MII MLX MMI MFI MMF MIB MBB BBB MMB MFB
MII 12 none 12 12 12 12 both both 12 both
MLX 12 12 12 both 12 both 12 both 12 both
MMI 12 12 12 12 12 12 12 both 12 12
MFI 12 both 12 both 12 both both both 12 both
MMF 12 12 12 12 12 12 12 both 12 12
MIB 12 both 12 both 12 both both none 12 both
MBB none none none none none none none none none none
BBB none none none none none none none none none none
MMB 12 12 12 12 12 12 12 12 none 12
MFB both both 12 both 12 both both none 12 both
Table 16: Possible dual-issue instruction bundles for the Itanium and Itanium 2 processors
Note the significantly higher number of cells labeled “12”, indicating those pairings that only the Itanium 2 
processor supports. The additional M-units and issue ports significantly impact the degree of possible paral­
lelism in the Itanium 2 processor.
67
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e  t h e  S k i  S i m u l a t o r
13.1.3 Pipelines
Table 14 also shows that the implementations differ in the length of their execution pipelines. The pipeline 
stages of each processor are described in Table 17.
Processor Stage Mnemonic Description
Itanium
1 ipg Generate instruction pointer
2 FET Prefetch up to 6 instructions per cycle; predict branch direction
3 rot Rotate instructions of current group into position; calculate branch address
4 exp Issue up to 6 instructions through 9 ports
5 REN Rename registers
6 w ld Deliver data loaded from memory, requires latency of at least one cycle
7 REG Deliver data from the Gr, Fr, and Pr registers
8 EXE Execute operations
9 DET Detect exceptions; abandon results if predicate register was false
10 WRB Store results in Gr, Fr, and Pr registers, as necessary
Itanium 2
1 ipg Generate instruction pointer
2 rot Rotate instructions of current group into position
3 exp Issue up to 6 instructions through 11 ports
4 REN Rename registers; decode instructions
5 REG Deliver data from the Gr, Fr, and Pr registers
6 EXE Execute operations
7 DET Detect exceptions; abandon results if predicate register was false;
8 WRB
correct mispredicted branches
Store results in Gr, Fr, and Pr registers, as necessary
Table 17: The Itanium and Itanium 2 pipelines
In the original Itanium processor, each stage belongs to one of four larger phases: the front-end (IPG, 
FET, ROT); instruction delivery (EXP, REN); operand delivery (WLD, REG); and execution, with a change 
of machine state (EXE, DET, WRB).
However, in the Itanium 2 processor, there are only two larger phases: the front-end (IPG, ROT) and the 
back-end (EXP, REN, REG, EXE, DET, WRB).
It should be obvious that we have not attempted an exhaustive comparison of the two Itanium family pro­
cessors. Several more details of these processor are available from the Intel web site (http://www.intel.com/).
13.2 The Ski Simulator
HP maintains a freely available software-based Itanium ISA simulator, called Ski. The simulation environ­
ment executes on IA-32 architectures running the Linux operating system. Ski functionally simulates the 
Itanium instruction set architecture, not a specific processor implementation. As a result, it is extremely fast. 
However, the Ski simulator cannot be used to measure the actual performance of a simulated program because 
it does not simulate the micro-architectural characteristics of an Itanium implementation.
The HP Ski web site states that the simulator is well-suited for:
•  Itanium application development on non-native hardware,
•  Itanium compiler tuning,
•  operating system and firmware development for Itanium architectures, and
•  functional instruction verification of Itanium processor implementations.
68
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e the  Ski Simulator
The simulator provides two execution modes: system-mode, in which both the system-level and the application- 
level instructions are simulated, and user-mode, in which only the application level instructions are simulated. 
While user-mode simulation is faster, it does not support many useful or necessary features, multi-threading 
being a prime example. All system calls are intercepted and translated into calls for the host machine’s op­
erating system. In system-mode, applications execute along with an actual Itanium Linux kernel. For better 
simulation accuracy, system-mode execution is required.
HP also provides the Native User Environment (NUE), which emulates the IA-64 Linux environment 
and is intended for use with the Ski simulator. NUE provides the compiler, linker, assembler, libraries, and 
execution environment necessary to develop IA-64 Linux applications. We note that the compiler included 
with NUE is not an optimizing compiler, but such a compiler is available from SGI, Inc. See SGI’s web site 
(http://oss.sgi.com/projects/Pro64/) for more information.
We have found the Ski simulator and NUE to be extremely useful for porting existing applications to the 
Itanium architecture. Information about obtaining and installing the Ski and NUE software can be found at 
HP’s IA-64 Linux Developer Tools web site (http://www.software.hp.com/products/LIA64/).
69
A  S u r v e y  o f  t h e  I t a n i u m  A r c h i t e c t u r e
f r o m  a  P r o g r a m m e r ’ s P e r s p e c t i v e
Bibliography and Additional Resources
Evans, James S. and Gregory L. Trimper, Itanium Architecture fo r  Programmers, Upper Saddle River, NJ: 
Prentice Hall, Inc., 2003.
Gerber, Richard, The Software Optimization Cookbook, Hillsboro, OR: Intel Press, 2002.
Hennessy, John L. and David A. Patterson, Computer Architecture: A Quantitative Approach, 3rd ed. San 
Francisco, CA: Morgan Kaufmann Publishers, Inc., 2002.
Hennessy, John L. and David A. Patterson, Computer Organization and Design: The Hardware/Software 
Interface, 2nd ed. San Francisco, CA: Morgan Kaufmann Publishers, Inc., 1997.
Hewlett-Packard Company, “HP IA-64 Linux Simulator (Ski)”, IA-64 Linux Developer’s Kit web site. 
Available at http://www.software.hp.com/products/LIA64/overview1a.htm.
Hewlett-Packard Company, “IA-64 Linux Developer’s Kit (CD)”, IA-64 Linux Developer’s Kit web site. 
Available at http://www.software.hp.com/products/LIA64/overview4a.htm.
Hewlett-Packard Company, “IA-64 Root File System”, IA-64 Linux Developer’s Kit web site.
Available at http://www.software.hp.com/products/LIA64/overview3a.htm.
Hewlett-Packard Company, “Native User Environment (NUE)”, IA-64 Linux Developer’s Kit web site. 
Available at http://www.software.hp.com/products/LIA64/overview2a.htm.
Intel Corporation, “Application Architecture”, revision 2.1, Intel Itanium Architecture Software Developer’s 
Manual, Vol. 1, 2002.
Intel Corporation, “Instruction Set Reference”, revision 2.1, Intel Itanium Architecture Software Developer’s 
Manual, vol. 3, 2002.
Intel Corporation, “System Architecture”, revision 2.1, Intel Itanium Architecture Software Developer’s M an­
ual, vol. 2, 2002.
Intel Corporation, Itanium 2 Processor Reference Manual fo r  Software Development and Optimization,
2003.
Intel Corporation, Itanium Architecture Assembler User’s Guide, 2000.
Markstein, Peter, IA-64 and Elementary Functions: Speed and Precision. Upper Saddle River, NJ: Prentice 
Hall PTR, 2000.
Rau, B. Ramakrishna, ’’Dynamic Scheduling Techniques for VLIW Processors,” HP Technical Report HPL- 
93-52 (June 1993). Available at http://www.hpl.hp.com/techreports/.
Rau, B. Ramakrishna, and Joseph A. Fisher, “Instruction-Level Parallel Processing: History, Overview, and 
Perspective,” HP Technical Report HPL-92-132 (October 1992). Available at 
http://www.hpl.hp.com/techreports/.
SGI, Incorporated, “Pro64”, Developer Central Open Source web site. Available at 
http://oss.sgi.com/projects/Pro64/.
7 0
