An evaluation of Rockwell's Advanced Architecture Microprocessor for digital signal processing applications by Albin, Kenneth Lee.
AN EVALUATION OF ROCKWELL'S
ADVANCED ARCHITECTURE MICROPROCESSOR
FOR DIGITAL SIGNAL PROCESSING APPLICATIONS
by
KENNETH LEE ALBIN
B. S. , Kansas State University, 1981
A MASTER'S THESIS
submitted in partial fulfillment of the
requirements for the degree
MASTER OF SCIENCE
Department of Electrical and Computer Engineering
KANSAS STATE UNIVERSITY
Manhattan, Kansas
1984
Approved by:
A11EDE bb33D0
i
TABLE OF CONTENTS
Introduction
Features of the AAMP 3
Software environment 3
Process Stack 9
Executive Process 14
Event handling 19
Evaluation procedure 21
Code used for evaluation 21
Widrow algorithm modification 28
Optimization for the AAMP 32
Compiler optimization 43
Performance measurements 47
Results and Conclusions 63
Acknowledgements 6 8
References 69
Appendices 7
Appendix A : Hand compiled listings 70
Notes on hand compiled listings 70
Standard Widrow listing 73
Standard Lattice listing 79
Modified Widrow listing 84
Optimized Widrow listing 89
Optimized Lattice listing 94
Appendix B : Ada-subset listings 99
Notes on Ada-subset compiler 99
Widrow fixed-point listings 100
Widrow floating-point listings 107
Lattice fixed-point listings Ill
Lattice floating-point listings 116
ADATESTS fixed-point listings 120
ADATESTS floating-point listings 127
LIST OF FIGURES
Figure 1 AAMP addressing modes 6
Figure 2 Process stack during procedure call 10
Figure 3 Process stack and linkages 11
Figure 4 User PSD and process stack 17
Figure 5 Executive and User data structures 18
Figure 6 Widrow algorithm 22
Figure 7 Lattice algorithm 23
Figure 8 Weight updating process 30
Figure 9 Sample and error array updating scheme .... 31
Figure 10 Sample and weight pair updating 31
Figure 11 Synchronization timing diagrams 54
Figure 12 AAMP Timing worksheet 59
Figure 13 Execution rate quick estimate worksheet. ... 60
LIST OF TABLES
Table 1 AAMP data types 7
Table 2 AAMP arithmetic operations 8
Table 3 AAMP Widrow algorithm execution rates 26
Table 4 AAMP Lattice algorithm execution rates .... 26
Table 5 Stack-updating vs other operations 33
Table 6 Opcode to stack update mapping 34
Table 7 Non-optimum stack-updating instructions. ... 35
Table 8 Microcycles required by the AAMP 48
Table 9 Estimated vs measured execution rates 52
Table 10 Widrow compilation differences 62
Table 11 Lattice compilation differences 62
Table 12 Widrow implementation comparisons 64
Table 13 Lattice implementation comparisons 65
Introduction
With the advent of low-cost and relatively high performance
microprocessors, digital signal processing has found application
in a wide variety of fields. One such application is the use of
adaptive linear prediction in intruder detection devices. These
algorithms reduce false alarms by adapting to correlated
background noise and passing only intruder signals. Many
processors are capable of performing these algorithms in real-
time, but few of these have the low power requirements desirable
for field applications. The Electrical and Computer Engineering
Department at Kansas State University, in conjunction with Sandia
National Laboratories, has attempted to identify processors
which are most appropriate for such use. The ideal processor
would require very little power, be easy to interface, perform
multiplications very quickly and use floating-point arithmetic.
Processors which have been previously evaluated include the Zilog
Z80, Intel 8748, RCA ATMAC [1] and National NSC800 [2,3]. These
processors were successful to varying degrees, but still left
much room for improvement.
In the winter of 1983-1984, a processor that satisfies the
above criteria became available for evaluation. This
microprocessor, the Advanced Architecture Microprocessor (AAMP)
,
was designed by Rockwell-Collins in Cedar Rapids, Iowa and is
produced by Rockwell in Anaheim, California. It is a CMOS/SOS
microprocessor that has a stack architecture with a 16-bit wide
data path. Single and double precision integer and fractional as
well as single and extended precision floating-point data types
are supported on a single chip. It consumes approximately 50 mW
at its rated 20 MHz clock rate and uses a single 5 volt supply.
The purpose of this thesis is to examine the architecture of
the AAMP and attempt to estimate performance on signal processing
algorithms. Special attention is paid to both strong points and
bottlenecks of the processor. Relative efficiency that can be
achieved with high-level languages is also investigated.
The remainder of this thesis consists of three parts. The
first part is an introduction to the AAMP's architecture,
instruction set and data structures. This description is not
exhaustive but seeks to highlight the processor's properties
which are significant to the evaluation at hand and to supplement
the detailed treatment available from Rockwell-Collins. The
second part details the investigation and findings from the
evaluation. Included in this section is a discussion of ways to
optimize the Widrow and Lattice algorithms for the processor's
architecture. The third part contains the results and
conclusions of the evaluation in a concise form.
Gary Mauersberger is currently completing a hardware
oriented evaluation of the AAMP which includes the development of
a minimal system. The hardware evaluation combined with this
thesis should provide a comprehensive view of the AAMP and form a
basis for future comparisons of microprocessors.
Features of the AAMP
The purpose of this chapter is to provide an introduction to
the architecture and capabilities of the AAMP. The discussion is
directed toward an Electrical Engineer with a limited knowledge
of computer run-time structures. A concise but detailed
description can also be found in the August 1982 issue of IEEE
Micro [41; a very detailed description is contained in a document
from Collins-Rockwell titled AAMP, CAPS-7 and CAPS-10 INSTRUCTION
SET ARCHITECTURE t5].
Software environment
The primary run-time structure found in the AAMP is the
process stack. This process stack contains the environment of
the currently active procedure and the status of procedures that
were suspended in the calling process. This will be discussed in
more detail below.
Because the AAMP has nearly a pure stack architecture, that
is, it has no user-accessible registers, nearly all of its
instructions fall into four main categories:
1) Memory reference instructions which place the contents of
the specified memory location on the top of the stack. Also,
literal instructions which place constants on the top of the
stack.
2) Operators which perform actions on operands which reside
on the top of the stack, deleting the operands and placing the
result on the top of stack.
3) Memory assignment instructions which remove data from the
top of the stack and place them in the specified memory location.
4) Control instructions such as SKIP, CALL and RETURN which
affect the sequence in which instructions are executed.
The AAMP uses a 24 bit address word to select 16 bit memory
words. Since all AAMP opcodes are 8 bits long, the 16 bit word
containing the opcode byte is read and a 25th bit is used
internally to select the proper byte. Constructing the 24 bit
address from concatenating the top two stack locations is known
as the Universal addressing mode (see Figure la).
Because the data path is only 16 bits wide, it becomes
awkward to specify the full address. In order to increase
efficiency, the Global addressing form specifies the least
significant 16 bits and automatically uses the upper address bits
specified when the procedure started. The 8 most significant
address bits for data constitute the Data Environment (DENV).
The Code Environment (CENV) consists of the 9 most significant
address bits for the area of memory containing the opcodes. The
16 least significant address bits are specified by the word on
the top of the stack (see Figure lb) or by two immediate bytes
following the opcode. Note that the DENV and CENV can both refer
to the same area if desired.
A third form of addressing is yet more efficient and is used
to reference variables local to the current procedure (see Figure
lc). A reference or assignment using Local addressing can
specify any of 16 locations in a single byte opcode or any of 256
locations using a one byte opcode with an immediate byte. Local
addressing is also very useful because of the nature of block
structured languages and their emphasis on local variables.
Instructions are available to provide the absolute address of a
local storage location in the current procedure (LOCL) and in
calling procedures (LOCNL).
Finally, memory may be accessed through an Indexed
addressing mode where the index into the array is contained in
the stack and the base address of the array is either on top of
the stack or in an immediate word following the opcode. The
array base and index are used to calculate the address of the
element, taking into account the data type specified in the
instruction (see Figure Id). Another addressing mode is the
Constant Offset form which is essentially the same as the Indexed
immediate mode with the offset in an immediate byte and the array
base on the top of the stack. The calculation of the element's
address consists of adding the base and the offset together
without taking into account the data type being accessed as the
Indexed mode does.
Each addressing mode discussed above can be used to access
single (16-bit) words and double (32-bit) or triple (48-bit)
words stored in the form of consecutive 16-bit memory locations.
Also, a byte indexed mode is available wherein a byte offset is
added to a base (both of which must be on the stack) to access a
byte.
[ ]
TOS-> [ BBBB ]
[ xxAA ]
[ ]
[ ]
TOS->[ 7777 ]
address contents
AABBBB [ 7777 ]
before after
a) REFSD: reference single word with Universal addressing mode,
[ ]
TOS-> [ BBBB ]
[ ]
TOS->[ 6666 ]
DENV = xxCC
CCBBBB [ 6666 ]
before after
b) REFS: reference single word with Global addressing mode,
[
TOS-> [
TOS->[ 5555 ]
[ ]
accumulator
stack
[ ]
[ SPCR ] I
[ CENV ] I stack
[PROCID] I mark
[ LENV ] I
accumulator
stack
LENV->[ 5555 ]
[ ]
[ SPCR ] I
[ CENV ] I stack
[PROCID] I mark
[ LENV ] I
LENV->[ 5555 ] I
• 1 Local
• 1 storage
[
[
]
]
1
* stack frame
• of calling
• procedure
•
] I
I Local
storage
stack frame
of calling
procedure
before after
c) REFSL.O: reference single Local from location 0.
t ]
TOS->[ 1000 ]
[ 0040 ]
[ ]
[ ]
TOS->[ 4444 ]
DENV = xxCC
CC1040 [ 4444 ]
before after
d) REFSX: reference single word indexed.
Figure 1. AAMP Addressing Modes
The three word-lengths correspond to the data types
supported in the instruction set with arithmetic and conversion
functions. The data types are shown in Table 1.
Table 1. AAMP data types
Data type Precision Length Notation used
Boolean
Integer
Integer
Fractional
Fractional
Single
Double
Single
Double
Floating-point Single
Floating-point Extended
16 bits
16 bits
32 bits
16 bits
32 bits
32 bits
48 bits
0=FALSE,else TRUE
Two's complement
Two's complement
Two's complement,
msb = 2~-l
Two's complement,
msb = 2 A-1
Signed, hidden-bit,
8 bit XS128 exponent
Signed, hidden-bit,
8 bit XS128 exponent
In the floating-point notation, the mantissa is represented
in a positive normalized form with the sign bit and an assumed
binary point to the left of the most significant bit. Since a
properly aligned floating-point number (the AAMP automatically
handles alignment) will have a one for the most significant bit
(except in the case of zero), the bit can be omitted. The
representation of zero is defined to be the case where the sign,
mantissa and exponent are all zero. The exponent is represented
in excess 128 form in the least significant byte. Extended
precision floating-point numbers are the same except for 16
additional bits of precision in the mantissa.
The six arithmetic operations available for each of the
above non-boolean data types and their execution times are shown
in Table 2. Other instructions perform boolean (AND, OR, NOT and
XOR) and numeric data type conversion operations.
Table 2. Arithmetic operations and execution times,
(all times in microseconds)
Fixed-point Floating-point
Operation single double single extended
precision precision precision precision
Addition 0.55 0.75 7.55 11.35
Subtraction 0.55 0.75 8.65 12.25
Multiplication 4.75 14.95 19.15 30.25
Division 5.55 15.75 19.75 34.65
Negation 0.55 0.75 0.75 0.95
Absolute value 0.75 0.85 0.35 0.55
The preceding paragraphs have briefly described the primary
data types available and the instructions to manipulate them.
The procedures doing the manipulations are rooted in a process
stack which is dedicated to a particular task. This means that a
task's procedures, functions and subroutines and their associated
local variables, accumulator stack and parameters are all
contained in the stack. AAMP supports the concept of task
concurrency, that is, having multiple independent process stacks.
An executive stack initializes the system on reset and provides
the means for transferring control between tasks.
Process Stack
The process stack for a task has an active stack frame for
the currently active procedure on the top and its calling
procedures' stack frames below it in the calling order. When a
procedure is called, a new stack frame is set up on top of the
current one and becomes active. When the procedure ends, the new
stack frame is deactivated and, in effect, becomes lost as the
previous stack frame becomes active. This is illustrated in
Figure 2. Each of these stack frames consists of three main
areas: 1) the accumulator stack, 2) the local storage area and
3) the stack mark.
A procedure's accumulator stack is the area on the top of
the stack where nearly all operations on data are performed.
This area is the logical equivalent of registers in conventional
architecture microprocessors. The accumulator stack is initially
empty but grows as literal and reference instructions place data
on it and shrinks as words of data are removed by operations and
assignment instructions. If the current procedure calls another
procedure, the accumulator stack is left unused under the new
stack frame until the new stack frame is removed when the new
procedure returns.
TOS->
LEVN->
TOS->
LENV->
I Proc. A
I Acc.
I
I Proc. A
I stack
I mark
Proc. A
Local
storage
previous
stack
frames
I Proc. B
I stack
I mark
I
I Proc. B
I Local
I storage
I
I Proc. A
I Acc.
TOS->
I Proc. A
stack
mark
I
LENV->
Proc. A
Local
storage
previous
stack
frames
I
I Proc. A
I Acc.
I
I Proc. A
I stack
I mark
Proc. A
Local
storage
previous
stack
frames
a) Proc. A executing b) Proc. B executing c) Proc. A executing
Figure 2. Process stack for Procedure A, after Procedure A has
called Procedure B, and after Procedure B has returned.
10
Proc. B
active
stack
frame
DENV,TOS->
DENV,LENV->
Proc. A
suspended
stack
frame
SPCR 1
CENV ]
PROCID]
LENV ]
locO ]
loci 1
loc2 ]
locN-1]
-\
-/
]ace
ace J
ace ]
SPCR ]
CENV ]
PROCID]
LENV ]
locO ]<
loci ]
[locM-1]
>[
[
[
[
N
[ M ]
[ ]
[call B ]
>[ ]
[ ]
- header
code
[ ]
Procedure B
- header
code
Procedure A
Figure 3. Process stack and linkages after Procedure A calls
Procedure B.
11
The Local storage area is the area of the stack frame below
both the accumulator stack and stack mark. This area is used for
variables needed for the procedure associated with the stack
frame. This local variable area has four advantages: 1) the
quickest access times, 2) freedom from side-effects from other
procedures, 3) space is reclaimed automatically when the
procedure ends and 4) independence from a particular location in
memory or calling order. Local variable locations are created
when the stack frame is set up by leaving unused a specified
number of stack locations between the calling procedure's
accumulator stack and the stack mark of the stack frame being
created. This is shown in Figure 2.
The stack mark is the linkage between a procedure and its
calling procedure as shown in Figure 3. Recorded in the stack
mark is the calling procedure's Syllable (byte) Program Counter
Register (SPCR) , Code Environment (CENV) , Procedure ID (PROCID)
and Local Environment (LENV). The Code Environment is
concatenated with the Syllable Program Counter Register to form
the byte address of the instruction of the calling procedure
which is to be executed upon return from the called procedure.
The Procedure ID is an identification number for the calling
procedure which happens to be the byte address of the header of
its code body. The Local Environment is a pointer to the
location of the first Local storage location of the calling
procedure. These four words of data give the processor
information it needs to restart the calling procedure when the
called procedure ends.
12
At this point, it is appropriate to ask what makes up a
procedure. A procedure is a body of code with a header at
the location given by PROCID. This single word header defines
the number of words of storage to allocate for Local variables
between the calling procedure's accumulator stack and the new
stack frame's stack mark. The least significant byte of the word
following the header contains the first opcode to be executed in
the new procedure. Each time a procedure is called, a stack
frame is created to be associated with the procedure. Each time
a procedure is exited, the stack frame associated with that
activation is discarded. Thus, as long as Universal or Global
references are not used, the procedure may be called by different
procedures, by itself or even by different tasks and work well,
free from unwanted side-effects. Procedures are therefore
recursive with the above qualifications.
The calling sequence has been described above, but there is
one more detail: argument passing. To pass arguments to a
called procedure, the arguments are simply placed on top of the
calling procedure's accumulator stack before the CALL instruction
is executed. Since these arguments (and the rest of the calling
procedure's accumulator stack) are just below the called
procedure's Local variables, the called procedure can access them
using the Local addressing mode. The number of Local variables
and their relative locations are assigned and incorporated into
the procedure's header and instructions at the time the code is
13
compiled.
When a RETURN is executed, the top of the accumulator stack
must contain a number. This number tells the processor how many
storage locations below the stack mark to "deallocate." The
locations deallocated can include the called procedure's Local
storage, passed arguments and locations on the calling
procedure's accumulator stack. The called procedure's stack
mark is used to restore the processor state and is then discarded
along with the indicated number of local variables and calling
procedure arguments. Any locations between the called
procedure's stack mark and the deallocation number are considered
to be arguments to be returned and are copied onto the newly
determined top of the calling procedure's stack. Note that
parameters can also be returned if they reside in the local
storage locations immediately adjacent to calling procedure's
accumulator stack. The number of locations to be deallocated
would simply be the total number of local storage locations less
the number of locations to be left on the stack.
Executive process
In a system that may have multiple process stacks, the
mechanism which organizes the transfer of control between
processes is the Executive process. The Executive process begins
execution on reset through use of the Executive Entry Table. The
Executive Entry Table is located at memory addresses 0-8 and
contains information in three categories: 1) a Continuation
Status Pointer, 2) initialization information and 3) PROCIDs for
procedures handling special events that might arise.
14
When the processor is reset, there must be some way for it
to tell if it is starting cold or if it was executing a procedure
which needs to continue. The Continuation Status Pointer at
location contains the address of a memory location. If the
memory location pointed to contains zero, it indicates that
initialization should take place upon reset. Nonzero contents
indicate that the processor was interrupted in the middle of some
process, the status of which has been preserved and may now be
recovered to resume execution. Note that a pointer was used
because the Executive Entry Table will nearly always be located
in ROM and the indicated location in RAM. If the processor is
always to be initialized on reset, a zero in location will
insure this.
The three pieces of data in the Executive Entry Table used
in initialization are the Initial Stack Limit, Initial Top of
Stack and the Initial PROCID. The first two elements define the
location and extent of the Executive stack and the third
element gives the location of the instructions needed to perform
initialization. In addition, the processor automatically sets
LENV = DENV = CENV = and disables interrupts. The resulting
processor state is known as the Initialization State.
If a suspended process is to be restarted, the conditions
which existed before interruption must be recovered from the
process's Processor State Descriptor (PSD). For the Executive
process, this PSD is written out just before the processor halts.
This halting can occur from executing the HALT instruction or
15
from any of a number of error conditions that have been trapped.
Recorded in this Executive PSD are the contents of internal
registers that make up the processor state: Stack Limit (SKLM),
Top of Stack (TOS), LENV f DENV, SPCR and CENV. In addition, the
interrupt enable flip-flop status and an error code giving the
reason the processor was halted are provided. This dumping of
the processor status happens just below the Initial Executive Top
of Stack (the base of the Executive stack). The processor can be
restarted only if the error code indicates that the stoppage was
caused by the HALT instruction. No other errors can be corrected
by the processor (bus failure, etc.) and all are considered
fatal.
Once the Executive process has been started, it may then
call other procedures and perform operations on the Executive
stack. This single task system is the simplest configuration.
If multi-tasking is to take place, the Executive task must take
the responsibility for scheduling the tasks and initiating their
execution. Each User task (any task except the Executive task)
has its own PSD. This PSD contains the processor status (SKLM,
TOS, LENV, DENV, SPCR and CENV) of the task when execution was
stopped or the initial status if it has not yet been executed.
In addition, the PROCID and CENV for both the task and exception
handling routines are recorded. The User PSD and its
relationship with its process stack is shown in Figure 4.
16
>[
Proc. B
active
stack
frame
Proc. A
suspended
stack
frame
SPCR
CENV
PROCID
LENV
locO
loci
loc2
locN-1
ace
ace
ace
SPCR
CENV
PROCID
LENV
locO
loci
[locM-1]
<-+
SKLM ]
TOS ]—
LENV ]
DENV ]
SPCR ] -\
CENV ] -/
TASK. PROCID ] \.
TASK. CENV ] /
EXCEP. PROCID ] -\
EXCEP.CENV ]-/
[
[
>[
[
N ] header
]
]
]
code
[ ]
Procedure B
M>[
[ ]
[call B ]
[ ]
[ ]
] header
->[
code
[ ]
Procedure A
] header
]
code
]
Exception
Procedure
Note: DENV is concatenated with SKLM f TOS and LENV.
Figure 4. User PSD and Process stack.
17
12
3
4
5
6
7
8
EXECUTIVE
ENTRY TABLE
[CONT.STAT.PTR ]
[INIT. EXEC. SKLM
[INIT.EXEC.TOS
[INIT.PROCID
[BUS. ERROR. PROC
[NMI.PROCID
[INT.PROCID
[TRAP.PROCID
[EXCEP.PROCID
> [CONTINUE. STAT]
EXEC. STACK
•>[
[
[
[
-[
[
[
[
[
[
[
[
[
USER. PSD. PTR
EXEC.SKLM
EXEC.TOS
EXEC.LENV
EXEC.DENV
EXEC.SPCR
EXEC.CENV
INT. ENABLE. FF
EXEC. ERR. CODE
<-
EXEC. PSD
> [ SKLM
TOS
LENV
DENV
SPCR
CENV
TASK.PROCID
TASK. CENV
EXCEP.PROCID
EXCEP.CENV
Task A
SKLM
TOS
LENV
DENV
SPCR
CENV
TASK.PROCID
TASK. CENV
EXCEP.PROCID
EXCEP.CENV
SKLM
TOS
LENV
DENV
SPCR
CENV
TASK.PROCID
TASK. CENV
EXCEP.PROCID
EXCEP.CENV
Task B Task C
Note: Task B and Task C are inactive
Figure 5. Executive and User data structures
18
The Executive process initiates a User task by executing a
RETURN from the outermost Executive procedure. A pointer to the
PSD of the User task must be stored in the initial Executive TOS.
The processor uses the information in the indicated User task PSD
to set up the proper state in the processor and resume (or begin)
execution of the task. This is called Outer Procedure Return
Processing. The relationship between the Executive Entry Table,
Executive PSD and User PSD is shown in Figure 5.
The other instance of Outer Procedure Return processing is
when the User task execution is terminated. This could be due to
either an interrupt or a trap. The most common type of trap is
that which is generated when a procedure attempts to return with
no previous procedures on the User task's process stack. In any
case, the status of the processor is written into the task's PSD
so that execution can resume in the future and a pointer to the
User task's PSD remains in the initial top of the Executive
process stack. If the process has terminated itself, the PSD is
reset to its initial start-up state.
Event handling
There are three special kind of events that are handled by
the processor: interrupts, traps and exceptions. Interrupts are
generated externally by a reset, by a bus error condition or by
an external device asking for service. Traps are essentially
interrupts generated by the CPU itself. A trap can be caused by
the TRAP instruction, by an illegal instruction or by data
accessing problems such as stack overflow. Interrupts and traps
19
are handled on a system-wide basis by specific Executive
procedures. The PROCID of the routine corresponding to the type
of interrupt or trap can be found in the Executive Entry Table.
If a User task is executing at the time a trap or interrupt
occurs, the processor first performs an Outer Procedure Return to
save the User task status and to return to the Executive mode
where the proper servicing routine can be activated. In the case
of a trap, the trap number is placed on the top of the Executive
process stack so that it will be passed to the trap handling
procedure. The trap handling procedure may then use the trap
number to select and call a procedure appropriate to handle the
trap.
The other type of event, exceptions, are handled separately
by each task. Exceptions occur when the data being processed
cause arithmetic overflows, division by zero, etc. These events
can be handled in a default manner if no exception handling
procedures are specified (EXCEP. PROCID =0). If an exception
procedure is specified, it is handled as a normal procedure call
on the currently active process stack with the exception type
number passed as an argument.
20
Evaluation Procedure
This chapter discusses the research performed toward
completion of this thesis. Two main evaluation areas were
addressed. The first area was the identification of potential
strengths and weaknesses in the processor's architecture and
implementation. This was accomplished by first conducting a
general study of the processor, followed by a specific analysis
of its potential performance in the execution of signal
processing algorithms. At the same time, an attempt was made to
estimate how efficiently the AAMP can execute high-level compiled
languages. The second area was the testing of the validity of
the first analysis by running benchmark programs coded in the
first step. This was accomplished through the cooperation of
Collins-Rockwell in Cedar Rapids during a Sandia-sponsored visit.
Also, the author assisted Gary Mauersberger this spring in the
interfacing of a prototyping board supplied by Rockwell to the
Electrical Engineering department's HP9845B testing system. This
allowed the AAMP's initialization procedure and specific transfer
sequences to be confirmed on a logic analyzer.
Code Used for Evaluation
To compare the AAMP to other processors in the execution of
signal processing algorithms, it was necessary to use programs
which are representative of the class of algorithms which would
be ultimately run on the processor. Two adaptive linear
prediction algorithms, the Lattice and Widrow, were selected
21
because they had been used for this purpose in previous
evaluations. These algorithms are shown in Figures 6 and 7.
These choices seem to be valid ones because even though the
specific algorithms or implementations may not be used in future
designs, the type of operations performed will be similar.
Specifically, both algorithms involve large numbers of
multiplications and array handling in a real-time environment.
These factors were examined closely in addition to the general
performance of the stack-architecture for this type of algorithm.
n
(1) g(m) = \ b(m,k) * f(m-k) delta = 1 implied
/_ i by subscripts
k=l n = 16 (# of weights)
m = iteration
(2) e(m) = f(m) - g(m)
(3) b(m,k) = u * b(m-l,k) k = 1,2,. ..n
+ v * e(m) * f(m-k)
L
(4) q(m) = 1/L \ e(m-k+l) L = 16 (MAF window size)
k=l
(5) q2(m) = q(m) * q(m) output is squared for
threshold detection
Figure 6. Widrow adaptive linear prediction algorithm
22
eCl) = adc_input
wl(l) = e(l)
do 1 = l,n
e(l+l) = e(l) - k(l) * wl(l)
w(l+l) = wl(l) - k(l) * e(l)
v(l) = beta * v(l) +
betal * (e(l) * e(l) + will) * wl(D)
k(l) = k(l) + alpha * (e(l+l) *wl (1) + e ( 1) *w( 1+1) ) /v( 1)
wl(l) = w(l)
endo
wl (n+1) = w(n+l)
dac_out = e(n+l)
loop back to the beginning
Figure 7. Lattice adaptive linear prediction algorithm
Because of the potential speed of the processor and the
unique architecture (among micros), two versions of both the
Lattice and Widrow algorithms were coded. The listings of these
programs can be found in the appendices. The first version was
coded from a high-level representation to indicate what one would
expect from a compiler. The second version takes advantage of
assembly language "tricks" to optimize performance. The results
of this comparison are discussed in detail in the following
sections. After the Widrow algorithm had been coded, it was
23
discovered that previous evaluations had used what appeared to be
a less efficient method of implementation. The fifth listing was
written to correspond to the earlier implementations and see how
much was gained from the algorithm modifications. The gain was
18% for floating-point and 35% for fixed-point calculations.
This modification is discussed in detail in the section titled
"Widrow Algorithm Modification".
The AAMP was designed as a stack-machine to enhance support
of high-level language constructs. This efficiency coupled with
the processor's speed allows high-level language implementation
of algorithms which previously required assembly language
programming. The ability to implement algorithms in high-level
languages is a big advantage because it decreases required
programmer time and increases program reliability, portability
and quality of documentation.
Beginning with high-level pseudo-language, the Widrow and
Lattice algorithms were coded and then converted into AAMP
assembly language. Initialization was not included because it is
quite language dependent and would not affect the performance of
the algorithm once begun. It was assumed, however, that the most
frequently used variables were declared as local variables and
that the proper variable type declarations had been made. It was
also assumed that the default exception handling (divide by zero,
etc.) was used. An attempt was made to avoid restructuring the
high-level language representations to take advantage of
knowledge of the low-level structures except in the hand-
optimized versions of the algorithms. Optimizations obvious at
24
the high-level were used, such as the Widrow modification,
avoiding references to array members where local variables could
be used (as in the Lattice) and performing a multiplication once
outside a loop instead of every time through the loop. It was
assumed that the compiler would correctly select the addressing
modes, literal length, use the increment instruction, etc. and in
general take advantage of the facilities offered by the
processor. This turned out to be a reasonable assumption.
For the purposes of this evaluation, single-precision
integer and single-precision floating-point versions of the
algorithms were coded. Table 3 shows the execution times for the
Widrow and Table 4 shows execution times for the Lattice. The
AAMP has equivalent instructions available for each data format.
Because of this, the two versions were coded side by side, each
with the correct form of the arithmetic instructions and proper
length memory reference instructions. Execution statistics for
the fractional data format are identical to the integer version
and was not coded again. Another possibility which offers a
compromise between the fixed-point and floating-point is the
double-precision fixed-point format. Because of the word length,
double-precision fixed-point data transfers are the same as
single-precision floating-point transfers. Double-precision
fixed-point execution statistics have been estimated from single-
precision floating-point execution estimates. Shorter execution
times of the double-precision fixed-point arithmetic instructions
were taken into account. Extended-precision floating-point
implementations were not investigated because they would execute
much more slowly and the added precision seemed unnecessary.
25
Table 3. AAMP Widrow Execution Times
(all times in microseconds)
Add/ Stack
Algorithm Multiply Subtract Update
Samples
Other Total /sec
Fixed pt;
Standard
Modified
Optimized
237.50
237.50
237.50
19.25
19.25
19.25
Double-precision fixed pt
:
Standard
Modified
Optimized
757.50
757.50
757.50
Floating-point:
Standard
Modified
Optimized
957.50
957.50
957.50
26.25
26.25
26.25
272.05
272.05
272.05
507.60
345.60
0.00
907.20
691.20
0.00
907.20
691.20
0.00
588.20 I 1352.55
402.10 I 10o4.45
411.25 I 668.05
705.15
486.40
566.25
2396.10
1961.35
1350.00
I
712.90 I 2849.65
494.15 I 2414.90
574.00 I 1803.55
739
996
1497
417
510
741
351
414
554
Table 4. AAMP 16-Stage Lattice Execution Times
(all times in microseconds)
Multiply/ Add/ Stack
Algorithm Divide Subtract Update Other
Samples
Total /sec
Fixed pt;
Standard
Optimized
772.80
772.80
52.80
52.80
Double-precision fixed pt:
Standard 2436.80 72.00
Optimized 2436.80 72.00
Floating-point:
Standard 3073.60 756.80
Optimized 3073.60 756.80
345.60
0.00
1152.00
0.00
1152.00
0.00
758.10
648.70
1001.25
1134.75
1009.00
1142.50
1929.30
1474.30
4662.05
3643.55
5991.40
4972.90
518
678
214
274
167
201
26
It should be pointed out that the integer versions of both
algorithms make no provision for scaling. If needed, scaling
operations should not seriously affect the performance of the
algorithm. It is possible that using fractional notation could
take care of the scaling problem, but this has not been
investigated in any detail. Also, the AAMP automatically invokes
exception handling for overflows and division by zero. The
default exception handling should be adequate for most uses and
requires very little execution time overhead. If necessary,
however, user-supplied exception handling routines can be used at
the cost of the time needed to transfer to, execute and return
from the routines.
After the programs described above were coded and execution
rates were calculated, a trip to Collins-Rockwell at Cedar
Rapids, Iowa was arranged. With the cooperation of the Rockwell
personnel, the Lattice and Widrow algorithms were coded and
executed on their test equipment. Originally, Rockwell's PL/I
and Ada-subset were to be used, but due to lack of time and
accessibility only the Ada-subset was used. The object code
produced is discussed under the appropriate sections below and
more detail is provided in the "Performance measurements"
section.
The viability of using compiled output for time-critical
real-time signal processing depends on the efficiency of the
generated object code. Another program was written to test the
27
compiler's optimizations; it consists of a number of structures
that are commonly optimized, particularly those which appear
quite often in signal processing applications. The listings for
this program, ADATESTS, can be found in the appendices.
Widrow Algorithm Modification
While most of the Widrow algorithm's execution time can be
attributed to multiplications, a significant amount of time is
spent aligning the weight (b), input (f) and error (e) arrays.
This has been done either with block moves of the arrays or
through maintaining circulating buffers. The following is a
description of a method which is simpler and more efficient to
implement. The author came across this method by examining the
algorithm closely and has not found any previous use of this
method.
A common form of the Widrow algorithm is shown in Figure 6.
An important point to note is that the summation in step 1 will
be correct as long as all of the corresponding weight and input
pairs are multiplied and summed, regardless of order. The same
weight-sample pairs as in step 1 are used in the weight updating
in step 3, with one weight updated at a time. Again, the new
weights will be correct regardless of the order the updating
process uses. In fact, the only time a particular member of f is
needed is when the oldest sample is replaced with the new sample
(in a circular buffer).
28
Note that as long as a pointer is maintained to the oldest
sample in f, we need only concern ourselves with providing the
proper pairings of samples and weights for steps 1 and 3.
Usually, the sample array is advanced by one to simulate passage
of time. Instead, the weight array can be "moved back" by one to
create the same pairings of samples and weights. This turns out
to be convenient since newly calculated weight values must be
written into the array anyway. The updated weights are simply
written into the correct position for the next iteration's
pairings. Figure 8 illustrates this process.
Because the order is irrelevant as long, as the pairings are
correct, steps 1 and 3 can be efficiently performed by proceeding
from one end of the arrays to the other. In assembly language,
it is most efficient to go from the largest to smallest buffer
addresses, terminating when the array index equals zero. The
weight updating then requires only one additional memory transfer
to complete the circulation. This eliminates the overhead needed
for circulating pointers.
The pointer to the oldest sample circulates and thus must be
checked, but this occurs only once per sample. Also, the index
to the oldest value of e is the same as the index into f,
allowing a single index to be used for both purposes. Figure 9
illustrates the updating of the sample (f) and error (e) arrays.
Finally, Figure 10 shows that the pairs match correctly after
updating.
The results of the comparison between the block-move updates
and the modified version show an improvement of 35% for the fixed
29
point version and 18% for the floating point version. The timing
difference is actually greater in the floating-point version
because the buffers are twice as large, but the proportion of the
total is less. One would expect that the savings would be less
dramatic on processors that have block-move instructions using
register pointers. Dwight Gordon's NSC800/hardware multiplier
evaluation [3] showed that block-moves represented approximately
10% of the total execution time, which is what would be expected.
b(16)
b(15)
b(14)
b(13)
b(12)
b(ll)
b(10)
b( 9)
b(
b(
b(
b(
b(
b(
b(
b(
b(
8)
7)
6)
5)
4)
3)
2)
1)
1)
> update
< process
Step #1
[ b(16) > update 1 b( 1) ' ]<--
[ b(16) ' 1
<
process 1 b(16) ] 1
[ b(15) : b(15) ' ] 1
[ b(14) ' : b(14) ' ] 1
[ b(13) ' : b(13) ' ] 1
[ b(12) ' : b(12) ' ] 1
[ bdi) • : b(ll) ' ] 1
[ b(io) ' : b(10) ' ] 1
[ b( 9) • : b( 9) ' ] 1
[ b( 8) : b( 8) ] 1
[ b( 7) : b( 7) • ] 1
[ b( 6) •
'
b( 6) ' ] 1
[ b( 5) ' ' b( 5) » ] 1
[ b( 4)
'
b( 4) ] 1
[ b( 3) '
'
b( 3) ' ] 1
[ b( 2) '
'
b( 2) ] 1
[ b( 1) ' b( 1) ]—
Step #16 Step #17
Figure 8. Weight Updating Process
Note: The prime indicates variables for the next iteration.
30
discard old insert new value discard old insert new value
I
>f (16)
f (15)
f (14)
f (13)
f (12)
f (11)
f (10)
f( 9)
f (
f (
f (
f (
f (
f (
f (
f (
8)
7)
6)
5)
4)
3)
2)
1)
>[ f
>[->f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
>[ f
( 1) »
(16) '
(15) '
(14)
(13) '
(12)
(11) '
(10) '
( 9) »
( 8) »
( 7) '
( 6) '
( 5)
( 4)
( 3) '
( 2) '
->e(16)
e(15)
e(14)
e(13)
e(12)
e(ll)
e(10)
e( 9)
e( 8)
e(
e(
e(
e(
e(
e(
e(
7)
6)
5)
4)
3)
2)
1)
-->[ e( 1)
—>[->e(16)
>[
>[
>[
>[
>[
>[
>[
>[
•>[
>[
>[
>[
>[
>[
e(15)
e(14)
e(13)
e(12)
e(ll)
e(10)
e( 9)
e( 8)
e(
e(
e(
e(
e(
e(
7)
6)
5)
4)
3)
2)
Figure 9. Sample and Error Array Updating Scheme
Note: Only the oldest sample and error values are physically
replaced; the rest are left in the same place and merely
relabeled.
-> = pointer to oldest, ' = for next iteration
[ f(16) 1 <— > 1 b(16) ] [ f ( 1) ' 1 <— > 1 b( 1) ' ]
[ f(15) 1 <—
>
b(15) ] [ f (16) 1 <—
>
b(16) ' ]
[ f(14) 1 <— b(14) ] [ f (15) ' I <— b(15) ]
[ f(13) 1 <— b(13) ] [ f (14) ' 1 <— > b(14) ' ]
[ f(12) 1 <--> b(12) ] [ f (13) 1 <— b(13) ' ]
[ f(ll) 1 <— b(ll) ] [ f (12) 1 <— > 1 b(12) ' ]
[ f(10) 1 <— > 1 b(10) ] [ f (11) 1 <— > 1 b(ll) ]
t f( 9) 1 <— > 1 b( 9) ] [ f (10) » 1 <— > b(10) ' ]
[ f( 8) 1 <— > 1 b( 8) ] [ f ( 9) » ] <— > 1 b( 9) ' ]
[ f( 7) ] <— > 1 b( 7) ] [ f ( 8) » : <— > 1 b( 8) • ]
[ f( 6) 1 <— > 1 b( 6) ] [ f ( 7) ' 1 <— > 1 b( 7) ' 3
[ f( 5) : <— > 1 b( 5) 1 t f ( 6) ' : <— > 1 b( 6) ' ]
[ f( 4) 1 <— > 1 b( 4) ] [ f ( 5) ' 1 <— > 1 b( 5) ]
[ f( 3) <— > 1 b( 3) ] [ f ( 4) : 1 <— > 1 b( 4) ' ]
[ f( 2) 1 <— > ! b( 2) ] t f ( 3) ' 1 <— > I b( 3) ' ]
[ f( 1) <— > 1 b( 1) ] [ f ( 2) ' ' 1 <— > 1 b( 2) 1
Figure 10. Sample-Weight Pairs Before and After Updating
Note: The pairings remain the same between the arrays.
' = for next iteration
31
Optimization for the AAMP
The following section discusses a few of the features of the
AAMP that significantly affect the execution rates of programs.
These factors can be dealt with through coding style and compiler
optimiztion.
As a stack machine, the AAMP performs operations on the top
elements of the stack, popping the arguments off and pushing the
result. To speed this process, there is a provision to hold up
to four of the top values on the stack in registers inside the
processor itself. These registers are transparent to the
programmer; transfers into and out of these registers are handled
automatically by the processor. As elements are placed on the
stack, they are put in the processor registers. If a new value
is to be pushed onto the top of the stack when all processor
registers are full, the bottom element is moved out into memory
to free a register. Later, when the top elements of the stack
are removed by operations, the registers become empty and the
values in memory must be brought back into the registers. Thus,
each time the stack grows to more than four elements, stack
updating must take place. This storage and later retrieval of
stack elements is handled automatically by the processor but is
very costly in terms of processing time. Table 5 compares the
stack updating action with other operations. Each time a stack
element is moved from the registers to memory it must later be
returned to the registers. Therefore, the stack updating time in
this report is the combined storage and retrieval times.
32
Table 5. Execution Times of Some Common Operations
(all times in microseconds)
Operation Execution time
Fixed-point multiply 4.75
Floating-point multiply 19.15
Fixed-point addition 0.55
Floating-point addition 7.75
Local reference, 16-bit 0.85
Stack update 3.60
For the preliminary performance estimates, it was assumed
that the processor displaced into memory only enough registers to
make room for the element being pushed onto the stack. Also, the
processor retrieved only enough elements from memory to perform
the current instruction. This is the optimum approach since it
avoids unnecessary transfers and seemed to be the logical way to
implement the stack updating scheme. Tables 3 and 4 demonstrate
the significance of the stack updating in the execution time. If
any other scheme is used, it could degrade performance
significantly.
Due to "real estate" problems encountered in the
implementation of the AAMP on a single chip, it was not possible
to have optimal stack-updating. Instead, Rockwell used the
mapping of opcodes to internal stack status shown in Table 6.
This mapping is nearly optimal with the non-optimal instructions
listed in Table 7 and discussed below.
33
Stack
Elements
Opcodes Allowed
00-1F 0-3
20-3F 0-2
40-5F 1-4
60-7F 2-2
80-9F 4-4
AO-BF 3-4
CO-DF 2-4
EO-FF 2-4
Table 6. Opcode to stack update mapping
There are three types of non-optimal stack-updates:
1) Unnecessary - a stack update which provides a range of
stack elements that is more restrictive than required by the
instruction. There is a 0.5 probability that this action must be
reversed in subsequent instructions, causing inefficiency.
2) Inadequate - a stack update which provides a range of
stack elements that is less restrictive than required by the
instruction. The instruction must then continue the stack
updating (if needed) to meet its more restrictive requirements.
This type is harmless.
3) Destructive - a stack update which provides a condition
that must be immediately corrected before the instruction can be
executed.
Key to symbols:
( ) = a harmless state
<- = stack not used by instruction
!? = stack-update which must be immediately undone
# = optimums are 0-0 for PROCID = CENV =
0-4 for PROCID = CENV <>
34
Opcode Mnemonic Implemented Optimum
IB INTE 0-3 0-4
ID SKIPI 0-3 0-4
(IF CALLPI 0-3 0-0)
20 NOP 0-2 0-4<-
(23 CALLI 0-2 0-0)
(26 LIT48 0-2 0-1)
28 LIT4B.8 0-2 0-3
29 LIT4B.9 0-2 0-3
2A LIT4B.A 0-2 0-3
2B LIT4B.B 0-2 0-3
2C LIT4B.C 0-2 0-3
2D LIT4B.D 0-2 0-3
2E LIT4B.E 0-2 0-3
2F LIT4B.F 0-2 0-3
(58 TRAP 1-4 1-1)
(5D CALL 1-4 1-1)
(5E CALLP 1-4 1-1)
65 CVTSD 2-2 1-3
66 LOCU 2-2 1-3
67 REFD 2-2 1-3
68 REFDC 2-2 1-3
69 REFDXI 2-2 1-3
6A DUP 2-2 1-3
6C CVTDFE 2-2 2-3
6D CVTFFE 2-2 2-3
6E REFTXI 2-2 1-2
6F REFTX 2-2 2-3
74 REFTI 2-2 0-1 !?
75 REFT 2-2 1-2
76 REFTC 2-2 1-2
77 REFTLE 2-2 0-1 !?
78 REFTU 2-2 2-3
79 DUPT 2-2 3-4 !?
7A INCSLE 2-2 0-4 <-
7B INCSI 2-2 0-4 <-
7C INCS 2-2 1-4
7D DECSLE 2-2 0-4 <-
7E DECSI 2-2 0-4 <-
7F DECS 2-2 1-4
B7 POPD 3-4 2-4
B8 ARS 3-4 2-4
BD EXCEPTO 3-4 #
BE EXCEPT1 3-4 #
BF EXCEPT2 3-4 #
F4 NOT 2-4 1-4
F8 CVTBIT 2-4 1-4
FE HALT 2-4 0-4?<-
Table 7. Instructions with non-optimum stack-updating
35
Examination of the assembly listings for the Widrow and
Lattice algorithms led to the conclusion that the nonoptimum
instructions did not have a significant effect on execution rate.
The hand-compiled versions used the LIT4B, REFDXI, DUP and DECSL
instructions while the Ada-subset compiler output used only the
LIT4B instruction. Nonoptimal instructions would be used quite
frequently, however, if triple words were being accessed but
would not be very significant compared with the accompanying
relatively slow extended floating-point operations.
For hand-compilation of algorithms for the performance
estimates, it was assumed that the compiler was not "smart"
enough to generate object code that would not cause stack
updating. This turned out to be the case with the Ada-subset
compiler. There are two methods of avoiding stack updating: 1)
rearranging arguments and 2) storing intermediate results in
temporary locations. These two methods are complementary, and
both have been used in the hand-optimized versions of the
algorithms.
Rearranging arguments is the most desirable way of avoiding
stack updating because it does not require any extra
instructions. This rearranging of arguments is commonly used by
owners of RPN calculators when equations are entered beginning
with the innermost parenthetical expression. Also,
multiplication and division terms are evaluated before addition
and subtraction terms whenever possible. This does not always
work, as in the following case:
F=A*B + C*D
36
In this case, both product terms must be evaluated before they
can be summed. As the second multiplication is about to take
place, the product A*B, C and D are on the stack. If being
performed with floating-point numbers, each variable takes up two
storage locations, forcing the stack to have six members (or more
depending on previous actions). This causes at least two stack
updates.
The second way of avoiding stack updates is to store
intermediate answers in temporary locations. In the example
above, the product A*B would be stored in a temporary location
until C*D had been evaluated. Then the value would be retrieved
and the summation could take place. This method is only
economical if the temporary location can be efficiently accessed.
The addressing mode must use immediate data (to avoid pushing
more data on the stack!) or, preferably, the local addressing
mode.
Both of the above methods were used in the optimized
versions of the algorithms. Tables 3 and 4 show that the time
spent on "other" operations increased when the stack updates were
taken out. This was due to the added overhead from the temporary
variables.
A second important feature of the AAMP is its addressing
modes. The methods of addressing array elements are of
particular interest for the evaluation of signal processing
algorithms. The AAMP provides three addressing modes which can
be used for accessing array members: 1) Indexed, 2) Indexed
Immediate, and 3) Constant Offset.
The Indexed addressing mode adds the top two elements of the
37
stack to form an address. The top element of the stack
represents the base address of the array while the element next
to the top represents the index into the array. Before adding,
the index is multiplied by two for double-length word accesses
and by three for triple-word accesses. Through use of this
addressing mode, an array element may be specified by index
regardless of the element's word length.
The Indexed Immediate addressing mode is the same as the
indexed mode except that the base address of the array is in
immediate data in the instruction instead of on the top of the
stack.
The third addressing mode is Constant Offset. This mode
works essentially the same as Indexed Immediate except that the
base and offset are added together without taking into account
the word-length of the data being accessed. In other words, the
two numbers are added together without any multiplication of the
offset.
These are not the only methods of referencing arrays, but
they are the most convenient. Other methods include more
complicated calculation of offset and pre-calcul ating addresses
when the index into the array being used for a particular
reference is constant.
The AAMP's addressing modes are very convenient for
referencing data in tables and other common structures, but there
are some actions that are awkward at best for the AAMP to
perform. In particular, block moves and other actions which
require the use of one or more pointers fairly intensively are
38
awkward. The processor is fast enough to make these actions
reasonable, but the processor will yield better performance when
other programming techniques are used to perform these tasks.
This was the factor that led to the modified version of the
Widrow algorithm.
Because of the nature of array addressing in the AAMP, the
array's base address must be obtained from either the stack or
the accessing instruction's immediate data bytes. The array
index must be taken from the top of the stack. To get the index
onto the top of the stack, another memory access must have taken
place. If an array member is to be accessed more than once, it
becomes advantageous to store the member in a temporary local
location during its first access. From the next reference until
the last reference, the locally stored variable is accessed.
Quite often, the last access involving a variable is to assign a
new value to it before the next iteration is begun. This allows
the new value to be written directly to the actual array storage
location only.
Another possible optimization is in the case of a loop that
references the k+1 element during the kth iteration. If k+1 is
used more than once in the loop, k + 1 might be calculated once and
stored for future references, but a better solution at the
assembly language level is to specify an offset array base so
that when the Indexed address is calculated, it automatically
includes the +1 offset. This turned out to be one of the few
optimizations the Ada-subset compiler made.
The most efficient storage is in a 16 word Local area.
These locations should be used for the most frequently used
39
variables. For example, during a block move, the pointers must
be continually retrieved from memory. If the Local addressing
mode is used to access the pointers, significant improvements in
performance can be achieved.
Another signifcant factor in signal processing programs is
how efficiently loop structures can be implemented. The AAMP has
a pair of instructions, DO and ENDO, for that express purpose.
Before DO can be executed, the information necessary for control
of the loop must be put on the stack: loop variable address,
initial value, final value and increment value. The DO
instruction is then executed, intializing the variable and
executing the loop. In the process, the initial loop value on
the stack is replaced by the address of the beginning of the loop
(the instruction following DO). At the end of the loop, the ENDO
instruction performs the incrementing and comparison necessary to
determine whether to execute the loop again or to exit. To do
this, the four stack locations containing the information must
all be brought into the registers. Note that the DO instruction
is executed only once but the ENDO instruction is executed every
time the loop is executed.
The assumption made when hand-compiling the algorithms was
that the compiler would use these instructions to implement most
if not all loop structures. This had a rather significant effect
on the execution rates of the algorithms because of the large
number of stack updates it caused. Since all four stack
locations had to be brought into the processor for the ENDO
instruction, any word placed onto the stack automatically led to
40
a stack-update. For large loops, this would not constitute a
very significant part of the execution time, but for small loops
the effect is quite significant. In the hand optimizations, the
DO and ENDO instructions were not used but the actions were coded
explicitly. This eliminates the need for four words of
information to be stored on the stack during the loop and does
not cause stack-updates. This change alone accounted for much of
the improved performance in the hand-optimized versions of the
algorithms. It was discovered that the Ada-subset compiler also
discards the DO/ENDO instructions and codes the structures
explicitly.
In the implementation of large signal processing programs,
it is desirable if not necessary to partition the program into
functions and subroutines or procedures. This partitioning
offers the advantages of making the program easier to understand,
maintain and modify. The following discusses the resources
available in the AAMP to implement such partitioning.
The structures of functions, subroutines and procedures can
all be implemented through use of the AAMP's CALL and RETURN
instructions. These instructions are powerful because they allow
the programmer to set up a local environment and to pass
parameters using only a few instructions. Working with this
mechanism is the instruction LOCNL (locate nonlocal) which will
search through the process stacks until it finds the specified
PROCID and locates the specified variable. This mechanism allows
variables to be accessed from calling procedures without passing
the variables as arguments or making the variables global. An
example of where this would be useful is when a procedure which
41
has created a local array calls another procedure to manipulate
the array. Using LOCNL, the called procedure can locate the
array base and calculate the positions of the individual
elements.
The chief advantages of the AAMP's procedure calling
mechanism are the ease of programming and the flexibility it
allows in the calling order of procedures. The alternative is to
put a return address and arguments on the stack and SKIP to the
subroutine. The subroutine would return by SKIPing to the return
address left on the stack. The advantage of this alternative
method is that it requires less execution time. The
disadvantages are that more care must be taken in accessing
variables and passing parameters and that new local storage is
not set up for temporary variables used by the function. A
break-even point can be calculated where the savings from local
referencing set up by a procedure call become greater than the
procedure call's overhead. Both single and double word
references and assignments take one more microcycle and one half
more instruction fetch for Local Extended than for Local
addressing. The CALLI, LIT4A and RETURN instructions require 31
microcycles, 4.5 fetch cycles, 3 read cycles and 4 write cycles.
In addition, 6 microcycles, 1 read cycle and 1 write cycle are
required for each argument returned. Using these figures, 27
Local Extended accesses require the same amount of execution time
as 27 Local accesses plus a procedure call with no arguments
returned. Thus, 27 or more accesses make the procedure call
economical. The other advantages, however, should encourage the
42
use of the CALL/RETURN mechanism more often than what is strictly
economical.
Another possibility is that of doing away with the looping
structures altogether and using "in-line" code. This is not a
very graceful solution but should be considered, especially in
light of the good code density characteristic of AAMP. Instead
of a looping structure where each iteration processes a
corresponding stage, in-line code would have a specific set of
instructions for each stage. Instead of using the loop variable
as an index into the arrays, sections of in-line code would
contain the exact address of the element of interest, coded as
bytes of immediate data.
Compiler optimizations
In the past, digital signal processing programs written for
microprocessors have had to be hand-coded, with the utilization
of as many assembly language "tricks" as possible. As a result,
program efficiency was to a large degree a function of the
programmer's cleverness. Unfortunately, there always seems to be
a shortage of clever programmers and the cleverness must often
later be unraveled. If feasible, the ability to program in high-
level languages would greatly decrease the programming and
maintenance time.
The efficiency of execution of compiled high-level languages
seems to be dependent on three factors: 1) the ability of the
compiler to manipulate the program without altering the
semantics, 2) the mapping of compiled structures into machine
43
language instructions and 3) the execution speed of these
instructions.
The first factor, the manipulation of the algorithm, is
dependent upon the compiler. After the compiler has converted
the program into an internal form, often an abstract syntax tree,
this internal form can be rearranged and condensed.
Rearrangement uses commutati vity to produce a more efficient
order of evaluation. The internal form can be condensed by
the calculation of constant expressions, elimination of common
subexpressions, etc. [61.
The second factor, the translation of compiled constructs
into machine language instructions, is mostly dependent on the
microprocessor's acrhi tect ur e. Most register-oriented
microprocessors must use several machine language instructions to
implement a complex operation such as a procedure call or
a floating-point multiply. In particular, the allocation of
their registers is critical to performance. Because of the
AAMP's instruction set and stack architecture, high-level
languages map quite directly into machine language instructions.
Another way to state this is to say that AAMP programs exhibit
good code densities. Stack machines have no register allocation
problems but instead must strive to keep few arguments on the
stack. This is somewhat less of an optimization problem than
register allocation.
The third factor, the execution speed of the instructions,
depends on the technology of the implementation and appears to be
quite adequate in the case of the AAMP. The specifics of this
44
can be found elsewhere. [7]
The potential efficiency of compiled high-level languages
was first assessed when the Widrow and Lattice algorithms were
compiled. To examine the first two factors more closely, a
program titled ADATESTS was coded in both integer and floating-
point versions. This program was then compiled by the ICSC Ada-
subset compiler at Collins-Rockwell. The results of this test
show directly only what had been implemented on this compiler but
should indicate problems other compiler implementations might
encounter. Many common structures and commonly optimized
expressions were placed in this program:
- while , for and loop structures
- function and procedure calls
- commutative rearrangement of expressions
- optimization between statements
- constant expression evaluation
- loop invariant expressions
- use of the increment instruction
Examination of the compiled object code revealed little
optimization but did reveal an efficient translation of high-
level structures into machine language instructions. Exceptions
found were the DO/ENDO intructions discussed previously and the
DUP and INC instructions. The compiler, however, did calculate a
constant expression in the integer version. Also, in the Lattice
and Widrow programs, Local Extended addressing was used to access
specific array elements by computing the address of the element
at compile time.
The optimizations not present in the Ada-subset compiler
45
are practical but had not yet been added due to lack of time.
Another pass could be added to the compiler to take the
intermediate code macros it produces and optimize them before
the macro assembler generates object code. With this, the
compiler's output would be very close to that produced by a
skilled programmer. Optimization on the programmer's part would
then be performed only by simplifying the high-level
representation.
46
Performance measurements
Table 8 shows the total numbers of various types of cycles
for each of the versions of the algorithms coded. These numbers
were derived from the program listings and Rockwell's "AAMP
Instruction Execution Statistics" document [7], These totals are
provided here to show how the data shown in earlier tables were
arrived at and to allow performance calculations for various
memory speeds and processor clock speeds. The following
equations were supplied in Rockwell documents to calculate
execution times. The following section will develop a variant of
this equation which was used for the execution rate estimations.
Briefly, the equation from Rockwell does not take into account
synchronization.
Te = Nc * Tc
+ Nf * (Tf + (S+3) * Tc/4)
+ Nr * (Tr + (S+3) * Tc/4)
+ Nw * (Tw + (S+2) * Tc/4)
+ Nb * Tb
where:
T = time
N = number of actions
S = set-up time configuration of the processor
e = total execution
c = internal cycles
f = instruction fetches
r = data reads
w = data writes
b = bus cycles
47
Table 8 Microcycles Required by the AAMP
Algorithm Cycles Fetches Reads Writes
Fixed Point:
Standard Lattice 7311 735 594 228
Optimized Lattice 5531 581 482 164
Floating Point:
Standard Lattice 26270 7 85 1107 566
Optimized Lattice 21818 7 91 835 405
Fixed Point:
Standard Widrow 4953 482.5 463 261
Modified Widrow 3827.5 339.5 334 173
Optimized Widrow 2290 3 52 271 77
11765 506.5 708 441
10286.5 363 520 309
7541 411.5 424 149
Floating Point:
Standard Widrow
Modified Widrow
Optimized Widrow
Since the AAMP is being evaluated for use on small systems,
bus arbitration is unnecessary and Tb = 0, canceling the last
term in the equation. Rockwell has done all of its
specifications using a 20 MHz external clock and seems to be
getting parts to run at that speed or better, so 20 MHz was used
for this study. The microcycle clock frequency is the external
48
clock frequency divided by 4; thus, Tc = 200 ns. The fetch, read
and write times depend on the memory used and on the method of
generating control signals. To be conservative, 200 ns was
allowed for each write and 250 ns for each fetch and read cycle.
This could represent any one of the following conditions:
1) Tw = Tr = Tf = 100 ns, S = 0.
2) Tw = Tr = Tf = 50 ns, S = 1.
3) Tw = Tr = Tf = ns, S = 2.
These seem to represent a wide variety of set-up times which
should surely allow a common RAM to be used. It is possible that
such a system will be able to run without wait-states (S=0, Tf=0,
Tr=0 and Tw=0), but this needs to be investigated further.
To check the validity of the estimates, the programs were
written in Ada and executed on a Rockwell development system at
Collins-Rockwell in Cedar Rapids, Iowa. The output of the Ada-
subset front-end is in the form of macro-instructions for a
general stack machine, which is then translated into machine
instructions for either the VAX, 8086 or in this case, the AAMP.
The object code output was down-loaded into the AAMP test
equipment and executed. A Hewlett-Packard logic analyzer was
used to monitor execution rates of the programs.
Because there were no analog-to-digital or digital-to-analog
converters available on the AAMP development system, memory
locations (variables in the high-level notation) were read from
and written to respectively. This should approximate memory-
mapped real devices except for the lack of control signals and
possible overflows and underflows due to non-varying input
49
(unchanged memory contents) into the adaptive digital filter.
The algorithms have been checked by several people and appear to
be correct, but have not been run with actual data.
In the Ada-subset coding, a pragma was used to instruct the
compiler to generate code which does not check for array bounds
errors during execution. This checking would be very costly when
the number of array references is taken into account. Including
this pragma increases the execution rate significantly but puts
the burden on the programmer of guaranteeing correctness of
references. In these small programs, this represented no
problem. In larger programs, the pragma could be omitted until
the program is debugged and then inserted to increase execution
speed.
Table 9 compares estimated and measured sampling rates for
the Widrow and Lattice algorithms. The measured values were
obtained during the April 16-17 visit to the Rockwell facility in
Cedar Rapids, Iowa. These algorithms were coded in Ada and were
executed on a test system.
The timing differences between the estimated and measured
values are due primarily to coding differences. To check the
timing estimates, the following major coding differences were
taken into account. First, the Ada-subset compiler does
virtually no optimization. In the Widrow, a multiplication that
was moved outside of a loop in the estimated version was left
inside in the measured version. The Ada-subset compiler did not
use the DO/ENDO instructions, and therefore produced fewer stack-
updates and faster execution. These differences are shown in
50
Tables 10 and 11. Some differences in execution time remain
unaccounted for but probably would not be if all coding
differences were reconciled. Also, floating-point instruction
times vary according to the data used.
51
Table 9. Estimated vs measured execution rates
Samples/Second
Algorithm Timing
Parameters
Estimated
(20 MHz)
Actual
(20 MHz)
Actual
(30 MHz)
Widrow, integer s=0 780 826 1156
s=l 773 766 1140
s=2 698 741 1042
s=3 693 694 1031
Widrow, floating s=0
s=l
s=2
3 57
354
335
422
3 92
606
565
Lattice, integer s=0
s=l
s=2
488
483
446
481
427
625
617
ice, floating s=0 168
s=l 166
s=2 161
176
168
258
245
All measured times use XAQ=XRQ, BG=BR and Tb<0 ns,
52
During this benchmarking, the use of an odd number of set-up
cycles (S) caused erratic measurements. This problem was due to
the bus delays in the test system and a synchronization action of
the AAMP which is discussed below.
Although the AAMP interfaces with external devices in an
asynchronous manner, the signals are synchronized internally.
Figures 11a and lib contain timing diagrams showing this
synchronization for read and write cycles respectively. A bus
cycle begins in the middle of the 5 MHz microcycle clock and
stops the microcycle clock until the bus cycle is complete. The
microcycle clock is then restarted and continues the second half
of the microcycle.
When the microcycle clock is stopped, the AAMP attempts to
assert Bus Request. Bus Request can only become active when Bus
Grant, Transfer Acknowledge and Transfer Error are inactive. In
this manner, the AAMP will not disrupt other processors which
might be using the bus or use a bus which has failed. External
bus arbitration logic responds to Bus Request with a Bus Grant
when appropriate. If there is only one processor on the bus, Bus
Request and Bus Grant can be tied together, by-passing the bus
arbitration logic. When received, however, Bus Grant is
synchronized internally so that for a 20 MHz clock, tying Bus
Grant to Bus Request (Tb = 0) gives no time improvements over a
bus acquisition time of slightly less than one clock cycle (Tb <
50 ns).
53
Figure 11a. Read Cycle Synchronization
20 Mh2j [_n_n
10 Mhz I [ r
CLOCK
BR
|«-j- clock stopped
BG
XRQ
XAK
Figure lib. Write Cycle Synchronization
54
When Bus Grant has been received, the address, data and
status line drivers are enabled immediately. These signals are
then given time to propagate through the bus interface before a
Transfer Request is asserted. This time comes from the Bus Grant
synchronization delay plus one clock cycle for reads, or plus two
clock cycles for writes. The SI and SO pins provide a means of
externally selecting an additional set-up time of from zero to
three cycles. Thus, if Bus Request is tied to Bus Grant and SO =
SI = 0, there will be nearly two clock cycles (100 ns) from Bus
Request to Transfer Request for a read and three clock cycles
(150 ns) for a write.
The device being accessed is responsible for generating a
Transfer Acknowledge in response to a Transfer Request, allowing
itself enough time to operate correctly. Transfer Acknowledge
is, however, synchronized internally with the 10 MHz clock,
probably to ensure the microcycle clock restarts correctly.
Because of this synchronization, there must be an integer number
of 10 MHz clock cycles and thus an even number of 20 MHz clock
cycles between Bus Request and the internally synchronized
Transfer Acknowledge.
The microcycle clock is restarted when the synchronized
Transfer Acknowledge is received in the case of a write, or after
an additional 10 MHz clock cycle in the case of read. Transfer
Request and the other address, data and status lines remain
active until the end of the microcycle (100 ns after the clock is
restarted). Hold and Bus Request remain active until the middle
of the next microcycle, the point where another bus transaction
could begin. Because of this, the processor can make consecutive
55
bus transactions without relinquishing the bus, thus eliminating
the bus arbitration logic delay. The minimum, however, is the
same as the case where Bus Request is tied to Bus Grant due to
synchronization.
The key to the previously mentioned problem with an odd
number of set-up cycles is that because of the 10 MHz
synchronization, the Transfer Request assertion is delayed
without lengthening the total bus transaction time. The amount
of time Transfer Request is active is thus shortened. Usually,
Transfer Request is used to select the memory device and erratic
operation may result if it is not selected for a sufficient
amount of time.
The above analysis is summarized in Figure 12a in the form
of a timing worksheet. Note that Transfer Request is active from
the end of the set-up time until the end of the microcycle.
Applying the worksheet to the conditions that existed for the
benchmarking shows that at 20 MHz and SI, SO = 0, Transfer Request
is active six clock cycles (300 ns) for a read and three clock
cycles (150 ns) for a write. With SI = and SO = 1 (one set-up
cycle), Transfer Request is shortened to five clock cycles (250
ns) for a read and lengthened to four cycles (200 ns) for a
write. The shortened read select in combination with bus delays
most probably caused the erratic operation.
Once the timing worksheet has been filled out, it is then
possible to calculate the execution times of instructions.
Figure 12b shows the numbers used for the performance estimates.
Because of the predictable nature of translation from high-level
56
representations to AAMP instructions, it is possible to quickly
estimate the execution rate of an algorithm from a high level
language representation. This process is summarized in another
worksheet provided in Figure 13. By counting the occurrences of
various types of references, arithmetic operations and loops, the
majority of operations have been accounted for and a reasonable
estimate of the execution rate of an algorithm can be obtained.
This quick estimate is for single and double precision fixed-
point and single precision floating-point only. Less frequently
used operations such as type conversion are not included. By far
the most important operation omitted is the stack-updating. The
resulting estimate assumes that the coding was such that no
stack-updating occurred. As discussed earlier, this could be
brought about by making the compiler optimize more or by hand-
coding the program. Otherwise, a hand-optimized version will
likely yield only small amounts of improvement unless high-level
optimizations such as loop invariants are ignored.
The original performance estimates used the equations
supplied by Rockwell in [7] which failed to take into account the
synchronization action. These estimates have since been
recalculated to reflect the correct timing using the equations at
the bottom of Figure 13.
Another undocumented feature of the AAMP is a limited
prefetch feature. A single portion of the AAMP's
microinstruction word controls either its shift registers or its
bus cycle logic. During long instructions which are not
performing any shifts, the instruction word containing the next
opcode can be fetched if it is not already in the upper byte of
57
the word in the instruction latch. This prefetching does not
appear to increase the execution rate because the opcode fetch
microcycle must be performed by the instruction anyway. The only
difference is that the time for the bus transaction is taken
during a different microcycle.
58
Figure 12a. AAMP Timing Worksheet
read write
Bus Request to Bus Grant
Set-up overhead
Set-up cycles (0-3)
Transfer Request to
Transfer Acknowledge
SUBTOTAL
Add 1 if SUBTOTAL is odd
10 MHz cycle for read
TOTAL CYCLES Cf = Cr = Cw =.
Figure 12b. Timing parameters used for estimates
Bus Request to Bus Grant 1 * 1 *
Set-up overhead 1 2
Set-up cycles (0-3)
Transfer Request to
Transfer Acknowledge
SUBTOTAL
Add 1 if SUBTOTAL is odd
10 MHz cycle for read
TOTAL CYCLES Cf = Cr = 6 Cw =.
Note: * indicates that the number should be rounded up to the
next highest integer; the minimum is one cycle.
59
Figure 13. Execution Rate Estimate Worksheet
Operation totals
Nc Nf Nr Nw Instances Nc Nf Nr Nw
Reference Variable (non-indexed)
s.p. fixed 2
d.p. fixed 3
s.p. floating 3
Reference Fixed-index variable
s.p. fixed 3
d.p. fixed 4
s.p. floating 4
Reference Indexed variable
s.p. fixed 7
d.p. fixed 9
s.p. floating 9
Assign Variable (not indexed)
s.p. fixed 2 0.5 1
d.p. fixed 3 0.5 2
s.p. floating 3 0.5 2
Assign Fixed-index variable
s.p. fixed 3
d.p. fixed 4
s.p. floating 4
Assign Indexed variable
s.p. fixed 7
d.p. fixed 9
s.p. floating 9
Constants
s.p. fixed 1
d.p. fixed 4
s.p. floating 4
Addition
s.p. fixed 2
d.p. fixed 3
s.p. floating 38
0.5 1
2
2
0.5
0.5
1 1
2
2
1
1
3 2
3
3
3
3
0.5
1 2
1 2
0.5
0.5
0.5
1 1
1 2
1 2
3 11
3 12
3 1 2
60
Figure 13 (continued) . Execution Rate Worksheet
Subtraction
s.p. fixed 2
d.p. fixed 3
s.p. floating 38
0.5
0.5
0.5
0.5
0.5
0.5
Multiplication
s.p. fixed 23
d.p. fixed 74
s.p. floating 94
Division
s.p. fixed 27
d.p. fixed 78
s.p. floating 98
Procedure/ function
Call and Return
30 4 3 4
plus for N return parameters
6N N N
Loop structure
initial 15.5 5.5 1 1
plus for N iterations
24.5 9 2 2
If. . . then. . .else
If branching 2 1.5
else branching 2 1.5
Goto
TOTAL
1.5
0.5
0.5
0.5
TOTAL CYCLES = Nc * 4 + Nf * Cf + Nr * Cr + Nw * Cw
Note: Cf, Cr and Cw come from the AAMP Timing Worksheet
ITERATIONS/ SECOND = CYCLES/ SECOND
TOTAL CYCLES
61
Table 10. Widrow coding differences
Fixed-point Floating-point
Nc Nf Nr Nw Nc Nf Nr Nw
Estimated 4953 482.5 463 261 11765 506.5 708 441
Invariant -30 -3
Multiplication +384 +16
Loop structure +724 +376
Stack updates -2115
Index ref.s +280 +70
Loop var refs +140 +7
Constants -16
Cnst Indxd vars -6 -3
k+1 calculation -90 -30
New estimate 4230 962,
-1 -1
-106 -4 -2 -2
+1584 +2 4 +3 2
+50 +724 +376 +50
-141 -141 -2820 +188 +188
+140 +7
+140 +7
+3 2 -32 -48 +6 4
-6
-4.5 +3
-90 -3
413 119 11299 95 9 667 251
Table 11. Lattice coding differences
Fixed-point Floating-point
Nc Nf Nr Nw Nc Nf Nr Nw
Estimated 7311 735 594 228 26270 785 1107 566
Loop structure +247 +128 +17 +247 128 17
Stack updates -720 -48 -48 -960 -64 -64
Constants -24 +48 -48 -72 +64
DUP not used +160 +48 +64 +160 +48 +48
New estimate 7494 1015 691 196 25957 1017 1188 518
62
Results and Conclusions
The AAMP offers significant improvements over previously
evaluated processors due to its powerful instruction set and low
power consumption. The execution statistics for previously
evaluated processors are shown in Tables 12 and 13 for the Widrow
and Lattice algorithms respectively. Unfortunately, information
on digital signal processing execution by 16 bit microprocessors
was not available. Other microprocessors which might compete
with the AAMP, such as the Motorola's 68000 or National's 16032,
are not currently available in low-power versions. AAMP's
closest rival would probably be Intel's 80C86, which has to use
an external multiplier board or an external 8087 math chip, thus
increasing power consumption. Execution data comparing AAMP and
these other microprocessors for other types of programs have been
published [41. The AAMP appears to live up to published
performance claims and (unlike other recently released advanced
microprocessors) no bugs were observed. Undocumented features
found were: a limited instruction prefetching, nonoptimal stack
updating and synchronized timing. None of these features appear
to significantly affect performance.
63
to
c
M
JJ
It] ~
3 CO
(0 c
> o
u o
<u
o o
03 U
to u
0) -^
U E
o
u C
CU -H
3 co
o cu
u e
•H JJ
3
ITS
E->
o O in o
in in O r- o CN
H CM
<TJ
• T CTi m in r~
jJ in •*)• r> vo vo CO
o <—l m vo in
E-
r-
i—l
CN
co m CN
CO
0) w
jj 0)
(0 M-l
T3 U-l
a 3d n
cu
u
3
a a1
E
o
u
CO
CD JJ
4J £
to a*
a -w
= 3
0>
JJ
3
a. cu
£
O
u
cu
JJ
3
6
OU
U 3
a a
< c
CO T3
CU CH O
Cu u
e cu
<a co
w
in
en
CM
o
in
en
o
IT)
o
o
VO r~
m VO
•-i VO
o l-t
in
CTl
1—
t
o
as
VO
m
VO
o
VO
en
CO
o
CN
in vo
CN
O
CN
i-l
r—
I
in
in
ro
in O
VO
o
VO
.-1
O
in
o
m
o
m
O in
CO
o
CTl
o
CN
CO
m
o
o
in ih <-\
,vo
°
1 in
en
o
o o
Oi CU
-! a
u >,
rH JJ
3
CO
JJ
N
.* s:
u
oH 01
O 4J
as
in
cu cu
x
01
X
CN
m
CM
X3
a)
x
T3
0)
x
T3
0)
T3
CU
X
13
41
in
•-i
in
r—
I
m
CTl
~ * o
CN
o CTi VO
o ro Ol
in r» 9l
i© o
CS
o
o
CD
U
C/j
z
a
a
CO
u
w
2 00
a
C
Cu it]
s: jj
< to
< —
i0
0)
Cu T!
IE O
0>
_ jj
s: a.
< o
< —
a
T3
C
a, in
z: -
< CO
0)
a. t:
5: o
< E
< —
o
JJ it
ai p—
i
u
« >.
c u
o
•w en
JJ C
U o
CU (—
1
JJ
01 o
a 0)
Cfl
TJ c
•—i
o o
jC i/i
CO ro
ai
u *
J3 <u
E-l-H
u
0> >1
> u
•H
JJ JJ
Oj u
IT] O
•u JS
< n
CO u
0) 01
T3 to
3 c
*"> ^H
"U U 3
cu c CO
a, jj
< o
< «-
Table 13. Lattice Implementation Comparisons
(all times in microseconds)
fixed-point floating-point
NSC800 AAMP AAMP AAMP AAMP
Action (8-Stage) (optimized) (optimized)
Multiply type HW HW HW HW HW
Multiply time 74.5 4.75 4.75 19.15 19.15
22.60 13.50 57.40 41.25
11.80 8.70 43.00 34.05
38.15 28.30 128.35 110.60
37.70 30.10 131.70 109.70
3.25 2.90 4.25 4.25
6.10 7.85 8.10 9.85
Input from A/D 32.25 11.65 8.20 17.75 14.30
and init. loop
Loop:
Compute e(l+l) 122.0
Compute w(l+l) 119.25
Compute v(l) 364.50
Update weights 403.25
wl(l)=w(l) 27.50
Loop overhead
Output 10.50 4.05 4.05 8.85 8.85
Totals:
8-Stage 8334.75 972.50 743.05 3009.00 2500.75
16-Stage - 1929.30 1474.30 5991.40 4972.90
Sampling rate (in Hz):
8-Stage 121 1028 1346 332 400
16-Stage - 518 678 167 201
65
After a preliminary learning period, the AAMP intruction set
was quite easy to use. Coding was easy because of the relative
symmetry of the instruction set, which is demonstrated by the
side-by-side listings of the fixed-point and floating-point
versions.
The lack of registers eliminates register usage optimization
but introduces other optimization problems. First, the number of
arguments on the stack must be limited to avoid stack-updates.
Secondly, the Local memory locations must be used wisely for most
efficient operation. Both of these problems, especially the
former, can probably be dealt with more easily than register
optimizations on other microprocessors. The ease of
optimization and high-level language support structures indicate
that compilers could produce code very nearly as efficient as
that of assembly language programmers. While not very important
for this application, the high code density characteristic of the
AAMP is an indicator of the instruction set's efficiency.
To further ensure efficiency, compilers could be modified to
optimize structures common to signal processing programs.
An important factor in using a microprocessor is the
availability and completeness of documentation. With the
exception of timing specifications, the documentation provided is
as good if not better than that available for most commercially
marketed microprocessors. The lack of timing specifications was
due to Rockwell's use of the processor primarily on a CPU board
with a bus interface. The bus timing information was supplied,
however, and the Rockwell personnel were cooperative in answering
questions.
66
Most signal processing algorithms rely heavily on arrays of
values, which can be efficiently implemented with index
registers. The AAMP's lack of index registers is compensated for
by its speed, but best performance can be achieved by avoiding
structures such as block-transfers which require pointers.
67
Acknowledgements
This work was sponsored and funded by the Systems
Engineering Division, Organization 5238, Sandia National
Laboratories, Kirtland Air Force Base, Albuquerque, New Mexico.
The author would like to thank Dr. M.S. P. Lucas, Dr. M. Van
Swaay and Dr. D.H. Hummels for serving as committee members.
Special thanks go to John Rasure and David Hardin for their
careful proofreading and support. Appreciation is also extended
to D.W. Best and C.E. Kress of Rockwell for their cooperation.
68
References
l. d.j. Nickel, An Evaluation &1 Y^iiou^ iiicj:opi££555£x
Implementations of aji Adaptive Digita l Predictor for Intrusion
Detection / A Master's Report, Kansas State University, 1979.
2. M.A. Cody, An Eval uatio n of the N SC80Q 8-]
for Digita l Signal P rocessing Appl ications,
Kansas State University, 1981.
3. D.W. Gordon, AD NSC800 Development System .
Kansas State University, 1982.
t M icroprocessor
A Master's Report,
A Master's Report,
4. D.W. Best, C.E. Kress, N. M. Mykris, J.D. Russell, and W.J.
Smith, "An Advanced-Architecture CMOS/SOS Microprocessor," IEEE
Micro , Vol. 2, No. 3, Aug. 1982, pp. 10-26.
5. AAMP_«_ CAPS-7. and CAPS-10 Instruc tion Set Architecture ,
Processor Technology Section, Advanced Technology and
Engineering, Collins Avionics Division, Rockwell International,
1982.
6. W.A. Barrett and J.D. Couch, Compi ler Construction: Theory
and Practice . Science Research Associates, Chicago, 1979.
7. N.M. Mykris, AAMP Instruction Execution Statistics . Processor
Technology Section, Advanced Technology and Engineering, Collins
Avionics Division, Rockwell International, 1982.
69
Appendix A: Notes on Widrow and Lattice Listings
The following are lists of variables used in the hand-
compiled Widrow and Lattice algorithms. These lists may help to
explain the use of certain addressing modes in the algorithms.
It was assumed that the necessary declarations were made in the
high-level language to establish the named variables as local.
Not shown but common to all implementations are the digital-to-
analog and analog-to-digital converters which are assumed to be
memory-mapped and are referenced through use of the Universal
addressing mode. The universal addressing mode allows a complete
address to be specified, which might make upper address line
decoding easier.
Standard Lattice:
Local variables
present_w
present_e
next_w
next_e
*1 (loop variable)
Non-local variables
k( )
wl( )
v( )
Optimized Lattice:
Local variables
present_w
present_e
next_w
next_e
*1 (loop variable)
wl_l
k_l
**temp
Non-local variables
k( )
wl( )
v( )
70
Traditional Widrow:
Local variables
f
g
*k (loop variable)
e
c
q
Non-local variables
b( )
f( )
e( )
Modified Widrow:
Local variables
f
g
*k (loop variable]
e
c
q
*ptr
Non-local variables
b( )
f( )
e( )
Optimized Widrow:
Local variables
f
g
*k (loop variable)
e
c
q
*ptr
**temp
Non-local variables
b( )
f( )
e( )
Since only the execution times were important, no attempt
was made to write initialization or exception handling routines,
assign actual memory addresses to variables or include opcodes.
The single asterisk denotes variables that are always
integer. The rest of the variables and arrays vary according to
the number system in use. The double asterisk denotes variables
that are used only in the floating-point or double-precision
71
fixed-point implementations.
An important point is that there are only 16 local memory
words. These 16-bit words can store either 16 single-precision
numbers, 8 floating-point (or double-precision fixed-point)
numbers or some combination of the two. Fifteen of the sixteen
available locations were used in the optimized Lattice, but some
programs could require more, thus calling for careful coding or a
good compiler to make the most of these locations.
A point that might not be immediately obvious is that the
five instructions generated from a "do" statement initialize the
loop and are only executed once. On the other hand, the "endo"
instruction is executed each time the loop is executed.
Unlike the first three listings, the two optimized listings
use low-level manipulations. The majority of the changes come
from the following optimizations:
- Using a series of instructions to replace the "do" and
"endo" instructions to avoid stack penalties.
- Changing an incrementing loop to a decrementing loop to
allow easier testing for the final value (0).
- Rearranging arguments to avoid stack penalties.
- Using DUP to leave an argument on the stack for the next
operation.
- Storing frequently referenced indexed variables into local
memory for more efficient access.
- Using REFSC in place of REFSXI because both do the same
thing but REFSC has a shorter instruction (fewer instruction
fetches)
.
72
Standard Widrow listing
This listing approximates the output of a non-optimizing
compiler for the algorithm given below. The program was
translated quite directly and few assembly language modifications
were made.
loop: f = adc_in
g =
do k = 1,16
g = g + b(k) * f(k)
endo
e = f - g
c = v * e
do k = 1,16
b(k) = u * b(k) + c * f(k)
endo
q = q - e(16) + e
dac_out = q * q
do k = 1,15
e(k+l) = e(k)
f(k+l) = f(k)
endo
e(l) = e
f(l) = f
goto loop
Note: + = indicates a stack update
73
comments
loop: f = adc_in'
read ADC
convert to f.p.
store f
fixed-point floa ting-point
opcode Nc Nf Mr Nw opcode Nc Nf Nr Nw
LIT3 2 5.0 2.5 - - LIT3 2 5.0 2.5
REFSU 5.0 0.5 1 - REFSU 5.0 0.5 1 -
- CVTSD 2.0 0.5
- CVTDF 21.0 0.5
ASNSL 2.0 0.5 - 1 ASNDL 3.0 0.5 - 2
12. 3.5 1 1 36. 4.5 1 2
'g = 0'
get
store g
LIT4
ASNSL
1.0 0.5
2.0 0.5
3. 1.
LITD0
ASNDL
2.0 0.5
3.0 0.5
5. 1. -
"do k = 1,16'
loop var. addr.
initial value
final value
increment
LIT16
LIT4
LIT8
LIT4
DO
3.0
1.0
2.0
1.0
10.0
17. 5.5 -
LIT16
LIT4
LIT8
LIT4
DO
3.0 1
1.0
2.0 1
1.0
10.0 2
17. 5.5 -
"g = g + b(k) * f(k)"
get g +REFSL 2.0 0.5 1 - ++REFDL 3.0 0.5 2 -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get b(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get f(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
store in g ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
41. 6. 5 1 153. 6. 8 2
"endo"
end of loop ENDO 9.0 1.0 1 1
9. 1. 11
ENDO 9.0 1.0 1 1
9. 1. 11
74
iteration total
loop total
50 7
800 112
6 2
96 32
162 7 9 3
2592 112 144 48
'e = f - g'
get f REFSL 2.0 0.5 1 - REFDL
get g REFSL 2.0 0.5 1 - REFDL
subtract SUB 2.0 0.5 - - SUBF
store e ASNSL 2.0 0.5 - 1 ASNDL
8. 2. 2 1
3.0 0.5 2 -
3.0 0.5 2 -
40.0 0.5
3.0 0.5 - 2
49. 4 2
c = v
get v LIT16 3.0 1.5 LIT3 2
get e REFSL 2.0 0.5 1 - REFDL
multiply MPYI 23.0 0.5 MPYF
store c ASNSL 2.0 0.5 - 1 ASNDL
30. 3. 1 1
5.0 2.5
3.0 0.5 2 -
95.0 0.5 - -
3.0 0.5 - 2
106. 4. 2 2
do k = 1,16'
loop var. addr. LIT16 3.0 1.5 - - LIT16 3.0 1.5 - -
initial value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
final value LIT8 2.0 1.0 - - LIT8 2.0 1.0 - -
loop increment LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
DO 10.0 2.0 - 1 DO 10.0 2.0 - 1
17. 5.5 - 1 17. 5.5 - 1
'b(k) = u * b(k) + c * f(k)
get u
get k
get b(k)
multiply
get c
get k
get f(k)
multiply
add
get k
store b(k+l)
+LIT16 3.0 1.5 - - ++LIT3 2 5.0 2.5 - -
+REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
+REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
ASNSXI 4.0 1.5 — 1 ASNDXI 5.0 1.5 — 2
71. 9.5 6 1 257. 10.5 9 2
'endo'
75
end the loop ENDO 9.0 1.0 1 1
9. 1. 11
ENDO 9.0 1.0 1 1
9. 1. 11
iteration total
loop total
80 10.5 7 2
1280 168 112 32
266 11.5 10 3
4256 184 160 41
q = q - e(16) + e'
get q REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get 16 LIT8 2.0 1.0 - - LIT8 2.0 1.0 - -
get e(16) REFSXI 4.0 1.5 1 - REFDXI 5.0 1.5 2 -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
get e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
duplicate q DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
store in q ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
17. 5.5 3 1 96. 5.5 6 2
'dac_out = q * q"
duplicate q DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
square q MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
convert from f.p. CVTFD 17.0 0.5 - -
- CVTDS 3.0 0.5 - -
write to DAC LIT32 5.0 2.5 - - LIT3 2 5.0 2.5 - -
ASNSU 5.0 0.5 - 1 ASNSU 5.0 0.5 - 1
34. 4. - 1 127. 5. - 1
"do k = 1,15"
loop var. addr. LIT16 3.0 1.5 — — LIT16 3.0 1.5 — —
initial value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
final value LIT8 2.0 1.0 - - LIT8 2.0 1.0 - -
loop increment LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
DO 10.0 2.0 - 1 DO 10.0 2.0 - 1
17. 5.5 - 1 17. 5.5 - 1
"e(k+D = e(k)"
get k +REFSL 2.0 0.5 1
get e(k) REFSXI 4.0 1.5 1
get k +REFSL 2.0 0.5 1
get 1 +LIT4 1.0 0.5 -
+REFSL 2.0 0.5 1 -
+REFDXI 5.0 1.5 2 -
+REFSL 2.0 0.5 1 -
+LIT4 1.0 0.5 - -
76
add
store e(k+l)
ADD 2.0 0.5 - -
ASNSXI 4.0 1.5 - 1
15. 5. 3 1
ADD 2.0 0.5 - -
ASNDXI 5.0 1.5 - 2
17. 5. 4 2
"f(k+D = f(k)
get k
get f(k)
get k
get 1
add
store f(k+l)
REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
REFSXI 4.0 1.5 1 - REFDXI 5.0 1.5 2 -
REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
ADD 2.0 0.5 - - ADD 2.0 0.5 - -
ASNSXI 4.0 1.5 — 1 ASNDXI 5.0 1.5 — 2
15. 5. 3 1 17. 5. 4 2
"endo"
end the loop ENDO 9.0 1.0 1 1
9. 1. 11
ENDO 9.0 1.0 1 1
9. 1. 11
iteration total
loop total
39 11 7 3
585 165 105 45
43 11 9 5
645 165 135 75
e(l) = e'
get e
get 1
store in e(l)
REFSL 2.0 0.5 1 -
LIT4 1.0 0.5 - -
ASNSXI 4.0 1.5 - 1
7. 2.5 1 1
REFDL 3.0 0.5 2 -
LIT4 1.0 0.5 - -
ASNDXI 5.0 1.5 - 2
9. 2.5 2 2
f(l) = £'
get f
get 1
store in e(l)
REFSL 2.0 0.5 1 -
LIT4 1.0 0.5 - -
ASNSXI 4.0 1.5 - 1
7. 2.5 1 1
REFDL 3.0 0.5 2 -
LIT4 1.0 0.5 - -
ASNDXI 5.0 1.5 - 2
9. 2.5 2 2
"goto loop"
repeat LIT8N 2.0 1.0 LIT8N 2.0 1.0
SKIP 2.0 1.0 SKIP 2.0 1.0 - -
4. 2.0 4. 2.0
77
total instr. cycles 2838 482.5 322 120 7985 506.5 456 189
stack updates 2115 - 141 141 3780 - 252 252
TOTAL 4953 482.5 463 261 11765 506.5 708 441
78
Standard Lattice listing
This listing approximates the output of a non-optimizing
compiler for the algorithm given below. The program was
translated quite directly and few assembly language modifications
were made.
begin present_w = adc_input
present_e = present_w
loop do 1 = 0,15
next_e = present_e - k(l) * wl(l)
next_w = wl(l) - k(l) * present_e
v(l) = beta * v(l) + betal * (present_e *
present_e + wl(l) * wl(D)
k(l) = k(l) + alpha * (next_e * wl(l) + present_e
* next_w)/v(l)
wl(l) = present_w
present_w = next_w
present_e = next_e
endo
dac_out = present_e
goto begin
Note: + indicates a stack update.
79
comments
'begin present_w
present_e
read ADC
convert to f.p.
duplicate data
store present_w
store present_e
fixed-point
opcode Nc Nf Nr Nw
adc_input
present_w"
LIT32 5.0 2.5
REFSU 5.0 0.5 1 -
DUP 1.0 0.5 - -
ASNSL 2.0 0.5 - 1
ASNSL 2.0 0.5 - 1
15.0 4.5 1 2
floating-point
opcode Nc Nf Nr Nw
LIT32 5.0 2.5 - -
REFSU 5.0 0.5 1 -
CVTSD 2.0 0.5 - -
CVTDF 21.0 0.5 - -
DUPD 2.0 0.5 - -
ASNDL 3.0 0.5 - 2
ASNDL 3.0 0.5 - 2
41.0 5.5 1 4
loop do 1 = 0,15'
loop var. addr. LIT16 3.0 1.5 - - LIT16 3.0 1.5 - -
initial value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
final value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
increment LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
DO 10.0 2.0 - 1 DO 10.0 2.0 - 1
16.0 5.0 - 1 16.0 5.0 - 1
'next_e = present_e - k(l) * wl(l) 1
get present_e +REFSL 2.0 0.5 1 - ++REFDL 3.0 0.5 2 -
get 1 +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get k(l) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
get 1 +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get wl(l) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
store next_e ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 - 2
41.0 6.0 5 1 155.0 6.0 8 2
"next_w = wl(l) - k(l) * present_e'
get 1 REFSL 2.0 0.5 1
get wl(l) REFSXI 4.0 1.5 1
get 1 REFSL 2.0 0.5 1
get k(l) REFSXI 4.0 1.5 1
get present_e REFSL 2.0 0.5 1
multiply MPYI 23.0 0.5 -
subtract SUB 2.0 0.5 -
store next w ASNSL 2.0 0.5 -
REFSL 2.0 0.5 1 -
REFDXI 5.0 1.5 2 -
REFSL 2.0 0.5 1 -
REFDXI 5.0 1.5 2 -
++REFDL 3.0 0.5 2 -
MPYF 95.0 0.5 - -
SUBF 40.0 0.5 - -
ASNDL 3.0 0.5 - 2
80
41.0 6.0 5 1 155.0 6.0 8 2
v(l) = beta * v(l) + betal *
(present_e * present_e + wl(l) * wl(D)
get beta LIT16
get 1 REFSL
get v(l) REFSXI
beta*v(l) MPYI
get betal LIT16
get present_e REFSL
square present_e +DUP
MPYI
get 1 REFSL
get wl(l) REFSXI
square wl (1) +DUP
MPYI
sum ADD
multiply MPYI
sum ADD
get 1 REFSL
store v(l) ASNSXI
3.0 1.5
2.0 0.5 1 -
4.0 1.5 1 -
23.0 0.5
3.0 1.5
2.0 0.5 1 -
1.0 0.5
23.0 0.5
2.0 0.5 1 -
4.0 1.5 1 -
1.0 0.5
23.0 0.5
2.0 0.5 - -
23.0 0.5
2.0 0.5
2.0 0.5 1 -
4.0 1.5 - 1
124. 13.5 6 1
LIT32 5.0 2.5 - -
REFSL 2.0 0.5 1 -
REFDXI 5.0 1.5 2 -
MPYF 95.0 0.5 - -
LIT32 5.0 2.5 - -
++REFDL 3.0 0.5 2 -
++DUPD 2.0 0.5 - -
MPYF 95.0 0.5 - -
REFSL 2.0 0.5 1 -
REFDXI 5.0 1.5 2 -
++DUPD 2.0 0.5 - -
MPYF 95.0 0.5 - -
ADDF 38.0 0.5 - -
MPYF 95.0 0.5 - -
ADDF 38.0 0.5 - -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 — 2
494. 15.5 9 2
k(l) = k(l) + alpha * (next_e * wl(l) +
present_e * next_w) / v(l)
get 1 REFSL 2.0 0.5 1 - REFSL
get k(l) REFSXI 4.0 1.5 1 - REFDXI
get alpha LIT16 3.0 1.5 - - LIT32
get next e REFSL 2.0 0.5 1 - ++REFDL
get 1 REFSL 2.0 0.5 1 - +REFSL
get wl(l) REFSXI 4.0 1.5 1 - +REFDXI
multiply MPYI 23.0 0.5 - - MPYF
get present_e REFSL 2.0 0.5 1 - REFDL
get next_w +REFSL 2.0 0.5 1 - ++P.EFDL
multiply MPYI 23.0 0.5 - - MPYF
add ADD 2.0 0.5 - - ADDF
multiply MPYI 23.0 0.5 - - MPYF
get 1 REFSL 2.0 0.5 1 - REFSL
get v(l) REFSXI 4.0 1.5 1 - REFDXI
divide DIVI 27.0 0.5 - - DIVF
add ADD 2.0 0.5 - - ADDF
get 1 REFSL 2.0 0.5 1 - REFSL
store k(l) ASNSXI 4.0 1.5 — 1 ASNDXI
133. 14. 10 1
2.0 0.5 1 -
5.0 1.5 2 -
5.0 2.5
3.0 0.5 2 -
2.0 0.5 1 -
5.0 1.5 2 -
95.0 0.5
3.0 0.5 2 -
3.0 0.5 2 -
95.0 0.5
38.0 0.5
95.0 0.5
2.0 0.5 1 -
5.0 1.5 2 -
98.0 0.5
38.0 0.5
2.0 0.5 1 -
5.0 1.5 - 2
501. 15. 16 2
81
'wl(l) = present_w'
get present_w
get 1
store wl (1)
REFSL 2.0 0.5 1 -
REFSL 2.0 0.5 1 -
ASNSXI 4.0 1.5 - 1
8. 2.5 2 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
10. 2.5 3 2
"present_w = next_w'
get next_w REFSL
store present_w ASNSL
2.0 0.5 1 -
2.0 0.5 - 1
4. 1. 11
REFDL 3.0 0.5 2 -
ASNDL 3.0 0.5 - 2
6 . 1 . 2 2
'present_e = next_e"
get next_e REFSL
store present_e ASNSL
2.0 0.5 1 -
2.0 0.5 - 1
4. 1. 11
REFDL 3.0 0.5 2 -
ASNDL 3.0 0.5 - 2
6. 1. 2 2
"endo 1
end loop
iteration total
loop total
ENDO 9.0 1.0 1 1
9. 1. 11
364 45 31 8
5824 720 496 128
ENDO 9.0 1.0 1 1
9. 1. 11
1336 48 49 15
21376 768 784 240
'dac_output = present_e"
get present_e
convert from f.p.
store to DAC
REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
- CVTFD 17.0 0.5
- CVTDS 3.0 0.5
LIT32 5.0 2.5 - - LIT32 5.0 2.5
ASNSU 5.0 0.5 - 1 ASNSU 5.0 0.5 - 1
12. 3.5 1 1 33. 4.5 2 1
goto begin"
' begin' LIT8N 2.0 1.0 LIT8N 2.0 1.0
SKIP 2.0 1.0 SKIP 2.0 1.0 - -
2.0 2.0 2.0 2.0
82
total instr. cycles 5871 735 498 132 21470 785 787 246
stack updates 1440 - 96 96 4800 - 320 320
TOTAL 7311 735 594 228 26268 785 1107 566
83
Modified Widrow listing
This listing approximates the output of a non-optimizing
compiler. Although the algorithm is modified, the translation
was quite direct and few assembly language modifications were
made.
loop: f = adc_in
g =
do k = 1,16
g = g + b(k) * f(k)
endo
e = f - g
c = v * e
do k = 16,1
b(k+l) = u * b(k) + c * f(k)
endo
b(l) = b(17)
q = q - e(ptr) + e
dac_out = q * q
e(ptr) = e
f(ptr) = f
if ptr = 16
then ptr = 1
else ptr = ptr + 1
goto loop
84
Note: + = indicates a stack update
comments
'loop: f = adc_in'
read ADC
convert to f.p.
store f
fixed-point floa ting-point
opcode Nc Nf Nr Nw opcode Nc Nf Nr Nw
LIT32 5.0 2.5 LIT3 2 5.0 2.5
REFSU 5.0 0.5 1 - REFSU 5.0 0.5 1 -
- CVTSD 2.0 0.5
- CVTDF 21.0 0.5
ASNSL 2.0 0.5 - 1 ASNDL 3.0 0.5 - 2
12. 3.5 1 1 36. 4.5 1 2
'g = 0'
get
store g
LIT4
ASNSL
1.0 0.5
2.0 0.5
3. 1. -
LITD0
ASNDL
2.0 0.5
3.0 0.5
5.1.-
"do k = 1, 16"
loop var. addr • LIT16 3.0 1.5 — — LIT16 3.0 1.5 — —
initial value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
final value LIT8 2.0 1.0 - - LIT8 2.0 1.0 - -
increment LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
DO 10.0 2.0 - 1 DO 10.0 2.0 - 1
17. 5.5 - 1 17. 5.5 - 1
g = g + b(k) * f(k)
get g +REFSL 2.0 0.5 1 - ++REFDL 3.0 0.5 2 -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get b(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get f(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
store in g ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
41. 6. 5 1 153. 6. 8 2
"endo"
end of loop ENDO 9.0 1.0 1 1 ENDO 9.0 1.0 1 1
85
9. 1. 11 9. 1. 11
iteration total
loop total
50 7
800 112
6 2
96 32
162 7 9 3
2592 112 144 48
e = f - g'
get f REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get g REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
subtract SUB 2.0 0.5 SUBF 40.0 0.5 - -
store e ASNSL 2.0 0.5 - 1 ASNDL 3.0 0.5 - 2
8. 2. 2 1 49. 2. 4 2
c = v
get v LIT16 3.0 1.5 LIT32 5.0 2.5
get e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
multiply MPYI 23.0 0.5 MPYF 95.0 0.5
store c ASNSL 2.0 0.5 - 1 ASNDL 3.0 0.5 - 2
30. 3. 1 1 106. 4. 2 2
"do k = 16, 1"
loop var. addr. LIT16 3.0 1.5 - - LIT16 3.0 1.5 - -
initial value LIT8 2.0 1.0 - - LIT8 2.0 1.0 - -
final value LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
loop increment LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
DO 10.0 2.0 - 1 DO 10.0 2.0 - 1
17. 5.5 - 1 17. 5.5 - 1
"b(k+l) = u * b(k) + c * f(k)"
get u +LIT16 3.0 1.5 - - ++LIT3 2 5.0 2.5 - -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get b(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
get c REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get k +REFSL 2.0 0.5 1 - +REFSL 2.0 0.5 1 -
get f(k) REFSXI 4.0 1.5 1 - +REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get 1 LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
add ADD 2.0 0.5 - - ADD 2.0 0.5 - -
store b(k+l) ASNSXI 4.0 1.5 — 1 ASNDXI 5.0 1.5 — 2
74. 10.5 6 1 260. 11.5 9 2
86
"endo"
end the loop
iteration total
loop total
ENDO 9.0 1.0 1 1
9. 1. 11
83 11.5 7 2
1328 184 112 32
ENDO 9.0 1.0 1 1
9. 1. 11
269 12.5 10 3
4304 200 160 48
'b(l) = b(17)-
get 17 LIT8 2.0 1.0 - -
get b(17) REFSXI 4.0 1.5 1 -
get 1 LIT4 1.0 0.5 - -
store in b(l) ASNSXI 4.0 1.5 - 1
11. 4.5 1 1
LIT8 2.0 1.0 - -
REFDXI 5.0 1.5 2 -
LIT4 1.0 0.5 - -
ASNDXI 5.0 1.5 - 2
13. 4.5 2 2
'q = q - e(ptr) + e'
get q REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get ptr REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get e(ptr) REFSXI 4.0 1.5 1 - REFDXI 5.0 1.5 2 -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
get e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
duplicate q DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
store in q ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
17. 5. 4 1 96. 5. 7 2
"dac_out = q * q'
duplicate q
square q
convert from f.p.
write to DAC
DUP 1.0 0.5 DUPD 2.0 0.5
MPYI 23.0 0.5 MPYF 95.0 0.5 - -
- CVTFD 17.0 0.5
- CVTDS 3.0 0.5 - -
LIT32 5.0 2.5 LIT3 2 5.0 2.5
ASNSU 5.0 0.5 - 1 ASNSU 5.0 0.5 - 1
34. 4.-1 127. 5. - 1
"e(ptr) = e"
get e REFSL
get ptr REFSL
store in e(ptr) ASNSXI
2.0 0.5 1 -
2.0 0.5 1 -
4.0 1.5 - 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
87
8. 2.5 2 1 10 2.5 3 2
f(ptr) = f
get f
get ptr
store in e(ptr)
REFSL 2.0 0.5 1 -
REFSL 2.0 0.5 1 -
ASNSXI 4.0 1.5 - 1
8. 2.5 2 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
10. 2.5 3 2
"if ptr = 16"
get ptr REFSL
get 16 LIT8
equal? EQ
if not go to else SKIPZI
2.0 0.5 1 -
2.0 1.0
3.0 0.5
3.5 1.5
10.5 3.5 1 -
REFSL 2.0 0.5 1 -
LIT8 2.0 1.0 - -
EQ 3.0 0.5 - -
SKIPZI 3.5 1.5 - -
10.5 3.5 1 -
"then ptr = 1"
get 1 LIT4
store in ptr ASNSL
jump past else SKIPI
1.0 0.5
2.0 0.5
2.0 1.5
- 1
5.0 2.5 - 1
LIT4
ASNSL
SKIPI
1.0 0.5
2.0 0.5
2.0 1.5
- 1
2.5 - 1
"else ptr = ptr + 1"
increment ptr INCSLE 5.0 1.0 1 1
5. 1. 11
INCSLE 5.0 1.0 1 1
5. 1. 11
"goto loop'
repeat LIT8N
SKIP
2.0 1.0
2.0 1.0
4. 2.0
LIT8N
SKIP
2.0 1.0
2.0 1.0
4. 2.0
total instr. cycles
stack updates
2387.5 339.5 238 77
1440 - 96 96
7404.5 363
2880
TOTAL
328 117
192 192
3827.5 339.5 334 173 10286.5 363 520 309
88
Optimized Widrow listing
The following listing is a hand-optimized version of the Widrow
algorithm given below. The program generally follows the equations
below but uses some assembly language "tricks" to improve
efficiency.
loop: f = adc_in
g =
do k = 16,1
g = g + b(k) * f(k)
endo
e = f - g
c = v * e
do k = 16,1
b(k+l) = u * b(k) + c * f(k)
endo
b(l) = b(17)
q = q - e(ptr) + e
dac_out = q * q
e(ptr) = e
f(ptr) = f
if ptr = 16
then ptr = 1
else ptr = ptr + 1
goto loop
Note: + = indicates a stack update
89
comments
'loop: f = adc_in'
read ADC
convert to f.p.
store f
fixed-point
opcode Nc Nf Nr Nw
LIT32 5.0 2.5
REFSU 5.0 0.5 1 -
ASNSL 2.0 0.5 - 1
12. 3.5 1 1
float ing-point
opcode Nc Nf Nr Nw
LIT32 5.0 2.5
REFSU 5.0 0.5 1 -
CVTSD 2.0 0.5
CVTDF 21.0 0.5 - -
ASNDL 3.0 0.5 - 2
36. 4.5 1 2
'g = 0'
get
store g
LIT4
ASNSL
1.0 0.5
2.0 0.5
3. 1. -
LITD0
ASNDL
2.0 0.5
3.0 0.5
5. 1. -
"do k = 16,1"
initialize count LIT8
store count ASNSL
2.0 1.0
2.0 0.5
4. 1.5 -
LIT8
ASNSL
2.0 1.0
2.0 0.5
1.5 -
"g = g + b(k) * f(k)"
get g REFSL 2.0 0.5 1 - -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get b(k) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get f(k) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
get g - REFDL 3.0 0.5 2 -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
store in g ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
39. 5. 5 1 153. 6. 8 2
"endo"
decrement count DECSLE
get count REFSL
loop if countOO LIT8N
SKIPNZ
5.0 1.0 1 1 DECSLE 5.0 1.0 1 1
2.0 0.5 1 - REFSL 2.0 0.5 1 -
2.0 1.0 - - LIT8N 2.0 1.0 - -
3.5 1.0 - — SKIPNZ 3.5 1.0 — —
12.5 3.5 2 1 12.5 3.5 2 1
90
iteration total
loop total
51.5 8.5 7 2
824 136 112 32
165.5 9.5 10 3
2648 152 160 48
e - f - g'
get f REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get g REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
duplicate e DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
store e ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
9. 2.5 2 1 51. 2.5 4 2
"c = V * e"
get v REFSL 2,.0 0,.5 1 —
multiply MPYI 23 .0 0,.5 - -
store c ASNSL 2..0 0,.5 — 1
27. 1.5 1 1
REFDL 3.0 0.5 2 -
MPYF 95.0 0.5 - -
ASNDL 3.0 0.5 - 2
101. 1.5 2 2
"do k = 16,1'
initialize count
store count
LIT8
ASNSL
2.0 1.0
2.0 0.5
4. 1.5 -
LIT8 2.0 1.0
ASNSL 2.0 0.5 - 1
4. 1.5 - 1
*b(k+l) = u * b(k) + c * f(k)
get u REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get b(k) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
store in temp - ASNDL 3.0 0.5 - 2
get c REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get f(k) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
retrieve temp - REFDL 3.0 0.5 2 -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get k REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get 1 LIT4 1.0 0.5 - - LIT4 1.0 0.5 - -
add ADD 2.0 0.5 - - ADD 2.0 0.5 - -
store b (k+1) ASNSC 3.0 1.0 — 1 ASNDXI 5.0 1.5 — 2
70. 8. 7 1 264. 10.5 13 4
91
endo'
decrement count DECSLE 5.0 1.0 1 1
get count REFSL 2.0 0.5 1 -
loop if countOO LIT8N 2.0 1.0 - -
SKIPNZ 3.5 1.0 - —
12.5 3.5 2 1
DECSLE 5.0 1.0 1 1
REFSL 2.0 0.5 1 -
LIT8N 2.0 1.0 - -
SKIPNZ 3.5 1.0 - -
12.5 3.5 2 1
iteration total
loop total
82.5 11.5 9 2
1320 184 144 32
276.5 14
4424 224
15
240
5
80
'b(l) = b(17)"
get b(17)
store b(l)
REFSLE
ASNSLE
3.0 1.0 1 -
3.0 1.0 - 1
6. 2. 11
REFDLE 4.0 1.0 2 -
ASNDLE 4.0 1.0 - 2
8. 2. 2 2
q = q - e(ptr) + e"
get q REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get ptr REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get e(ptr) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
get e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
add ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
duplicate q DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
store in q ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
16. 4.5 4 1 96. 5. 7 2
"dac_out = q * q"
duplicate q
square q
convert from f.p.
write to DAC
DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
- CVTFD 17.0 0.5 - -
- CVTDS 3.0 0.5 - -
LIT3 2 5.0 2.5 - - LIT32 5.0 2.5 - -
ASNSU 5.0 0.5 - 1 ASNSU 5.0 0.5 - 1
34. 4. - 1 127. 5. - 1
e(ptr) re e*
get e
get ptr
store in e( ptr)
REFSL
REFSL
ASNSC
2,
2,
3,
.0
.0
.0
0,
1,
.5
.5
.0
1
1
1
7. 2. 2 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
10. 2.5 3 2
92
"f(ptr) = f"
get f REFSL
get ptr REFSL
store in e(ptr) ASNSC
2.0 0.5 1 -
2.0 0.5 1 -
3.0 1.0 - 1
7. 2. 2 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
10. 2.5 3 2
if ptr = 16"
"then ptr = 1"
"else ptr = ptr + 1'
increment ptr INCSLE
get ptr REFSL
load mask pattern LIT16
mask ptr AND
store ptr ASNSL
; 5.0 1.0 1 1 INCSLE 5.0 1.0 1 1
2.0 0.5 1 - REFSL 2.0 0.5 1 -
3.0 1.5 LIT16 3.0 1.5 - -
1.0 0.5 AND 1.0 0.5 - -
2.0 0.5 - 1 ASNSL 2.0 0.5 - 1
13. 4. 2 2 13. 4. 2 2
"goto loop"
repeat LIT8N 2.0 1.0
SKIP 2.0 1.0
LIT8N 2.0 1.0
SKIP 2.0 1.0 - -
program total
(no stack updates)
4. 2.0
22290 352 271 77
4. 2.0
7541 411.5 424 149
93
Optimized Lattice listing
The following listing is a hand-optimized version of the
Lattice algorithm presented below. The program generally follows
the equations below but uses some assembly language "tricks" to
improve efficiency.
begin present_w = adc_input
present_e = present_w
loop do 1 = 0,15
next_e = present_e - k(l) * wl(l)
next_w = wl(l) - k(l) * present_e
v(l) = beta * v(l) + betal * (present_e *
present_e + wl(l) * wl(D)
k(l) = k(l) + alpha * (next_e * wl(l) + present_e
* next_w)/v(l)
wl(l) = present_w
present_w = next_w
present_e = next_e
endo
dac_out = present_e
goto begin
Note: + = indicates a stack update
94
comments fixed-point floating-point
opcode Nc Nf Nr Nw opcode Nc Nf Nr Nw
comments
"begin present_w
present_e
read ADC
convert to f.p.
duplicate data
store present_w
store present_e
fixed-point
opcode Nc Nf Nr Nw
adc_input
present_w"
LIT32 5.0 2.5 - -
REFSU 5.0 0.5 1 -
DUP 1.0 0.5 - -
ASNSL 2.0 0.5 - 1
ASNSL 2.0 0.5 - 1
15. 4.5 1 2
floating-point
opcode Nc Nf Nr Nw
LIT32 5.0 2.5 - -
REFSU 5.0 0.5 1 -
CVTSD 2.0 0.5 - -
CVTDF 21.0 0.5 - -
DUPD 2.0 0.5 - -
ASNDL 3.0 0.5 - 2
ASNDL 3.0 0.5 - 2
41. 5.5 1 4
loop do 1 = 0,15'
init loop count
store count
LIT8
ASNSL
2.0 1.0
2.0 0.5
4. 1.5 -
LIT8
ASNSL
2.0 1.0
2.0 0.5
4. 1.5 -
"The following sequence of steps seeks to reduce the referencing
time of array members by copying them into local memory locations
at the beginning of each iteration. The floating point version
must do some of this using extra references to avoid stack
updates caused by having more than two floating point numbers on
it at once."
"next_e = present_e - k(l) * wl(D"
get 1 -
duplicate 1 -
get wl(l) -
store wl_l -
get k(l) -
duplicate k(l) -
store k_l —
get present._e REFSL 2.0 0.5 1
get 1 REFSL 2.0 0.5 1
get k(l) REFSC 3.0 1.0 1
REFSL 2.0 0.5
DUP 1.0 0.5
REFDXI 5.0 1.5
ASNDL 3.0 0.5
REFDXI 5.0 1.5
DUPD 2.0 0.5
ASNDL 3.0 0.5
1 -
2 -
- 2
2 -
- 2
95
duplicate k_l DUP 1.0 0.5 - - -
store k_l ASNSL 2.0 0.5 - 1 -
get 1 REFSL 2.0 0.5 1 - -
get wl(l) REFSC 3.0 1.0 1 - REFDL 3.0 0.5 2 -
duplicate wl_l DUP 1.0 0.5 - - -
store wl_l ASNSL 2.0 0.5 - 1 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
get present_e - REFDL 3.0 0.5 2 -
reorder arguments - EXCHD 6.0 0.5 - -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
store next_e ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 - 2
45. 7. 5 2 171. 8.5 9 6
'next_w = wl(l) - k(l) * present_e'
get wl_l REFSL 2.0 0.5 1 - -
get k_l REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 —
get present_e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
get wl_l - REFDL 3.0 0.5 2 -
reorder argume nts - EXCHD 6.0 0.5 - -
subtract SUB 2.0 0.5 - - SUBF 40.0 0.5 - -
store next_w ASNSL 2.0 0.5 — 1 ASNDL 3.0 0.5 — 2
33. 3. 3 1 153. 3.5 6 2
v(l) = beta * v(l) + betal *
(present_e * present_e + wl(l) * wl(D)
get wl_l REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
square wl_l DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
store in temp - ASNDL 3.0 0.5 - 2
get present_e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
square present_e DUP 1.0 0.5 - - DUPD 2.0 0.5 - -
MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
retrieve temp - REFDL 3.0 0.5 2 -
sum squares ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get betal REFSL 2.0 0.5 1 - LIT32 5.0 2.5 - -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
store in temp - ASNDL 3.0 0.5 - 2
get 1 REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get v(l) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
get beta REFSL 2.0 0.5 1 - LIT32 5.0 2.5 - -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
retrieve temp - REFDL 3.0 0.5 2 -
sum expressions ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get 1 REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
store in v(l) ASNSC 3.0 1.0 — 1 ASNDXI 5.0 1.5 — 2
116. 9. 7 1 502. 16. 12 6
96
'k(l) = k(l) + alpha * (next_e * wl(l) +
present_e * next_w) / v(D"
get present_e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get next_w REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
store in temp - ASNDL 3.0 0.5 - 2
get next_e REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
get wl_l REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
retrieve temp - REFDL 3.0 0.5 2 -
sum products ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get alpha REFSL 2.0 0.5 1 - LIT32 5.0 2.5 - -
multiply MPYI 23.0 0.5 - - MPYF 95.0 0.5 - -
get 1 REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
get v(l) REFSC 3.0 1.0 1 - REFDXI 5.0 1.5 2 -
divide DIVI 27.0 0.5 - - DIVF 98.0 0.5 - -
get k_l REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
sum expression ADD 2.0 0.5 - - ADDF 38.0 0.5 - -
get 1 REFSL 2.0 0.5 1 - REFSL 2.0 0.5 1 -
store k(l) ASNSC 3.0 1.0 — 1 ASNDXI 5.0 1.5 — 2
122. 9. 9 1 499. 13. 16 4
'wl(l) = present_w"
get present_w
get 1
store wl (1)
REFSL 2.0 0.5
REFSL 2.0 0.5
ASNSC 3.0 1.0
7.
1 -
1 -
- 1
2 1
REFDL 3.0 0.5 2 -
REFSL 2.0 0.5 1 -
ASNDXI 5.0 1.5 - 2
10. 2.5 3 2
"present_w = next_w"
get next_w REFSL 2.0 0.5 1 -
store present_w ASNSL 2.0 0.5 - 1
4. 1. 11
REFDL 3.0 0.5 2 -
ASNDL 3.0 0.5 - 2
6. 1. 2 2
'present_e = next_e'
get next_e REFSL
store present_e ASNSL
2.0 0.5 1 -
2.0 0.5 - 1
4. 1. 11
REFDL 3.0 0.5 2 -
ASNDL 3.0 0.5 - 2
6. 1. 2 2
"endo'
97
decrement count DECSLE
get count REFSL
loop if countOO LIT8N
SKIPNZ
iteration total
loop total
5.0 1.0 1 1
2.0 0.5 1 -
2.0 1.0
3.5 1.0
12.5 3.5 2 1
343.5 35.5 30 10
5496 568 480 160
DECSLE 5.0 1.0 1 1
REFSL 2.0 0.5 1 -
LIT8N 2.0 1.0 - -
SKIPNZ 3.5 1.0 - -
12.5 3.5 2 1
1359.5 49 52 25
21752 784 832 400
'dac_output = present_e"
get present_e
convert from f.p.
store to DAC
REFSL 2.0 0.5 1 - REFDL 3.0 0.5 2 -
- CVTFD 17.0 0.5
- CVTDS 3.0 0.5
LIT32 5.0 2.5 - - LIT32 5.0 2.5
ASNSU 5.0 0.5 - 1 ASNSU 5.0 0.5 - 1
12. 3.5 1 1 33. 4.5 2 1
"goto begin"
repeat LIT8N 2.0 1.0
SKIP 2.0 1.0 - -
program total
(no stack updates)
4. 2.0
5499 571.5 482 164
LIT8N 2.0 1.0
SKIP 2.0 1.0 - -
4. 2.0
21786 781.5 835 405
98
Appendix B: Ada-subset listings
Notes on the Ada-subset compiler
The Ada-subset compiler used is resident on a VAXll-780 at
the Rockwell Collins facility in Cedar Rapids, IA. The output of
the compiler front-end is in the form of macro instructions for a
stack machine. These macro instructions are then translated into
instructions for a particular machine, in this case the AAMP. In
order for the compiler to produce object code with Local
variables, the code must be inside a procedure within the
package. If the code is placed in the package without a
procedure, the variables will be addressed using the global
addressing mode, resulting in a considerable loss in efficiency.
Loop variables created in a program are assigned after
declared variables, thus causing them to often reside in the
Local Extended memory area. To use the more efficient Local
memory area, declare an integer variable with a different name at
the beginning of the loop and immediately assign the loop
variable's value to this new variable. This new variable is then
referenced during the rest of the loop. This is economical in
long loops which contain many loop variable references such as
the Lattice.
For each of the following three programs there is an integer
version source listing and both integer and floating-point
versions of the object listings. The reason for this is that the
source listings for integer and floating-point versions were the
same except for the type of number_system and the constants.
99
Standard Widrow Source listing
with TEXT_IO, portpack;
use TEXT_IO, portpack;
Standard Widrow Algorithm
April 17, 1984
Ken Albin, Dept. of Electrical and Computer Engineering
Kansas State University, Manhattan, KS 66506
This program is based on the Standard Widrow coded in the
AAMP preliminary report.
package WIDROWI is
procedure WIDROW;
end WIDROWI;
package body WIDROWI is
procedure WIDRWO is
pragma SUPPRESS ( INDEX_CHECK)
;
pragma SUPPRESS (RANGE_CHECK)
The loop variable k is always an integer.
Other variables will reflect the number system.
subtype number_system is integer;
k : integer; — loop variable
f
e
number_system; — current input value
number_system; — summation value
number_system; — current error value
(filter output)
number_system; — "alarm" output
c is left out to test the compiler's optimization
c represents an expression which is constant within a loop
type values is array (1..16) of number_system;
b_array : values; — weight array
f_array : values; — sample array
e_array : values; — error array
u : constant := 1;
v : constant := 0;
100
*************************************************************
begin
— initialization sequence goes here
for k in 1 . .16 loop
b_array (k)
f_array (k)
e_array (k)
= 1
= 1
= 1
end loop;
— end initialization
— begin main loop
. loop
f := adc_in;
g := 0;
for k in 1. .16 loop
g := g + b_array(k) * f_array(k);
end loop;
e := f - g;
for k in 1 . .16 loop
b_array(k) := u * b_array(k) + v * e * f_array(k);
end loop;
q := q - e_array(16) + e;
dac_out := q * q;
for k in 1. .15 loop
e_array(k+l) := e_array(k);
f_array(k+l) := f_array(k);
end loop;
e_array(l) := e;
f_array(l) := f;
101
end loop;
end WIDROW;
begin
null;
end WIDROWI;
102
Integer Standard Widrow Object listing
Macro/Instruction Definitions will be read from module
[TDJ. AAMP1 6 ] AAMP16 . MLB
Program Size For Counter 1 = 102 Words Decimal.
CAPS Macro Assembler listing for module WIDROWI.OBJ
IDENT.
XREF.
XREF.
XREF.
PACKAGE.
XDEF.
XDEF.
PROCDEF
Opcodes Instruction Macro
0036 { procedure header }
11 LIT4A.1 LIT.
35 5C ASNSLE ASNL.
L#1002:;
35 IE REFSLE REFL.
10 18 LIT8 LIT.
EC GR GRT.
1D5B SKIPNZI JUMPT.
'widrowi',' AAMP/ACAPS Code
Generator Version 1.6';
standard;
text_io;
portpack;
widrowi;
$init. widrowi. 0000;
widrow. widrowi. 0001;
widrow. widrowi. 0001, 54 ,12;
Macro args.
1,1; (init loop varaible k}
1,53;
1,53; (check loop variable k)
1,16;
1;
L#1001;
11 LIT4A.1 LIT. 1,1; (b_array(k) := 1}
35 IE REFSLE REFL. 1,53;
14 LIT4A.4 ASNLX 1,4;
53 LOCL
A6 ASNSX
11 LIT4A.1 LIT. l,l; (f_array(k) := 1>
35 IE REFSLE REFL. 1,53;
14 18
53
A6
LIT8
LOCL
ASNSX
ASNLX
.
1,20;
11 LIT4A.1 LIT. l,l; fe_array(k) := 1}
351E REFSLE REFL. 1,53;
2418 LIT8 ASNLX. 1,36;
53 LOCL
A6 ASNSX
3 51E REFSLE REFL. 1,53; (increment k}
11 LIT4A.1 LIT. l,l;
E4 ADD ADD. l;
355C ASNSLE ASNL. 1,53;
103
2319 LIT8N
59 SKIP
L#1001
L#1003
L#1004
0000 1C REFSI
41 ASNSL.l
10 LIT4A.0
42 ASNSL.2
11 LIT4A.1
355C ASNSLE
L#1007:;
351E REFSLE
1018 LIT8
EC GR
18 5B SKIPNZI
02 REFSL.2
351E REFSLE
14 LIT4A.4
53 LOCL
DO REFSX
35 IE REFSLE
JUMP. L#1002; {go to loop check)
14 18
53
E6
LIT8
LOCL
DO REFSX
MPYI
E4 ADD
42 ASNSL.2
351E REFSLE
11 LIT4A.1
E4 ADD
355C ASNSLE
1E19 LIT8N
59 SKIP
L#1006:;
L#1008:
;
REFS.
ASNL.
LIT.
ASNL.
LIT.
ASNL.
REFL.
LIT.
GRT.
JUMPT,
REFL.
REFL.
REFLX,
REFL.
REFLX
MPY.
ADD.
ASNL.
REFL.
LIT.
ADD
ASNL.
JUMP.
1,0, portpack; {f:=adc_in>
1,1;
1,0;
1,2;
1,1;
1,53;
{g := 0}
(init loop variable k)
1,53;
1,16;
1;
L#1006;
1,2; {g := g + b_array(k) *
1,53; f_array (k)
}
1,4;
1,53;
1,20;
If
If
1,2;
1,53; (increment k>
1,1;
1;
1,53;
L#1007; {go to loop check)
01 REFSL.l REFL. 1,1; {e := f - g>
02 REFSL.2 REFL. 1,2;
E5 SUB SUB. l;
43 ASNSL.3 ASNL. 1,3;
11 LIT4A.1 LIT. 1,1; {init loop variable k)
355C ASNSLE
L#1010:
ASNSL.
•
1,53;
351E REFSLE REFL. 1,53; {check loop variable k)
1018 LIT8 LIT. 1,16;
EC GR GRT. 1;
5B SKIPNZI JUMPT. L#1009;
104
11 LIT4A.1 LIT.
351E REFSLE REFL.
14 LIT4A.4 REFLX
53 LOCL
DO REFSX
E6 MPYI MPY.
10 LIT4A.0 LIT.
03 REFSL.3 REFL.
E6 MPYI MPY.
35 IE REFSLE REFL.
14 18 LIT8 REFLX
53 LOCL
DO REFSX
E6 MPYI MPY.
E4 ADD ADD.
35 IE REFSLE REFL.
14 LIT4A.4 ASNLX
53 LOCL
A6 ASNSX
351E REFSLE REFL.
11 LIT4A.1 LIT.
E4 ADD ADD.
355C ASNSLE ASNL.
2619 LIT8N JUMP.
59 SKIP
L#1009:;
L#1011:;
04 REFSL.4 REFL.
341E REFSLE REFL.
E5 SUB SUB.
03 REFSL.3 REFL.
E4 ADD ADD.
44 ASNSL.4 ASNL.
04 REFSL.4 REFL.
04 REFSL.4 REFL.
E6 MPYI MPY.
0001 54 ASNXI ASNS.
11 LIT4A.1 LIT.
35 5C ASNSLE
L#1013:;
ASNL.
35 IE REFSLE REFL.
2F LIT4B.F LIT.
EC GR GRT.
21 5B SKIPNZI JUMPT
35 IE REFSLE REFL.
24 18 LIT8 REFLX
53 LOCL
• DO REFSX
35 IE REFSLE REFL.
25 18 LIT8 ASNLX
If 1; (b_array (k) :=u*b_array (k) +
1,53; v*e*f_array(k)
}
1,4;
l;
1,0;
1,3;
1;
1,53;
1,20;
l;
1;
1,53;
1,4;
1,53; (increment k}
1,1;
l;
1,53;
L#1010; (go to loop check)
1,4; {q:=q-e_array (16) +e)
1,52;
l;
1,3;
1;
1,4;
1,4; (dac_out := q * q}
1,4;
1;
1 ,l,portpack;
1,1; (init loop variable k)
1,53;
1,53; (check loop variable k)
1,15;
l;
L#1012;
1,53; (e_array(k+l) :=
1,36; e_array (k)
}
1,53; (note: an optimization!
1,37; base+1 is calculated)
105
53 LOCL
A6 ASNSX
3 5 IE REFSLE
14 18 LIT8
53 LOCL
DO REFSX
35 IE REFSLE
15 18 LIT8
53 LOCL
A6 ASNSX
3 5 IE REFSLE
11 LIT4A.1
E4 ADD
35 5C ASNSLE
26 19 LIT8N
59 SKIP
L#1012:;
L#1014:;
03 REFSL.3
25 5C ASNSLE
01
155C
9519
59
36 18
5F
REFSL.l
ASNSLE
LIT8N
SKIP
L#1005:
;
L#1000:;
LIT8
RETURN
REFL. 1,53; {f_ar ray (k+1 ) :=
REFLX. 1,20; f_array(k)}
REFL. 1,53;
ASNLX. 1,21; {note: an optimization!
base+1 is calculated}
REFL. 1,53; (increment k}
LIT. 1,1;
ADD. 1;
ASNL. 1,53;
JUMP. L#1013; (go to loop check}
REFL. 1,3; (e_array(l) := e}
ASNL. 1,37;
REFL. 1,3; {f_array(D := f}
ASNL. 1,21;
JUMP. L#1004; {go to beginning}
PROCEND. 54,0; {procedure return}
0000
00 0023
0000 23
{procedure
CALLI
CALL I
L#2000:
10 LIT4A.0
5F RETURN
PKGDEF. $init.widrowi. 0000,12;
header for package body}
CALLGS. $i ni t.textio. 0000, text io;
CALLGS. $init.portpack.0000,portpack;
PKG END. 0;
FINI
106
Widrow Floating-point Object listing
Macro/Instruction Definitions will be read from module
[TDJ.AAMP161AAMP16.MLB
Program Size For Counter 1 = 120 Words Decimal.
CAPS Macro Assembler listing for module WIDROWF.OBJ
IDENT.
XREF.
XREF.
XREF.
PACKAGE
XDEF.
XDEF.
PROCDEF
Opcodes Instruction Macro
006E { procedure header }
0000
0000 8125 LIT32 LIT.
69 F7 ASNDLE ASNL.
0000
0000 25 LIT32 LIT.
6BF7 ASNDLE ASNL.
11 LIT4A.1 LIT.
6D 5C ASNSLE ASNL.
L#1002:• •1 /
6D IE REFSLE REFL.
10 18 LIT8 LIT.
EC GR GRT.
295B SKIPNZI JUMPT.
00
0000 8125 LIT32 LIT.
6D IE REFSLE REFL.
17 LIT4A.7 ASNLX.
53 LOCL
8C ASNDX
'widrowfV AAMP/ACAPS Code
Generator Version 1.6';
standard;
text_io;
portpack;
widrowf
;
$init.widrowf .0000;
widrow. widrowf . 0001
;
widrow. widrowf .0001 ,110,12;
Macro args.
2,1.00000000;
2,105;
2,0,00000000;
2,107;
1,1; (init loop varaible k}
1,109;
1,109; (check loop variable k}
1,16;
l;
L#1001;
2,1.00000000;
1,109; (b_array(k) := 1}
2,7;
00
0000 8125 LIT32
6D IE REFSLE
27 18 LIT8
53 LOCL
8C ASNDX
LIT. 2,1.00000000;
REFL. 1,109; (f_array(k) := 1}
ASNLX. 1,3 9;
0000
0081 25 LIT32 LIT. 2,1.00000000;
107
6D1E REFSLE
4718 LIT8
53 LOCL
8C ASNDX
REFL. 1/109; {e_array(k) := 1}
ASNLX. 2,71;
6D1E REFSLE
11 LIT4A.1
E4 ADD
6D5C ASNSLE
2F19 LIT8N
59 SKIP
L#1001
L#1003
L#1004
0000 1C REFSI
65 CVTSD
D9 CVTDF
41 ASNDL.l
0000
0000 25 LIT32
C3 ASNDL.3
11 LIT4A.1
6D5C ASNSLE
L#1007:;
6D1E REFSLE
1018 LIT8
EC GR
18 5B SKIPNZI
33 REFDL.3
6D1E REFSLE
17 LIT4A.7
53 LOCL
D7 REFDX
6D IE REFSLE
27 18 LIT8
53 LOCL
D7 REFDX
86 MPYF
84 ADDF
C3 ASNDL.3
6D1E REFSLE
11 LIT4A.1
E4 ADD
6D5C ASNSLE
1E19 LIT8N
59 SKIP
REFL,
LIT.
ADD.
ASNL,
JUMP.
1,109; {increment k}
1,1;
If
1,109;
L#1002; {go to loop check)
REFS.
CONVERT,
ASNL.
LIT.
ASNL.
LIT.
ASNL.
REFL.
LIT.
GRT.
JUMPT.
REFL.
REFL.
REFLX,
REFL.
REFLX
MPY.
ADD.
ASNL.
REFL.
LIT.
ADD.
ASNL.
JUMP.
1 ,0,portpack; {f:=adc_in>
1,5,0,0;
2,1;
2,0.00000000;
2,3;
1,1; {init loop variable k>
1,109;
1,109;
1,16;
l;
L#1006;
2,3; {g := g + b_array(k) *
1,109; f_array(k)>
2,7;
1,109;
1,39;
1;
1;
2,3;
1,109; {increment k)
1,1;
l;
1,109;
L#1007; {go to loop check)
L#1006:
;
L#1008:;
108
31 REFDL.l
3 3 REFDL.3
85 SUBF
C5 ASNDL.5
11 LIT4A.1
6D5C ASNSLE
6D1E REFSLE
1018 LIT8
EC GR
22 5B SKIPNZI
L#1010
69 22
6D IE
17
D7
6B 22
35
I
6D IE
27 18
53
53
86
86
D7
86
6D IE
17
84
53
8C
REFDLE
REFSLE
LIT4A.7
LOCL
REFDX
MPYF
REFDLE
REFDL.5
MPYF
REFSLE
LIT8
LOCL
REFDX
MPYF
ADDF
REFSLE
LIT4A.7
LOCL
ASNDX
REFL. 2,1; Ce := f - g}
REFL. 2,3;
SUB. 2;
ASNL. 2,5;
LIT. 1,1; {init loop variable k)
ASNSL. 1/109;
REFL. 1,109; (check loop variable k}
LIT. 1,16;
GRT. 1
;
JUMPT. L#1009;
REFL. 2,105; {b_array (k) : =u*b_array (k)
+
REFL. 1/109; v*e*f_ar r ay (k)
}
REFLX. 2,7;
MPY. 5;
REFL. 2,107;
REFL
.
2,5;
MPY
.
5
REFL. 1,109;
REFLX. 2,3 9;
MPY. 5;
ADD. 5;
REFL. 1,109;
ASNLX
.
2,7;
6D1E REFSLE
11 LIT4A.1
E4 ADD
6D5C ASNSLE
2819 LIT8N
59 SKIP
L#1009:;
L#1011:;
37 REFDL.7
6722 REFDLE
85 SUBF
35 REFDL.5
84 ADDF
C7 ASNDL.7
37 REFDL.7
37 REFDL.7
86 MPYF
0001 54 ASNXI
11 LIT4A.1
REFL. 1,109; {increment k}
LIT. 1,1;
ADD. 1;
ASNL. 1,109;
JUMP. L#1010; (go to loop check}
REFL. 2,7; {q: =q-e_ar ray ( 16) +e}
REFL. 2,103;
SUB. 5;
REFL. 2,5;
ADD. 5;
ASNL. 2,7;
REFL. 2,7; {dac_out := q * q}
REFL. 2,7;
MPY. 5;
ASNS. 1 ,l,portpack;
LIT. 1,1; {init loop variable k}
109
6D
6D
5C
IE
2F
ASNSLE ASNL.
L#1013:;
EC
21 5B
6D
47
6D
49
6D
27
6D
29
6D
6D
26
IE
18
53
]
IE
18
53
D7
8C
IE
18
53
]
IE
18
53
D7
8C
IE
11
i
5C
19
59
E4
35
49 F7
6E
00
0000
31
29F7
9P19
59
18
5F
0000
0023
23
REFSLE
LIT4B.F
GR
SKIPNZI
REFSLE
LIT8
LOCL
REFDX
REFSLE
LIT8
LOCL
ASNDX
REFSLE
LIT8
LOCL
REFDX
REFSLE
LIT8
LOCL
ASNDX
REFSLE
LIT4A.1
ADD
ASNSLE
LIT8N
SKIP
L#1012:;
L#1017:;
REFDL .
5
ASNDLE
REFDL.
1
ASNDLE
LIT8N
SKIP
L#1005:;
L#1000:;
LIT8
RETURN
REFL.
LIT.
GRT.
JUMPT.
REFL.
REFLX.
REFL.
ASNLX
REFL.
REFLX
REFL.
ASNLX
.
REFL.
LIT.
ADD.
ASNL.
JUMP.
REFL.
ASNL.
REFL.
ASNL.
JUMP.
1,109;
1/109; (check loop variable k)
1,15;
1;
L#1012;
1,109; {e_array(k+D : =
2,71; e_array(k)}
1,109; (note: an optimization!
2,73; base+1 is calculated}
1,109; {f_array(k+D : =
1,39; f_array(k)>
1,109;
2,41; {note: an optimization!
base+1 is calculated}
1,109; (increment k}
1,1;
1;
1,109;
L#1013; (go to loop check}
2,5; (e_array(l) := e}
2,73;
2,1; (f_array(l) := f}
2,41;
L#1004; (go to beginning}
PROCEND. 110,0; (procedure return}
10
5F
PKGDEF. $init.widrowf .0000,12;
(procedure header for package body}
CALLI CALLGS. $init. textio. 0000 , textio;
CALLI CALLGS. $init.portpack .0000
,
portpack
;
L#2000:
;
LIT4A.0 PKGEND. 0;
RETURN
FINI
110
Integer Standard Lattice Source listing
with portpack;
use portpack;
Standard Lattice Algorithm
April 18, 1984
Ken Albin, Dept. of Electrical and Computer Engineering
Kansas State University, Manhattan, KS 66506
This program is based on the Standard Lattice coded in the
AAMP preliminary report.
package LATTICEI is
procedure LATTICE;
end LATTICEI;
package body LATTICEI is
procedure LATTICE is
pragma suppress (index_check)
;
pragma suppress (range_check)
stages: constant integer := 16;
type number_system is new integer;
type values is array (1.. stages) of number_system;
loop_count: integer;
present._w: number..system;
present._e: number..system;
next_w: number..system;
next_e: number..system;
k: values;
wl: values;
v: values;
beta: constant := 1;
betal: constant := 2;
alpha: constant := 0;
begin
loop
present_w := number_system(adc_in)
;
111
present_e := present_w;
for i in 1.. stages loop
loop_count := i;
next_e := present_e -
k (loop_count) * wl (loop_count)
;
next_w := wl (loop_count) -
k (loop_count) * present_e;
v(loop_count) := beta * v(loop_count) +
betal * (present_e * present_e +
wl (loop_count) * wl (loop_count) )
;
k (loop_count) := k (loop_count) + alpha *
(next_e * wl (loop_count) +
present_e * next_w) / v(loop_count)
;
present_w := next_w;
present_e := next_e;
end loop;
dac_out := integer (present_e)
;
end loop;
end lattice;
begin
null;
end latticei;
112
Integer Lattice Object listing
Macro/Instruction Definitions will be read from module
[TDJ.AAMP161AAMP16.MLB
Program Size For Counter 1 = 71 Words Decimal.
CAPS Macro Assembler listing for module LATTICEI.OBJ
IDENT.
XREF.
XREF.
PACKAGE.
XDEF.
XDEF.
PROCDEF.
Opcodes Instruction Macro
0036 (procedure header}
L#1001:;
00 001C REFSI REFS.
41 ASNSL.l ASNL.
01 REFSL.l REFL.
42 ASNSL.2 ASNL.
11 LIT4A.1 LIT.
35 5C ASNSLE ASNL.
L#1004:;
3 5 IE REFSLE REFL.
10 18 LIT8 LIT.
EC GR GRT.
6A5B SKIPNZI JUMPT.
351E REFSLE REFL.
40 ASNL.O ASNL.
02 REFSL.2 REFL.
00 REFSL.O REFL.
14 LIT4A.4 REFLX.
53 LOCL
DO REFSX
00 REFSL.O REFL.
14 18 LIT8 REFLX.
53 LOCL
DO REFSX
E6 MPYI MPY.
E5 SUB SUB.
44 ASNSL.4 ASNL.
00 REFSL.O REFL.
' latticei', ' AAMP/ACAPS Code
Generator Version 1.6';
standard;
portpack;
latticei
;
$init. latticei. 0000;
lattice. latticei. 0001;
lattice. latticei. 0001, 54, 12;
Macro args.
1 ,0
,
portpack; {present_w :=
1,1; number_system(adc_in)
}
1,1; (present_e := present_w>
1,2;
1,1; (init loop variable i>
1,53;
1,53; (loop count check)
1,16;
l;
L#1003;
1,53; (loop_count := i>
1,0;
1,2; (next_e := present_e
1,0; - k (loop_count) *
1,4; wl (loop_count)
}
1,0;
1,20;
1;
l;
1,4;
1,0; (next_w := wl (loop_count)
113
14 18 LIT8
53 LOCL
DO REFSX
00 REFSL.
14 LIT4A.,4
53 LOCL
DO REFSX
02 REFSL. 2
E6 MPYI
E5 SUB
43 ASNSL.,3
11 LIT4A, 4
00 REFSL.
24 18 LIT8
53 LOCL
DO REFSX
E6 MPYI
12 LIT4A. 2
02 REFSL, 2
02 REFSL. 2
E6 MPYI
00 REFSL.
14 18 LIT8
53 LOCL
DO REFSX
00 REFSL.
1418 LIT8
53 LOCL
DO REFSX
E6 MPYI
E4 ADD
E6 MPYI
E4 ADD
00 REFSL.
24 18 LIT8
53 LOCL
A6 ASNSX
00 REFSL.
14 LIT4A, 4
53 LOCL
DO REFSX
10 LIT4A.
04 REFSL. 4
00 REFSL.
1418 LIT8
53 LOCL
DO REFSX
E6 MPYI
02 REFSL. 2
03 REFSL. 3
E6 MPYI
E4 ADD
E6 MPYI
REFLX. 1,20; - k (loop_count)
* present_e>
REFL. 1,0;
REFLX. 1,4;
REFL. 1,2;
MPY. 1;
SUB. l;
ASNL. 1,3;
LIT. 1,1;
REFL. 1,0;
REFLX. 1,36;
MPY. l;
LIT. 1,2;
REFL. 1,2;
REFL. 1,2;
MPY. l;
REFL. 1,0;
REFLX. 1,20;
(v(loop_count) := beta
* v(loop_count) +betal
(present_e*present_e+
wl (loop_count) *
wl (loop_count) ) }
REFLX. 1,20;
MPY. l;
ADD. 1;
MPY. l;
ADD. 1;
REFL. 1,0;
ASNLX
.
1,36;
REFL. 1,0;
REFLX. 1,4;
LIT. 1,0;
REFL. 1,4;
REFL. 1,0;
REFLX. 1,20;
MPY. 1;
REFL. 1,2;
REFL. 1,2;
MPY. l;
ADD. 1;
MPY. 1;
(k (loop_count) : =
k (loop_count) +
alpha * (next_e *
wl (loop_count) +
present_e * next_w)
/
v(loop_count) }
114
00 REFSL.O
24 18 LIT8
53 LOCL
DO REFSX
E7 DIVI
E4 ADD
00 REFSL.O
14 LIT4A.4
53 LOCL
A6 ASNSX
01 REFSL.l
00 REFSL.O
14 18 LIT8
53 LOCL
A6 ASNSX
REFL. 1,0;
REFLX. 1,36;
DIV. 1;
ADD. l;
REFL. 1,0;
ASNLX
.
1,4;
REFL.
REFL.
ASNLX,
03 REFSL.3 REFL.
41 ASNSL.l ASNL.
04 REFSL.4 REFL.
42 ASNSL.2 ASNL.
35 IE REFSLE REFL.
11 LIT4A.1 LIT.
E4 ADD ADD.
35 5C ASNSLE ASNL.
70 19 LIT8N JUMP.
59 SKIP
L#1003:;
L#1005:;
02 REFSL.2 REFL.
0001 54 ASNSI ASNS.
8019 LIT8N JUMP.
59 SKIP
L#1002:
;
L#1000:;
36 18 LIT8 PROCEND
5F RETURN
PKGDEF.
0000 {procedure hea der)
00 0023 CALL I
L#2000:;
CALLGS.
10 LIT4A.0 PKGEND.
5F RETURN
20 NOP
FINI
1,1; (wl (loop_count) :=
1,0; present_w>
1,20;
1,3; (present_w := next_w)
1,1;
1,4; (present_e := next_e}
1,2;
1,53; {increment i)
1,1;
1; .
1,53;
L#1004; {go to loop check)
{dac_out :=
1,2; integer (present_e) }
1 ,1 ,portpack;
L#1001; {go to beginning)
54,0; {procedure return)
{never used)
$init.latticei.0000,12;
$init.portpack.0000,portpack;
0;
115
Floating-point Lattice Object listing
Opcodes Instruction Macro
Macro/Instruction Definitions will be read from module
[TDJ. AAMP1 6 ] AAMP16 . MLB
Program Size For Counter 1 = 85 Words Decimal.
CAPS Macro Assembler listing for module LATTICEF.OBJ
IDENT. 'latticef', 1 AAMP/ACAPS Code
Generator Version 1.6';
XREF. standard;
XREF. portpack;
PACKAGE. latticef;
XDEF. $init. latticef .0000;
XDEF. lattice. latticef .0001;
PROCDEF. lattice. latticef. 0001, 112 ,12;
Macro args.
2,1.00000000;
2,105;
2,2.00000000;
2,107;
2,0.00000000;
2,109;
REFS. 1 ,0, portpack;
CONVERT. 1,5,0,0; (present_w :=
number_system(adc_in)
}
ASNL. 2,1;
REFL. 2,1; (present_e: =present_w}
ASNL. 2,3;
LIT. lflf (init loop variable i)
ASNL. 1,111;
REFL. 1,111; (loop count check)
LIT. 1,16;
GRT. 1;
JUMPT. L#1003;
REFL. 1,111; (loop_count := i}
ASNL. 1,0;
REFL. 2,3; (next_e := present_e
REFL. 1,0; - k (loop_count)
0070 {procedure header)
0000 8125 LIT32 LIT.
00
6 9 F7 ASNDLE ASNL.
0082 25 LIT32 LIT.
0000
6BF7 ASNDLE ASNL.
0000 0025 LIT32 LIT.
00
6D F7 ASNDLE ASNL.
L#1001:;
0000 1C REFSI
6 5 CVTSD
D9 CVTDF
CI ASNDL.l
31 REFDL.l
C3 ASNDL.3
11 LIT4A.1
6F5C ASNSLE
6F1E REFSLE
1018 LIT8
EC GR
6D 5B SKIPNZI
L#1004:
6F IE
40
REFSLE
ASNSL.O
33 REFDL.3
00 REFSL.O
116
17 LIT4A,.7
53 LOCL
D7 REFDX
00 REFSL.,0
2718 LIT8
53 LOCL
D7 REFDX
86 MPYF
85 SUBF
C7 ASNDL.,7
00 REFSL,.0
2718 LIT8
53 LOCL
D7 REFDX
00 REFSL.
17 LIT4A,.1
53 LOCL
D7 REFDX
33 REFDL. 3
86 MPYF
85 SUBF
C5 ASNDL. 5
6922 REFDL
E
00 REFSL.
47 18 LIT8
53 LOCL
D7 REFDX
86 MPYF
6B22 REFDL
33 REFDL. 3
33 REFDL
.
3
86 MPYF
00 REFSL.
2718 LIT8
53 LOCL
D7 REFDX
00 REFSL.
27 18 LIT8
53 LOCL
D7 REFDX
86 MPYF
84 ADDF
86 MPYF
84 ADDF
00 REFSL.
4718 LIT8
53 LOCL
8C ASNDX
00 REFSL.
17 LIT4A. 7
53 LOCL
D7 REFDX
REFLX
REFL.
REFLX
MPY.
SUB.
ASNL
REFL.
REFLX,
REFL.
REFLX
REFL.
MPY.
SUB.
ASNL.
REFL.
REFL.
REFLX,
MPY.
REFL.
REFL.
REFL.
MPY.
REFL.
REFLX
REFL.
REFLX
MPY.
ADD.
MPY.
ADD.
REFL.
ASNLX
REFL.
REFLX.
2,7; * wl (loop_count)
}
1,0;
2,39;
5;
5;
2,7;
1,0; (next_w := wl (loop_count)
2,39; - k (loop_count)
* present_e>
1,0;
2,7;
2,3;
5;
5;
2,5;
2,
1#
2,
5;
2,
2,
2,
5;
li
2,
105;
0;
71;
107;
3;
3;
0;
39;
(v(loop_count) :=beta
*v(loop_count) +betal
(present_e*present_e
+wl (loop_count)
*wl ( loop_count) )
}
1,0;
2,39;
5;
5;
5;
5;
1,0;
2,71;
1,0; (k (loop_count) :=
2,7; k (loop_count) +alpha
* (next_e*wl (loop_count)
+present_e*next_w)
/
117
86 MPYF
33 REFDL.3
35 REFDL.5
86 MPYF
84 ADDF
86 MPYF
00 REFSL.O
47 18 LIT8
53 LOCL
D7 REFDX
87 DIVF
84 ADDF
00 REFSL.O
17 LIT4A.7
53 LOCL
8C ASNDX
31 REFDL.l
00 REFSL.O
27 18 LIT8
53 LOCL
8C ASNDX
35 REFDL.5
CI ASNDL.l
37 REFDL .
7
C3 ASNDL.3
6F IE REFSLE
11 LIT4A.1
E4 ADD
6P 5C ASNSLE
73 19 LIT8N
59 SKIP
MPY.
REFL.
REFL.
MPY.
ADD.
MPY.
REFL.
REFLX.
DIV.
ADD.
REFL.
ASNLX
REFL.
REFL.
ASNLX.
REFL.
ASNL.
REFL.
ASNL.
REFL.
LIT.
ADD.
ASNL.
JUMP.
33
L#1003:;
L#1005:;
REFDL.3 REFL.
DB
DA
0001 54
8719
59
70 18
5F
0000
00 0023
CVTFD
CVTDS
ASNSI
LIT8N
SKIP
LIT8
RETURN
L#1002:
;
L#1000:;
CONVERT.
ASNS.
JUMP.
PROCEND
.
5;
2,3;
2,5;
5;
5;
5;
1,0;
2,71;
5;
5;
1,0;
2,7;
2,1;
1,0;
2,39;
v(loop_count)
}
(wl ( loop_count) :
=
present_w>
2,5; {present_w := next_w>
2,1;
2,7; {present_e := next_e}
2,3;
1,111; {increment i)
1,1;
1;
1,111;
L#1004; {go to loop check}
2,3; {dac_out :=
5,1,0,0; integer (present_e)
}
1 ,1 ,portpack;
L#1001; {go to beginning}
112,0; {procedure return}
{never used}
10
PKGDEF. $init.latticef .0000,12;
{procedure header for package}
CALLI CALLGS. $init. portpack. 0000 ,portpack;
L#2000:
LIT4A.0 PKGEND. 0;
118
5F RETURN
20 NOP
FINI
119
Integer ADATESTS Source listing
Loop Structure Test
April 18, 1984
Ken Albin, Dept. of Electrical and Computer Engineering
Kansas State University, Manhattan, KS 66506
This program attempts to test the efficiency of various
compiled structures available in Ada.
package adatests is
procedure dummy;
procedure stuff;
end adatests;
package body adatests is
procedure dummy is
Nothing goes on here - this is just to look at calling code,
begin
null;
end;
procedure stuff is
This section test various control structures found in Ada.
type number_system is new integer;
done: boolean := false;
A, B,C,D, E,F,G: number_system;
function add_seven ( junk_in: number_system)
return number_system is
begin
return junk_in + 7;
end add_seven;
beqin
120
while not done loop
done := true;
end loop;
for count in 1..5 loop
null;
end loop;
loop
null;
exit;
end loop;
The following is a test to see the reordering (if any)
performed.
A := B + C * (D + E * (F + G) ) ;
The following is a test to see if an optimization is made to
avoid storing and then immediately retrieving a variable.
First argument matches last assigned (A)
.
A := B + C;
D := A + G;
Second argument matches last assigned (B)
.
C := D + E;
B : = F + C;
Common subexpression elimination test.
A := (B + C) * D - (B + C)
;
Duplicate argument instead of fetch again.
A := D * D;
A := (D + 5) * (D + 5)
;
Removal of loop invariant expressions.
for count in 1..5 loop
121
A := 1;
E := 1 + 3;
B := C + D;
end loop;
Test to see if the increment instruction is used,
A := A + 1;
Sample procedure call,
dummy.
Sample function call.
B := add_seven (A)
;
end stuff;
begin
null:
end adatests;
122
Integer ADATESTS Object listing
Macro/Instruction Definitions will be read from module
[TDJ. AAMP1 6 ] AAMP16 . MLB
Program Size For Counter 1 = 68 Words Decimal.
CAPS Macro Assembler listing for module ADATESTS. OBJ
IDENT. adatests', AAMP/ACAPS Code
Generator Version 1.6';
XREF. standard;
PACKAGE. adatests;
XDEF. $init. adatests. 0000
XDEF. dummy. adatests. 0001
XDEF. stuff .adatests. 0002;
PROCDEF. dummy. adatests. 0001,0,12;
Opcodes Instruction Macro Macro args.
0000 (procedure header for dummy}
L#1000:
;
10 LIT4A.0 PROCEND. 0,0; {null procedure body}
5F RETURN
0000 {procedure header for add_seven}
00 REFSL.O REFL. 1,0; {return junk in + 7}
17 LIT4A.7 LIIT. 1,7;
E4 ADD ADD. l;
11 LIT4A.1 RETURN. 1;
5F RETURN
L#2000: •
11 LIT4A.1 PROCEND
.
0,1;
5F RETURN
20 NOP PROCDEF. stuff .adatests. 000 2, 9,1 2;
0009 {procedure header for st uff}
10 LIT4A.0 LIT. 1,0;
40 ASNSL.O ASNL. 1,0;
L#3001: •
00 REFSL.O REFL. 1 ,0; {initial ize done:=fal
05 5B SKIPNZI JUMPT. L#3002;
11 LIT4A.1 LIT. 1,1;
40 ASNSL.O ASNL. 1,0;
07 19
59
LIT8N
SKIP
JUMP. L#3001; {end loop}
L#3003:
L#3002:;
11 LIT4A.1 LIT. 1,1; Unit count := 1}
48 ASNSL.8 ASNL. 1,8;
L#3005:
08 REFSL.8 REFL. 1,8; {check count}
123
15
]
07 5B
08
E4
EC
11
48
OB 19
59
LIT4A.5
GR
SKIPNZI
REFSL.8
LIT4A.1
ADD
ASNSL.8
LIT8N
SKIP
LIT. 1,5;
GRT. If
JUMPT. L#3004;
(null loop body}
REFL. 1,8; {increment count)
LIT. 1,1;
ADD. 1;
ASNL. 1,8;
JUMP.
031D SKIPI
L#3004
L#3006
L#3007
L#3009
0419
59
02
03
04
05
06
07
E4
E6
E4
E6
E4
41
02
03
E4
41
01
07
E4
44
04
05
E4
43
06
03
E4
42
02
03
E4
LIT8N
SKIP
REFSL.2
REFSL.3
REFSL.4
REFSL.5
REFSL.6
REFSL.7
ADD
MPYI
ADD
MPYI
ADD
ASNSL.l
REFSL.2
REFSL.3
ADD
ASNSL.l
REFSL.l
REFSL.7
ADD
ASNSL.4
REFSL.4
REFSL.5
ADD
ASNSL.3
REFSL.6
REFSL.3
ADD
ASNSL.2
REFSL.2
REFSL.3
ADD
L#3008:
JUMP,
JUMP,
REFL.
REFL.
REFL.
REFL.
REFL.
REFL.
ADD.
MPY
ADD.
MPY.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL,
REFL,
REFL,
ADD.
ASNL,
REFL.
REFL,
ADD.
L#3005; {go to loop check)
{begin loop)
L#3008; {exit loop)
L#3007; {end loop)
1,2
1,3
1,4
1,5
1,6
1,7
If
If
l;
if
l;
1,1
1,2
1,3
if
1,1
1,1
1,7
If
1,4
1,4
1,5
If
1,3
1,6
1,3
If
1,2
1,2
1,3
If
{ A:=B+C*(D+E*(F+G) ) )
{ A := B + C )
{ D := A + G )
{ C := D + E )
{ B := F + C )
{ A:=(B+C)*D-(B+C) )
124
04 REFSL.4 REFL. 1,4;
E6 MPYI MPY. l;
02 REFSL.2 REFL. 1,2;
03 REFSL.3 REFL. 1,3;
E4 ADD ADD. l;
E5 SUB SUB. l;
41 ASNSL.l ASNL. 1,1;
04 REFSL.4 REFL. 1,4; { A := D * D }
04 REFSL.4 REFL. 1,4;
E6 MPYI MPY. l;
41 ASNSL.l ASNL. 1,1;
04 REFSL.4 REFL. 1,4; (A:=(D+5)*(D+5)
}
15 LIT4A.5 LIT. 1,5;
E4 ADD ADD. l;
04 REFSL.4 REFL. 1,4;
15 LIT4A.5 LIT. 1,5;
E4 ADD ADD. l;
E6 MPYI MPY. l;
41 ASNSL.l ASNL. l,l;
11 LIT4A.1 LIT. l,l; (init count := 1}
48 ASNSL.8
L#3011:
ASNL.
•
1,8;
08 REFSL.8 REFL. 1,8; (check count}
15 LIT4A.5 LIT. 1,5;
EC GR GRT. l;
0F5B SKIPNZI JUMPT. L#3010;
11 LIT4A.1 LIT. 1,1; {A := 1}
41 ASNSL.l ASNL. l,l;
14 LIT4A.4 LIT. 1,4; {E := 1 + 3}
45 ASNSL.5 ASNL. 1,5; {note: an optimiza
03 REFSL.3 REFL. 1,3; {B := C + D>
04 REFSL.4 REFL. 1,4;
E4 ADD ADD. l;
42 ASNSL.2 ASNL. 1,2;
08 REFSL.8 REFL. 1,8; {increment count}
11 LIT4A.1 LIT. 1,1;
E4 ADD ADD. 1;
48 ASNSL.8 ASNL. 1,8;
1319 LIT8N JUMP. L#3011; {go to check c
59 SKIP
L#3010:
L#3012:
•
•
01 REFSL.l REFL. 1,1; {A := A + 1}
11 LIT4A.1 LIT. 1,1;
E4 ADD ADD. l;
41 ASNSL.l ASNL. 1,1;
125
0000 23 CALLI CALLG.
01 REFSL.l REFL.
000 4 23 CALLI CALLL.
42 ASNSL.2 ASNSL.
L#3000:;
29 LIT4B.9 PROCEND.
5F RETURN
dummy. ada tests. 0001;
1,1; { B := add_seven(A) }
add_seven. ada tests .0000;
1,2;
9,0
20 NOP PKGDEF.
0000 {procedure header)
L#4000:;
10 LIT4A.0 PKGEND.
5F RETURN
FINI
$init.adatests.0000,12;
0;
126
Floating-point ADATESTS Object listing
Macro/Instruction Definitions will be read from module
[TDJ. AAMP1 6 ] AAMP16 . MLB
Program Size for Counter 1 = 83 Words Decimal.
CAPS Macro Assembler listing for module ADATESTSF.OBJ
IDENT. 'adatestsfV AAMP/ACAPS Code
Generator Version 1.6';
XREF. standard;
PACKAGE. adatestsf;
XDEF. Sinit. adatestsf .0000;
XDEF. dummy. adatestsf .0001;
XDEF. stuff .adatestsf .0002;
PROCDEF. dummy. adatestsf .0001,0,12;
Opcodes Instruction Macro Macro args.
0000 (procdedure header for dummy}
L#1000:
;
10 LIT4A.4 PROCEND. 0,0; {null body of dummy}
5F RETURN
0000 (procedure header for function add_seven}
30 REFDL.O REFL. 2,0; {arg passed on stack}
0083 25 LIT32 LIT. 2,7.00000000;
6000
84 ADDF ADD. 5; {return junk_in + 7}
12 LIT4A.2 RETURN. 2;
L#2000:
12 LIT4A.2 PROCEND. 0,2;
5F RETURN
20 NOP PROCDEF. stuff . adatestsf
.
0002,16 ,12;
0010 {procedure header for stuff}
10 LIT4A.0
40 ASNSL.O
L#3001;
00 REFSL.O
05 5B SKIPNZI
11 LIT4A.1
40 ASNSL.O
07 19 LIT8N JUMP. L#3001; {go to while test}
59 SKIP
L#3003:;
L#3002:;
11 LIT4A.1 LIT. 1,1; Unit loop variable}
4F ASNSL.F ASNL. 1,15;
L#3005:;
OF REFSL.F REFL. 1/15; {test loop variable}
127
LIT. 1,0; {init done:=false}
ASNL. 1,0;
REFL. 1,0; {while test}
JUMPT. L#3002;
LIT. 1,1; {set done:=true}
ASNL. 1,0;
15
]
07 5B
OF
E4
EC
11
4F
OB 19
59
LIT4A.5
GR
SKIPNZI
REFSL.F
LIT4A.1
ADD
ASNSL.F
LIT8N
SKIP
LIT. 1,5;
GRT. l;
JUMPT. L#3004;
{null body of loop}
REFL. 1,15; {inc loop variable}
LIT. 1,1;
ADD. 1;
ASNL. 1,15;
JUMP.
031D SKIPI
L#3004
L#3006
L#3007
L#3009
0419
59
33
35
37
39
3B
3D
84
86
84
86
84
CI
33
35
84
CI
31
3D
84
C7
37
39
84
C5
3B
35
84
C3
33
35
84
LIT8N
SKIP
REFDL.3
REFDL.5
REFDL .
7
REFDL.9
REFDL.
B
REFDL.
D
ADDF
MPYF
ADDF
MPYF
ADDF
ASNDL.l
REFDL.3
REFDL.5
ADDF
ASNDL.l
REFDL.
1
REFDL.
ADDF
ASNDL.7
REFDL.
REFDL.9
ADDF
ASNDL.5
REFDL.
REFDL.5
ADDF.
ASNDL.3
REFDL.3
REFDL.5
ADDF
L#3008:
JUMP.
JUMP.
REFL.
REFL.
REFL.
REFL,
REFL.
REFL,
ADD.
MPY.
ADD.
MPY.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
ASNL.
REFL.
REFL.
ADD.
L#3005; {go to loop test}
{beginning of loop}
L#3008; {exit loop}
L#3007; {go to loop beginning}
2,
2,
2,
2,
2,
2,
5;
5;
5;
5;
5;
2,
3
5
7
9
ll;
13;
{A:=B+C*(D+E*(F+G) ) }
1;
{C:=D+E}
2,3; {A:=B+C}
2,5;
5;
2,1;
2,1; {D:=A+G}
2,13;
5;
2,7
2,7
2,9
5;
2,5
2,B
2,5
5;
2,3
2,3
2,5
5;
{B:=F+C}
{A:=(B+C)*D-(B+C)
}
128
37
86
33
35
84
85
CI
37
37
0000
20
0083
2000
86
CI
37
8325
84
37
25
84
86
CI
0000
00
0000
00
0082
4000
11
4F
OF
15
EC
1D5B
8125
CI
8125
25
84
C9
35
37
84
C3
OF
11
E4
4F
2119
59
REFDL.7
MPYF
REFDL.3
REFDL.5
ADDF
SUBF
ASNDL.l
REFDL.7
REFDL.7
MPYF
ASNDL.l
REFDL .
7
LIT32
ADDF
REFDL.7
LIT32
ADDF
MPYF
ASNDL .
1
LIT4A.1
ASNSL.F
]
REFSL.F
LIT4A.5
GR
SKIPNZI
LIT3 2
ASNDL .
LIT32
LIT32
ADDF
ASNDL.
9
REFDL .
5
REFDL.7
ADDF
ASNDL.
3
REFSL.F
LIT4A.1
ADD
ASNSL.F
LIT8N
SKIP
L#3011
REFL. 2,7;
MPY. 5;
REFL. 2,3;
REFL. 2,5;
ADD. 5;
SUB. 5;
ASNL. 2,1;
REFL. 2,7; (A:=D*D}
REFL. 2,7;
MPY. 5;
ASNL. 2,1;
REFL. 2,7; (A:=(D+5.0)*(D+5.0
LIT. 2,5.00000000;
ADD. 5;
REFL. 2,7;
LIT32 2,5.00000000;
ADD. 5;
MPY. 5;
ASNL. 2,1;
LIT. 1,1; (init count:=l>
ASNL. 1,15;
REFL. 1,15; (loop test}
LIT. 1,5;
GRT. 1;
JUMPT. L#3010;
LIT. 2,1.00000000; {A:=1.0}
ASNL. 2,1;
LIT. 2,1.00000000; (E:=l+3>
LIT. 2,3.00000000;
ADD. 5;
ASNL. 2,9;
REFL. 2,5; {B:=C+D>
REFL. 2,7;
ADD. 5;
ASNL. 2,3;
REFL. 1,15; (increment count}
LIT. 1,1;
ADD. 1;
ASNL. 1,15;
JUMP. L#3011; (go to loop test}
129
L#3010:;
L#3012:;
31 REFDL.l REFL. 2,1; {A:=A+1.0>
0000 8125 LIT32 LIT. 2,1.00000000;
00
84 ADDF ADD. 5;
CI ASNDL.l ASNL. 2,1;
0000 23 CALLI CALLG. dummy . adatestsf . 0001
;
31 REFDL.l REFL. 2,1;
0004 23 CALLI CALLL. add_seven. adatestsf . 0000
;
C3 ASNDL.3 ASNL. 2,3;
L#3000: ;
10 18 LIT8 PROCEND. 16,0;
5F RETURN
PKGDEF. $init. adatestsf .0000,12;
0000 {procedure header)
L#4000: ;
10 LIT4A.0 PKGEND. 0; (null procedure body)
5F RETURN
FINI
130
AN EVALUATION OF ROCKWELL'S
ADVANCED ARCHITECTURE MICROPROCESSOR
FOR DIGITAL SIGNAL PROCESSING APPLICATIONS
by
KENNETH LEE ALBIN
B. S. , Kansas State University, 1981
AN ABSTRACT OF A MASTER'S THESIS
submitted in partial fulfillment of the
requirements for the degree
MASTER OF SCIENCE
Department of Electrical and Computer Engineering
KANSAS STATE UNIVERSITY
Manhattan, Kansas
1984
ABSTRACT
This thesis examines the architecture of Rockwell's Advanced
Architecture Microprocessor (AAMP) and predicts performance on
signal processing algorithms. Performance that can be achieved
with high-level languages is also investigated.
The Electrical and Computer Engineering Department at Kansas
State University, in conjunction with Sandia National
Laboratories, has attempted to identify processors which are
most appropriate for implementation of real-time adaptive linear
prediction in intruder detection devices. The ideal processor
would require very little power, be easy to interface, perform
multiplications very quickly and use floating-point arithmetic.
The AAMP is a CMOS/SOS microprocessor that has a stack
architecture with a 16-bit wide data path. Single and double
precision integer and fractional as well as single and extended
precision floating-point data types are supported on a single
chip. It consumes approximately 50 mW at its rated 20 MHz clock
rate and uses a single 5 volt supply.
This thesis consists of three parts. The first part is an
introduction to the AAMP's architecture, instruction set and data
structures. The second part details the investigation and
findings from the evaluation. Included in this section is a
discussion of ways to optimize the Widrow and Lattice algorithms
for the processor's architecture. The third part contains the
results and conclusions of the evaluation in a concise form.


