The study of trace cache memory on superscalar DLX processor by Apisake, Hongwitayakorn
The Study of Trace Cache Memory on
Superscalar DLX Processor
Apisake HongwitaYakorn
Thesis submittedfor the degree of
Master of EngineeringScience













1.3 Trace Cache Memory
1.4 Contribution of the Thesis
1.5 Outline of the Thesis
2 Background
2.1 Overview
2.2 Trace Cache Architecture
2.2.1 The trace cache
2.3 Related Works





2.2.2 The fill unit
2.2.3 The branch predictor


















2.3.2 Other high bandwidth fetch mechanism
2.4 Conclusion
ll




3.2.2 DLX data types
3.2.3 DLX addressing modes
3.2.4 DLX instruction tYPes
3.3 The Superscalar DLX Model ..........
3.4 The Fetch Unit
3.5 Conclusion
4 Experimental SetuP
4.1 Trace Cache in the Superscalar DLX Processor













4.5 Simulation Testbench Configuration




5.2 Hits and Misses of the Trace Cache
5.3 Percentage of Trace Cache Hits and Misses
5.3.1 Trace cache hits










4.3.2.2 Fill-logic and fill-policy
Trace cache memory
4.3.3.1 Trace cache memory structure
4.3.3.2 Buffer-cache transfer
























5.4 Trace Cache Space Usage







A Companion CD-ROM Contents
4.1 DLX Sourcecode
4.2 Test Programs
4.3 Simulation log files




C Excerpts Tromlog files of DCT





















1.1 Organization of superscalar architecture'
Trace cache overview.
Trace cache architecture diagram.
The trace cache fetch mechanism.
A loop contains 3 segments
The trace cache fetch mechanism.
The Branch Address Cache.
Collapsing Buffer.
Big Endian byte ordering. ..
DLX instruction format
Superscalar DLX structure. ......
Instruction cache structure.
Address-translation and cache-access. . . ....
Branch-target-buffer structure'
Trace cache placement in the superscalar DLX machine'









































Anatomy of Trace Information and Trace Content'
Trace cache memory structure.
Trace information portion of the trace cache memory'
Trace cache line selector is extracted starting from bit 3






5.1a Hits and misses of the trace cache and the instruction cache on bs-a
5 . lb Hits and misses of the trace cache and the instruction cache on bs-r
5. lc Hits and misses of the trace cache and the instruction cache on bs-d
5.ld Hits and misses of the trace cache and the instruction cache onpn-20
5.le Hits and misses of the trace cache and the instruction cache on pn-50
5.lf Hits and misses of the trace cache and the instruction cache onpn-L00 """
5.19 Hits and misses of the trace cache and the instruction cache on Permute ""
5.lh Hits and misses of the trace cache and the instruction cache ot DCT
5.2 Comparison between TC hit and Compulsory Miss and
Conflict Miss of DCT










Percentage of total trace cache hit of TC-4 and TC-8
Percentage of TC First Tag Hit of TC -4 and 
TC -8 ' ' ' ' ' '



























5.7 Percentage of TC Compulsory Miss of TC-4 andTC-8
5.8 Percentage of TC Conflict Miss of TC-4 andTC-9 ""
5.9 TC4 - Percentage of Cache Space Usage























Trace cache hits comparison table of bs-r on TC 4. .. " "














Instruction-level parallelism (ILP) is a technique to increase processor performance
through the simultaneous execution of multiple instructions. Superscalar processor
architectures implement ILP by providing multiple execution units to process
instructions in parallel. To achieve high perfoffnance' the execution units must be
occupied by a continuous series of instructions. Hence, the front-end of the processor
has to be expanded in order to supply a continuous stream of instructions for the
execution units. Although instruction-cache memory has been successfully used 
to
enhance the fetch mechanism of superscalar processors for years' it cannot perform
well enough for contemporary processors because of the nature of the statically
ordered instructions stored in the cache. Branch instructions are the major problem
because of the two possible directions of the branch outcome' They break up the
continuity of the static code into short run-length basic blocks. Therefore, a line of an
instruction cache can contain instructions that might be abandoned if they follow a
branch that will be taken'
Trace cache architecture has been developed to reduce the effect of the
problem. It has a sophisticated logic unit to capture dynamic instruction traces'
possibly including multiple basic blocks, and store them in a single line. Therefore, it
is most likely able to supply a latger segment of useful instructions in one hit.
Moreover, the trace cache was deliberately designed not to lengthen the processor
pipeline. It has been shown that trace cache can outperform instruction caches in
large-scale microprocessors, e.g. 1 6-instruction wide processors.
This research studies the effect of trace cache memory on smaller-scale
microprocessors like the superscalar DLX model that can process only 2 instructions
simultaneously. The study will investigate the performance of the experimental trace
cache compared to the existing instruction cache and also investigate the trade-offs in
varying trace cache size'
* v111
To mY Parents,
my wife, my familY, mY incoming child,
and everyone who believes in me.
{i
1X
This work contqins no rnateriat which has been accepted for the award of any
other degree or diploma in any university or other tertiary institution and, to the best
of my lrnowledge and. betief, contains no material prevìously published or written by
another person, e)ccept where due reference has been made in the text.
I give consent to this copy of my thesís, when deposited in the University Library,
being availablefor loan and photocopying'




First, I would like to thank Michael J. Liebelt, my supervisor, for his help and
support in everything. Without his excellent guidance and patience, my work
definitely cannot be achievable.
All staff in the department of Electrical and Electronic Engineering, I
appreciate their help for all these years since the flrrst day I've been here. Thank you
for everything.
I also want to thank my colleagues at Silpakorn University and my students
out there for faith and belief that I can do this. In addition, abigthank to AusAID ancl
Thai Royal Government for the scholarship.
And, of course, I'd like to thank all my teachers who gave me good knowledge
from a very fltrst day at school.
I have a very long list of friends and relatives who I want to thank. If I write it
down, it would dominate the thesis. So, I would like to thank everyone with all my
heart.
Last but definitely not least, I want to thank my mom, dad, sister, and brothers
for everything. Also, I want to thank my wife who always be there for me and, at the
time I wrote this, she is about to deliver the best gift I've ever got. It is the f,rrst child






The performance requirements of high performance computers are escalating
tremendously in order to respond to the complexity of modern software applications'
Much research has been conducted on techniques to improve the performance of
microprocessors as they are deployed in almost every level of modern computers' The
objective is to increase the number of instructions that can be executed per unit time'
Researchers in the field of semiconductor technology propose to increase processor
clock frequency, the reciprocal of time usage. Meanwhile, computer architects
attempt to modify processor microarchitecture and improve compiler technology in
order to execute multiple instructions simultaneously'
Instruction-level parallelism (ILP) is the dominant technique exploited in
modern processor microarchitecture. Parallelism of incoming static sequential
instructions is detected in order to execute multiple instructions concuffently' This
technique can be implemented using both software and hardware approaches
depending on the type of processor. VLIW (Very Long Instruction word) and
superscalar are two types of ILP processors [23], [29]' The former aggressively uses
compiler techniques to obtain high levels of parallelism. Hardware techniques are
used in the latter to capture incoming instructions and dynamically determine those
that can be executed in parallel. Consequently, software applications can be run on
superscalar processors without recompiling ll2l. In this thesis, we focus on





The operation cycle of a superscalar processor begins with fetching instructions
from a static program into the processor using the instructionfetching mechanism and
decoding them at the decoder unit. After this stage, the decoded instructions will be
dispatched and temporarily accumulated in an instruction buffer called the window of
execution. These instructions are no longer constrained by static program order'
Therefore, they are free to be executed in parallel and ready to be issued
simultaneously into the appropriate functional units located in the instruction
execution mechønism after their operands become available, subject to data
dependence and resource constraints [14], [15], 1251, Í321. Figure 1.1 shows the
diagram of superscalar architecture organization'
instruction disPatch
instruction ¡ssue
Figure I .l Organization of superscalar architecture
To effectively exploit ILP is to improve superscalar processor performance by
widening the window size for the purpose of increasing the possibility of finding data-
independent instructions. More functional units are also required in order to be able to
execute more instructions concurrently. Ideally, instruction-fetching bandwidth
should correspond to the peak instruction dispatch and issue rate, to avoid the
bottleneck problem [25]. However the constraint imposed by control dependence
impedes the ability of the fetching mechanism to fetch instructions continuously, so it
becomes important to overcome this constraint.
2
F U I
Instruct¡on Execut¡on M echanism





1.3 Trace Cache MemorY
The fetch unit must be able to feed a continuous stream of instructions to the
window of execution as quickly as possible. It would be much easier if instructions
were all lined up in contiguous fashion from start to finish. Unfortunately, such
behavior is not found in typical application programs because they possess branch
instructions. Branch instructions, the causes of control dependence, are very common
in typical programs [8] and cause the instruction fetching mechanism to wait for
branch outcomes to determine whether the branches are taken or not taken. In [1], the
term basic blockhas been def,rned as an instruction group, which has one entry 
point
and one exit point. Whenever abranchinstruction is encountered, that will be the end
of the basic block. Typically, the average run-length of a basic block is about 4 to 6
instruction s pal. Therefore, the sequentiality of instruction addressing is disrupted
and the program is divided into numerous small basic blocks.
The branch prediction method was introduced to lessen the problem of control
dependence by speculatively predicting the outcome of branches' However, there is a
problem of non-contiguous location of individual basic blocks inside the conventional
instruction cache. Basically, there are useless instructions lurking between useful
basic blocks that are scattered among different cache lines, so a single fetch might not
be so effective. Trace cache memory l24l is proposed not only to overcome this
crucial drawback which blocks the possibility of fetching multiple basic blocks
concurrently, but also to diminish the latency of fetching' which is the flaw of related
prior research on high bandwidth fetching mechanisms. Moreover, the trace cache
was designed to work outside the main pipeline of the processor. Therefore, it does
not introduce an additional pipeline stage that would increase processing time.
Trace cache research has been conducted for very high performance
microprocessors, i.e. 16 instruction-wide superscalar processors' Rotenberg et al124]
showed that the fetching performance of a processor using a trace cache is improved
by 34% for integer benchmarks and 160/o for floating-point benchmarks. Meanwhile,
the trace cache work on enhanced features conducted by Patel [21] showed that a
trace cache can outperform an aggressive instruction cache scheme by 14% of overall
performance and increase the fetch bandwidth by 34%. Recently, Intel Corporation
adopted trace cache technology for the Intel NetBurst micro-architecture in its
mainstream commercial processor, the Pentium-a [I1]'
J
Introduction
There has been no reported study of trace cache performance for a small-scale
microprocessor. Therefore, this research will study the effect of the different trace
cache memory configurations for a VHDL model of a superscalar DLX machine [10],
which can process only 2instructions simultaneously. The design of the trace cache 
of
the experiment will be done for two main configurations, TC-4 and TC-8 fot 4
instruction-wide and 8 instruction-wide trace cache respectively. Each configuration
is studied with a varying number of trace cache lines to understand the trade-offs
between performance and cache size.
L.4 Contribution of the Thesis
The contributions of this work can be summarized as follows:
o An analysis of the trade-offs between performance and trace cache size for
narrow-issue sperscalar processor.
o An indication of whether trace caches are a worthwhile enhancement for
narrow-issue superscalar DLX processor'
o A greater understanding of the performance characteristics of trace cache'
1.5 Outline of the Thesis
The thesis is organized into 6 chapters. chapter 2 describes the background of
trace cache design and related research work. Details of the DLX architecture and the
superscalar DLX model that have been used in this research will be presented in
Chapter 3.
chapter 4 explains the experimental setup and methods. The results of the
experiment and corresponding analysis will be in Chapter 5' This chapter also







Although the conventional instruction cache has served as a good source of
instructions at high fetch rates for a long time, it cannot satisfy that high instruction
consumption of wide issue processors. Instructions residing in the instruction cache
are placed in compiled order and, unfortunately, typical programs possess many
branch instructions. Consequently, several small basic blocks exist in run-time
execution and disrupt the continuity of static instruction sequence in a wide
instruction cache line. Even though the processors are designed to fetch several
instructions in each line at the same time, many fetched instructions are abandoned'
Therefore, fetching efficiency is low in this circumstance. To avoid this instruction-
supply bottleneck, the trace cache was introduced to increase effective instruction
fetch bandwidth.
ln a superscalar architecture, the sequences of executed instruction from the
pipeline are dynamic and divided into several basic-blocks by control instructions
(e.g. branches, return, and etc.). These are called instruction traces' Several such
instructions grouped together look like a VLIW instruction format but formed in
dynamic sequence. The trace cache counts on two important properties of dynamic
sequences of instructions, i.e. temporal locality and branch behavior 1241. ThaI is, the
most recently used instructions are most likely to be reused in the near future and
branches mostly bias to one direction. If these dynamic traces are collected in a
special kind of cache memory, the performance of the fetch mechanism will possibly
be increased. There will be no need to fetch several times from different lines of the
5
Background










Figure 2.1 Trace cache overview
Figure 2.1 demonstrates the principle of the trace cache scheme. There are
four basic blocks (4, B, C, and D) residing in non-contiguous locations in the
instruction cache. They aÍe logically connected together in run-time manner'
Unfortunately, they are split in physical location due to static-compiled order; this is
called "partial fetch" since each fetch could obtain just some part of all of the desired
instructions. Time is wasted reading these instructions, as 3 cache reads are required
(in this example). 'When these basic blocks are issued through the pipeline of the
processor core they aîe teaf'rafrged in dynamic Sequence or trace order (4, B, C, and
D) to perform the task. This trace can be collected in the trace cache line. According
to temporal locality and branch behavior as mentioned earlier, this trace is most likely
to be used againin exactly the same sequence corresponding to the matching of fetch
address and multiple predicted branches. Then, all instructions in this trace can be
read in one fetch from the trace cache to the pipeline. This scheme obviously has the




2.2 T race Cache Architecture
The trace cache architecture is composed of four main components:
1. the trace cache (trace container),
2. the fill unit,
3. the branch predictor, and
4. the instruction cache.
As shown in figure 2.2, inshructions can be read from the instruction cache or
the trace cache depending on the outcome of the hit logic which processes the
incoming fetch address and the outcomes of the branch prediction unit. If it signals hit
the trace cache will deliver instructions. Otherwise, instructions are supplied from the
instruction cache. Instructions residing in the trace cache are collected by the fill unit'
which copies instruction traces entering the processor execution pipeline'
Fetch Address
I n stru ct¡o n s













The trace cache container is an array of fast-access memory, which dominates
the area of the trace cache circuit. It collects several lines of trace issued from the fill
unit. To each individual line of the trace cache is attached information similar to that
in an ordinary instruction cache i.e. a valid bit to indicate availability of data in the
line and a tag to identify the starting address of the trace. Moreover, there are some
extended fields related to branch addresses because there might be more than one
basic block inside the trace. All of this information is processed by the ttace cache hit
logic to determine whether an instruction fetch results in a trace cache hit or miss.
2.2.2The Fill Unit
The fi|I unit is an essential component of the trace cache organization as all of
the instructions accommodated in the trace cache come from this section. It gathers
dynamic instruction sequences from the processor pipeline, merges the incoming
instructions with existing instructions to form a packet, provides the attached
information for each trace cache line as described above, and sends the packet to a
line of the trace cache container. The essential step in the formation of a trace packet
is packet finalization The maximum number of instructions z and the number of
predicted branches m are the main trace-packet delimiters. Both Patel [21] and
Rotenberg et al. l24l have built models which carry 16 instructions (n:16) with a
maximum of three branch predictions (m:3).Then, four conditions for finalizing the
trace-packet are:
1. the packet contains 16 instructions, or
2. the packet contains 3 conditional branches, or
3. the packet contains a single indirect jump, return, or trap instruction, or
4. incoming instructions could not be concatenated with the existing
instructions since the sum would exceed 16 instructions'
2.2.3 The Branch Predictor
The performance of any fetch mechanism relies on the precision of the branch
predictor because an incorrect branch prediction causes a time penalty due to
instruction recovery. In the case of a wide issue processor, a single branch predictor
seems to be inadequate because a line of trace cache is likely to contain multiple basic
8
Background
blocks, as mentioned earlier. Therefore, a ttace in the trace cache would be more
effective if the predictor can cover all of the branch instructions in a line and if the
outcome of the prediction is sufficiently accurate. otherwise, the penalty would be
more severe and waste more time'
Unfortunately, at a present, the technology of multiple branch predictors is
still immature and the accuracy is less than that of single branch predictors. However,
the scheme known as two-level branch prediction [34] showed impressive prediction
accvracy at97Yo. This method can be implemented within the trace cache scheme to
predict three branch outcomes in a single cycle.
2.2.4 The Instruction Cache
Even though the trace cache plays an important role supplying instructions for
the processor, the conventional instruction cache is still needed. When the hit logic
signals a trace cache miss, the instruction cache has to provide the requested
instructions, instead. Moreover, the instruction cache, itself, is the instruction gateway
connected between main memory and the processor. However, the size of the
instruction cache might be trimmed down to suit such less frequent activities.
2.3 Related Work
There is a large amount of published research, using both hardware- and
software-based approaches, on high bandwidth fetch mechanisms. Some hardware-
based approaches are listed here for the purpose of tracing back the history of the
trace cache. Some of these are currently adopted in parts of the trace cache scheme'
The others are signif,rcant competitors of the trace cache approach.
2.3.1 Trace Cache HistorY
The history of trace cache development begins with the fill-unit, which was
introduced as hardware proposed to increase the front-end performance of the VAX
architecture. Melvin et al [16], showed that the parallelism of such a sophisticated
instruction set architecture can be exploited by using a f,rll unit to create large
execution atomic units (EAUs) dynamically. Hypothetically, the larger EAUs contain
more microoperations able to be executed simultaneously. Each EAU is stored in the
9
Background
decoded instruction cache to be reused by the execution unit. In subsequent work
[17], Melvin and patt varied the size of EAUs of the dynamically scheduled 
machines
using a frll-unit unit to gather two or more instruction basic blocks in the associated
cache. The results showed that larger EAUs effectively enhance the performance of
the processor because of the higher utilization of processor pipeline slots.
In 1994, Franklin and Smotherïnan [6] adopted the fîll-unit for their multiple
instruction issue architecture. The fiIl-unit dynamically packs multiple instructions
into VLIW-type instructions and stores them in the shadow cache' When the
instructions in a shadow cache line are required, they can be issued and executed
simultaneously. The proposed fiIl-unit also includes logic for checking data
dependencies of stored instructions as well as a unit for dealing with delayed
branches. There is also a branch predictor to assist the fetching mechanism with
speculative execution in order to create effective cache lines.
In 1994, peleg and Weiser 122) patented their new instruction cache design,
which is similar to the trace cache, namely the Dynamic Flow Instruction Cache' This
scheme enhances the fetching mechanism for superscalar machines by storing 2
instruction basic blocks in a cache line. The branch instruction at the end of the first
basic block has been predicted and the outcome of the prediction is the physical
address of the first instruction of the following basic block of the cache line.
Instructions in the cache are collected dynamically from the instruction flow and all
instructions in a cache line can be fetched in a single access. The difference between
this cache scheme and the current trace cache is that in the former each basic block is
used as a starting point for each trace packet created'
The other trace cache lookalike is the Expanded Parallel Instruction Cache
(E7IC) proposed by Johnson in 1994 [13]. This architecture has been designed to
enhance in-order superscalar machines by reducing the complexity of the instruction
decoding and issuing mechanism. Each line of the Expansion Cache contains decoded
and dependency analyzed instructions, which were routed to certain execution units.
Therefore, it can reduce the processing time once the instructions are fetched. The
performance of this design is approximately equal to one of the more complex out-of-
order superscalar machines with traditional instruction cache.
Rotenberg et al. 124] designed the trace cache scheme consisting of a small
cache with a large instruction cache embedded in a 16-wide issue superscalar
10
Background
cache design, fill unit design, and in particular, multiple branch prediction. They
showed that a large trace cache assisted by a small instruction cache ouþerforms
alternative configurations [24]. Therefore, the instruction cache can be designed less
aggtessively as it is subject to fewer instruction accesses. Patel et al. continued their
work to improve the performance of the trace cache as reported in [18], [19], and [20]'
They explored several enhancements to the trace cache model in order to overcome
performance limitations. Recently, Patel assembled all of his previous works and
some new features of the trace cache into his Ph.D. dissertation [21]. He describes and
evaluates the basic trace cache fetch mechanism, which outperforms an aggressive
instruction cache. High perforrnance was achieved through the use of several
enhancements including:
o partial Matching - the ability to pick up the useful blocks in a matching
trace line and to discard the rest instead of wasting the whole trace due to
branch Prediction mismatch.
o Inactive Issue - instead of totally discarding useless blocks because of
branch prediction mismatch as in Partial Matching, Inactive Issue allows
the whole trace to be fetched and marks these mismatch blocks as inactive
blocks. There is no effect on fetching performance if branch prediction
was conect. Otherwise, the inactive blocks would offer useful instructions
to be executed.
o Branch promotion - in order to reduce the bandwidth of the branch
predictor and increase the effectiveness of the fetch mechanism, Branch
promotion embeds the statically predicted information (takerVnot taken) to
strongly bias branch instructions [25]'
o Trace Packing - this enhancement sacrifices trace cache area in order to
increase individual fetching capability within the loop as shown in figure
2.4. In case of a l6-instruction trace, segment AB already occupied 11
slots and left 5 slots for the next segment. Unfortunately, segment C has 6
instructions and can not fit in. Therefore, the possible traces would be AB,
CA, and BC. Using Trace Packing will store 6 combinations for the
dynamically unrolled loop as follows: AoBsCs, CrAoBs, CoAeB¿, B1C6A6'







Figure 2.4: Aloop contains 3 segments
The aggregation of Partial Matching, Inactive Issue, Branch Promotion, and
Trace packing, make the trace cache ouþerform the state-of-the-art Sequential-Block
instruction cache scheme both in processor performance (IPC metric) and in average
fetch rate. Furthermore, Patel's analysis showed that as fetch rate increases' branch
resolution time increases. Lastly, a next-generation processor implementation is
described which achieves high fetch rates at high branch prediction accuracy. Figure

















inslrucl¡on path infotatgot addl€sses lnstruction Cache
Fetch Address
kace øch6 hil
Figure 2.5: The trace cache fetch mechanism [19]
13
Background
Comparing figures 2.3 and2.5, even though they are both based on the trace
cache fetch mechanism, there are some differences between them affecting overall
performance. The former model delivers dynamic instruction streams that have been
captured before they are sent to the decoder. On the other hand, in the latter model
decoded instructions are sent to the fill unit before being dispatched to the execution
engine. Therefore, when atrace cache hit is signaled, instructions go directly through
the execution engine without passing to the decoder/routing again. Furthermore, these
instructions are already analyzed for dependencies and pre-routed to appropriate
execution units. The other difference between the models is the information contained
in each trace cache line. The latter model includes not only the branch target address
for checkingtrace cache hit/miss, but also path information which facilitates the path
enhancement of the model i.e. Partial Matching and Inactive Issue.
2.3.2 Other High Bandwidth Fetch Mechanisms
The Branch Address Cache Í341 and Collapsing Buffer l4l have been
previously mentioned as multiple basic block fetch mechanisms. They achieve high
effective fetch rate, although they cannot perform as well as a trace cache. However,
it is worthwhile to examine them to see why this is so.
In 1993, Yeh et al. proposed the branch address cache scheme [34] shown in
figure 2.6.It generates multiple fetch addresses in a single cycle resulting from the
branch address cache working together with the branch predictor. These addresses
will be calculated as indices to point to the exact location of each basic block residing
in the interleaved instruction cache. Finally, all targeted instructions are passed
through the alignment and masking network in order to form a packet ready for issue.
















Figure 2.6: The Branch Address Cache
conte et al. proposed the collapsing Buffer [4] as shown in figure 
2'l ' Two
nonadjacent cache lines can be fetched together since the scheme 
uSeS two passes
through an interleaved branch target buffer. Each pass through 
the branch tatget
buffer produces a fetch address. Moreover, the BTB can detect any 
number of
branches in a cache line. Therefore, it can detect intrablock branches 
and eliminate the


















unused instructions by using the collapsing buffer in the interchange/masking
network. Likewise, this approach adds mofe process stages to the fetching pipeline







Figure 2.7: CollaPsing Buffer
2.4 Conclusion
ln summary, the trace cache mechanism can perform better than other
aggressive approaches in respect of fetching ability but it needs sophisticated logic to
create effective traces and a substantial memory area. Therefore, a trace cache might
not be cost-effective for general-purpose processors at the present' However, the


























Hence, this research focuses on the effectiveness of trace cache on naffow-issue
processors. The objective is to find out the significance and trade-offs of TC
parameters that affect the performance of the cache scheme and the usage of cache




Exp erimen tal Pro cessor Model
3.L Overview
The trace cache experiments in this thesis are based on simulation' A
superscalar implementation of the DLX architecture has been chosen as the
experimental processor model. The vHDL language is used to describe the simulation
model, since the language facilitates both model construction and testbench
simulation. In addition, the working model could be used as a foundation to
synthesize the processor using suitable VHDL synthesis tools. Fortunately' there is a
superscalar DLX processor model [10] in vHDL that is suitable for the proposed
experimentation.
3.2 DLX Architecture Summary
The DLX architecture was first introduced by Hennessy and Patterson [9]' It
poSSeSSeS features, which can be commonly found in several successful processors
based on the RISC PhilosoPhY.
The significant features of the DLX architecture are
- an uncomplicated load/store instruction set,
- pipelining effectiveness'
- arleasily decoded fixed-length instruction set, and





There are three register types in the DLX architecture. Firstly, the general-
purpose registers (GPRs) comprise thirty{wo 32-bit registers named R0, Rl, . . ., R31.
The value of R0 is permanently set to zero. The GPRs are used for all integer
operations and memory addressing modes. Secondly, the floating-point registers
(FPRs) comprise thirty-two 32 bit single-precision floating point registers named F0,
F1, ..., F31. They can be used as double-precision floating point registers (64-bit) by
coupling odd and even registers into a register pair (F0, F2, ..., F30). These registers
are used only for floating-point operations. Lastly, the special-purpose registers
comprise several registers for purposes such as masks and flags'
3.2.2DLXData Types
There are 8-bit (byte), 16-bit (half word), and32-bit (word) integer data plus
32-bit single precision and 64-bit double precision floating point data type. They









Figure 3. 1 : Big Endian byte ordering
Leasa Signilicant Byte
n
3.2.3 DLX Addressing Modes
The explicitly supported data-addressing modes in the DLX are immediate and
displacement, using 16-bit fields as immediate ðata and displacement address fields,
respectively. However, putting 0 in the 16-bit displacement field can accomplish the
register-deferred mode and using register R0 as a base register associated with 16-bit
field can accomplish absolute addressing. Therefore, there are four effective






Exp erimental Processor Model
3.2.4 DLX Instruction TYPes
There are three different instruction types: I-type (immediate), R-type












Figure 3.2DLX instruction format.
Since all instructions are of fixed-length format, instruction decoding is very simple'
DLX is an easy architecture to understand and, moreover, widely studied and
modeled. consequently, it is a useful processor on which to base the study of the trace
cache.
3.3 The Superscalar DLX Model
The superscalar DLX model used in this research was created by Horch in the
VHDL language [10]. Both the source-code and documentation are provided 
at URL
http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/superscalarDLX.htm'
Although the documents were written in German, the source-code is commented in
English and is quite simple to follow. Figure 3.3 shows the structure of the superscalar
DLX processor.
6



























Figure 3.3: Superscalar DLX structure' [10]
The microarchitecture of this model is a pipelined superscalar processor. It can
fetch a maximum of two instructions simultaneously in a single cycle. The instruction
fetch unit is supported by a 64-byte instruction cache coupled with a 4 entry
instruction address translation buffer (ITB). There is a branch target buffer (BTB) to
provide the speculative target of branch instructions'
The dispatcher is the heart of the processor since it connects to every major
unit of the model. Accordingly, it generates control signals to manipulate all processor
activity from instruction entry until instruction commit. Moreover, the dispatcher also
manages precise exception processing. This is assisted by the reorder-buffer, which
works with the commit unit to commit instructions in program order'
There are four execution units, each with a reservation station: pipelined load-
store unit, integer unit (arithmetic logic unit or ALU), multiply-divide unit (MDU),
2l
Experimental Processor Model
and branch resolve unit. The load-store unit works cooperatively with the write buffer
and 64-byte data cache equipped with 4-entry data address translation buffer (DTB).
The ALU executes all logical, shift, and set-on-comparison instructions. Moreover, it
mainly does the integer arithmetic calculation for addition and subtraction.
Integer multiplication and division can be performed by the MDU but the
implementation of MDU is slightly different from the original DLX architecture. In
the original architecture, multiply and divide instructions can be performed only with
floating-point registers (F0-F31). Therefore, data type conversion instructions from
integer to floating-point and vice versa (i.e. MOVI2FP and MOVFP2I) ate available
to enable integer multiplication and division using the floating-point multiply/divide
unit. To avoid any implementation of floating-point operations, Horch defined a
unique register file that can be addressed as GPRs (R0-R31) or FPRs (F0-F31). R0
and F0 are the same physical register and so on. Consequently, multiply and divide
instructions (MULT, MULTU, DIV, and DIVU) perform integer multiplication and
division on the GPRs. This variation from the standard architecture required some
code modification, which will be described in chapter 4.
Lastly, the branch resolve unit determines actual branch outcomes, determines
the target address to insert in the BTB and also indicates when a branch misprediction
has occurred.
3.4 The Fetch tlnit
The fetch unit is the part that is of most interest in this research, since the trace
cache is intended to improve the fetching performance beyond the conventional
instruction cache. So, the original fetch unit will be described in detail, to provide
information on the original model design.
The fetch unit has been designed to fetch a maximum of two instructions from
the instruction cache in a single cycle if the address of the fîrst instruction in the
program counter is double word aligned. V/ord order within double word is Big
Endian (i.e. 0x00000000 is the high word and 0x00000004 is the low word)' The
registers for storing the fetched instructions are divided into the stage A register and
the stage B register. Both of them can store either high word or low word. Normally,
stage A stores thc high word and stage B stores the low word. However, stage A can
))
Experimental Processor Model
store the low word in which case stage B will become invalid. In the case of fetching
two instructions when the address is double word aligned, but when stage A is not
available, stage B can store the low word and the program counter will be increased
by 4 bytes.
As mentioned above, there are two main units associated with the fetch unit'
They are the instruction cache and the branch-target-buffer (BTB)' The instruction
cache in the original model has a small capacity and is configured as a direct 
mapped
cache. It has 8 lines containing two instructions each' So, it can contain only 16
instructions (16*4:64 bytes) at a time. The availability of instructions in each cache
Tag Field (26 bits) tnstruction Cache Block
Valid Bit (1 bit) (2 instruction per block)
Figure 3.4: Instruction cache structure'
block is indicated by the valid bit and the tagfield used for address matching'
The instruction cache cooperates with the instruction-address-translation
buffer (ITB) to convert a virtual page number (bits 3l to 7 of the program counter)
into a physical page number and this is joined with bit 6 of the program counter' The
results are used to compare with the tags of instruction blocks to determine cache 
hit
or miss. The ITB has a 128-byte page size. Figure 3.5 shows the address translation
mechanism of the instruction cache and instruction-address-translation buffer' This


















Cache Size: 64 BYtes
Page Size: 128 Bytes
Cache-H it
Figure 3.5: Address-translation and cache-access. [10]
The branch-target-buffer (BTB) is a memory that contains destination
addresses of previously executed branch instructions. These addresses are most likely
to be the target of future branches. When one of these branches is fetched again, the
BTB will speculate the direction of the next instructions without waiting for the
outcome of branch condition determination. In this model, there are four slots within
the BTB to store destination addresses. Like the instruction cache, each entry
















Least significant bit distinguish between a
branch of high / low word
Branch Destination
Address (32 bits)
Tag Field (28 bits)
Valid Bit (1 bit)
Figure 3.6: Branch-target-buffer structure
Indexing to the BTB slot uses bits 4 and,3 of program counter (2 bits 
: 4
combinations). Then, the last three bits make all entries represent 8-byte 
aligned
addresses. Consequently, the destination address stored in each slot has to 
indicate
whether the branch is a high-word or low-word instruction. This 
is accomplished by
attaching the extra bit as the least significant bit of the 28-bit tag portion' 
The extra bit
comes from bit 2 of the program counter'
3.5 Conclusion
The superscalar DLX model [10] is a narrow-issue processor model, which
was written in vHDL format. It can execute integer programs including integer
multiply and divide instructions without conversion between floating-point and
integer data type. The implementation details of a trace cache on this 
processor model

















This research was conducted to find out whether and how a trace cache
memory can help a nafforw issue superscalar processor to fetch instructions' As
mentioned in chapter 3, an existing superscalar DLX processor model [10] was
chosen to avoid spending the time required to build the processor from a scratch 
and
give more time to focus on the trace cache model which is the tatget of this research'
This chapter contains the explanation of an experimental setup that was usecl to
implement the trace cache and to gain results for experimental analysis'
4.1 Trace cache in the superscalar DLX Processor
The trace cache is a source of instructions containing dynamic traces of
instructions instead of static ones as an ordinary instruction cache does' Therefore, 
the
trace cache is supposed to be an alternative repository' to compete with the embedded

















Figure 4.1: Trace aache placement in the superscalar DLX machine
Figure 4.1 shows the placement of atrace cache in a way that would fit in with
the original superscalar DLX processor model. The task of the trace cache is to collect
instructions tiom the fetch unit and pack these into traces with help from control
signals of the fetch unit itself and other units in the processor core, which determine
how to pack them. The trace cache can feed these traces back to the fetch unit.
However, the existing instruction cache is still the main supplier but also is the
competitor of the trace cache. Therefore' the trace cache system must be equipped
with trace cache 'hit' logic to make a decision on which instruction supplier would do
the job.
It is possible to build a trace cache by gathering instructions from the main
processor pipeline and producing instruction traces. However, incorporation of attace
cache would add a lot of complexity due to the original processor model, which was
not designed for this kind of expansion. Hence, the other way to accomplish this
mission is to leave the original superscalar DLX model untouched and examine the
utilization of the trace cache passively. In the other words, we build the whole trace
cache mechanism and investigate whether instructions collected in the trace cache
match the current fetched address in the program counter register of the processor'
This allows us to determine the effectiveness of the trace cache by measuring
trace cache hit rate, but we do not provide instructions from the trace cache to the








increase due to the trace cache. In other words, we will assess trace cache
performance in terms of trace cache hit rate and not be concerned with 
processor
performance improvement due to the higher fetch rate produced by the 
trace cache'
The latter is highly dependent on implementation technology.
4.2Trace Cache Line Size
In prior trace cache models 127,24], the superscalar processor models 
have
wide instruction-paths, which aÍe significantly different from the modest 
2'
instruction-wide superscalar DLX machine we use in this research' In this




is designed to fit 16 instructions and is able to feed a maximum of 16 instructions
simultaneously to the fetch unit. This approach would not fit with the 2-instruction-
wide processor (i.e. making the trace cache to accommodate 
just only 2 instructions in
a cache line) because the average basic block for typical applications 
is about 4 to 6
instruction s Qal.More importantly, the significant idea of the 
trace cache system is to
provide a trace that covers the basic block and to overcome the penalty of branch
misprediction. In this circumstance, we will provide a trace cache line size 
of 4-
instruction-wide or 8-instruction-wide to cover at least one basic block' 
The reason
for making 2 versions of trace cache line size is to find out an appropriate
configuration from the experiment results. In this project, we call the 
4-instruction-
wide model and 8-instruction-wide model TC 4 andTC-S,respectively'
4.3 Trace Cache Model ComPonents
There are 4 main parts that work in concert, starting from gathering
instructions from the fetch unit until determinin g a ttace cache 'hit' or 'miss' '
4.3.1 Instruction Gathering Unit
This unit works closely with the fetch unit. In the original processor model'







could be dispatched through the execution windows simultaneously or just one of
them depending on the availability schedule of the required execution unit for each
instruction. The instruction that was left behind will be fed in the next clock period in
which the functional unit is available. To avoid double copying of the same
instruction from different fetch cycles, the instruction-gathering unit must be able to
determine how many instructions could be collected and which one of them should be
collected in the case that only one instruction could be dispatched. The determination
can be made by consulting a group of control signals. These signals are created by the
dispatcher in which they are originally used for checking the validity of incoming
instructions at the fetch stage. After the determination is accomplished, the individual
valid instruction is ready to be placed into the appropriate position of the fill-buffer
under the fîll-policy for creating a dynamic instruction trace.
4.3.2 Fill-Buffer
This buffer is a temporary memory which stores valid incoming instructions as
traces, before transferring them to the trace cache memory Space. However, it is the
most signifi cantpartof the mechanism because the usability of packed traces depends
directly on the fill-policy that crafts the individualtrace. Because of the different trace
cache line size and different number of basic block coverage, the fill-policy will be
different from the previous trace cache works in [21] andl24). Details of f,rll-policy
will be described in section 4.3.2.2 Fill-Logic and Fill-Policy.
4.3.2.1 Fill-Buffer ConfTguration
The buffer comprises two main sections, Trace Content and Trace
Information. Trace Content stores collected instructions and their addresses.
Meanwhile, Trace Information stores information associated with the trace which is
used during frll-buffer to trace cache transfers. Trace Content is constructed as 4 lines
(line numbers 0-3) by the number of instructions (4 and 8 instruction slots for TC 4





o Branch Existing Flag: This flag indicates whether or not there is a branch
instruction in the buffer line.
o Branch Position: Together with the previous flag, this f,reld pinpoints the
location of the available branch instruction of the line'
The bit-length of the information fields Trace Size and Branch Position
depend on trace cache line size. Each of them is 2 bits for TC 4 and 3 bits for TC-8'
Note that, as explained in the following section, each trace cache line will contain no
more than one conditional branch (i.e' 2 basic blocks).
4.3.2.2 Fill-Logic and Fill-Policy
The most significant part of the whole fill-buffer is the f,rll-logic, which
determines how to fill instructions into the buffer, because the usefulness of traces
directly depends on the characteristic of the traces themselves. The implementation of
the fill-logic is ruled by the fill-policy, which defines how to construct and when to
terminate afiace from instructions collected from the instruction gathering unit.
The collected instructions will be put into available slots one after another in
the current incomplete trace until the trace is terminated by one of the following
conditions:
1. when the size of the current trace including one or both of the new
incoming instructions equals the buffer line size, or
2. when either one of the new incoming instructions is an unconditional
branch Ûump), RFE, or traP, or
3. when the current trace already possesses a conditional branch and either
one of the new incoming instructions is also a conditional branch. It was
decided that there must be maximum of 2 basic blocks per line because
general programs have basic block run-length about 4-6 instructions [24].
Hence, the narrow cache width like TC-4 and TC-B will rarely be able to
accommodate more than2 basic blocks.
Accordingly, the fill-logic fundamentally composes of a set of pointers used to
locate the current row and slot in the trace content space of the fîIl-buffer for each of
the incoming instructions and their addresses. The most important task of this unit is




Rule I is the simplest way of terminating the line, when an incoming
instruction makes the current trace reach the limit of the buffer line size. Note that
there might be either one or two instructions collected from the fetch unit. In case
there is only one instruction, if the length of the incomplete current line plus this
instruction equals the buffer line size, the instruction will be placed into the line and
also terminates this line. Meanwhile, the pointers will be updated to the beginning of
the next line. However, if there are 2 incoming instructions, there could be two
distinct cases.
o Case l: The first incoming instruction of the two occupies the last slot of the
current buffer line. Therefore, this line will be terminated and the other
instruction will be placed in the beginning slot of the next line.
o Case 2: There are two slots left in the current line while there also are 2
incoming instructions. The logic can place both instructions into the slots,
terminate the current line, and start the next line for the next incoming
instructions.
If the trace was terminated by rule 1, it means that the line was fully occupied
by instructions that are most likely coming from the same basic block and they can be
put in the buffer very easily in practice. This scenario is quite rare in reality because
there are many instructions that break into several small basic blocks l2al. Therefore,
rule 2 and rule 3 are often the ones that terminate the trace'
Rule 2 and rule 3 handle instruction-path changing instructions (i.e.
unconditional branches, RFE, and traps) and conditional branches, respectively.
Therefore, it is necessary to enable the f,rll-buffer to classify instruction types. If the
former was detected in either one of the incoming instructions, rule 2 wtll be applied.
Basically, that instruction will be put in the current position provided by the pointers
and then the line will be terminated immediately'
According to the structure of the fill-buffer (see figure 3.3), there are 2 fields
in the trace information concerning branch instructions. The 'branch existing flag'
indicates whether or not there is a branch existing in the line yet. This field will be set
once a conditional branch instruction was inserted in the line. If the flag is set and if
one of the incoming instructions was detected as a conditional branch, the logic will
push that branch to start in the new line next to the current line even though there is a




Apart from manipulating pointers to place instructions into their places by
applying the frll policy rules, the fill-logic also has to complete the trace information
of each buffer line. However, this task has to be done in parallel with the pointer
manipulation.
o The 'buffer line ready' flag is set immediately after the current line was
terminated.
o The 'trace size' field is the counter that counts the number of instructions
placed in the buffer line continuously until the line is terminated. Once the
line was terminated, this field can tell how many instructions are in the
particular buffer line.
o The 'branch instruction flag' was mentioned above.
o The 'branch position' indicates the location of the branch instruction
within the trace. This field is updated once the branch instruction was
placed in the buffer line. This information can be extracted from the
pointers that locate the position of the instruction.
4.3.3 Trace Cache Memory
4.3.3.1 Trace cache memory structure
Trace Cache Mem
Trace Inform ation Trace Instructions











The structure of the trace cache memory is quite similar to that of the fill-
buffer. Figure 4.4 shows the structure of the trace cache memory space' There are also
2 sections: Trace Information and Trace Instructions'
o Trace Instructions stores only sequences of instructions since it is not
necessary to store instruction addresses anymore. However, significant
instruction addresses of the trace (i.e. the address of the first instruction
and the address of the branch target instruction (if any) of the trace) are
kept and appear as tags, which will be stored in the trace information
Portion.








Figure 4.5: Trace Information portion of the hace cache memory
tr Valid bit: This is a flag to indicate whether or not the particular
trace cache line is occupied by valid cache content'
tr Tag_l: This is the tag field of the first instruction in the line.
tr Tag]: This is the tag held of the branch destination instruction
address if there is a branch instruction available in the line.
Otherwise, this field is an identical copy of tag-l'
tr Trace Size.
o Branch Existing Flag.
a Branch Position.
The last three fields of trace information are identical copies of 'trace
size','branch existing flag', and 'branch position' fields of the associated
fill-buffer entry as describecl in section 4.3.2.1 Fill-Buffer Configuration'
34
Experiment Setup
In this experiment, the number of trace cache memory lines is one of the
interesting parameters to investigate. It ranges from 4 up to 512 cache lines
(increasing by factor of 2) to analyze the effect of trace cache memory size on the
trace cache utilization. This parameter will be varied for both TC-4 and TC 8
configuration.
4.3.3.2 Buffer-cache transfer
Every clock cycle, there must be a procedure to check whether there are any
traces ready in the fiIl-buffer waiting to be transferred into the trace cache memory
space. The buffer-cache transfer unit was built to accomplish this task. Moreover, the
unit has to make a decision whether the new trace should be placed into the trace
cache memory or dropped out.
tr Trace cache memory line selection
When there is a rcady fill-buffer line, the address of the f,trst instruction of the
buffer line will be used as the trace cache memory line selector. Based on a direct
mapped cache, the number of extracted bits used for line selection depends on the
numberof lines of trace cachememory (i.e.2,3,4,5,6,and 7 bits are for 4,8,16,
32, 64, and I28lines, respectively). The position of the extracted bits starts from
the third bit of the address (see figure 4.6).
1rbit 1 and 0
-t- Line selector starting from bit 2iTag -t-
Address of the first ¡nstruction of the buffer line (32 bit length)
Figure 4.6: Trace cache line selector is cxtracted starting from bit 3 of the address word.
tr Commencing the transfer
Once the destination line was decoded, the transfer would be commenced if:
35
Experiment SetuP
. There is more than one instruction in the ready-to-transfer fill-buffer line. This
avoids single-instruction traces, which are not likely to be very useful, from
occupying an entire TC line.
. The trace size of the ready-to-transfer frll-buffer line is longer than the existing
trace in the selected memory line. Hypothetically, a longer trace provides
more instructions and this would increase the probability of finding more
useful instructions'
tr The contents to transfer
The contents from the frll-buffer line are:
. All instructions in the trace (note: addresses of these instructions will be
abandoned).
. The extract (Bits 3I to 2) of the address of the first instruction and the
address of the branch instruction destination (if any) to f,rt in 'Tag-l' and
'Tag-2'of the cache line.
. Identical copies of 'Trace size', 'P¡tanch Existing Flag', and 'Branch
Position' from the buffer line for each field with the same name of the
cache line.
tr Finishing the transfer
After the transfer was complete, there are 2 tasks to be done.
. Reset all fields in the buffer line to make it ready to accommodate a new
trace.
. Set the ,Valid Bit' of the selected cache line to signal the validity of the
content.
4.3.4Trzce Cache Hit Logic
As mentioned earlier, the original DLX model will be left untouched and the
performance measurement will be done passively. The trace cache hit logic is the unit
assigned to find out whether the instruction at the current address in the progfam
counter register and its successors can be found in the trace cache' Therefore' this
function is the point at which can be made the measurement of trace cache hit rate'
The typical instruction-cache 'hit' is the outcome of comparison between the
value in the program counter register of the processor and the tag of the selected line'




instruction of the line since its tag covers all of the instructions of the selected line.
This is different from the trace cache 'hit' definition, particularly, the trace cache
configuration of this experiment.
Trace cache information for 'hit' or 'misst
The trace cache is supposed to collect instructions from the dynamic
instruction stream. Although a trace cache line has a fixed size line into which
instructions are placed, we can not forecast which instruction would be the first
instruction of the cache line and how many instructions it can collect for a trace'
Moreover, some traces may contain a branch instruction with a destination
instruction whose address is not in consecutive order' Consequently, it is not
possible to make the tag address cover all of the instructions in a trace cache line'
In addition, the execution-path of the processor is only 2 instructions-wide. Then,
all instructions from the selected trace cache line can not flow through the
instruction-path simultaneously like those in the original instruction cache' One
trace cache line might contain instructions to be fed through the instruction-path 
in
several successive cycles. Therefore, the trace cache 'hit' or 'miSS' depends on the
corresponding trace information and the trace information must be able to
indicate
o
. how many instructions there are in a particular ttace,
. the address of those instructions,
. whether the trace possesses a branch,
¡ the direction (taken / not taken) of that branch instruction.
Trace cache thit' or 'misst determination
There are 2 types of trace cache 'hit': a hit on the f,rrst instruction of the cache
line and a hit on the rest of the line. The former can be detected by matching the
current value in the program counter (PC) with the 'tag-l' of the selected cache
line. After a hit on the first instruction, it is possible to have a hit on the rest of
the line in the next fetch. Thus, there must be a line-hit flag to indicate that the
first instruction of that line has been hit. This method will enable the hit logic to
check out the rest instructions'
37
Experiment Setup
2. Addthe run time startup code crt1.o to the start of the compiler output.
3. Run this file through the standard link editor,ld'
4. Edit the a.out file to set the load addresses for the text and data segments, as
required by the simulation model.
5. Use the perl script to transform the floating-point instructions.
6. Edit the file to add nops aroundthe jr instructions'
7. Assemble the resulting f:ire ("dlx.asz") into object code ("dlx'out") using
dlxasm.
The crt7.o file and perl script are listed in Appendix D and are included in the
companion CD-ROM.
At the end, the assembly codes were assembled into binary code as 'out filre fot
the processor simulation. The assembler named dlxasm (downloaded from
http : //www. as henden. c om. au/des i gn er s - guide/ D G- DLX-mat eri al. html)'
4.5 Simulation Testbench Configuration
The testbench conf,rguration for simulating the superscalar DLX processor
model has been set to run DLX binary-assembled files. A program used to run on the
simulation must be named as 'dlx.out' and fit within 32 kilobytes memory range
(0x0000 to Ox7FFF). Originally, the capacity of the main memory was only 16
kilobytes but this was expanded to accommodate larger test programs. Note that there
must not be floating-point instructions in the test programs due to the processor
design. The output file will be created as 'dlx.dump' if it was programmed to generate
outputs.
4.6 Measuring the Trace Cache
In order to analyse the performance of the trace cache, the number of TC hits
and misses were collected. Hit and miss counts of the original instruction cache and
also the total cache accesses were also required for referencing purposes. The f,rnal
sum of trace cache hits and misses from the trace cache lines is too coarse a metric to
make any detailed analysis, therefore, the activities of each line of trace cache
memory were recorded as described in the Appenclix A.
40
Experiment Setup
In addition, there must also be analysis of the cache space usage because the
trace cache model occupies real estate on the chip once it is implemented.
4.7 Conclusion
This research project benefits from the use of an existing processor model in
that it was not necessary to set up the experiment from scratch. However, this model
constrains the implementation of the original processor and the ability to expand the
instruction cache. This chapter has described the way in which the trace cache was
constructed and the method used to measure the trace cache performance in the
aspects of usefulness relative to the instruction-cache and space usage. VHDL source





The trace cache was simulated in two configurations: a 4-instruction wide
(fC_4) and an 8-instruction wide (TC_8) trace cache. In each case the number of
trace cache lines was varied from4 to 572. V/e will use 4L, 8L, and so on to denote
individual cache line configurations. Each one of them will be simulated on 4
different test programs: bubblesort (bs-a, bs-r, and bs-d), primenumber þn-20, pn-50,
and pn-100), permutation, and DCT. For bubblesort and primenumber, the
simulations were performed for 4L to 128L only, because trace cache performance
became steady before reaching l28L and certainly would not vary fot 256L and 5l2L
configurations.
5.1 Overview
This experiment is meant to determine the effect of a trace cache on a narrow-
issue processor like the SuperscalarDLX in order to be able to determine whether the
trace cache is worth considering for implementation on this kind of microprocessor.
Obviously, performance comparison between the trace cache and the originally
embedded instruction cache seems to be inevitable. Unlike the trace cache, however,
it proved to be impractical to increase the capacity of the instruction cache in order to
make a fair comparison between the two. For this reason the instruction cache
capacity was not varied in these studies. The instruction cache can hold a maximum
of 16 instructions when the trace cache can increase virtually unlimited. The best case
for fair comparison would be TC_4 at 4 lines of trace cache, in which the total
capacity of the cache is 16 instructions (TC-4 = 1 line contains 4 instructions).
42
Results
Therefore, this analysis of this experiment will not focus on a head-to-head
comparison of the performance between the two caches. Instead we will focus on the
performance of different trace cache configurations.
There are three sections analyzing the performance of the trace cache from
different points of view. The first section shows the hit and miss counts on the trace
cache while the capacity of the cache is increasing in both the width of the trace cache
line and the number of trace cache lines. This section also shows hit and miss counts
of the instruction cache to provide a reference point for trace cache performance.
The next section shows the percentage of hits and misses of the trace cache for
different test programs. Hits and misses are presented in separated graphs to facilitate
analysis of each of them individually. The last section displays how much of the trace
cache space has been used and how much of it was left unused when the capacity of
the trace cache is expanding. Please note that the words trace cache and instructíon
cache might be, from time to time, replaced with the abbreviations TC and IC,
respectively.
5.2 Hits and Misses of the Trace Cache
Fundamentally, the number of cache hits and misses is the performance
indicator of cache memories. If there are more cache hits and fewer misses, it
represents a better performance of the cache. This experiment has two main
parameters that affect the performance of the trace cache when they vary, the size of a
trace cache line and the total number of trace cache lines. The product of these
parameters is actually the capacity of the trace cache but there may be different results
for the same capacity from different parameter combinations because of the trace
cache mechanism. Generally, a bigger trace cache capacity should perform better than
a smaller one. However, it is essential to observe the actual results from these
parameters that come into play with the fill policy in order to understand the design
trade-offs.
The results are presented as graphs with associated data tables of individual
test programs (figures 5.la to 5.1h). Each of them shows the acquired number of hits















No. of TG llnes
22a - lc-Misses¡ 2222 2222 22
23¡ ìF TC &Misses 2323 2323 23
168r }! TC ,t-Misses 2323 2323 23
245,t" lO-Hits 24Þ245 245245 245
244#TC 8-Hits 244244 244244 244
















No. of TC llnes
159. O - lC-Misses 159159 159159 159
6820r F TC 8-Misses 229229 229229229
6869r )þ TC 4-Misses 1593328 159159 159
8562' tG-Hlts 85628562 85628562 8562
1901#TC 8-Hits 84928492 84928492 8492
1852*TC 4-Hits 85625393 85628562 8562
4L lol8L 64Èlzl {zsl-
Figure 5.1a : Hits and misses of the trace cache and the instruction cache on ås-4.
Figure 5.1b : Hits and misses of the hace cache and the instruction cache on å,r-r.
44
O - lC-Misses 218 218 218 218 218218
F 7435 314 314 314314 314
)È Tc 4-Misses 7729 2717 216 216 216216
lC-Hlts 8980 8980 8980
8980 89808980
#TG-8-Hlts 1763 8884 8884 8884 8884 8884





















Figwe 5 .1 c : Hits and misse s of the trace cache and the instruction 
cache on bs-d '
Figure 5. ld : Hits and misses of the trace cache and the instruction cache 
on pn-20
O - lO-Misses¡ 40 40 40 40 40 40
r )ll Tc &Misses 127 35 36 36 36 36
r )l TC 4-Mlsses 5'19 331 65 65 65 65
" lG-Hits 1631 1631 1 631 1631 1631
1631
#TC &Hits 1W 1636 't635 1 635 '1635 1 635
*TC-4-Hits 1152 1340 1606 I 606 1606 1606
4L 8L 161 32L 641 128L

















89a - lo-Misses 8989 89 8989
383r þ TC 8-Mlsses 3635 36 3636
2963r )f TC 4-Misses 't696 163163 163163
9892å lc-Hits 98929892 9892 98929892
9598#TC &Hits 9946 99459945 99459945
*TC 4-Hits 82857018 98189818 981 I9818
æ4L













FigUre 5 .1 e : Hits and misses of the trace cache and the instruction c^che o1 pn-50 '











195O - lC-Misses 195 195195 195195
907r þ TC 8-Mis6es 35 3636 3636
11290¡ }F TC 4-Misses 6129 375375 375375
s83nt lGHits 38377 3837738377 3837738377
#TC 8-Hits 3853737665 3853638536 38536 38536
*TC 4-Hits 9244327282 38197 3819738197 38197
8EIt'















11767O - lG-Mlsses 1176711767 1176711767 11767't1767 11767
9440r ilF Tc 8-Misses 66509384 12423002 9211090 921
10616r }f TC 4-Misses 7895't06'18 206'.l3618 12361812 1236
1216¿' lO-Hlts 12161216 12161216 12161216 1216
3543#TC 8-Hits 63333599 117419981 120621 1893 12062
*TC-4-Hits 23652367 93655088 117471117110922 11747
8L4L 32L161 128L641 512L256L





FigUre 5.1g : Hits and misses of the trace cache and the instruction cache on Perurute









O - lC-Missesa 122124122124 122124122124122124 122',124122124 122124
r F TC 8-Misses 9235882221 123784584577579 61786700 6178
r }F TC 4-Misses 9169699180 4405175468 847212713 63096309
-" lO-Hits 25102510 2510251 0 25102510 25102510
#TC 8-Hits 3227642413 8078947055 117934112256 1't8456I 18456
*TC-4-Hits 3293825454 80583491 66 116162111921 118325I 18325
8L4L 32L161 128L641 512L2561







In general, TC-ï perfoffns better than TC-4 in particular at the same number
of TC lines exceptfor DCT. Undoubtedly, the longer traces of TC-8 increase the
opportunity to find useful instructions in a single trace and the larger space allows
more instructions to fit in and also decreases the chances of overwriting useful ones
due to space contention. However, this advantage is not effective in every program as
mentioned in the explanation above. The effectiveness depends on the pattern of
dynamic execution of each particular progran\ so the advantage is not perfectly
predictable.
For the case of bubblesort, TC_8 at 4L is not as effective as TC-4 because of
two significant reasons evident from the raw datzfrom the simulation (referencing the
companion CD-ROM). This analysis is based on the comparison of 4L and 8L oî bs-r
and bs-d. The first reason is that 4L provides less space to hold useful traces long
enough to offer required instructions and that particular traces were replaced by other
traces that are not well used and live too long and, therefore, result in a lot of misses'
The other reason is there is too much overwriting to the same line too frequently, so
the useful traces cannot live long enough to produce hits. All of this is chiefly the
problem of cache space contention combined with the direct-mapped scheme'
Therefore, more TC lines can relax this drawback and offer more TC hit counts as we
can see from the results.
DCT is an exception from all of the graphs mentioned above. The performance
of TC g is not better than TC_4 at the same number of TC lines. This peculiarity can
be explained by the comparison graphs of TC hit, compulsory miss, and conflict miss
of TC-4 and TC-8 in figure 5.2.
In the following discussion we refer to misses as either compulsory misses or
conflict misses. When the line was selected at the first time but there is no instruction
in it (valid bit = 0), it is called a compulsory miss.In contrast, if the selected line












No. of TG lines
Figute 5.2: Companson between TC hit and Compulsory Miss and Conflict Miss of DCZ
Figure 5.2 shows the comparison of three features (TC hit and both TC miss
types) of TC-4 and TC-Y. Suppose that the trend of the TC-4 graphs draws the
baseline of normal behavior in which the TC hit rate is increasing, representing better
performance when there are more TC lines. Meanwhile, the conflict miss rate falls
and the compulsory miss rate is a little higher as the number of lines increases. We
can see that the behavior of 7C_8's graphs is different. If TC-9 consistently
performed better than TC_4, either of compulsory miss rate or conflict miss rate
should be distinctly lower than those of TC. 4 from 4L to 512L. But it is only at 4L
that TC_g performs better than TC_4 because of the low conflict miss rate. At 8L, the
conflict miss rate of TC_8 is higher than that of TC-4. This unexpected effect has to
be explored by consulting the raw data on Appendix C which contains excepts from
the corresponding simulation log files of DCT. Comparing the dat¿ of TC-4 and TC-\
at 8L for DCT, it reveals that trace line number 2 of TC I has a noticeably higher
conflict miss than TC_4. Although the overwriting count is only a few in TC-9, the
conflict miss rate is high. This means that useful traces are overwritten by less useful
ones. Moreover, the less useful traces occupy lines for too long. Comparing this to the
same line of TC_4,there are more trace overwrites but fewer conflict misses.
At l6L, although the conflict miss rate decreases the compulsory miss rate




9232282197--O--TC I Conf 3683967546 148378587 0
3624- - X' "'TC-8-ComP 70061 0033 617718633791 61 78
3227642413x---Tc I Hit 8078947055 118456117934112256 1 18456
91672991 58TC 4 Conf 4523675439 36480I 0499 0
i-¿ TC_4_Comp 292422 2214805 630963061992





rate of TC_$ still less than TC_4's. After 16L, the falling conflict miss rate of ZC-8 is
offset by an increasing compulsory miss rate. The ruw datafor TC I at l6L to 64L of
DCT gives the insight into this occunence. This happens at trace cache line number 2
that has high compulsory miss which dominates the total compulsory misses.
The above comparison between TC_4 and TC-9 for the same number of TC
lines is actually not fair because TC_8 naturally has more room for instructions. At a
particular number of TC lines, TC-ï has twice the capacity of TC-4. For example, the
8l-trace cache of TC 8 is able to store 64 instructions as is the líL-trace cache of
TC_4. Hence, it is interesting to make a comparison using total capacity to categorize
a particular comparison between TC-4 and TC I as shown in table 5.2.
Table 5.2: The equivalent cache capacity of different trace cache configurations
For the bubblesort and primenumber programs, the comparison will stop at
l28L of TC 4 while all of the above configurations (8L-512L of fC-4) will be shown
for Permute and DCT. Figures 5.3a to 5.3h show the graphs of hits and misses based







23 2323 23 23
I TO4 Ml88æ 23 2323 23 23
244 24244 244 244
+TOlHlts 244244 24 244 244





















fohl Crch. clpsclt (lnrtuctbÉ)
bs-d
Total Caaho C.paclly (lnrÙuotlom)
fobl C¡clþ CIP¡clV (ln!üætfi!)
bs-r
Tdrl Ctohe ClPôcltY (lnltluotlom)
Totrl Clol¡ C¡PtolU (lGùrcüor)
pn-100







































Figure 5.3: Hit and miss compaf son between-zc-4 mð'TC-8 
at the same cache capacity of
(a) bs - a,(b) ás-r, (c) bs -d, (d) pn- 2 0' (e) p n- 5 0' and (f) pn- 1 00'
+LAMlÑ.r 2256820 229 229 229
332ø! TC4 Ml¡ses 169 159159
t59
+TCoHltB 1901 u928452 ø452 8/.92
Hllg 85ô25393 8562
856286ô2
6432 128 256 512
-I-Tc8 
Mt¡åêã 7435 31431À 3'l4
314
TC4 Mlsaes 2182f1f 21A
216216
1763+fcsHlte 6884 88848684 8884
-+Tc¿l.{ltå 8481 89826982 6962 6982
32 12ø64 256 512
+reÂMlÊRâr 127 3635 36 3ô









+tsreÂMhaâr 383 3635 36 36
TC4 Ml88s8 1696
1ô3163 't63 183
#fC0Hlte s598 s945994ô 9945 9S45
+rc4Hll8 9ô188285 s818 981898 18
32 64 256128 512
so7+r^A tllêa.Ê 35 36 3ô38
6129TC4 M1886 375 375
375375
37665+TCSHlte 38537 38536 3853638536
+TC4HltB 32443 38197 38'19738197 38'197
6432 12ø 2ß 512
s4
Results
93849440--l-TCB MlssÊs '1090124230026650 921
7895106'18TC4 l\.'l¡sseg 1236141220ts1361 I 1236
633335993543ir Tc8 Hlts 1 1893117419981 12062


























Totsl Cåcho CapaclV (lnst¡ucümE) Tobl 
C¡cho CaptclV (ÉÙwtlons)
(e) (h)
Figure 5.3 : Hit and miss comparison between TC_4 and TC_8 at the same
(contínue) cache capacity of (g) Permute, andQl) DCT'
Figures 5.3a to 5.3h show that TC_8 does not significantly ouþerforrr'TC-4
as it might be expected to when the total cache space is equal between the two. For
Permute aîd DCT, TC-| performs generally worse than TC-4. The hypothetical
reason for this is the importance of TC line numbers. Although the wider TC line
offers the opportunity to get more hits on the contents of traces, there is also the
possibility that not all of the contents are useful and, in addition, not all fully useful
traces are longer than 4 instructions. Therefore, when comparing TC-4 with ZC-$ on
a fair basis (equal capacity), TC-4 generally performs better Ihan TC-8' Permute and
DCT are good basis for this comparison since bubblesort and primenumber are too
short to tell the difference due to their early saturation of TC hit rate'
In the end, these results are useful for determination of chip area investment in
the stage of hardware implementation in which the choice between increasing the




7546891696IV iss6s 8472127',t34405f 63096309
3227642413Hitslt 1122568078947055 1117934





5.3 Percentage of Trace Cache Hits and Misses
This section focuses on the percentage of trace cache hits and misses in order
to show and compare the nature of the hits and misses of each test programs. Trace




Total trace cache hit -This is the percentage of trace cache hits relative to
total cache accesses. Total trace cache hit is the sum of the trace cache
first tag hit andthe trace cache line content hit'
Trace cache fi.rst tag hit - The first tag of the trace cache line represents
the first instruction of the trace. When the fîrst tag was hit, it means the
line can start to supply instructions. However, the other instructions in that
line can be fetched or not depending on another hit signal - trace cache
line content hit.
Trace cache line content hit - As mentioned above, this hit is counted
when instructions after the hit first instruction are eligible to be fetched.
As we described the definition of compulsory misses or conflict misses earlier,
both of these figures are elements of the total trace cache miss percentage, which is
the percentage of total trace cache hits subtracted from 100.
Each of figures 5.4 to 5.8 shows the graphs of percentage of particular hit and
miss of TC-4 and TC-| together for comparison purposes'
5.3.1 Trace cache hits
Figure 5.4 shows the graph of percentage of total trace cache hit of (a) TC-a
and (b) TC_8. Hence, they provide comparable figures among all test programs in





(a) TC 4 - Trace Cache Hit
No. of TC Lines
TG 8 - Trace Gache Hlt
95.04s04gt6290.07úø237.7525.934.03lrDCT
92.9192S191,690.4376.8848.7827 72n.æI Permute





91 399'l æ9't 3991.3991.3991 39tr bs-a
512-2561'læL64L3L161ot4L
No. of TG Lines












































5.3.1 Trace cache misses
There are two kinds of Tc misses - compulsory miss and conflíct miss - as
explained earlier. Figures 5.7 and 5.8 show the percentage of each kind of miss for
both TC_4 and TC_8,taking the total miss count as 100% and each miss is the share
of the total miss count.
We would expect that the percentage of compulsory miss would be increasing
when the number of TC lines is higher while conflict n ¡ss tends to go the opposite
way. This can be explained by the nature of the cache scheme. 
rù/hen there are a few
TC lines, it is most likely that an existing trace is replaced by a new incoming trace
since they are mapped at the same line. Therefore, when that line is engaged by a new
one that is not matched with the requirements of the fetch unit, it signals conflict miss.
Conflict misses can be resolved by increasing the number of TC lines and eventually
when there is enough room to store most or all of the instructions, the conflict miss
rate is zero. Likewise, compulsory miss can be explained from the same effect of
increasing the number of TC lines. More TC lines increase the probability of a hit on
an empty cache line. Hence, eventually all of the TC misses are conflict misses when
the trace cache is big enough to cover all executed instructions.
If we look at the results of TC misses, each of them has the tendency as
hlpothesized. However, the results of each test program are different and
unpredictable when two parameters - the trace size and the number of TC lines - are
varied. Therefore, no firm conclusion can be made about how the variation of trace






0.04003 23.5117 42 100




Ebs¡ 51.8549 1748 93 10010099 54
bs-r 52 8342.7412 10010099 37
64 2ebs{ 95 65782660 87 100100
4L 641161 512125611281
(a)
TC-4 - TC GomPulsorY Miss
No. .l ìC Llnss
000576 4S82 5898 1799 98E Dcr
00.9744 6568 5197 6299 4899 1799 21
000848 8799899 92
0018447 24993997I pn-50
0043.0896 37ea 27E pn-20
0004648 1550 8351-07E bst
0006347.1757.358.8I bs+






No. ot lC Unô¡
0o0272 1969.37B4 0287 07s9 9699 97@mr
00.9823 214444a7 5499 2399 1499.24I Pemu(e
0055ô2f 7A4098 68E pn-100
0027 784096 87I pn-50
0055627 7840o
0003232 4880 5799 88E bst
0004879 4899.87I bs+


































Figure 5.7 : Percentage of TC Compulsorv Míss of (a) TC-4 and (b) TC-9
(a) Tc 4 - Tc Gonfllct Miss





























No. ol lC Llmg
No. ol lC Llne8
Figure 5.8 : Percentage of TC Conflict Mrss of (a) TC-4 andþ) fC-9
12 93 27 A1
Pemute o770860.76 99 02767955 5612 46 100
0,08tr 7222o2 10010099 44
72.22603'13 10010099 44
945 99 4472 2260 100100
o'12E bsl 99 6867 5219 43 100100
0.13bs-r 67 6920 52 10010099 56




5.4 Trace Cache SPace Usage
Increasing trace cache space seems to improve the hit rate but also introduces
additional expenses. Therefore, it is important to obtain some indication of how
efflrciently each configuration uses the available memory. Figures 5.9 and 5'10 show
the percentage of trace cache space usage of TC-4 and TC-9, respectively' The
individual percentage was calculated from the total number of instructions stored in
the cache divided by the maximum instruction capacity of the cache.




























No, dTC llm8No. ol TC llnos
Figure 5.9 Figure 5.10
5.4.1 Results of TC-4 md TC-ï
According to figures 5.9 and 5.10, there is one common feature among all test
programs: the utilization of cache space is dropping while the cache capacity is
expanding.
5.4.2 Analysis
This result is quite predictable since the more cache space is available the





















primenumber. Althou gh Permute and DCT are longer in terms of program span' at
5l2L there is clearly much more space available than required. According to the raw
data (in the cD-RoM), there are two main effects that waste cache space. First, some
cache lines were occupied by short traces (less than 4 and 8 instructions for TC-4 and
TC_8, respectively). second, there are some cache lines that have never been
occupied by any trace; this is worse when the number of TC lines increases' Both of
them are inevitable because, for the former, there is no certainty of the width of an
individual trace, one might contain only 1 instruction while another contains more
instructions up to the maximum number (4 or 8 instructions) due to the fill-policy'
Consequently, the mapping of a trace entirely depends on the address of its first
instruction. Some cache lines, then, might be unused because it is unlikely that the
cache line following the line occupied by the current trace will be the place for the
succeeding trace.
5.5 Conclusion
The results show that the hit rate of the trace cache tends to increase while the
trace cache size is increasing from both the bigger number of instructions in each line
and the largernumber of trace cache lines. However, the increasing rate will come to
saturate once the number of trace cache lines can cover all of the instructions of the
test program.
In short test programs, the results show that the hit rate of the trace cache in
TC 8 is increasing and becoming steady earlier comparing to TC-4' It shows that the
larger cache space allows more instructions to fit in and, certainly, the probability of
finding the right instructions is increasing. Meanwhile, the hit rate also saturates 
faster
because of the larger cache space, which can cover all of the instructions in fewer
lines. On long programs, the results show that extending cache capacity both by
increasing the number of trace cache lines and increasing the line width increases 
the
performance of the trace cache as well.
Increasing the number of TC lines might improve performance of the trace
cache but also leads to trace cache Space waste because of the escalation of unused
trace cache lines accoding to the direct-mapped scheme'
63
Results
The wider trace on TC-T also increases the performance of the trace 
cache
over TC 4 if thecomparison has been made at the same number of TC lines' But' if
rwe compafes at the same capacity, for example, 16L of TC-4 and 8L of 
TC-8 for the










This research has been conducted to study the performance of trace cache
memory on a small-scale superscalar microprocessor. The superscalar DLX machine
[9] was chosen as the basis for the experiment. The original 
model can process two
instructions simultaneously with the help of a very small instruction cache to supply
instructions. The trace cache memory was designed with less complexity than the
previous works 1201,123).There is no sophisticated branch prediction 
unit for packing
the instruction traces and the number of instructions in one trace cache line was
reduced to balance with the issue width of the processor' The experiment has been
performed on2 main configurations: TC-4 and TC-\, which are 4 instruction-wide
and 8 instruction-wide trace cache, respectively' Each conf,rguration has a 
number of
cache lines varied from 4 to Sl}lines. Test programs used in this experiment can be
categorized into 2 groups: the short ones (e.g' bubblesort and primenumber) and 
the
longer ones (e.g. permute and DCT).
6.2 Conclusions
The experiment shows that the crucial parameters affecting the performance of
the trace cache are the number of TC lines and the width of trace. An increment of
both parameters leads to better performance of the trace cache indicated by the
increasing number of hits. However, increasing the number of TC lines also causes
more unused cache space according to the nature of the cache scheme.
65
Conclusion
The performance of TC-ï is generally better than Tc-4 if the comparison has
been made at the same number of TC lines but if we compare them at the same cache
capacity, TC-| does not really outperform TC 4 and most of the time performance of
the former is lower than that of the latter'
Apart from those parameters that affect the performance of the trace cache' the
policy of the fill-unit and also the logic unit for transferring traces from the f,rll-buffer
to the trace cache memory are also crucial. Investigation of the effect of these 
policies
is a matter for future work. From these results, the trace cache is not able to
demonstrate clear advantage over the instruction cache as expected' On 
the other
hand, even this trace cache model without sophisticated fîll strategies is quite complex
comparing with the original instruction cache. In that case, we found no evidence 
that
it is worthwhile to invest the chip area to implement such model while a simple
instruction cache works quite well for narrow-issue processors.
6.3 Further'Work
From the results of these experiments, we gain some insight into
characteristics of the designed trace cache on a naffow-issue processor and 
also some
indications of pitfalls of the model. This section is a discussion of these drawbacks,
which were not resolved in this research because of the time limitations'
The results show that the number of Tc lines and the width of traces are
crucial parameters in the aspect of trace cache performance' However, 
there is another
parameter that is also vital but was paid less attention' It is the functional unit for
transferring traces from the fill-buffer to trace cache memory. Some trace 
cache lines
are not as useful as they should be and cause a significant number of misses' 
This is
because the strategies to put a ttace into the destination TC line were not 
as effective
in avoiding misses as they could have been. Therefore, this unit should be
investigated further to find the optimum strategies. Clearly, this unit involves
significant complexity and adds to the implementation cost of the trace cache'
The advantage of a ïrace cache is the ability to contain two or more non-
consecutive basic-blocks, which an instruction cache cannot' The best metric 
we have
to evaluate the usefulness of the trace cache ís TC Line content Hit in which the hit
counts indicate the possibility of taking advantage of the trace cache' This feature
66
Conclusion
should have been explicitly gathered for the purpose of trace cache analysis. Yet, it
was not implemented in the design because of the complexity of modifying the
existing DLX model to collect this data'
In this experiment, there are only two more additional test programs, permute
and DCT, apart from the originally provided programs, bubblesort and primenumber,
for the simulation. They can be categorized as long programs and short programs
according to the span of the particular sourcecode. The results and analysis would be
more reliable if there were mofe long programs simulated. However, there are two
main constraints that obstruct us for gathering more test programs. The first one is
converting the sourcecode of the prospective test programs to binary code is a time-
consuming process because of the hand conversion of the source code described in
chapter 4. The second constraint is the simulation for the long programs takes a very
long time.
Finally, the experiment reveals that the other parameter that should taken into
account to gain more insight into the trace cache is the strategies used to decide
whether to hold the an existing trace or to replace it with a new coming trace that
mapped at the same TC line. A more intelligent scheme would improve performance
of the trace cache because it can hold the useful trace and ignore the less useful one at
the right time. However, investigation of this feature needs more time and certainly






This CD contains essential materials that can be used to reproduce the
simulation and the results created from our simulation for referencing purposes. At
root directory, there are three subdirectories: DLX Sourcecode, Test Programs, and
Simulation logfiles.
4.1 DLX Sourcecode
This subdirectory contains sets of VHDL sourcecode of the Superscalar DLX
processor model categorized by processor configurations. Each set of VHDL
sourcecode comprises four files: Dlx.vhd, DlxPackage.vhd, Environment.vhd, and
Testbench.vhd.
Dlx.vhd is the main VHDL file that describes the architecture of the DLX
processor. Most of the trace cache code is in this file.
Dlxpackage.vhd is the package file that contains types, subtypes, constants,
and functions in which they are used along with Dlx.vhd. This file also includes some
code for the trace cache.
Environment.vhd creates the environment for the simulation. It describes how
the processor model interfaces with the outside world and the system otganization
including the memory configuration. Originally, the memory capacity was l6Kbyte,
which was not enough to run Permute and DCT. Therefore, this file was modified to
increase the memory capacity to 32Kbyte.
Testbench.vhd connects all fîles together to make the simulation possible. This
file is the only original fîle that was not modified for the trace cache.
68
Companion CD-ROM Contents
To run each simulation configuration, we have to compile DlxPackage'vhd
first followed by Dlx.out, Environment.vhd, and Testbench.vhd.
4.2 Test Programs
There are two subdirectories: outJiles and Test programs sourcecode'The out
files subdirectory contains .out r:fles for use as test programs' 
To use these files, we
have to change the filename of the desired one into dlx'out and place it into the same
directory as the desired DLX VHDL code. For example, if we want to run DCT in the
simulation of TC 4 at8L, change from dct.out to dlx.out and put it into directory
tc4 gt. 'When the simulation is halted, it will create a result.log file of DCT rot the
chosen configuration.
In Test programs sourcecode, there are assembly files of all test programs'
Bubblesort and Primenumber are the original assembly sourcecode (bs-r and 
pn-20)
and the manually modified assembly sourcecode (bs-a, bs-d, pn-50, and 
pn-l00)'
Permute and DCT are the ones create dby GNU-d.lxcc of permute.c and dct'c 
from c
sourcecode subdirectory and patched as described in 4'4, which is included 
in that
subdirectory. All assembly f,rles are in Assembly sourcecode subdirectory'
4.3 Simulation log files
This directory contains results created by each TC configuration as 'log fúes'
An individual filename was changed from result.Iog created from the simulation 
after
the name of the test program. Each file contains information as follows:
. Log file banner - identifies the trace cache configuration and the name of the
test program used in the simulation'
Example:
*************i***********************************
TC 4 : 4 Línes *Test Program: bs-a'out
*******************i*****************************
. General Information - this section provides information about the number of
instructions that have been fetched into the processor (Total Fetched
htstructions), committed by the Commit Unit (Committed Instructions)'
69
Companion CD-ROM Contents
rejected (Omitted Instructions), and the number of instructions that have been
accessed from the instruction cache (Cache Memory Access (fetch))'
o Instruction-cache Info - indicates how many instruction cache hits there are in
the simulation of the test program and the percentage of hits by the total
instruction cache accesses.
o Trace-cache Info - this section shows how many trace cache hits (including
TC-First Tag Hit and TC-Content Hit) and misses (including Compulsory
Misses and conflict Misses) there are in the simulation of the test program and
the percentage of hits and misses'
o Information collected From Individual Trace cache Line - this table is the
information gathered from each TC line and contains the following items:
o Line - the trace cache line number'
oComp-Miss_thenumberofcompulsorymissesonaTCline,
o conf-Miss - the number of conflict misses on a TC line,
oTC-Write-thenumberoftraceswrittenonaTCline,
o TC-O Write - the number of traces that were overwritten with a
different trace content on a TC line,
oTC-Size-thelongesttracesizeexistingonaTCline,
o TC-Hit - the number of hits on a TC line,
oFTag-Hit-thenumberorTC-FirstTagHitonaTCline,and
oCont-Hit_thenumberofTC-ContentHitonaTCline.
o Percentage of Cache Space Usage - this section indicates the percentage of
cache space that has been written by traces'
70
VHDL Code of Trace Cache
Appendix B
VHDL Code of Trace Cache
Three files have been modified from the original Superscalar DLX model'
They are Dlx.vhd, DlxPackage.vhd, and Environment.vhd'
8.1 Dlx.vhd
This file describes the architecture of the trace cache. The first part is signal
declaration.
-- Filt Buffer structure
signal FB-InstrBuffer: TypeÀrraylnstr( O to clnstrRow, 0 to clnstrslot );
silnal le-InstrAddrBuffer: TypeArraylnstl( 0 to clnstrRow, 0 to clnstrSlot );
signal EB_BuffêrReady : unsigned( 0 to clnstrRow);
siinal FB-Tracesize : TypeArrayslotcount( 0 to clnstrRow);
signal EB BranchBxisting : unsigned( 0 to clnstrRow);
signal FB-BranchSloL : TyPeAlraysÌot( 0 Lo clnstrRow);
-- signal to inforn tlace line counter when there aIe 2 instluctions sit ín the line simultaneously
sígnaI EB_Tracecount2up : bit;
-- Buffer line termination signals
signal EB_RowTerminatedByA : bit;
signaÌ FB-RowTerminatedBYB : bit;
signal PB_FinishRowNunber-A : TypeRow;
signat FB_FinishRowNumber-B : TypeRow;
signal FB_BranchInstrA-Row : TypeRow;
signal FB_BranchInstrB-Row : TypeRow;
-- Instruction Write Enable
signal FB_InstrAwrite I bit;
signaÌ FB InstrB9Írite: bit;
signal EB-InstrWrite-A-B : unsíqned( 0 to 1 );








FB InstlA Row : TYPeRow:=0;
FB InstrA slot i TYPeSlot:=0;
FB InstrB Row : TYPeRow:=0;
FB InstrB SloL : TYPeSlot:=0i
FB CurrentRow : TYPeRow:=0;
FB culrentStot : TYPeSlot:=0;
-- Instruction TYPe ['lags
signal FB_InstrA_IsBrânch : biL
signal EB_InstrB_lsBranch : bit
signal EB_InstrA_IsDelimiter :






signaÌ FB_LastIûstr : TYPeWord;
signal FB_Lâstlnstrshift : Typeword,'
signal FB_LastlnstrlsBranch : bit ::' 0' ;
signâl EB_Test : bit::r0' ;
7l
VHDL Code of Trace Cache
) elae
'1' when Clock ='0'and EB InstrA!'Ùrite:'1' and
( EB lnstrA lsDelimiter = '1' ol
FB_Currentslot = clnstrslot or
( Fg-Instre-Isgranch : '1' and FB-BranchExisting(FB-CurrentRow) = 
r1r
'1' when clock = 'O' and FB-Instrt'Ûrite-À-B = "11" and
FB InstrA IsBranch : r1r and
EB InstrB lsB¡anch = '1' else
FB RowTerminatedByB <=
-- Test
'O' when Clock = '0' and FB-lnstrf'ÙriLe-Ä-B = "11' and
FB_currentslot /= 0 and
(Eaual (IF-lnstrAddrRegÀ-lnPut, FB Lastlnstrshift)=r 0r ) and
(FB LastlnstrlsBranch='0' ) else
'1' when clock = 'o' and FB-Inst¡write-A-B = "01" and
FB Currentslot /= 0 and
(Equa1 ( TE-InstrAddrRegB-Input, FB-LastlnstrShif t ) =' 0' ) and
(FB LastlnstllsBranch='0' ) elsê
-- when only insLruction B is coming
'1' when Clock = '0r and EB-Instr!'ÙriLe-Ä-B = "01' and
( FB_IôstrB-IsDeliniter = r1r or
FB CurlentSlot = clnstrSlot or
( FB-TnstrB-lsBranch ='1' and EB-BranchExisting(FB-currentRow)
-- f,lhen both instructions are coming
'0' when Clock = '0' and EB-InstrwríLe-A-B = "11u and
FB lnstrA IsBranch = r1r and
FB BranchExistinq(EB CurrentRow) = '1' and
EB InstlB lsBlanch = r0r and
EB InstrB IsDeliniter = '0' else
'O'when ctock ='o' and FB Inst¡9'¡rite A B: 
ir11rr and
FB TnstrA rsBran¿h = '1' anã
EB_InstrB_IsBranch = '1r and
FB BranchExisting(Þ'B-CulrentRow) = r0' else
'1' when clock ='0' and FB lnstrg'irite A B = "11" and
EB lnslrA Issranãh = 'I' anã
EB_InstrB_IsBranch = '1' and
FB BranchExisting(EB-CurrentRow) = rlr efse
'1' when Clock ='0'and EB Instrwrite A B = "11u and
( FB-Instr B--LsD.li*it.. = ' t'-o.
EB Cullentslot = clnstrslot-1 or
( FB_lnstrA_IsBrânch = '0r ând
FB IûstrA IsDeliniter = r0r and
EB_InstrB_IsBranch = r1' and
FB-BranchExisting(FB-CurrentRov) ='1' ) ) else
-- Identify which buffer tine(s) is(are) terminated'
FB EinishRowNumber A <=
EB-CurrentRow when clock = '0' and EB-InstrAwrite = r1r and
FB-CulrentsÌot /= 0 and
( Eãual ( IF-InstrAddrRegA-Input, FB-LastTnstrshif t ) = | 0 I ) and
(FB LâstlnstrlsBranch='0' ) else
FB-currentRo{ when clock = 'O' and EB-Instrt'lrite-Ã-B = "11" ãnd
FB-InstrA-IsBranch = '1' and
FB BranchExisting(FB-Cur¡entRow) = r1' and
FB InstrB IsBranch = '0' and
FB TnstrB IsDelimiter = '0' efse
) else
FB InstrA Row when Clock = '0' and EB Inst¡A9{rite: '1' and
( EB_InstrA_IsDelimiter = r1r ol
( EB Currentslot = clnstrsfot and
FB InstrA IsBranch = '0' ) ol
( FB Currentslot = clnstlslot and
FB InstrA IsBranch : '1' and
FB BranchExisting(FB culrentRow) = '0'
EB Cur¡entRow when Clock = 'O' and FB InstrAwrite ='1' and
( FB_InstrA_IsBranch = '1r and
FB_BranchExisting(FB_CurrentRow) = r1r ) else
FB Tnst¡À Row when Clock ='O' and EB-TnstrWrite-A-B: "11" and
FB InstrA IsBranch : '1' and
FB InstrB IsBranch = '1' eÌse
una f fe cted;
FB FinishRowNumber B <=
) ) else
-- Test
FB CurrentRow when Clock ='0' and EB-InstrWrite-A-B = "01irand
FB Currentslot /= 0 and
( Eãual ( TE-InstrÀddrRegB-InPut, FB-Lastlnstrshif t ) =' 0' ) aod
(FB LastlnstrlsBlanch:'0' ) else
-- f,lhen only instruction B is coming
EB InstrB Row when Clock = 'O' and FB-Instrwríte-A-B = "01" and
- ( EB InsLrB IsDelimíLer = '1r o¡
( EB CurrentSlot = clnstlSfÔt and
FB InstrB IsBranch : '0' ) ) else
73
VHDL Code of Trace Cache
FB CurrentRow when Cfock
FB BranchExisting(EB-CurrentRow) = '1' ) else
: '0' and FB_Inst¡Write_A_R : "01" and
( EB_InstrB_IsBranch = '1' and
-- t'Ùhen both instructions are coming
EB-InstrA-Row when clock : 'O' and EB-InstlWrite-A-B = "11" aod
- FB-InstrA-IsBranch = '0' and
EB-fnstrA-IsDelimiter = '0' ând
FB-InstrB-IsBranch = r1' ând
FB BranchExisting(FB-CurrentRow) = r1r else
FB-InstrA-Row when Clock = '0' and FB-InstrWrite-A-B = "11" and
FB lnstrA IsBranch = '1' and
EB-InstrB-IsBranch = '1' and
FB BlanchExisting(FB-CurrentRow) = r1r else
EB-InstrB-Row when clock = '0' and EB-Inst¡Í'¡rite-A-B = "11" and
- ( FB-InstrB-IsDelimiter = '1' or
FB CurrentSfot = clnstrSlot-1 or
( FB-InstrA IsBranch = r0r and
EB-InstlA-IsDelimite! : '0' and
FB-InstrB-IsBlanch = '1' ând
EB BlanchExistinq(FB CurrentRow) = '1r ) ) else
unaffected;
-- Logic for informing when Lhele are 2 instructions sitting in the same line
-- ( for trace line counter mechanism )
FB-TracêCount2tjp <= r1' when clock = '0' and Clock'event and
EB Instrf'Ùrite A B: "11" and
FB-CurrenLsÌoE ¿= clnstrsl ot-1 and
( ( FB-InstrA-IsBranch = '0' and
FB-lnstrB-IsBlanch = '0r ènd
FB-TnstrA-IsDefimite¡ ='0' ) o¡
( ( ( FB-BrãnchExisting(aB-currentRow) ='0' ând
( ( EB-InstrA-IsBranch = '1' and
EB InstrB IsBranch = r0r ) or
( aB InstrA-lsBrènch = '0' and
EB lnstrA-IsDeliniter = i0r ãnd
FB rnstrB IsBlanch = r1' ) ) ) or
( PB-BraÃchExisting(FB-CurrentRow) = r1r and
( FB_InstrA-IsBranch = '1' and
EB InstrB IsDelimiter = '0' and
FB InstrB TsBranch ='0' ) ) ) ) ) efse
,0,;
-- DeLermining whether íncoming blanch would sit in the curlent row or next possible row
FB BrãnchlnstrA Row <=
O when ( Clock = '0' and clock'event ) and
i EB-InstrA-IsB¡anch ='1' and FB BranchExisting( EB-curlentRow ) 
: '1' ) and
FB CurrentRow = clnstrRow else
FB-CurrentRow + 1 when ( Clock ='O'and Clock'event ) and
( FB InstrA IsBranch : '1' and FB BranchExisting(
FB currentRow ) : r1' ) and
FB CurrentRow ) :'1' ) and
EB CurrentRow < cTnstrRow else
FB CurrentRow;
FB_BranchInstrB_Row <=
- When only instruction B is comang
0 when ( Clock = 'O' and Clock'event ) and
FB rnstrv'lrite A B = "01" and
i-fe-i."ate-t"eraoch ='1' and EB BranchExisting( EB currentRow ) ='1! )
EB CurrentRow: clnstrRow elsê
FB CurrentRow + 1 whên ( Clock = 'O'and Clock'event ) and
FB-Instrwlite-A-B = "01" and
( FB InstrB rãeianch = r1' and EB-BranchExisting(
FB CurrentRow < clnstrRow else
-- When boLh instrucLions are coming
0 when ( Ctock = r0' and Clock'event ) and
EB-lnstrg'Írite-A-B : "11" and
( ( FB-CurrentRow = clnstrRow and
( ( FB-lnstrÀ-IsBranch = '0' and
FB-InstrA-IsDe1íniter : '0' ând
FB InstrB lsBrânch : '1' ând
EB Culrentstot = clnst¡Slot ) o!
( FB-Inst!A-IsBrãnch = '0r and
FB lnstrA IsDelimiLer : '0' and
FB_InsttB_lsBranch = '1' and
FB BranchExisting(FB-currentRow) : '1' ) or
( FB-InstrA-IsDeIÌmiter = '1' and
FB-Inst¡B-IsBlanch ='1' ) or
( FB-InstrA-IsBranch = '1' and
FB-InstrB-IsBrånch = '1' and
FBBranchExisting(FB-cu¡rentRow) = r1r ) ) ) or
( FB CurrentRow = clnstrRow-I and
EB-InstlA-IsBranch : '1' and
FB-InstrB-IsBlanch = '1' and
FB-BranchExisting(FB-curlentRow) ='1r ) ) else
and
1' ) ) ) or
FBCurlentRow+1when ( clock : '0' and ctock'event ) and
FB rnsLrwtite À B = "11" and
EB CulrêntRow < cTnstrRow and
( ( EB_lnstrA-IsBranch : '0' and
FB InstrA lsDelimiter : '0' and
FB InstlB IsBranch = '1' and
( FB_Currentslot = clnstrslot or
( EB-BranchExisting (FB-CurlentRow)
( FB-TûstrA-IsDelimiter = '1r and
FB InstrB IsBranch = r1r ) or
( FB_TnstrA-TsBranch = '1
EB InstrB IsBranch = '1' and




VHDL Code of Trace Cache
FBCulrentrow+2when
FB CurrentRow;
( Ctock = 10' and Clock'êvent ) and
FB Instrwrite A B = "11" and
EB currentRow < cTnstrRow-l and
FB_rnstrA rsBrânch = '1' and





-- Index for instluction A
FB InstrA Row <=
-- Test
0 when ( clock = 'O' and clock'event ) and
FB lnstrAwrite = r1r and
FB currentRow = clnstlRow and
FB Currentslot /= 0 and
(Eãual (IF-InstrAddrRegA-Input, EB-Lastlnstrshift) = ' 0 ' )
(EB LàstlnstrlsBranch='0') efse




FB CutrentRow+1 when ( Clock =rO' and Ctock'event ) and
- FB-lnsLrAl'lrite : '1' and
FB CullentRow < clnstrRow and
t'B-currentSlot /: 0 and
( Equal ( IF-InstrÄddrRegA-lnput' FB-Lastlnstrshif t ) =' 0' ) and
(FB LastrnstrTsBranch='0' ) else
0 when ( Clock = '0' and Clock'event ) and
FB-InstrA-IsB¡anch = '1' and
( rg e¡anchÉxisting(EB-currentRow) = '1' and FB-cur¡entRow = clnstrRow )
( EB-Instrf'lrite-A-B = '10" or FB-Instrt,Ùrite-A-B = "11r' ) else
FB_currentRow;1 wnen ( cToãk = r0' and clock'event ) and
FB_InstrÀ_IsBranch = '1' and
( fe granchExisting(FB-CurrentRow) = r1r and FB-CurlentRow
( FB-InstrWrite-A-B = "10" o¡ FB-Instrt'Ûrite-A-B = "11" )
ctock = '0' and clock'event ) and
( FB lnstrl,lrite A-B : "10' or FB Instrl4rite A B = "11" ) else
-- Test
0 when ( clock = 'O' and CÌock'event ) and
FB TnstrAf'Ìríte = '1' and
EB Curlentslot /:0 and
( EãuaI ( IF-InstrAddrReqÀ-Input, FB-Lastlnstrshif t ) :' 0' ) and
( FB LastlnstrlsBlanch=' 0' ) else
O when ( clock = '0' and Clock'event ) and
FB rnstrA IsBlanch : '1' and
( FB-InstrW;ite-A-B = "10' or FB-Instrwrite-A-B = 'r11rr ) and
( FB-BranchExísaing(FB-currentRow) = !1r ) elsê
FB CurrentSlol when ( CÌock = '0' and cfock'event ) and
( FB-InstrWrite-A-B = "10" or FB-InstrWrite-A-B = "11"
una ffected;
FB Cu¡rentRow when (
una f fect ed;
FB InstrA SIot <=
-- lndex fol instruction B
FB InstrB Row <:
-- Test
0 when ( clock = '0' and clock'event ) and
EB_Instrt'ùrite-A-B : "11" and
FB CurrentRow = clnstrRow and
FB Currentslot /= 0 and
(Eãuâ1. (1F-InstrAddrRegA-Input, PB-Lâstlnstrshift) =r 0 I )
(EB LâstlnstrlsBranch=!0' ) else
t'B CurrentRow+1 when ( Clock = '0' and Clock'event ) ând
- FB Instrwrite A B = "11" and
FB CurrentRow < clnstlRow and
FB Currentslot /= 0 and
( Equal ( I F-InstrAddrRegA-Input, FB-Last Inst rSh i ft ) : ' 0
(EB LastlnstrlsBranch='0r ) else
0 when ( Clock = '0' and Clock'event ) and
FB rnstrf'lrite À B = "01" and
FB-currents.lot /= 0 and
EB Cu¡rentRow = clnstlRow and
(Eãua1 ( I F-InstrAddrRegB-InPut, FB-LastlnstrShif t ) =' 0 I ) and
(FB LàstlnstrlsBranch='0' ) efse
FB CurrentRow+1 when ( Clock ='O'and Clock'event ) ând
- FB rnstrwrite A B = "01" and
FB-Currentslot /= O aod
FB-CurrentRow < clnstrRow and
( Eãual ( I F-InstrAddrRegB-Input, EB-Last Instrsh ift ) = ' 0
( FB LastlnstllsBranch='0' ) else
0 when ( Clock : '0' and Clock'event ) and
FB InstrAl'Ûrite = '1' and
FB Currentslot /:0 and
EB-CurtentRow = clnstrRow and
(EõuaI(IF-InstrAddrRegA-Input, FB-Lastlnstrshift)=' 0' ) ãnd
(EB LaotlnstrlsBranch='0' ) else




VHDL Code of Trace Cache
FB InstrAwrite = '1' and
FB Currentslot /= 0 and
FB CurrentRow < clnstlRow and
(Eãual ( I E-rnstrAdd¡RegA-InpuL' FB-Lastlnstrshif t ) = | 0' ) and
(FB LastlnstrlsBranch='0' ) else
-- current low has branch ènd Both instructions are comlng
1 when ( clock = rO' and CÌockrevent ) ênd
FB_InstrW¡ite-A-B = "11" and
FB BranchExisting(EB-CurrentRow) : r1' and
FB CurrentRow = clnstrRow and
EB InstlA IsBlanch = r1r aDd
FB lnstrB IsBranch = '1' else
FB CurrentRow+2 when ( Clock: '0' and Clock'evênL ) and
- FB Instrt'Ù¡ite A B = "11" and
EB-BranchExisting ( EB-CurrentRow)




sBranch = '1' and
sBrånch = '1' else
I
I
0 when ( clock = '0' and Clock'event ) and
( FB_Instrw¡ite-A-B : "11" and
FB BranchExisting(FB CurrentRow) = '1' ) and
( ( FB-currentRow = cr;strRou and
( ( FB-InstrA-IsBranch ='1' and EB-rnstrB-IsBrânch: :9: I :t.
( FB-lnstrA IsBlanch ='0' and FB-lnstrB-lsBranch = r1i ) ) ) or
( FB CurrentRow : clnstlRow-1 and
EB-Instr.A-IsBlanch : '1' and
FB lnstrB lsBranch = '1' ) ) else
EB CurrentRow+1 when ( Clock ='0' aDd Clock'event ) and
- ( EB-InstrÍ'lriLe-A-B = "11" and
FB-BranchExisting(EB-curlentRow) : r1' ) and
( EB-CurrentRow < clnstrRow end
( ( EB-InstrA-IsBranch = '1' and EB-InstrB-IsBranch : '0'
( EB Inst¡A IsBranch : '0' and EB-InstrB-IsBranch =)or
r1r)))etse
0' ) or
1' ) ) ) or
-- Current tow has NO branch and Both instructions are coning
O when ( Clock = '0' and Clock'event ) and
( FB-InstrWrite-A-B: "11" and
EB BranchExìsting(EB CurrentRow) = r0r and
FB-currenLRow = clnsfrRow ) and
( ( FB-currentslot = clnstrslot and
( I EB-InstrÀ-IsBranch = '1' and FB-lnstrB-IsBranch =
( FB InstrA IsB¡anch : 'O' and FB-InstrB-IsBranch =
( FB-InstrA-IsBlanch = r1' and
EB InstlB lsBlanch : '1' ) ) else
FB CurrentRow+1 when ( Clock ='0' and Clock'event ) and
- ( FB Instrwrite A-B = '11" aod
EB-BranchExisting(FB-Cur¡entRow) = r0r and
FB CuÌrentRow < êInstrRow ) and
( ( FB Currentsfot = clnstlslot and
( ( FB-TnstrA-IsBranch = '1i and FB-InstrB-fsBranch =
( FB-InstrA-IsBranch = '0' and EB-InstrB-IsBrânch =
( FB_Instr,q_IsBranch = r1r ând
FB InstrB IsBranch : '1' ) ) else
-- Current row has b¡anch ând only instruction B is coning and it is a blânch
0 when ( Clock = '0' and Clock'êvent ) and
FB_Instrwlite-À-B = "01" and
EB BranchExisting(EB-CurrentRow) = r1r and
FB InstrB IsBranch = '1r and
FB Currentrow = clnstrRow else
0' ) or
1' ) ) ) or
FB Current Row+1 when ( Clock = '0' and Clock'event ) and
EB_InstrWrite_A-B : "01" and
FB BranchExisting ( FB-currentRow)
EB InstrB IsBranch = '1' and




-- In case of Instluction A is a delimiter instruction (Jumps,Trap,RFE)
0 when ( Clock = '0' and CÌock'event ) and
FB Instrwrite A B: "11" ând
FB CurrentRow: clnstlRow and
FB_InsttA-IsDeÌimiter = '1' else
FB currenrRo;+1 when ( clock =;.;.:i."_:Ì::no.å.l."lrî.3,0
EB CurrentRow < clnstrRow and
FB InstrA IsDelimiter = '1' else
-- original cases
FB cur;entRow when ( clock ='0i and Clock'event ) and FB-InstrWritê-A-B: 
r01tr else
0 ;hen ( clock = '0' and clock'evenL ) ând
( FB-Instll'lrite-A-B = "11" and FB-currentRow = clnstrRow and FB-CurlenLslot:
FB CurrentRow+l when ( Clock = '0' and Cl'ock'event ) and
- ( EB-InstrWrite-A-B = "11' and FB-Currentsfot = clnstrslot )




0 when ( Clock : '0' and Clock'event ) and
EB InsLrwrite A B = "01" and
FB InstrB Slot <=
76
VHDL Code of Trace Cache
1 when
EB_Instlf'Irite_À_B = "11' and
FB BranchExísiiñg(EB-cutlentRow) : rl' and
EB-currentRow < clnstrRow and
( I EB-InstlA-rsBranch = '1' and
FB InstrB-IsBranch = '0' and
FB-InstrB-IsDeIimiter ='0' ) or
( FB-InstrA-IsBranch = '0' and
FB-InstrA-IsDelimiter : '0' and
FB-lnstrB-IsBlanch :'1' ) or
( FB-Inst¡A IsDelimiter = '1r and
FB InstrB IsBranch : 'l' ) ) etse
( clock = '1' and clock'event ) and
FB_Instrwrite_A_B = "11" ånd
FB-BranchExisting(FB-CurrentRow) :'1' and
FB currentRow = clnst¡Row ând
( ( FB-InstrA-lsBrânch = rl' and





sBranch : '1' and
sBranch='1,) ) else
0 when ( clock = '1' and clock'event ) and
FB_Instrwríte-A-B = "11" aDd
EB-BranchExisting(FB-CulrentRow) : r1r and
( ( FB-CurrentRow = clnstrRow ând
( I t'B-rnstrA-rsBranch = r1' ând
EB_InstrB-IsBranch = '0' and
FB-InstrB-IsDelimíte¡ ='0r ) or
( EB-lnstrA-IsBlanch = '0' and
EB lnstrA TsDelimiter : '0' and
FB-lnstrB-IsBranch = i1' ) or
( FB InstrA-IsDe'limiter : '1' and
FB-InstrB-IsBranch='1') ) ) or
( FB-Cur¡entRow = clnstrRow-1 and
( ( FB-InstrA-IsBratch = '1' and
FB-InstrB-IsDeIimiter ='1' ) or
( EB-InstrA TsBranch = r1r ând
FB InstrB IsBranch :'1' ) ) ) ) else
-- Current low has NO branch and both instluctions is coming
1 when ( clock = '1' and Clock'event ) and
FB_rnstr!,¡¡ite-A-B = "11" and
EB BranchExisting(FB-currentRow) = r0r and
EB-Currentslot = clnstlslot and
FB currentRow: clnstlRow and
EB InstlA IsB¡anch = '1' and
FB InstrB IsDelimiter = 'l' else
O when ( clock = '1' ãnd clock'event ) and
FB-InstrWríte-A-B = "11" and
FB BranchExisting(FB-CurrentRow) = '0' and
FB-currentslot = clnstrslot and
( ( FB-CurrentRow : clnstrRow and
( ( FB-InstrA-IsBranch = '1' and
!'B-InstrB-IsBrânch : '0' and
FB-InstlB-IsDeliniter :'0' ) or
( FB-InstrA-IsBranch = r0' and
FB InstrA IsDelimiter : r0r and
FB_InstrB-fsB¡anch :'1' ) or
( EB-InstrÀ-IsDelimite' : '1' and
FB-lnstrB-IsBrânch : r1' ) ) ) or
( FB-CurrentRow = clnstrRow-1 and
EB InstrA-IsDelimiter = '1' and
FB InstrB fsDelimitel = '1r ) ) else
FB-CurrentRow+1 when ( Clock : '1r and Clock'event ) and
FB-Instrl,lrite-A-B = "11" and
FB CúrrentRow <= clnstrRow-1 and
FtInstrA IsBlanch = '1' and
FB InstrB IsBranch = r1r else
FB CurrentRow+1 when (
FB CurrentRow+1 ehen ( Cfock :
Clock = '1' and clock'event ) and
EB-Instrg'[rite-A-B = "11" and
EB-BranchExisaing(FB-culrentRow) : r0r and
EB Currentslot = clnstrSlot and
1 -- I FB-currentRow <= clnstrRow-1 and
-- FB InstrA TsBranch = '1r and
-- FB InstlB IsBranch : '1' ) or
( FB-CurrentRow < clnstrRow and
( ( EB-InstrA-IsBranch = '1' and
FB-InstrB-lsB¡anch = '0' and
FB-InstrB-lsDelimiter ='0' ) or
( FB-InstrÄ-IsBrânch : '0' and
FB-InstrA-IsDeliniter = '0i and
FB InstrB IsBranch :'1' ) ) ) or
( FB-CulrentRow = clnstrRow-1 and
FB InstrÀ-IsDêlimiter = '1' and
EB lnstrB IsBranch = '1r ) ) else
-- current row has a blanch and instruction A is coming (it is a branch)
O when ( clock = '1' and Clock'event ) and
EB_Instrwrite-A-B = "10" and
FB-BranchExis¿ing(FB-CurrentRow) ='1r and
FB-CulrentRow = clnstrRow ãnd
FB InstrA IsBlanch : '1' else
'1' and CIock'event ) and
EB_Instrwrite-A-B = "10" an.l
FB_BranchExi sting ( EB-Cu rrentRow )
FB CurrentRow < clnstrRow and
FB lnstrA IsBranch = 'L' else
78
= r1r and
VHDL Code of Trace Cache
__Current'owhasabranchandinstructionBisconing(itisabranch)
0 when ( Clock = '1' and Clock'event ) and
EB Instrv{rite A B = "01" and
FB-BranchExistíng ( FB-currentRow) =
FB CurrentRow: cTnstrRow and





FB_currentRow+1 when ( cr.ock =r;r ;,:i."rr:i::*o.å.1.. å rî'l"o
EB BrânchExisting ( EB-CurrentRow )
FBturrentRow < clnstrRow and
FB rnsLrB IsBlanch = '1' else
-- 1f Instruction À is deliniter instruction
EB-InstrA Row+1 uhen ( Clock = r1' and Clock'event ) ând
FB_rnstrt.{rite_A-B = "10rr ând
FtrnstrA lsDerinite. = '1' and
EB InstlA Row < clnstrRow else




'1' and clock'event ) and
rwriteAB='10"and
rA lsDelimiter = r1r and
FB InstrA-Row = clnstrRow else
Ee_rnstrB-Ro; when T clock = 'l; ï;!.:*:ii.";'l'=,"ïiÎ."0
EB-rnstrÀ-rsDãliniter = '1' and
FB InstlB IsDeliniter : '0r else
-- If Instruction B is deliniter instruction
EB-lnstrB-Row+1 when ( Cfock = '1' and Clock'event ) and
FB lnstrB IsDelimiter = '1' and
FB InstrB Row < clnstlRow else
0 when ( Clock = '1' and Clock'event ) and
FB-InstrB-IsDeliniter = '1' ând
FB InstrB Row = clnstrRow eÌse
-- original cases
FB currentRow+1 uhen Clock = '1' ând Cfock'event ) and
( ( FB-InstrWrite-A-B = "10u and
['B-InstlA-Row < clnstrRow and
FB InstrA-Slot = clnstrslot ) or
( FB Instrwrj-te-A-B = "01" and
FB InstrB-Rou < clnstlRow and
FB InstrB Slot = clnstrsfot ) or
( FB Instrw;ite-A-B = "11" and
( ( FB-lnstrB Rõw /= 0 and FB-lnstrA-slot :
( EB-InstrB-Row < clnstrRow and EB-InstrB-Slot =
Clock = '1' and cÌock'event ) and
( ( EB_Instrtlrìte-A-B = "10" and
FB InstrA Row: cTnstrRow and
fa fnstrA Stot = clnstlslot ) or
( FB-rnstrw;ite-A-B: "01" and
EB-InstrB-Row = clnstrRow and
FB InstlB SloL = clnstrslot ) ol
( FB-Tnstrw;ite-À-B : "11" and








2 when ( Clock : '1' and Clock'event ) and
FB-lnstrl,[rite-A-B = "11" and
EB currentslot /= 0 and
( Equal ( IF-InstrAddrRegA-Input, FB-LastInst¡Shif t ) =' 0' ) and
(EB LastlnstrlsBranch='0r ) ê1se
1 when ( clock : '1' and clock'evênt ) and
FB InstrAwlite : '1' and
FB-currentslot /= o and
( EquaI ( IF-InstrAddrRegA-Input, FB-LastlnstrShif t ) =' 0' ) ând
(FB LastlnstrlsBlanch:r0' ) else
1 when ( Clock = '1' and Clock'event ) and
EB Instrvlrite A B: "01" and
FB currentslot /:0 and
(Equal (IE-InstrAddrRegB-Input, EB-Lastlnstrshift)=' 0' ) and
(EB LastlnstllsBranch='0' ) else
-- Current low has branch and both instluctions are coning
2 when ( Clock = '1' and clock'event ) ånd
FB-Instrl,lrite-A-B = "11" ând
EB-BranchExisiing(EB-curÌentRow) ='1' and
FB InstlA-IsBranch = '1' and
FB lnstrB IsBranch = '0' and
ÉB InstrB IsDelimiter : '0' else
0 uhen ( clock = '1' and Clock'event ) and
EB Tnstrt'ùrite A B: "11" and
FB BranchExisEiãg(FB-currentRow) ='1
FB InstrA IsBranch = '1r and
EB lnstrB IsDelimiter = '1' eÌse
and
( Clock = '1' and Clock'event ) and
FB_Instrwrite-A-B = "11" and
FB-BranchExisting(FB-currentRow) : r1r and
( ( EB-In6trA-IsBranñh = '0r ênd
FB InstrA IsDelimiter = '0' and







VHDL Code of Trace Cache
( EB_fnstÌÀ_IsBranch =





1 when ( Clock = '1' and CÌock'êvent ) and
FB-Instrt/lrite-A-B = "10" and
FB-BrânchExisting(FB-CurrentRow) = r1r and
EB InstrA IsBrãnch = r1r else
1 when ( Clock = '1' and clock'event ) and
FB-Instrf'úrite-A-B = "01" and
FB-BranchExisaing(FB-currêntRow) ='1r and
FB InstrB lsBranch : '1' else
-- Cu¡Ìent low has NO brânch ând both instructions are coming
1 wben ( Clock : '1' and Clock'event ) ând
FB-InstrWrite-A-B = "11" and
EB BranchExisting(FB CurrentRow) = r0r and
( ( FB-CurrentSlot = clnst!SIoL ànd
FB-InstrA-IsBrânch : r0' and
FB fnstrA IsDefimiter = '0' and
EB-InstrB-IsBranch = r1' ) or
( FB lnstrA-IsBranch = '1' and
FB InstrB lsBranch = '1' ) ) else
0 when ( Clock = '1' and Ctock'evênt ) and
FB_InstrÍ.¡rite_A_B = "11" and
¡'B sranchaxisEing(FB curlenbRow) : '0' and
( a FB-Currentsl-ot = c-LnstrSlol-I and
EB-InstrA-IsBranch = '0' ând
FB InstrÀ IsDelimiter : '0' and
EB-InstrB-IsBranch ='1' ) or
( EB InstrA-IsDeÌimiter = !1r and
FB InstrB IsBranch = r1r ) ) else
FB-currentslot+2 when ( Clock = '1' and Clock'event ) and
EB_Instrwrite_A_B = il11,, and
FB-BranchExisaiãg(FB-currentRow) = r0r ând
FB Culrentslot < cTnstrsÌot-1 and
EB InstrA IsBranch = r0' and
FB InstrA rsDelimiter = '0' and
EB InstrB-IsBranch = '1' eLse
O when ( Clock = '1' and cfock'event ) and
( ( FB-Instrf'lrite-A-B = "10" and
FB-rnstrA-lsDelimite¡ :'1' ) or
FB InstrB IsDelinitel = '1r ) else
FB_rnsrrB_sror+r when ( "r."0 ;", 1,"ïl$,!l!.1'["3"lrl..llo
FB InstrA IsDeliniter = '1r and
FB InstrB IsDelimiter : '0' else
-- Original cases
0 whån ( clock = '1' and Clock'event ) and
( ( FB-InstrWrite-A-B = "10" and FB-InstrA-Slot = clnstrslot )
i ¡g-rnstrwrite-e-e = "01" and EB InstrB-sfot = clnstrslot )
( FB-InstrWrite-A-B = n11n and FB-InstrB-Slot: clnst¡Slot )
1 when ( Clock = '1' and Clock'event ) aôd
( FB-InstrWrite-A-B: "11" and EB-InstrÀ-SIot = clnstrslot and
( FB lnstrA_IsBranch = '1'
EB InstlB IsBranch = '11














clock = '1' and clock'event ) and
( ( EB-Instrwrite-A-B = "10" ând FB-InstrA-slot <=
( EB lnstrt'ürite-À-B = "01" and FB-InstrB-Slot <=
cÌock = '1' èôd clock'everlL ) alld
( FB lnstrwrite-A-B : "11" and FB-InstrB-Slot <=
This part is the concurrent paft of the trace cache memory and trace cache hit
logic





























VHDL Code of Trace Cache
'1' and
EB Tracesize(1)-1 and





EB Tracesize (3) -1 and
EB_BranchExisting(3) : '1' else
una f fe cted;
TC SelectedEnt¡y <= To-Integer(TC-EirstlnstrAddr(3 downto 2) );
amount




FB InstrAddrBuffer(2,F8 Branchslot(2)+1) when EB-Tracesize(2) > 1 and
EB BufferReâdY(2) =
FB BranchSlot(2) <
FB lnstrÀddrBuffer(3,F8 BÌanchslot(3)+1) when FB-Tracesize(3) > 1 and -.EB BufferReãdy(3) :
FB Branchslot(3) <
-- <--- No of bit for Tc line
rc_Firsrras'* <: Equar( 'L*'iËlìljilaläiìiillli¿åliÌî;:ï*;:å:31:"ii3lÌ3.311,1."',,,, ,
,l' û IF Instrcounterreg(31 downto 2)); -'^^^^ Chânge here fol No of bit
for Tc line âmount
Tc_OtherHit <: '1' when TC Hit],lne = '1' and
TC FirstTagHit = '0' and
( To_Integer(lF-InstlcountêrReg) >
To-InLeger(TC-TagReg-o1(TC-HitLineNÙmber) & "00"), ),and
( ( TC BranchExisting(Tc HitLineNúmber) = r0' and
To Integel
To Integer (Tc-TaqReg-o1 (fC-HitLineNumber)
(IF_InstrCounterReg) <
¿ "OO" + ( TC Tracesizê(Tc-HitlineNumber) 
* 4 ))
)or







To-Integer(Tc-TagReg-o2(Tc-HitlineNunber) & ';00" +-( ( rc-Tracesize(Tc-HitlineNunber)-
TC Branchslot(Tc HitlineNumber)+1) r4))
else
Trace Cache Hit Logic
TC Hit <= TC EirstTagHit or Tc-OtherHit;
!d
¡
This section is the sequential portion of the VHDL code that defines the
conditions to put instructions and the associated addresses into fill-buffer.
FiÌl Buffer
-- Place instruction(s) and add¡ess(es) into fill-buffer
if EB InstrAwrite = '1' then
FBInstrBuffer(FBlnstrARow,FB_InstrA_slot)<:1E_InstlRegÂ_Input;
f'ã-instreddreufferT rB-tnstrn now , EB-InstrÀ-slot )




if FB InstrBwrite = '1' then
EB lnsLrBuffe¡( FB Ins trB_Row , FB-InstrB-Slot ) <= IF-InstrRegB-lnput;
FB InstrB Row , 0B InstrB Slot )
end if;
-- for Experiment
if EB_InstrWrite-A-B: "10" then
EB Lastlnstr <: IF InstrÀddrRegA Input;
EB-Lastlnstrshift ?= lF-InstrAddrRegA-Input+4;
FB LastlnstrlsBranch (= IsBranch(IF lnstrRegA-InPut) ;
end if;
if EB-lnstrwrite-A-B = "01" or EB-Instrwrite-A-B = "11" theD
EB Lastlnstr <: IE InstrAddrRegB Input;
EB Lastlnstrshift ¡= IF InstrAddrRegB-Input+4;




VHDL Code of Trace Cache
-- Any row têrmination(s)?
-- If so/ which instruction? (A and/or B did it) and which line?
-- When known, set the "Buffer Ready flag" to indicate the incident
if EB RowTerminatedByA : '1' then
EB BufferReady( EB FinishRowNunbêr-A ) <= '1';
end if;
if EB RowTerninatedByB = '1' then
EB BufferReady( FB FinishRowNumber-B ) <= '1';
end if;
-- counting the trace size of fill-buffer line(s)
if EB_Instr!'lrite-A-B = rr10rt then
- EB-Traãesize( EB-InstrA-Row ) <= FB-TraceSize( EB-InstrA-Row
end if;
if FB_Instrt'ùrite-A-B = "01" then
- r'g t¡aãesize( EB rnstrB Row ) <= FB-Tracesize( EB-InstrB-Row
end if;
if FB_InstrWrite_A-B = "11" then
if EB Tracecount2Up = r1r Lhen








<= FB T¡âcesize( FB_InstrÀ-Row )
1;
<: FB_Tracesize( FB-InstrA-Row ) +
<= EB Tracesize( FB-InstrB-Row ) +
-- Updating "B¡anch Existing Flag" and "Blanch S1ot" of tlace infolmation when there
comes Lhe branch
if FB InstrA IsBranch = '1' then
FB-Branch8xisting( FB-BranchInstrA-Row ) <: r1r'
FB Branchslot( EB InstrA Row ) <= FB-InstrA SIot;
end if;
if FB TnstrB IsBranch = '1r then
- FB-BranchExisting( FB-BranchTnstrB-Row ) <= r1r;
FB-Blanchslot( FB-InstrB-Row ) <: EB-InstrB-Slot;
end ift
This section describes the transfer function of instructions from fill-buffer to
trace cache memory and the trace cache hit consideration.'!
!d
.tj
-- Trace transfel function and line reset
for line in FB BufferReady'range loop
if FB BufferReady( line ) : '1'
-- If there are more than one
comence
then
iostruction in the Iine, the transfer function wiII
-- Otherwise, the fine would be abandoned'
if ( FB Tracesize( line ) > 1 ) ând
(
(TC-FirstlnstrAddr (31 downto 2 ) /= Tc-TagReg-o1 (Tc-selectedEntry) ) or
(FB Tracesize( linê ) >: TC-TlaceSize(Tc-SelectedEntry))
) Lhen_- Tlansfer function comencìng
TC-ValidBit (TC-setectedBntry) <=' 1' ;
Tc TàgReg-Ol (Tc-SeÌectedEntry) <= TC-FirstrnstrAddr (31- dow¡to 2) ;
it ( r'g Bt"n"hExisting (line) = '1' and FB-Branchslot (line) <






TC TagReg-o2(TC-SelectedEntry) <= Tc-FirstlnstrAddr(31
Tc T¡acesize(Tc SelectedEntry) <= FB-Tracesize(Iine) ;
Tc-BranchExisti;g (TC-SelectedEnt!y) <= EB-BranchExisting (line) ;
Tc-Branchstot (ÎC-SelectedBntry) <= EB-Branchslot(Iine) ;
Iine
Tc Tracewrite (Tc selectedEntry) +1;
downto 2) ) ând
EB Tracesize(line)-1 ) )
downto 2) ) and
FB_BranchStot(line) < FB-TrâceSíze(line)-1 ) )
TC Traceoverwrite (TC SelectedEntry) +1;
-- Eol counting the number of writing to the
TC Tracewrite (TC SeIectedEntrY)
not ( EB BraochExisting ( line)
) then
TC Traceoverwrite(TC SeÌectedEntry)
-- Number of ove¡writing
if ( ( Tc-Trâcesize(Tc-selectedEntry) /= EB-Tracesize(line) ) or
( ( Tc-TâgReg-ot(Tc-selectedEntry) /= Tc-Destrnst¡Addr(31
( FB BranchExisting(line) = '1' âôd FB-Branchslot(Ìine) <








VHDL Code of Trace Cache
then
line ) ;
EB InstrBuffêr (line, trace_slot) ;
FB TnstrAddrBuffer (line, trace slot) ;
end if;
-- Recording the longest tracê in a particular line
if ( FB Tra;esize( line ) > Tc-LongestTrace( Tc-selectedEntry ) )
TC LongestTrace ( Tc-SelectedEntry ) <= FB-TraceSize (
end if;
for trace_slot in 0 to cInstrsloL loop
TC Instr(TC SelectedEntry,trace-slot) <
Tc InstrÀddr(TC SelectedBntry,trace-slot) <
end loop;
end if;
FB Buffe¡Ready( Iine ¡ <= 'g't
EB Tracesize( Iine ) <: 0;
FB BranchExisting( line ) <: '0';
FB BranchsÌot( line ) <= 0;
for slot in 0 to crnstrslot IooP
FB lnstrBuffer( Ìine, slot ) <= ( others =>'0' );





If TC FirstTagHit = '1r then
TC Hitline <='1';
TC FtítlineNumber <= To Integer(IE Inst'CounterReg(3 downto 2) );
Change he¡e for Tc lines
end if;
External RESET
if Reset = '1' then
lP ValidFlagA <= '0';
IE ValidElagB <='0';
BTB VaIidFIag <= ( others :> '0' );
DP HaltEtag <= r0r;
DP-lnterruPtEnablePlag <: '0' ;
DP ProcessldentifierReg <= ( others
RB VatidFlaq <= ( others :> '0' );
BRU ValidElag <=r0','
ALU ValidFlag <=r0r;
MDU ValidFlag <= '0';
LSU VâlidFlâg <= r0';
LSU EA VatidEtag <= '0';
LSU sPR ValidElag <= '0';
CU NextcomiLPointerReg <= "10000";
ITB VâIidFlag <= ( others =>'0' );
IC ValidFÌag <= ( others => '0' );
DTB_VaIidFIag <= ( others => '0' );
Dc validElag <= ( others
l'ûB EntrancevalidElag <= '0';
llB ValidElag <= ( othêrs => '0' ) ;
BIU AcLiveloadFlaq <= '6't
BIU ActiveFetchFlag <='6',
BIU Actj veStoreEIàg <= ' 0 ' ;
BIU EirstBusclockofÀctivecycleFlag <=
EC Trac€Write <= ( othêrs => 0 );
TC-IraceOverwrit€ <= ( oÈhêr6 => 0 );
This section shows the resetting of the trace cache signal at the start of the
simulation (in bold).
-- Initialize Pointer
This is the last part of Dlx.vhd for writing the log file of the simulation.
end if;
-- Write âcquisited data to file --




VHDL Code of Trace Cache
wait on Incomingctock until Incomingclock : '1';




-- Name of experinent set
constant exp_name : string(1 to 49)
constânt exp_name-uI : string(1 to 49)
-- Type of Hit/Miss
constant TraceMiss i string(1 Lo 26\ r=
constant TCAccessMiss : stling(1 to 26)
constant CompMiss : stling(1 to 2B)
constant contMiss I string(1 to 28)
-- Result file name
file log: text open write-node is "¡esult'1o9";
variable log_line : line;
-- Variables for experiment lesult
valiable TraceMisscount : natulal ::0;
variable TCAccessMisscount : natulaÌ :=0;
variable comPMisscount : natural i=0;
variable ContMisscount : natural ::0;
variable CacheAccesscount : naturaf ::0;
"No. of T¡ace Cache Miss : "





conpulsary Miss = ";
conflict Miss = ";
constant CåcheAccess : string(1 to 26) := "No' of AIl Câche Access = ";
use std,têxLio.all;
-- Disposal
co.rtant lcache-Hit: stling(1 to 30) :: "Total lnstruction câche Hit = ";
constantTcache-EirstTagHiL:string(1to21)::"Tc(FírstTag)Hit=";
constant Tcache-OtherHit: string(1 to 1?) := "TC (other) Hit = ";
constant PC : stríng(1 to 26) r= "Progran Counter Àddress : ";
coostant InstrÀ : string(1 Lo 26) r= "lnst¡ucLion A Address i "ì
constanL InstrB : string(1 Lo 26\ t= "Instruction B Address : ";
constant Comit : string(1 to 30) := "Comitted Instruction Count = ";
constant Omit : st¡ing(1 to 28) := "Omitted Instruction Count: ";







variable Icache-Hitcount : natural:=0i
variabte TCache-EirstTagHitcount : natural:=0;
variable Tcache otherHítcount : natural:=0;
variable FirstTagHitCount : TracecãcheLine;
vâriable ContentHitcount : TrâceCacheLine;
type Tracecðcheline is array (0 to cTc-Ently) of natural;
variabfe compMisslinecount : Tracecacheline;






Comit Count : nâtural:=0;
MemFetch Count : nâturaÌ:=0;
PC Word : string(1 to B);
InstrA word: string(1 to 8);
InstrB word: sLring(1 to 8);
variãble CachespaceUsage : integer::0;
-- Datatype conversion functions
function ÑumberToDigit( Number : naLuraÌ ) return character is
beg in
if (Number >= 0) and (Number <= 9) then
leturn character'va1( character'pos('0') + Number );
eÌsif (Number >= 10) atd (Number <= 15) then







function NatúralTost¡ing( Numbêr : natural ) return string is
variable StringResulL : string(1 to B);
variable i'JorkNumber : naLural i= Number;
begin
foriinSdowntolloop
StringResult( i ) := NumberToDigit( WorkNumber





function wordTostring( word : unsigned ) return string is
84
VHDL Code of Trace Cache
variable StringResuIL: string(1 to B);
vâriable Dígit : unsigned( 3 downto 0 );
beg ín
foriinltoBlooP
Digit := VÙord( 4*(8-i)+3 downto 4*(8-i) );




To Integer ( Digit ))
begin
wait on Clock untif Clock = '1';
PC tJord := WordTostring(lF-InstrcounterReg) ;
InãtrA wo¡d := wordTostring(IE-lnsLrAddrRegÀ-Input) ;
InstrB ['lord : = t¡ùordTostring ( r t'-InstrAddrRegB-r ûput ) ;
if IF InstrcountelRegwríte = '1' then
CâcheAccesscount :: CacheAccesscount + 1;
if not ( Tc EirstTagHit: '1' or Tc othelHit ='1r ) then
TraceMisscount := TrèceMisscount + 1;
end if;
if ( TC EirstTaqHit =r0' and TC otherHit = '0r ) then
TcAccessMisscount :: TcAccessMisscount + 1;
ifTc_ValidBiL(To_IntegeI(IE_InstlcounterReg(3dounto2))):.0'then
ComPMisscount := ComPMisscount + 1;
conpMissT,inêcount (To Inteqer(rF-rnstrcountelReq (3 downto 2) ) )
CompMisslinecount (To-Integer ( IF-InstrcounterReq (3 downto 2l \' + 7 ;
else
ContMisscounL :: ContMisscount + 1;
ContMissLinecount(To-Integer(lF-InstrcounterReg(3 downto 2) ) )
ContMíssl,inecounL (To-Inteqel (IE-InstrcounLelReg (3 downto 2) ) ) + 1 ì
end ifi
end if;
if TC FirstÎagHit = '1' then
EirstTagHitcount (To-InLegel (IE-InstrcountêrReg (3 downto 2) I )
FilstTaqHitcount(To Integer(IF-InstrcounterReg(3 downto 2\ | ) + Iì
end if;
'if TC OtherHit = r1r then-----'-õ"^t.ntHitcount(Tc 
HitLineNumber) := contentHitcount(Tc_HitT,ineNumber) + 1;
end if;
end if;
if DP HaItDIx = '1' then
-- DisPlaY the experiment set name
write (Iog-line, exp-nane-ul) ;
writeline (log, Iog-line) ;
write ( log_line, exp-name ) ;
writeline (Iog, 1og_line) ;
write ( tog-line, exP-nane-ul ) ;
writeline (Ìog, log-line) ;




write(Iog line,sLring' ("-------------------") ) ;
writeline ( 1 og, Iog_line) ;
-- Fetched Instruction Count
write(1og-line'string' ("Totâl Fetched Instructions = ") ) ;
w!ite ( log-1ine, Instr-count ) ;
writelíne (1o9, log-line) ;
-- Comited/Omitted Instructions
write(log-line,string' ("Comitted Instluctions : ") ) ;
wtite ( fog-line, CoMit-count ) ;
wr-iteline (log, Iog 1 ine) ;
write(1og-line,string' ("omitted Instructions ='));
write ( log-line, Instr-count-comit-Count ) ;
writeline (tog, log Iine) ;
-- Cache Infotmation
-- Cache Access
write(Ioq-Ìine,string'("cache Memory Access (fetch) = ") );
ilrite ( tog_Iine, CacheAccesscount ) ;
writeline (Iog, lo9_Iíne) ;




write (log-line,stÌing' ("----------------------") ) ;
wliteline (1og,1og line) ;
wrlLe (1og-lirte, string' (r'fnstruction-cacho Hit : ") ) ;
write (log_Iine, Icache Hitcount) i
writeline (1o9, log_line) ;
85
VHDL Code of Trace Cache
write(log-tine,gtring'("Percent of Ic Hit : "));
writeilo;-Iine,reâ1(icache-HitcounL*100),/real(cacheAccesscount)'digits => 2);
wr-iteline (log, Iog_line) ;
writeÌine (log/ 1og-line) ;
-- Trace Cache
w!ìte (1og-Iine, string' ( n------
write ito;-Iine,striné' ("Trace-cache lnfo") ) ; writeline (log' log-line) ;
write(Iog line,string' ("----------------") ),
writeline ( 1og, I og-line),'
write(log-Iine,string' ("Trace-Cache Hit = ") ) ;
write (1o;-line,TCach¿-FirstTagHitcount + TCache-OtherHitcount) t
write (Iog-tine' string' (" ( ") ) ;
write(log line,string'("Tc-First Tag Hit = ") );
write ( Ìog-line, Tcache-FirstTagHitCount ) ;
write (log-line, string' (" / Tc-Content Hit = ") ) ;
write ( 1og-Iine, TCache-OtherHitcount ) ;
write(Iog_line,string' (" ) ") ) ;
writeline (1o9, Iog-l ine) ;
write(Iog-line,stringr("Percent of Tc Hit = "));
write (Iog-1ine, real ( (Tcache FirstTagHitcount
Tcâche-otherHitcount) *100) /real (cacheAccêsscount),digíts => 2)'
write(Iog-tine,string'(" ( ") );
write(tog line,string'("Tc-First Tag Hit = "))t
write iloé-rine, rear (Tcache Fitstragiitcoúnt*100) /rea1 (câcheAccesscount) ' 
digits :>
write(tog-Iine,string'(" / Tc-content Hit = ") ) 'write(1o;-line,real(Tcache otherHitcount*100)/real(cacheAccesscoûnt)'digits => 2);
write (log_llne,string' (" ) ") ) ;
writêline ( Iog, lo9_l j ne) ;
writeline ( 1og, I og_L i ne) ;
write ( tog-Iíne, st;ing' ( "TotaI Trace cache Miss = " ) ) ;
write ( Iog_Iine, TraceMisscount ) ;
w¡ite (log_line, strinq' (" ( ") ) ;
write(loõ-line,string'("Tc - Compulsary Miss : ") );
write ( fog-lìne, conpMi sscount ) ;
write(Ioõ-line'strinq'(" / TC - Conflict Miss = ") );
wríte ( Iog-Iine, contMisscoúnt ) ;
wlite(log_Iine,string' (" ) ") );
writeline (Iog, log-l ine) ;
2\ì
wríte(1og_line,string' ("Percent of ( "));
write(log line,stling'("Conpulsary Miss = ") );
*.ii.ìi"é 1i".,¡eal(óomp¡'tisicount+100)/rear(TraceMisscount)'digits:> 2);
write(loq-líne,string'(" / conflict Miss = "));
r.it.iroé-rin.,real(óootMisscount*100)/real(TraceMisscount)'digits => 2);
write(log_Iine,string' (" ) ") ) ;
writeline (log, Iog_Iine) ;
writeÌine (log, log_1ine) ;
-- lnformation Table
write(Iog-Iine,string'("------ ------------------"));
writeline (Iog, log_Iine) ;
write(1og-Line,strinq'("lnformation Collected From Indlvidual Trace Cache Line") );
writerine(rog'rog line); 
write(toq line'stríng'('------------- -----------"));
writeline (log, Iog-line) i
writeline (log, Iog-1ine) ;
write(log-líne,sLring'("Line comP-Miss conf-Miss Tc-t¡¡rite Tc-o write
Tc-Size TC-Hit FTag-Hit Cãnt-Hit")); writeline(log'Iog-1i'ne);
write (1og-line,string' ("-------------
-------------=-------,') ) ; writeline(Ìog, -Log_1ine) ;
for index in 0 to cTC Entry loop
write (log-line, index, justitiea >
rrite(log_line,string' (" "));
write (Iog-line, CompMissLinecount (index) , justified :> right' field => 10) ;
write (Iog_line, st¡íng' (" ") );
write (Iog-line, contMissT,j.necount (index) , jusLified :> right' fieÌd => 10) ;
write(Iog_Iine,string' (" ") ) ;
w¡ite (log-line,Tc-Tracewrite (index) , justified >
H¡ite(log_line,stringr (" ") ) ;
write(1og-líne,Tc-Traceoverl{rite(index) 'justified => 
right' fietd => 10);
write (Iog_Iine, stling! (" ") ) ;
write (Io9-line, Tc- LongestTrace (index) , justified => right' field :> 10) ;
wlite(Iog_líne,string' (" "));
wriLe(lo9-Iine,FirstTagHitcount(index)+contentHitcount(index)'justified >
right, field => 1O); urite(log-Iine,string'(" ")li' 
"iit.1toq-Iine, 
FirstTagHitcount (índex) , justified => right' field => 10);
wlite(1og_Iine,string' (" "));
write (log-line,ContentHitcount (index), iustified => right' field :> 10) ;
write(Iog_Iine,string' (" "));
writeline (1o9, log- Iine) ;
-- CâIculate the sum of Cache space Usage
cachespaceUsage :: CacheSPaceUsðge + Tc-LongestTrace(index) ;
digits :> 2);
end loop;
writeline ( log, log_1in
write(rog line,siiing ---"))' writeline(log'rog-rine);
write iloé-Iine, strinq cache space Usaqe" ) ) ; wÌitêline (Iog' log-lite) ;
write iÌo;-line, strinõ ---") ) ; writeline (1og' log-line) ;
write(1ot-line,string pâce Usage = ,) );
write(Ioé-linc,rcal(*100/((cTc-Entry+1)*(crnstrsÌot+1)))
write (log_line,string' (" B") ) ;
86
VHDL Code of Trace Cache
writeline (Iog, log_Iine) ;
end if;
-- counting Mechanísm --
if IF InstrcounterRegwlite = '1' then
if 1c-Hit = 11' then
Icache Hitcount := Icache Hitcount+1;
end if;
if TC-FirstTagHit = r1' then
Tcache t'irstTagHiLcount := Tcache-FirstTagHitcount+1;
end if;
if TC otherHit = r1r then
Tcâche otherHitcount :: TCache-otherHitcount+1;
end if;
end if;
if fc_FetchRequest : '1' then
MemEêtch count := MemFetch Count+1;
end if;
if IF InstlcounterReqt'Ùrite = 'l' then
- if ( IF-StáqeA-Write = '1'and IF-SLðgeB-f'ùlite =r1' ) then
if ( IF lnstrcounterReg : IF-InstrAddrRegA-Input ) and




( IE-rn;trcounterReg /: rF-InstrAddrRegA-Input. ). and
( IE InstrcounterReg = IF InstrAddrReqB-Input ) ) or
( rE-Instrcounte¡Reg = IF-lnstrAddrRegA-Input ì ?nd.
( IF-lnst¡counterReg = IE-InstrAddrRegB-Input ) ) then
Instr Count := Instr Count+1;
end if;
eÌsif ( IF-StâgeA-W¡ite = '1r and 1F-StageB-Write = '0




if ( cU-ComitInstrA =r1'and cu-ComiLInstrB ='1' ) then
Comit-count : = comit-Count+2;
elsif ( cu-comiilnstre ='1' and cu-comitlnstrB: r0r ) or
( Cu-comitlnstrÄ = 'O' ðnd Cu-ComitlnstrB = '1' ) Lhen






This section is to define types, subtypes, and constants used in the trace cache'
For Trace cache Model
type TypeArraylnst! is arlay (natulal rânqe<>,natural range<>) of unsigned (31 downto 0);
--TC4:cTnstrslot=3
--TcB: clnstrslot:?
constãnt clnstrRow : intêger ::3;
constant clnstrslot : integer :=3;
-- These constant has to be added by 1 for actual amount'
-- Since they will be mainly used for counter that start at 0 instead
of1
constant cTc-Entry : integer :=3; -- <--- Lines of trace cache
subtype TypeRow is integer range 0 to clnstrRow;
subtype TypeSlot is integer range 0 to cTnstrslot;
subtype lypeslotcount j-s integer lange 0 to clnstrslot+1;
type TypeArrayslot is array (naturâl' range<>) of Typeslot;
type fypenrraySlotcount is array (natural range<>) of Typesfotcount;
type ÍypeArrayt'üritecounL is arrây (natural range<>) of integer;
type TypeArtayTag is array (naturâl ¡ange<>) of unsigned(31 dounto 2)i
87
VHDL Code of Trace Cache
This is the declaration of additional functions used in the trace cache. Function
IsBranch is for checking whether the instruction is any kind of branch instruction and
function IsDelimiterlnstr is for checking, whether the instruction is a jump, trap, or
rfe.
-- Functions fot Tracecache Model
function IsBlanch( Instruction : Typeword ) return bit;
function rsDelimiterrnstr( rnstruction I Typeword ) retuÌn bit;
These functions are here:
-- Functions for Tracecache Model
function IsBranch( Inst¡uction : Typeword ) leLurn bit is
alias Instructionopcode : TypeDlxopcode is Instluction( 31




when copcode-beqz :> Result := r1r;
when coPcode-bnez => ResuÌt ::'1';
-- not a branch





function IsDelimite!Instr( InsLruction : Typei'lord ) leturn bit is
alias Instructíonopcode : TypeDlxopcode is Instruction( 31 downto 26 );




when coPcode-j => R€sutt := r1';
when coPcode-il :> Result l= '1';
when copcode-jal' :> Result ::'1';
when copcode-jalr => Result ::'1';
-- trâP
when copcode-traP:> Result ¡= r1r'
-- rfe
when coPcode-rfe => Result::'1';
-- other instructions





This file has been modified to increase the memory capacity from l6Kbyte to
32Kbyteto run Permute and DCT. Therefore, these two lines are changed'
constant cMemorysize : Positive t= 32'168;
constaot cHighAddress-unsigned : unsigned := X"0000-7FFF";





Excerpts framlos files of DCZ
Appendix C
Excerpts from log files of DCT
8L























































Runtime Startup Code and Perl Script Listines
Appendix I)
Runtime Startup Code and Perl Script Listings











; Starting point for simulations: Ioads 129 with nensize and cal1s naín with
; argc and argv
thi 129, ( ( (mensize-B) >>16) eoxffff)




add 11, r0, r0
thi r1, ( (argc>>16) &0xffff)
ãddui r1, 11, (argc&oxffff)
lw Ê2, (tL)
sw lt29l ttz




























r1, ( (argv>>16) &oxffff)
11, r1, (argv&oxffff)
12, ltTl
4 lr29) , 12
r1, r0 / r0
11, ( (_envaron>>16) eoxffff
11, 11, (_environeoxf f f f )
12, ltl)
B lr29) , 12
maÌn
Perl Scrip t (fìlter.pl)
* ! /usr/1ocat/bin/perl
eval 'exec /usr/local/bin/perl -S 90 çf1+"çG"l'
if $running_under_some_shell ;- * this emulates # ! processing on NIH machines
* (remove *l Ìine above if indigestible)
eval,S,.S1.'S2;'white SARGVtOI =- /"1[À-za-z 0-9]+=)(.*)/ e& shift; *'
* Process ânY Eoo=bar switches
t29, r29 ' #16exi t
# set array base to 1
# set output field separator
* set output recold seParator
åregnâp = O;
90
Runtime StartuP Code and Perl ScriPt Listings
Iine: while (<>) {
chop; # strip record
@Ftd - sptlt(' ', $ ,
separator
if ($FIdtll eq rnovi2fPr) {
pri-nt ";;; " 9- ;
õparlist = split-l', ', $Fldt2l ' 
99991 t #should strip ; first
$regmap{SParlist [1] ] = 9Parrist[2] ;
next;
Ì
it lçer¿rrl =- /nurt/ I I $Frdtll -' /dív/ I I $Fldt1l =- /murtu'/ I I
$FIdtll =- /dív/) I
Print ";;i " 9- ,
õParlist = spliLJ',', gEldt2l | 9999); #shourd strip ; first
çoperation = çFIdt1l,
$oprndl = 9ParIist[1];
goprnaZ = 'ft . substr($legmap{$Parlist[2])' 2t 9999991i
$oprnd: = 'f' . substr($¡egmap{çParIist[3]l | 2, 999999) ì
next;
)
if (9FIdt1l eq 'movfp2i') {
print ";;; " $_ ;
õpartist = split(',', $FIdt2l ' 9999)t #should 
strip ; first
if (SParlistt2l ne $oprndl) I *zlz




print "\trr . "noP";
print "\t" "nop";
print "\t" . $opèration 'f' substr(çParlist [1], 2'







11] A. V. Aho, R. Sethi, and J. D. ullman, compiler: Principles, Techniques, and
Tools, Addison-V/esley Publishing Company, 1 986'
Í21 p. J. Ashen den, The Designer's Guide to VHDL, Morgan Kaufmann Publishers,
San Francisco, CA, 1996.
t3] D. Burger, T. Austin, and S. Bennett, "Evaluating Future Microprocessor: the
Simplescalar Tool Set," Technical Report 1308, University of Wisconsin-
Madison Technical Report, July 1996.
t4l T. M. Conte, K. N. Menezes, P. M. Mills, and B. A. Patel, "Optimization of
Instruction Fetch Mechanism for High Issue Rates," in Proceedings of the 22'd
Annual Intern ational Symp o s ium on C omputer Archít ectur e, 199 5 .
t5] S. Dutta and M. Franklin, "Control Flow Prediction with Tree-Like Subgraphs
for Superscalar Processors," in Proceed'ings of the 28'h ACM/IEEE Annual
I n t er nati o n al sy mp o s iurn o n Mi cr o ar chit e c t ur e, pp. 25 8 -263, 19 9 5 .
t6l M. Franklin and M. Smotherrnan, "A Fill-Unit Approach to Multiple Instruction
Issue," in Proceedings of the 27th Annuat ACM/IEEE International Symposium
on Microarchitecture, pp' 162-17 I, 1994.
l7l D. H. Friendly, S. J. Patel, and Y. N. Patt, "Alternative Fetch and Issue
Techniques from the Trace Cache Fetch Mechanism ," iî Proceedings of the 30th
Annual ACM/IEEE International Symp o s ium on Micro architecture, 1997 -
t8l D. H. Friendly, S. J. Patel, and Y. N. Patt, "Putting the Fill unit to work:
Dynamic Optimizations for Trace Cache Microprocessors," in Proceedings of
the 3l't Annual ACM/IEEE International Symposium on Microarchitecture,
1998.
t9] J. L. Hennessy and D. A. Patterson, Computer Archítecture: A Quantitative
Approach, Morgan Kaufmann Publishers, San Francisco, CA, second edition,
1996.
92
t10] J. Horch, "A Superscalar Version of the DLX Processor," Superscalar DLX
Processor, lggT . http://www.rs.e-technik.tu-darmstadt.de/TUD/res/dlxdocu/
SuperscalarDlX.html (9 July 1999).
tl1] Intel Corporation, "Intel@ NetBurstrM Microarchitecture," The Intel@
P entium@ 4 Proces s or Product Overview, 2002. http://www.intel. com/design/
Pentium4/prodbref/index.htm ( 1 5 July 2002).
U2] M. Johnson, Superscalar Microprocessor Design, Prentice Hall, Englewood
Cliffs, NJ, 1991.
113] J. D. Johnson, "Expansion Caches for Superscalar Microprocessors," Technical
Report CSL-TR-94-630, Stanford University, Palo Alto CA, June 1994.
t14] S. Jourdan, P. Sainrat, and D. Litalize, "Exploring Confîgurations of Functional
Units in an Out-of-Order Superscalar Processor," in Proceedings of the 22nd
Annual International Symposium on Computer Architecture, pp. 1 17-125,1995.
115] S. Jourdan, P. Sainrat, and D. Litalize, "An Investigation of the Performance of
Various Instruction-Issue Buffer Topologies," ín Proceed'ings of the 28th
ACM/IEEE Annual International Symposium on Microarchitecture, pp. 279-
284,1995.
t16] S. V/. Melvin, M. C. Shebanow, and Y. N. Patt, "Hardware Support for Large
Atomic Units in Dynamically Scheduled Machines," in Proceedings of the 2I't
Annual ACM/IEEE International Symposium on Microarchitecture, pp' 60-63,
1988.
llTl S. W. Melvin and Y. N. Patt, "Performance Benefits of Large Execution Atomic
Units in Dynamically Scheduled Machines," in Proceedings of Supercomputing
'89,pp.427-432,1989.
tlS] S. J. Patel, M. Evers, and Y. N. Patt, "Improving Trace Cache Effectiveness
with Branch Promotion and Trace Packing," in Proceedings of the 25th Annual
International Sympo sium on Computer Architecture, 1998.
tl9] S. J. Patel, D. H. Friendly, and Y. N. Patt, "Critical Issues Regarding the Trace
Cache Fetch Mechanism," Technical Report CSE-TR-335-97, University of
Michigan Technical Report, lll{ay 1997 '
l¡0l S. J. Patel, D. H. Friendly, and Y. N. Patt, "Evaluation of Design Options for the
Trace Cache Fetch Mechanism," IEEE Transactions on Computers, vol. 48, no.
2, pp. 435-446, February 1999.
93
l2ll S. J. Patel, "Trace Cache Design for V/ide-Issue Superscalar Processor," PhD
Dissertation, University of Michigan, Ann Arbor MI, 1999.
Í22] A. Peleg and U. Weiser. Dynamic Flow Instruction Cache Memory Organized
Around Trace Segments Independent of Virtual Address Line. U.S. Patent
Number 5,381,533, 1994.
l23l B. R. Rau and J. A. Fisher, "Instruction-Level Parallel Processing: History,
Overview and Perspective," Journal of Supercomputing, vol. 7, no. ll2, pp. 9-
50, 1993.
Í241 E. Rotenberg, S. Bennett, and J. E. Smith, "TÍace cache: A Low Latency
Approach to High Bandwidth Instruction Fetching," Technical Report 1310,
University of 
'Wisconsin-Madison Technical Report, April 1996.
l25l E. Rotenberg, S. Bennett, and J. E. Smith, "A Trace Cache Microarchitecture
and Evaluation," IEEE Transaction on computers, vol. 48, no.2, pp. 111-120,
February 1999.
Í261 E. Rotenbefg, Q. Jacobsen, Y. Sazeides, and J. E. Smith,"Ttace Processors," in
proceedings of the 3dh Annual ACM/IEEE International Symposium on,
Microarchitecture, 1997 .
l27l R. H. Saavedra and A. J. Smith, "Measuring Cache and TLB Performance and
Their Effect on Benchmark Runtimes," IEEE Transaction on Computers, vol-
44,no.10, pp. 1223-1235, October 1995'
t28] p. M. Sailer and D. R. Kaeli, The DLX Instruction Set Architecture Handbook,
Morgan Kaufmann Publishers, San Francisco, CA, 1996.
l2gl M. Schlansker et al., "Compilers for Instruction-Level Parallelism," CompLûer,
pp. 63-69, December 1 997.
t30] D. Sima, "superscalar Instruction ISSue," IEEE Micro, vol. 17, pp.28-39,
September-October 1997.
t31] A. J. Smith, "Cache Memories," ACM Computing Surveys,vol. 14, pp.473-530,
September 1982.
132] J. E. Smith and G. S. Sohi, "The Microarchitecture of Superscalar Processors,"
Proceedings of the IEEE, vol. 83, pp. 1609-1624,December 1995'
l33l M. Smotherman and M. Franklin, "Improving CISC Instruction Decoding
Performance Using a Fill lJnit," in Proceedings of the 28'h Annual ACM/IEEE
Int ernational Symp o sium on Micro archit ec tur e, pp. 219 -229, 199 5 .
94
Bibliography
l34l T-Y. Yeh, D. Marr and Y. Patt, "Increasing the Instruction Fetch Rate via
Multiple Branch Prediction andaBranch Address Cache," inProceedings of the
zth Aclrt lnternatíonal Conference on Supercomputing, pp. 67 -7 6, 1993.
95
