Mississippi State University

Scholars Junction
Theses and Dissertations

Theses and Dissertations

5-8-2004

A Study for Reducing Conflict Misses in Data Cache
Rami J. Ammari

Follow this and additional works at: https://scholarsjunction.msstate.edu/td

Recommended Citation
Ammari, Rami J., "A Study for Reducing Conflict Misses in Data Cache" (2004). Theses and Dissertations.
268.
https://scholarsjunction.msstate.edu/td/268

This Graduate Thesis - Open Access is brought to you for free and open access by the Theses and Dissertations at
Scholars Junction. It has been accepted for inclusion in Theses and Dissertations by an authorized administrator of
Scholars Junction. For more information, please contact scholcomm@msstate.libanswers.com.

A STUDY FOR REDUCING CONFLICT MISSES IN DATA CACHE

Rami J. Ammari

A Thesis
Submitted to the Faculty of
Mississippi State University
in Partial Fulfillment of the Requirements
for the Degree of Master of Science
in Computer Engineering
in the Department of Electrical and Computer Engineering
Mississippi State, Mississippi
May 2004

Copyright by
Rami J. Ammari
2004

A STUDY FOR REDUCING CONFLICT MISSES IN DATA CACHE

By
Rami J. Ammari

Approved:

Yul Chu
Assistant Professor of Electrical and
Computer Engineering
(Director of Thesis)

Nicolas H. Younan
Professor of Electrical and Computer
Engineering
(Graduate Program Director)

Bob Reese
Associate Professor of Electrical and
Computer Engineering
(Committee Member)

A. Wayne Bennett
Dean of the College of Engineering

Edward Luke
Assistant Professor of Computer Science
(Committee Member)

Name: Rami J. Ammari
Date of Degree: May 8, 2004
Institution: Mississippi State University
Major Field: Computer Engineering
Major Professor: Dr. Yul Chu
Title of Study: A STUDY FOR REDUCING CONFLICT MISSES IN DATA CACHE
Pages in Study: 63
Candidate for Degree of Master of Science

During the last two decades, the performance of CPU has been developed much
faster than that of memory. In order to reduce the performance gap between CPU and
memory, cache memories should have been used between CPU and memory.
In general, cache memory is a small and fast buffer to reduce memory access time
by saving data in advance before CPU uses. There are two types of cache memory:
instruction cache and data cache. In addition, there can be multi-levels (Level 1, 2, …etc)
in memory hierarchy (memory and cache memories) for system purpose: the level 1 (onchip) cache is the closest one to CPU and it affects system performance directly.
In this study, we evaluated two factors in designing an efficient Level 1 data
cache. Those factors are: distance between two data in an array and multi xor mapping
functions in a bank.
We designed a data cache called SLDC (Store/Load Dependent Cache, Two-way)
to implement the first factor. This cache uses the distance between two data addresses of
data-transfer instructions (load and store). It groups close data into the same group and

places into the same bank. The other cache we designed for the second factor is called
Multi-XOR (MXOR). The MXOR splits the cache virtually into several zones (2 to 6
areas); a different xor mapping function per area is used to index data (for better cache
utilization).
In this study, we used the SimpleScalar simulation program to implement data
cache with SPEC2000FP benchmark programs. Based on the experiment results, we
recommended considering those factors in designing an efficient cache memory since
SLDC and MXOR show some improvement (5-to-10%) compared to a conventional
cache memory (two-way set-associative).

DEDICATION
I would like to dedicate this research to my parents Jihad and Etaf, my brothers Raed and
Ramzi, and my sister Rita.

- ii -

ACKNOWLEDGEMENTS

I sincerely express my gratitude to my major professor, Dr. Yul Chu, for his
guidance and support for the successful completion of this research work. I would like to
thank Dr. Edward Luke for his invaluable advice for completion of this thesis and serving
as a thesis committee member. I would also like to thank Dr. Bob Reese for his
willingness to be on the thesis committee and for providing useful knowledge during the
graduate program. I would like to acknowledge the Research Initiate Program (RIP) of
Mississippi State University for funding this research work.

- iii -

TABLE OF CONTENTS
Page
DEDICATION ...............................................................................................

ii

ACKNOWLEDGEMENTS ............................................................................

iii

LIST OF TABLES .........................................................................................

vi

LIST OF FIGURES ........................................................................................ vii
CHAPTER
I. INTRODUCTION.......................................................................................
1.1.
1.1.1.
1.1.2.
1.1.3.
1.1.4.
1.2.
1.2.1.
1.2.2.
1.2.3.
1.2.4.
1.2.5.
1.3.

Cache Memory....................................................................................
Principle of localities ..........................................................................
Cache Memory Organization...............................................................
Cache Memory Read...........................................................................
Cache Memory Write..........................................................................
Cache Memory Performance ...............................................................
Changing cache Memory Parameters ..................................................
Splitting Cache Memory into Instruction and Data ..............................
Multi-level Caches ..............................................................................
Victim Cache ......................................................................................
Write Buffer........................................................................................
Motivation for this Research ...............................................................

1
4
4
4
8
9
10
10
11
11
13
14
15

II. SIMPLESCALAR TOOLSET.................................................................... 17
2.1.
2.2.
2.3.
2.4.

Introduction to SimpleScalar ...............................................................
Running SimpleScalar.........................................................................
Coding SimpleScalar...........................................................................
Customizing SimpleScalar ..................................................................

- iv -

18
21
23
24

CHAPTER

Page

III. RELATED WORK................................................................................... 26
3.1.
3.2.
3.2.1.
3.2.2.

The Two-Way Skewed Cache .............................................................
Thrashing-Avoidance Cache (TAC) ....................................................
Basic Operation of TAC......................................................................
Performance of TAC ...........................................................................

26
29
30
33

IV. THE STORE/LOAD DEPENDENT DATA CACHE ............................... 34
4.1.
4.2.
4.3.
4.4.

Basics of SLDC ..................................................................................
Architecture of SLDC .........................................................................
Operation of SLDC .............................................................................
Simulation Results and Analysis .........................................................

35
37
39
42

V. THE MultiXOR CACHE ........................................................................... 45
5.1.
5.2.
5.2.1.
5.2.2.
5.2.3.
5.3.

Case Study..........................................................................................
Basics of MXOR cache .......................................................................
An example of xor mapping functions.................................................
Replacement policy used.....................................................................
Method of Implementation..................................................................
Simulation Results and Analysis .........................................................

45
47
50
51
51
52

VI. CONCLUSION AND FUTURE WORK .................................................. 59
6.1.
6.2.

Summary ............................................................................................ 59
Future Work........................................................................................ 60

REFERENCES............................................................................................... 62

-v-

LIST OF TABLES
TABLE

Page

2.1

SimpleScalar Different Simulators Information...................................... 19

4.1

Spec2000 Benchmark Programs Used in SLDC Simulation ................... 42

- vi -

LIST OF FIGURES
FIGURE

Page

1.1

Memory Hierarchy................................................................................... 2

1.2

Memory Hierarchy with Multi-Level Cache ............................................. 3

1.3

Mapping Simple Memory Address to Different Cache
Organizations........................................................................................... 6

1.4

Multi-Level Cache Hierarchy ................................................................. 12

1.5

Victim Cache Memory Hierarchy........................................................... 14

1.6

Write Buffer in Memory Hierarchy ........................................................ 15

2.1

SimpleScalar Software Architecture....................................................... 20

3.1

An example of Splitting the Address in Skewed Cache .......................... 27

3.2

Data Conflicting in Bank0 but not in Bank1 in A Two-Way
Skewed Cache........................................................................................ 28

3.3

Grouping Instructions Separated by a Call Instruction............................ 30

3.4

The Operation of BoPLRU..................................................................... 31

3.5

Comparison between Conventional Cache Schemes, the TwoWay Skewed Cache and the TAC........................................................... 33

4.1

A case where the Distance Between Data Addresses is
within the Limit of a Group.................................................................... 36

4.2

A case where the Distance Between Data Addresses is
not within the Limit of a Group.............................................................. 37

4.3

Hardware Implementation of SLDC ....................................................... 38

4.4

The Operation of the Registers in SLDC ................................................ 39

4.5

Algorithms for SLDC............................................................................. 40

4.6

SLDC operation ..................................................................................... 41
- vii -

FIGURE

Page

4.7

Simulation Results for SLDC ................................................................. 43

5.1

In Two-Way Skewed Cache, Two Sets Having A Set Conflict
After The Xoring ................................................................................... 46

5.2

A Possible Slicing For Two-Way MXOR With
Four Xor Functions Applied In Each Zone ............................................. 48

5.3

Possible Address Mapping For Two Sets In Bank0
Using Four Xor Mapping Functions ....................................................... 49

5.4

Results for MXOR Cache Using Two Functions .................................... 53

5.5

Results for MXOR Cache Using Four Functions .................................... 54

5.6

Results for MXOR Cache Using Six Functions ...................................... 55

5.7

The Number of Conflict Misses in MXOR and
Two-Way Set Associative for Quake...................................................... 56

5.8

Results for MXOR Cache When Simulated With Quake Benchmark...... 57

- viii -

CHAPTER I
INTRODUCTION
One of the main bottlenecks in computer architectures is the speed gap between
CPU and memory. The speed of the CPU is increasing in a faster rate than memory; a
great deal of time is wasted waiting for memory response. This problem is referred to as
the processor-memory performance gap [1,2].
Having the memory unit off-chip contributes to the overall latency of the system,
because data propagate from CPU to the memory through an external data path slows
down the traffic between these two units.
Computer architects use a combination of fast and expensive memory units (static
random access memory-SRAM) and cheaper but slower memory units (dynamic random
access memory-DRAM), forming what is called the memory hierarchy.
The memory hierarchy decreases the gap between the CPU and memory,
therefore increasing the performance of the system. In this organization, the CPU
communicates directly with the fast memory (SRAM), which is organized as a cache
memory that may be split into more than one level as well. Cache memory in return
communicates with the regular slower memory (DRAM) fetching more than the
requested data based on the cache policies. Finally, memory fetches data from a
permanent storage device like the hard disk. Figure 1.1 shows this typical memory
hierarchy.

-1-

-2-

C PU
C P U R eg isters

C ac he M e m ory (SR A M )

M ain M em ory (D R A M )

E x terna l/Intern al S torag e D ev ice

Figure 1.1: Memory Hierarchy.

Performance increased tremendously after introducing cache memory [3,4] in
1982. Since then the CPU speed increased rapidly in a faster rate than memory speed,
cache memories were used to keep up with data access demands. New cache techniques
were proposed like the multi-level cache architectures, which helps in reducing memory
access time. In addition, splitting the cache into instruction and data sub caches greatly
benefited this field giving the designers more options and flexibility.
In multi-level cache, the first level can be on-chip and so provides high-speed
access to the data stored. The second level can be off-chip, providing a relatively slower

-3much faster access time than main memory. Figure 1.2 shows the multi-level cache
architecture.

C PU
L 1 o n -c h ip ca ch e

L 2 o ff -c h ip ca c h e (S R A M )

M a in M e m o r y (D R A M )

E x te rn a l/In te rn a l S to ra g e D e v ice

Figure 1.2: Memory hierarchy with multi-level cache

These days, huge databases are shared on the Internet and intranet require fast
data cache architectures. Network processors depend also on data caches to store and
retrieve lookup tables used in routing. Conventional cache schemes have been successful
in obtaining a good access time, but those architectures are not always successful in
utilizing and dispersing the data properly in the cache, which causes the need to have new
designs.
This chapter presents cache concepts, like temporal and spatial locality, writing
policies, replacement policies, mapping functions and conventional cache schemes.

-4-

1.1. Cache Memory
1.1.1 Principle of localities
A cache is based on the principle of locality, which is divided into spatial and
temporal localities.
1. Spatial Locality: During the lifetime of a program, given a block in
memory is being accessed, there is a high probability that in the near
future adjacent blocks are to be accessed.
2. Temporal locality: During the lifetime of a program, given a block in
memory is being accessed, there is a high probability that the same block
is to be accessed again in the near future.
A Cache hit occurs when the requested data is found in the cache, while cache
miss is when the data is not found. In the case of cache miss, the unfound data is fetched
from the memory into the cache for future accesses, based on both temporal and spatial
locality.

1.1.2 Cache Memory Organization
Organization of the cache specifies the way lines are split and organized within
the cache. Based on the way memory addresses are mapped, there are three basic cache
organizations:

-51. Direct Map: The simplest organization, where memory references are
mapped into one possible location in the cache.
2. Set associative: uses several direct-mapped caches where any memory
reference could mapped into more than one possible location in the cache
into any of those direct-mapped caches based on an associative number.
This organization is more complicated but gives much better performance
than direct mapped cache.
3. Fully associative: the most complicated scheme, where memory reference
can be mapped to any line in the cache. This requires more time searching
for data, but gives better performance than both set associative and direct
mapped in terms of cache miss ratio.
Figure 1.3 shows an example of simple memory address mapped to different
cache organizations.

-6Index

DIRECT MAPPED CACHE

TAG

DATA

000
001

Address

010

1010 111 0

011
100
101
110
111

1010

Index TAG DATA

2-WAY SET ASSOCIATIVE CACHE

TAG DATA

00

Address

01

10101 11 0

10
11

10101

DATA

TAG

DATA

DATA

TAG

TAG

DATA

DATA

TAG

TAG

DATA

DATA

TAG

DATA

1010111 0

TAG

Address

TAG

FULLY ASSOCIATIVE CACHE

Figure 1.3: Mapping a simple memory address to different cache organizations.

The delivered address from the memory is split into three portions, from the right:
•

Block Offset: specifies which word in cache line should be selected for
reading or writing. The number of bits depends on the block size of the
cache, block offset bits’ number (b) = log2 (Block Size).

•

Index: comes after block offset in sequence from the right and specifies
which line in the cache data should be read from or written to. Number of

-7bits is specified by cache size, block size and associative number based on
this relationship:
Number of index bits (l) = log2 [Cache Size (B) / (# of associativity * Block
Size (B))].
There is no index in fully set associative cache, since data can be placed in any
block of the cache, which means that all tags for those blocks should be searched
every time cache is accessed.
•

Tag: Part of the address which is left after cutting block offset and index
bits, so:
Number of tag bits = number address bits – (l + b).
Tag is used for validation purposes, after storing data in cache; tag is also
stored on the same line. When it is time to read or write again, tag is
compared with the tag part of the requested address.

Back to the example in figure 1.3, assuming that the block size is 2B and cache
size as 16B, so one bit only is needed out of the address as a block offset. In the case of
direct-mapped, since the cache size as 16B, log2 [16/(1*2)] = 3 bits are needed for the
index. The rest of the address is the tag.
In the case of two-way set associative, with the same number of bits for the block
offset, two bits are needed for the index to map four sets. For fully associative cache,
there is no need for the index, only block offset is needed and the rest becomes the tag.

-81.1.3 Cache memory read
During this process, the requested address from CPU is checked with the one
stored in the cache. First, the index part of the address is mapped to the corresponding
cache set, then the tag part is compared with the one stored in the cache for that specific
block. Another bit is also stored called the valid bit “V”, during a tag-match if the valid
bit is set, then data stored is valid and can be read into the CPU safely.
In direct-mapped organization, only one block is selected to have the stored tag
compared. For set associative caches, all the banks are searched for a tag match, so the
number of associativity decides how many are there to be compared in parallel with the
requested address tag.
Finally, if there is a cache hit, after finding a tag match in the block, the block
offset decides the word that should be read into the CPU. If there is a cache miss, the
block should be updated with the correct data. The requested data is fetched from
memory and placed in the cache after updating both the valid bit and the tag. In the case
of set or fully associative cache, there is more than one possible candidate for block
replacement. Replacement policies decide which one is replaced. The most famous [2]
are:
1. Random replacement: Where the new block is replacing a randomly
selected block in the set. This implementation is very simple but shows
average performance.

-92. Least Recently Used: In this technique the least recently used block (read
or write-wise) is being replaced with the new one. LRU is more expensive
to implement since it is more complicated than random, but it shows much
better performance.

1.1.4 Cache memory write
There are two types of write-hit policies, write-through and write-back. In writethrough, the data is written in the cache and the corresponding data is updated as well in
the lower memory structure. In write-back, the data is only stored in the cache, when
updated data in cache is a candidate for possible replacement then the data is updated in
main memory before being replaced.
Write-through is a slower mechanism than write back. In addition, some
difficulties may arise when using write back due to inconsistencies between main
memory and cache contents. This issue is referred to as a “cache coherence” problem.
There are two write-miss policies as well, write-allocate and write no allocate. In
write-allocate, the block is brought from main memory to the cache, and then followed by
the write procedure to update the block. This policy usually used with write-back because
subsequent writes will captured by the cache. In write no allocate, the block is updated in
the main memory directly, without being brought to the cache. This policy usually used
with write-through because the blocks have to go to the main memory anyway.

-101.2. Cache memory performance
Caches are faster than main memory; finding the data in cache would speed up an
application’s execution. On the other hand, missing data would cost the system more time
since both cache and memory have to be searched. Thus, the miss rate is the
measurement of cache performance.
Improving miss rate is the main issue in cache design. Average memory access
time (AMAT) is given by the following equation [5]:
AMAT = Hit time + (Miss rate * Miss penalty).
Therefore, to optimize cache performance, AMAT factors should be reduced to have
minimum value. Reducing any of the following factors would contribute in improving the
overall performance of the cache: Hit time, miss rate and miss penalty. The following
subsections introduce techniques in cache design to reduce those factors and the
associated tradeoffs.

1.2.1 Changing cache memory parameters
Parameters of the cache like block size, cache size, mapping functions and
associativity can be changed to improve the performance of the cache. There are many
tradeoffs while designing a cache scheme involving those parameters. Therefore, a
balance should be achieved for optimized performance.
For example, increasing block size reduces miss rate by making use of the spatial
locality. On the other hand, bigger block sizes produce an increase in miss penalty,

-11because on a miss, larger blocks are fetched into the cache. Increasing cache size or
associativity decreases the miss rate as well, but produces an increase in the access time
due to time needed to search larger storage area and compare more tags. Different
mapping functions can be applied to reduce conflict misses, but their degree of success is
associated with complicated implementation.

1.2.2 Splitting cache memory into instruction and data
Splitting the cache into two separate schemes one for instructions and one for data
has a great impact on cache bandwidth and access capability. On the other hand, this
organization produces more miss rate than unified caches.
Miss rate for split cache is higher than that of unified cache, while AMAT of split
cache is less than that of unified cache.
In modern cache designs, separate instruction and data caches are implemented
because of their flexibility and better bandwidth they offer. Bigger cache sizes and new
mapping functions can overcome the potential miss rate associated with split cache
technique.

1.2.3 Multi-level caches
Multi-level caches can reduce the miss rate substantially if their sizes are tuned
carefully [5]. Figure 1.4 shows a multi-level cache hierarchy.

-12-

C P U
r e g is te r s

L 1 C a c h e

L 2 C a c h e

M a in

M e m o r y

Figure 1.4: Multi-level cache hierarchy.

In this hierarchy, faster caches are closer to the CPU while slower ones are closer
to main memory. The problem in these organizations is the great miss penalty, especially
if the data is not found in all levels of the cache and should have to be looked up in main
memory.
The following example shows the benefit of adding a second level cache to a
direct-mapped architecture [5]:
For a direct-mapped with the following characteristics: 95% hit rate, 4 ns hit time
and 100 ns miss penalty, the AMAT would be,
AMAT = Hit Time + Miss Rate * Miss Penalty
= 4 + 0.05 * 100 = 9 ns

-13Where new AMAT when adding a L2 cache with 20 ns hit time and 50% hit rate would
be,
AMAT = Hit TimeL1 + Miss RateL1 * (Hit TimeL2 + Miss RateL2 * Miss penaltyL2)
= 4 + 0.05 * (20 + 0.5 * 100) = 7.5 ns
Average memory access time with two level cache is less than with only one level.

1.2.4 Victim cache
A recent technique is used to reduce the miss rate called the victim cache. The
victim cache is a fully associative cache that comes in between the main cache and the
main memory. Although fully associative schemes are expensive to build, having a small
size victim cache would not affect the complication of the design.
Figure 1.5 shows the victim cache in memory hierarchy. When a block is
discarded from the cache on a miss, it is kept in this victim cache. On a cache miss, in
cache and before checking main memory, the victim cache is searched for the requested
data. If the data is found in the victim cache, the block and the corresponding block in
cache are swapped.

-14-

CPU
registers

Victim Cache

Main Cache

Main Memory
Figure 1.5: Victim cache in memory hierarchy.

1.2.5 Write Buffer
In write-through on a write miss, much time is wasted waiting to write data into
main memory and then replace the data in cache. A write buffer is used to temporarily
store this data coming from the cache and then write it to the main memory while the
cache is continuing with normal process. This technique reduces the latency caused by
write policy when updating memory.
A problem arises with write buffer. When there is a read miss in cache, and the
data must be brought from memory, if corresponding data in the main memory is not yet
updated by the write buffer then invalid data is transferred. To overcome this problem, on
a miss, the write buffer is searched first for data before looking the main memory; if data
is found then it will be sent directly to CPU. Figure 1.6 shows a write buffer scheme.

-15-

CPU
registers

Cache
Write buffer

Main Memory

Figure 1.6: Write buffer in memory hierarchy.

1.3. Motivation for this research
This research is motivated by finding new cache schemes that can utilize cache
usage and disperse data in the cache in a way that is more efficient.
It was shown that the thrashing-avoidance cache (TAC) [6,7] improved the miss
rate, especially with C++ based applications. TAC operated on instruction cache by using
the call instructions as a parameter to split instructions into members to be grouped in
each bank of the cache. Store/Load dependent data cache (SLDC) proposed in this
research applies the same concept on data cache. SLDC uses the distance as a parameter
to split data into related members. Groups of the related members are virtually created
and placed in each bank, hoping to reduce miss rate of the cache.

-16Skewed cache proposed [8] showed a great way of applying xor-mapping
functions, better utilization of the cache has been achieved. Though, skewed cache was
not adopted in the industry because of its complicated architecture and replacement
policy. Multi-XOR cache is presented in this research as a possible substitute for the
skewed cache. Multi-XOR splits the cache into zones of operation; in each zone, a
different xor mapping function is applied. Multi-XOR cache tries to have a better spread
of data all through the cache with minimum implementation complexity.
Multi-XOR is very flexible because number of functions and the zones they are
operated on can be adjusted, simpler functions can be used as well to achieve the best
miss rate possible.
This research presents those two schemes, SLDC and Multi-XOR, analyze the
performance and report whether they can be improved further in the future.

CHAPTER II
SIMPLESCALAR TOOLSET
In the past, most of computer architecture work, including designs and
implementation, was performed only at large institutions and companies where the
necessary tools could be developed. Building tools capable of measuring and evaluating
new architectures required a huge effort and a great deal of time.
These days engineers and designers employ software models based on the
hardware they want to construct; for that purpose, they use regular programming
languages or hardware descriptive language. Then they test those designs by applying
standard workloads called benchmarks to validate functionality and evaluate
performance.
The SimpleScalar toolset provides an infrastructure for simulation and
architectural modeling. Researchers use it to build modeling application to test and
analyze hardware performance. This research, compiled and simulated different
benchmarks on the Sim-Cache simulator tool, which is part of the SimpleScalar toolset.
This part generates different statistics about the modified or standard cache architecture
like, cache misses, cache hits and their ratios.

- 17 -

- 18 -

2.1. Introduction to SimpleScalar
SimpleScalar is part of the “Multiscalar” project that was directed by Gurindar Suhi
[9] at the University of Wisconsin and was written in 1992 in C language by Todd
Austin. This toolset was first released as an open source to the public in 1995 with the
help of Doug Burgers. SimpleScalar is now freely available through the SimpleScalar
website at http://www.simplescalar.com.
In the year 2000, more than one-third of the papers published in the area of high
performance computer architectures used the SimpleScalar toolset to get an evaluation of
their new designs [9]. This figure gives an idea of how popular this simulator became
since its release in the computer architecture research community. SimpleScalar’s
popularity rose from its offering a complete toolset to implement the design and then to
evaluate its performance by applying different benchmarks on it. During the simulation,
SimpleScalar, through its different functions, dynamically measures hardware
characteristics and performance, giving accurate statistics of how close the design to the
required characteristics.
SimpleScalar includes different simulators with different complexity; they range
from Sim-Safe, which emulates only the instruction set, to Sim-Outorder, a detailed
microarchitectural model with dynamic scheduling. Table 2.1 lists those simulator
models along with their complexity based on the number of code lines and execution
time.

- 19 Table 2.1 SimpleScalar different simulators information

Simulator

Description

Lines of code

Simulation Speed

Sim-Safe

Simple functional simulator

320

6 MIPS

Sim-Fast

Speed-optimized functional simulator

780

7 MIPS

Sim-Profile

Dynamic program analyzer

1,300

4 MIPS

Sim-Bpred

Branch predictor simulator

1,200

5 MIPS

Sim-Cache

Multilevel cache memory simulator

1,400

4 MIPS

Sim-Fuzz

Random instruction generator and tester

2,300

2 MIPS

Sim-Outorder

Detailed microarchitectural timing model

3,900

0.3 MIPS

The different SimpleScalar simulators are listed as follows:
•

Sim-Fast: Fast-optimized simulator that does not count for cache, pipelined or any
other part of the microarchitecture.

•

Sim-Safe: Checks the alignment and access permissions for all memory
operations, used if Sim-Fast crashes with some applications.

•

Sim-Cheetah: Complicated cache simulator for different cache levels.

•

Sim-Cache: Can include up to two levels of cache, shows the statistics of both
cache and TLB configurations for both data and instruction cache. This simulator
has been modified to integrate the proposed cache schemes in this research.

•

Sim-Profile: More detailed simulator that produces profiles on addresses, memory
accesses, branches and data symbols.

•

Sim-Outorder: Very detailed simulator, which can produce statistics for out-oforder super processor with two level cache memory and main memory.

- 20 -

Figure 2.1 [9] shows the hardware model’s software architecture for SimpleScalar.
Execution–driven simulation is used during the simulation process; this simulation
requires an instruction set emulator and I/O emulation module.

User Programs
Prog/Sim Interface

Program Binary
Target ISA

I/O Interface

Functional Core Target ISA emulator
B Prid

Performance Core

Resource

Cache

I/O emulator

Simulator
Core

Loader

Regs

Stats
Dlite!
Memory

Host Interface
Host Platform
Figure 2.1: SimpleScalar software architecture

The simulator code performs the definitions of the models’ organization and
instrumentation, where interaction between the input and output with the simulated

- 21 program is done by the I/O emulation module. Finally, the instruction set emulator
interprets the instructions, as shown in Figure 2.1.

2.2. Running SimpleScalar
The following code shows a sample used in SimpleScalar for timing instruction
execution in a single cycle. For load and store instruction in a data cache, cache hit takes
two cycles and cache miss takes ten cycles.

“Simulator Core”
counter_t insn;
counter_t cycle;
sim_main()
{stat_register(“insn”, &insn, ”total instructions”);
stat_register(“cycle”, &cycle, ”total cycles”);
stat_formula(“IPC”, ”insn/cycle”, ”inst/cycle”);
while (!sim_done)
{inst = sim_execute_insn();
insn++;
cycle++;
if (inst.flags & F_MEMOP)
cycle += cache_access(inst.addr)}}
“Cache Component”
time_t cache_access(addr_t addr)
{word_t index = cache_hash(addr)
if (tag[index] == addr)
{/* There is a hit */
cache_update_lru(index);
return 1;}
else {/* miss occurs */
cache_handle_miss(addr);}
return 9;}

- 22 -

For this research, only three files were modified to implement different cache
schemes: cache.c, cache.h and sim-cache.c.
•

Cache.c: Part of the SimpleScalar modules, implements data and instruction
caches along with replacement policies, mapping functions, writing
policies…etc

•

Cache.h: Includes the header functions for the cache.c file.

•

Sim-cache: Implements a functional cache simulator, and produce the
statistics for the cache and TLB configurations which may include up to two
levels of data and instruction cache and one level of instruction and data
TLBs.

After successful installation of SimpleScalar, different cache architectures can be
checked and simulated. Cache configuration is formatted as following:
<name>:<nsets>:<bsize>:<assoc>:<repl>
<name>: Unique cache name.
<nsets>: The number of sets needed.
<bsize>: The block size.
<assoc>: The associativity of the cache; it should be a power of two.
<repl>: The replacement policy of the cache; l: LRU (least recently used), f: FIFO (first
in first out) and r: random replacement.

- 23 When using the cache module on a Solaris operating system the command line
would be:
./sim-cache <sim-name> <cache-config> <benchmark> “<“ <input to benchmark>
For example, to simulate an instruction cache names example, with 512 sets and block
size of 8 using a LRU replacement policy, simulated on a SPEC95-Quake, the command
line would be:
./sim-cache –cache:il1 exmaple:512:8:l ./quake.ss < ./inp.in
Notice that for one level instruction cache –cache:il1 is used and –cache:dl1 for data
cache. The benchmark and the input in this example reside in the same folder as the
simulator.

2.3. Coding SimpleScalar
In cache.h the user instantiates a new cache using the function cache_new(), then
the user specifies the geometry of the cache (number of sets, block size and
associativity); the cache module traces latency through the simulation process and shows
the number of hits and misses.
In this file two important structures were modified for the different cache schemes
studied in this research. Those structures are “cache_set_t” and “cache_t,” which specify
different functions and arguments for sets and the cache respectively.

- 24 In cache.c a couple of functions and structures worth mentioning for the rule they
played in implementing the cache schemes:
•

update_way_list(): Modified then used to implement different mapping
functions.

•

cache_creat(): Includes all the parameters controlling the characteristics of
the cache; many parameters were included to add new functionality to new
cache schemes.

•

cache_reg_stats(): Used to print statistics results on the screen. New
statements were added to show the new parameters’ count.

•

cache_access(): The function which was modified the most, and contains
cache_miss() and cahe_hit() functions; new functions were added as well to
in each cache scheme. This function shows the cache behavior like writing
policy, mapping functions, replacement functions…etc.

2.4. Customizing SimpleScalar
The toolset in its original form was not helpful enough for the new proposed cache
schemes. As was mentioned previously, three files were modified: cache.c, cache.h and
sim-cache.c. The major changes were in the form of adding new functions, new address
masking or modifying existed functions.

- 25 Along with this toolset, Spec2000 benchmarks, which are real application programs
were used as a standard way of measuring system’s performance. After each simulation,
detailed statistics, including cache miss rate, were reported and saved on a file.
The main problem was with simulation time; some benchmarks took over two days
for each to finish one simulation. Due to simulation time, Unix scripts were used to
automatically start simulations in sequence and store the results in files. These files are
checked later and the miss rate is extracted into an Excel spread sheet.

CHAPTER III
RELATED WORK
Over the past decade, cache memory attracted many researchers because of the
role it plays in narrowing the gap between the processor and memory. New schemes and
designs have been tested; few researchers came up with good, interesting results. This
chapter has a brief discussion about two of the successful cache designs, the skewed
cache and the thrashing-avoidance cache (TAC).

3.1. The Two-Way Skewed Cache
Seznec [8] proposed the skewed cache, which can enhance the cache performance
by reducing conflict misses. The skewed cache offers a better miss rate without
increasing the size of the cache or its associativity.
The basic principle of skewed cache is to employ two or more mapping functions to
control the flow of data from and to the cache, one mapping function per bank. In twoway skewed, addresses requested by the CPU will not map to the same set blocks in the
two banks, so an address is mapped to a certain set in the first bank and then skewed to
map into a different set in the second bank. To achieve this skewed mapping, a shuffling
function should be used to change the order of address bits. This shuffling along with
xoring portions of the address together produces a unique new address. Skewed cache has

- 26 -

- 27 a simulator called CACHESKEW [10], which can be found on the author’s website
Http://www.irisa.fr/caps/PROJECTS/Architecture/CACHESKEW.html.

TAG
A2

A1

Block Offset

Figure 3.1 an example of splitting the address in skewed cache

Figure 3.1 shows a possible division of the address. Part of the tag is shuffled and
then xored with the index or with part of it. The address can be divided with no
restrictions as long as A1 ⊕ A2 produces an appropriate address for blocks in the banks
and as long as neither A1 nor A2 overlaps with the block offset part. Shuffling occurs in
each successive level of the cache.
An example would be:
For bank 0: A2 ⊕ A1
For bank 1: θ1(A2) ⊕ A1
For bank 2: θ2(A2) ⊕ A1
For bank 4: θ3(A2) ⊕ A1, where θ is a shuffling function.
The shuffling function reorders the bits of the address in a random way. Any
function that uniquely maps the blocks in a bank would work as well. Skewing the
addresses this way helps to reduce the conflict misses by mapping those blocks that could
collide in one bank to different sets in the other bank.

- 28 The two-way skewed cache uses a pseudo-LRU replacement policy, which along
with xoring and shuffling actions produces a complex architecture; on the other hand, the
skewed cache produces the same hit ratio as the four-way set associative with the same
size [8]. The mapping functions used in the skewed cache provide a low probability for
addresses with the same index to collide in the cache by skewing them to other sets based
on the chosen xor function. Ideally, those mapping functions are chosen so that the data is
scattered all over the lines of the cache equally, which gives a much better utilization of
the cache.

Figure3.2 Data conflicting in bank0 but not in bank1 in a two-way skewed cache.

Figure 3.2 [9] shows three addresses A, B and C, which collide in the first bank, but
using the second different mapping function, they are scattered over the other bank.
This use of mapping functions eliminates conflict misses that could happen in a
two-way set associative, which uses one mapping function for the two banks. Results

- 29 show that the main advantage of two-way skewed cache is reducing conflict misses
significantly [11].

3.2. Thrashing-Avoidance Cache (TAC)
Chu and Ito [6,7] proposed the TAC scheme. This scheme shows great results
reducing conflict misses significantly. The TAC uses the idea of grouping; it places
groups of instructions separated by a call instruction in one bank. The targeted bank is
selected according to two replacement policies: Bank Selection Logic (BSL) and Bankoriented pseudo LRU (BoPLRU). The TAC shows that grouping can enhance the cache
performance by reducing the conflict misses between call instructions.
Calder et al. [12] showed that system calls are executed seven times more often in
object-oriented programs (like ones written in C++) than in regular programs (like Cbased). Making use of such frequent call instructions and finding a way to reduce the
miss rate is the main idea behind this cache.
Call instructions serve as a barrier between different sequential instructions. TAC
groups those instructions and spreads them into the banks of the cache to achieve
minimum conflict. This study concentrates on two-way architectures, so for the following
sections, each TAC term refers to a two-way TAC architecture.

- 30 3.2.1

Basic operation of TAC
Figure 3.3 [6] shows the operation of the Bank Selection Logic (BSL), which is the

first replacement policy applied in the cache. In the figure, a couple of instructions are
executed in sequence, and then a system call takes the execution process to other
sequenced instructions, where instructions B and I are call instructions and called group
separators.

Group A
Group H

Inst. A

Inst. B

CALL

Inst. H
Group X
Inst. I

Inst. J

CALL

CALL

Inst. X

Inst. Y

Figure 3.3: Grouping instructions separated by a call instruction

In TAC, there are two steps in deciding which bank the replacement will take place.
BSL decides the bank of replacement in the first step, where BoPLRU either confirms the
first decision or corrects it in the second step. Based on BSL, the number of call

- 31 instructions decides in which bank the new coming line is to be placed in the cache, as a
first guess. Every two call instructions, the bank’s choice of replacement is switched.
Looking at figure 3.3, call instructions B and I are the last members of their groups; they
serve as separators between consecutive groups. For the first two groups, BSL decides
that the replacement should occur in bank0, and the third group should be placed in bank
1. So, groups A and H are selected to be placed in bank0 while group X is to be placed in
bank1. Within each bank, XOR mapping functions are applied to perform a better
utilization of cache sets.
After selecting which bank is the candidate for replacement, the second replacement
policy (BoPLRU) is applied as a correction mechanism for the BSL. Figure 3.4 shows the
operation of BoPLRU.

Address

“On a Cache Miss”
1. First guess Æ Bank1
Bank0

BLS

Bank1
Bank0

Bank1

BoPLRU
2. Second Guess

If flag of Bank1 is 1
ÆReplace in bank1
Æupdate flag to 0

Figure 3.4: The operation of BoPLRU

If flag of Bank1 is 0
ÆReplace in bank0
Æupdate flag to 1

- 32 For BoPLRU to operate correctly, blocks in each bank are supplied with flags (a
one-bit register). When BSL takes the decision of replacement in either bank, XOR is
applied to map the correct set in that bank. Then the blocks flag is checked; if it holds the
value ‘0’ then the other banks block is updated and the flag is set to the value ‘1’.
Otherwise, if the flag holds the value ‘1’ then the same block is updated and the flag’s
value is changed to the value ‘0’. For example, if BSL chose bank 1, the mapped block’s
flag is checked after applying XOR mapping functions, if the flag’s value was 1, the
block is updated with the missed line from memory, and flag’s value is switched to the
value ‘0’. Otherwise, the block in the other bank is updated and corresponding flag’s
value is changed to ‘1’.
BoPLRU is a modified version of Pseudo LRU replacement policy; it guarantees
that different groups are placed in different banks safely by directing instructions to their
final destination in the cache; thus, recent groups can be placed in the same bank.
This procedure makes sense because instructions in the same group have a lower
chance of conflicting in the same bank, where as a group of sequenced instruction have a
higher chance of conflicting in the other bank. Applying both BSL and BoPLRU
guarantees to a certain limit that groups are spread between the two banks.

- 33 3.2.2 Performance of TAC
Simulation was performed on SPEC95 CINT “programs written in C language”
benchmarks and on a suite of programs written in C++ language to investigate the
difference in performance. TACSim tool was used for the simulation process.
The simulation results [6] for both deltablue (C++ program) and m88ksim (C
program) are in figure 3.5. The simulation was performed on 1 KB cache size, 16B block
size for direct-mapped, conventional two-way set associative, conventional 4-way set
associative, conventional four-way set associative, two-way skewed cache, two-way TAC
and conventional sixteen-way set associative architectures.

4
3 .5
Miss Ratio (%)

3
2 .5
2
1 .5
1
0 .5
0
D M

2W A
m 8 8 k s im

4W A
[C ]

2 W -s k e w

2 W -T A C

d e lt a b lu e [ C + + ]

Figure 3.5: Comparison between conventional cache schemes, the two-way skewed cache and the TAC.

According to [6,7], the TAC cache has a better performance than two-way skewed,
especially when running C++ programs with no increase in the complexity of the
hardware implementation.

CHAPTER IV
THE STORE/LOAD DEPENDENT DATA CACHE

TAC succeeded in improving the miss rate by using the concept of grouping
related instructions separated by some parameter (a call instruction). The TAC operated
on instruction cache, and concentrated on group separators between the instructions to
perform the algorithm. This research attempts to apply the same concept on the data
cache, by using the distance (difference between data addresses) to group related data.
The store/load dependent data cache (SLDC) is a two-way set associative with
modified mapping functions and replacement policies. SLDC is a new cache scheme that
adopts the TAC’s concept of grouping and tries to improve the miss rate in the data
cache.
This chapter presents the Store/Load Dependent Data Cache, along with
algorithm diagrams, followed by the simulation results.

- 34 -

- 35 -

4.1. Basics of SLDC
When checking the sequence of instructions that access data cache, a sequence of
the same-type data-transfer instructions is seen more frequently (consecutive loads
followed by consecutive stores). Since this cache is trying to group related data in this
sequenced data-transfer instructions, SLDC employs different registers for each sametype instructions (load registers and store registers).
The basic idea behind this SLDC algorithm is that, if the address of the data
requested by same-type data-transfer instructions (either two loads or two stores) is close
enough to the previous one (the range of the difference “distance” is predefined), then
both addresses are assumed to be within the same group, and should be placed in the
same bank. On the other hand, if the new address is not within the range, then it is saved
in the register and the counter is toggled to diversify the next data to the other bank.
SLDC employs a register to monitor the distance between data of two same-type
consecutive data-transfer instructions (either two loads or two stores). The addresses of
data requested by those consecutive instructions are subtracted and the value is kept in a
control register called the distance register (D register); then comparing the value of the
distance register to some predefined constant decides the value of control registers called
counters (load counter and store counter). Replacement decisions are based on counters
and flags in the cache using a concept similar to the one used in the TAC mentioned
before. The store/load counters here are one-bit registers; their values either flip or stay

- 36 the same. Those store/load counters can be implemented in hardware by using a one-bit
register, where the value either flips or stays the same.
A constant is kept in a register to be compared with the value of the D register;
this predefined constant defines the maximum distance between addresses of data
requested by same-type data-transfer instructions to be considered in the same group. The
D register is the parameter that splits groups of data in the cache and specifies the route in
the algorithm for the missed line to follow.
Figure 4.1 shows a case where the value in the D register is less than the constant
that defines the group limits, and both data members, current and previous, are assumed
to be in the same group; the data should be placed in the same bank.

.
.
.
Stored Address

Compare

D

⊕

Constant

Addr-Data x
.
.
“First Guess” of bank replacement candidate
.
. Current Address
Addr-Data z
.
Data Members x & z are in the same group, because (D < Constant)
.
.

Figure 4.1: A case where the distance between data addresses is within the limit of a group.

- 37 Figure 4.2 shows the case where the previous and current data members do not
belong to the same group because of the difference between their addresses and the
predefined constant. Those data members should be placed in different banks since they
are related to different groups.

.
.
.

Compare

D

Stored Address

Addr-Data x
.
.
.
.
.
. Current Address
Addr-Data z
.

⊕

Constant

“First Guess” of bank replacement candidate

Data Members x & z are not in the same group, because (D > Constant)
Figure 4.2: A case where the distance between data member’s addresses is not within the group limit.

4.2. Architecture of SLDC
The hardware implementation of the SLDC consists of the following: a register
stores the distance parameter for the whole cache; only one register performs this purpose
because the value stored is flushed and overwritten every time the cache is accessed.
Keeping the constant that defines group limits requires another register. In addition, the

- 38 cache has two one-bit counters (store/load counters) to decide in which bank the coming
data will reside; this decision is a first guess of the candidate bank for replacement. Onebit flags are needed as well in each one of the blocks (two per set), in the hope that
adding those control elements will redirect the data member to the right bank. Figure 4.3
shows the main hardware of the SLDC.

Store Register

Load Register

Store counter

Bank1

Load counter

D

Bank0

flag1

flag0
Accessed set

Figure 4.3: Hardware implementation of SLDC

The whole cache has two registers to keep the last load and store addresses
requested, one register to keep the distance, and two counters, one for load and one for
store. More than that, each block will have a one-bit flag to help direct the new block
lines to either one of the two banks.

- 39 -

4.3. Operation of SLDC
The algorithm starts by determining whether the instruction is a store or load, based
on that, the address of the data held is compared with the correspondent load or store
register. The difference between the two addresses, the address of the current data and the
one stored in the store/load register, is the distance between two consecutive data
members requested.
This research assumes that if the difference between data addresses requested by
same-type data-transfer instructions is within a specific range, then the data of these two
instructions is within the same group and should be placed together in the same bank,
where the algorithm assumes that members of the same group have less chance of
conflict in the same bank. Otherwise, if the distance is greater than the predefined range,
the corresponding counter (load counter for the load instruction and store counter for the
store instruction) is toggled; this counter is checked later to replace the new data in one of
the banks. Figure 4.4 shows the operation of the registers in the algorithm.
Previous Data
Address

Store/Load Register

-

Subtract

Store/Load Current Data Address

=
Compare

D

⊕

Constant

Store/Load Counter

Figure 4.4: The operation of registers in SLDC.

- 40 When the distance is greater than the predefined range, this data is assumed not to
be within the same group as the last one stored in the cache. Toggling the counter helps in
directing the coming data into the other bank. The corresponding register, either store or
load, is updated afterwards with the new address.
Checking for cache hit is performed by comparing the blocks’ tag with the tag part
of the address. In case of cache miss, the cache corresponding store/load counter either
flips or stays the same reset based on the stored value in the distance register. The
counters then are checked to make the first guess and decide the bank that will host the
new data. Then the flag in the block does the correction, if any, for this decision. This
correction may result in diverting the data to the other bank. Finally, the flag is updated.
Figure 4.5 shows the SLDC’s algorithm.
L/S Inst.
no
D = L/S last addr. –
L/S current addr.
update L/S Reg. with
current addr.

Hit in B0 ?
no
B1-flag =0

yes

yes

Hit in the
Cache ?

no

|D| > XX ?
yes

Toggle L/S 1-bit counter

yes

L/S 1-bit no
Counter=0?

B0-flag =0

Take to CPU(LOAD)/
Follow write Hit policy(STORE)
Where:
B0 : Bank 0
B1 : Bank 1
D : Distance
XX: Predefined const.

B0
flag
=1?
yes
no

yes

Rep. B1
Toggle flag

B1
flag
=1?
no
Rep. B0
Toggle flag

Fetch data from Memory(LOAD)/
Follow write miss policy(STORE)

Figure 4.5: Algorithm for SLDC

- 41 -

In summary, based on the data-transfer instruction (either store or load), the
corresponding counter is checked, and the algorithm makes a first guess about which
bank the new data should reside in; then the flag of the same block is checked and the
algorithm decides whether to replace data in this bank or in the other one. Finally, the
algorithm updates the block’s flag in the selected bank.
Figure 4.6 shows the operation of this cache using a simple example. The first guess
based on the instruction counter was to replace in bank1; after checking the flag of the
block in bank1, the decision had changed to replace data in bank0 since the flag holds the
value 0, indicating a replacement should occur in the other bank. Finally, the algorithm
updates the flag.

Address

“On a Cache Miss”

Counter

1. First guess Æ Bank1
Bank0

Bank1
Bank0

Bank1

Bank1 flag

2. Second Guess Æ Bank0

Bank0

Bank1

Data is replaced in Bank0, and the flag is updated “flipped”

Figure 4.6: SLDC operation

- 42 -

4.4. Simulation Results and Analysis
The simulation uses the SimpleScalar tool set, after modifying the code. Table 4.1
shows the Spec2000 benchmark programs suite used in the simulation. Different
simulations used different distance parameters (8, 128, 1024 and 4096); for all those
individual simulations, three different sizes of cache block were used (32, 16 and 8 bytes)
with same cache size (32 KB).

Table 4.1: Spec2000 benchmark programs used in SLDC simulation

Benchmark

Language

Description

Quake (Floating Point)

C

Applu (Floating Point)

C

Swim (Floating Point)
Parser (Integer)
Bzip2 (Integer)

Fortran 77
C
C

Seismic
Wave
Propagation
Simulation
Parabolic / Elliptic Partial
Differential Equations
Shallow Water Modeling
Word Processing
Compression

The simulation results are shown in figure 4.7. For a block size of 8B, the Parser
benchmark shows a slight improvement for all distance values. For block size of 16B,
Bzip2 shows good results compared to the conventional cache schemes. For block sizes
of 16B and 32B, the swim benchmark shows promising results. In general, SLDC had no
significant improvements over the conventional cache schemes.

- 43 -

SLDC, 32KB-8B block

Miss Ratio

0.2
0.15

2-way
D=8
D=128
D=1024
D=4096

0.1
0.05

im

rs
er
Bz
ip
2

pa

sw

lu
ap
p

qu
a

ke

0

SLDC, 32KB-16B block

Miss Ratio

0.1
0.08
2-way
D=8
D=128
D=1024
D=4096

0.06
0.04
0.02
pa
rs
er
B
zi
p2

im
sw

qu
ak
e
ap
pl
u

0

SLDC, 32KB-32B block

Miss Ratio

0.05
0.04
2-way
D=8
D=128
D=1024
D=4096

0.03
0.02
0.01
pa
rs
er
B
zi
p2

im
sw

qu
ak
e
ap
pl
u

0

Figure 4.7: Simulation results for SLDC

- 44 In all simulations, the results were not appreciably different between the two-way
set associative and the SLDC. That’s because changing the value of the distance could
not reduce cache misses significantly.
This research assumed that data for consecutive data-transfer instructions might
have locality based on the distance between two data. In case that a group of related data
members shows a good locality in the bank, the distance between two data may not be the
right parameter to separate this group from others. In this research, the distance is the
parameter that decides the value of the D control register in the algorithm.
In two-way set associative, locality comes naturally within the block fetched from
memory based on the address. This behavior creates conflict misses in each bank.
Therefore, SLDC was designed to define the locality of groups, where data with close
addresses are considered to be in one group, to reduce conflicts among groups instead of
individual data.
Results show that static (fixed) distance might not be a good factor in grouping data
in the data cache since it covers only part of the data. Accessing data members of a data
array in the form of A[1][0], A[1][1], …, A[1][n] has the data in a sequenced form; using
a static distance to group data has a good chance of decreasing the miss rate. On the other
hand, if the data array is accessed in the form of A[1][0], A[2][0], …, A[n][0], then using
a static distance might not cover the whole group efficiently; resulting in more conflict
misses.

CHAPTER V
THE MultiXOR CACHE
One flow associated with conventional mapping functions is that all references
with the same index are forced into the same set, which results in replacing the already
existing blocks with new ones.
The most critical problem in cache design is overcoming the conflict miss, since
very little can be done to decrease capacity or compulsory misses [5]. In cache, this
problem occurs because a portion of the address called the index is used as a reference to
the data stored in the cache. Addresses with the same index portion map to the same
place in the cache; this result in conflict misses.
Although skewed cache shows good miss rate results, it has some disadvantages.
MultiXOR (MXOR) tries to use the xor functions and the skewed cache technique to
have a more dynamic utilization in the cache. This chapter presents MXOR cache, with
SimpleScalar toolset implementation and Spec2000 benchmark programs simulation.

5.1 Case Study
The skewed cache was proposed by Seznec [8]. He claims that the skewed cache
avoids most of the conflict misses in the cache. However, conflict misses may still occur

- 45 -

- 46 in some cases; if the frequency of accessing the addresses that cause conflict misses in
the skewed cache is high, the cache will have a high miss rate.
The following example shows a conflict miss in skewed cache:
Let Add1: A3, A2, A1, A0
Add2: A3, A1, A2, A0
Figure 5.1 shows a special case where two addresses conflict in the first bank.
Applying the simple xor function “A1⊕A2” is causing a conflict for the two addresses
shown in the figure.

Bank1

Bank0

Add1: A3,A2,A1,A0
A2⊕ A1

Target Block

Add2: A3,A1,A2,A0

Figure 5.1: In two-way skewed cache, two sets having a set conflict after the xoring.

To solve this problem, complex mapping functions are used in both banks; the
address has to go through shuffling before further xoring is performed on it [10]. These
operations add to the hardware complexity [13]. Many different replacement policies can
be used with the skewed cache, however, it is hard to find simple hardware

- 47 implementation of an LRU (Least Recently Used) replacement policy that gets most of
the skewed cache potential, because the block lines are changed with the new data
replacing old ones [11]. Pseudo-LRU is one way of implementing a replacement policy
that approximates the work of an LRU [8]. Another way is to use a pseudorandom
replacement in which flags are set and reset to choose different blocks for replacement;
this policy is called Not Recently Used Not Recently Written (NRUNRW) [11].
MXOR employs the power of xor mapping functions and distributes the functions
in different zones in the cache with less hardware complications and simpler replacement
policy.

5.2. Basics of MXOR cache
Cache miss conflicts increase the miss rate and decrease the utilization in the cache.
The MXOR cache tries to avoid some of that by applying a few number of xoring
mapping functions to achieve an overall utilization in the cache using simpler algorithm.
In MXOR, the cache is virtually spliced; each bank is sliced into different mappingfunction zones, where each zone presents a number of adjacent blocks in the cache and
one mapping function is applied in each zone. The mapping function may differ in all
zones. In the requested address, A1 (the index) is used as an indicator for the bank zone,
serving as a key to determine which corresponding xor mapping function to apply.

- 48 Figure 5.2 shows a possible slicing of the cache into four zones. First, the index
decides which set is the key; in bank1, either F0 or F2 is applied based on the location of
the set. In bank0, either F1 or F3 is applied also based on the set location. Flags
associated with the sets decide which bank to choose for final replacement.

Bank1

Bank0

Set 1
Set 2

Add: A3,A2,A1,A0

F0

F1

F2

F3

Set k
Set k+1
Set k+2
Set n

Figure 5.2: A possible slicing for two-way MXOR with four xor functions applied in each zone.

In figure 5.2 the cache has been sliced evenly, although even slicing is not a
limitation for MXOR. Slicing and picking the right functions can be done in a way to
give the best utilization and lowest miss rate for the cache. In this research, the cache was
sliced evenly and simple xor mapping functions were applied. The issue of which
mapping functions to use, and how the cache should be sliced can be studied in future
work.
Figure 5.3 shows a possible distribution of the data when using the same slicing, as
in Figure 5.2. In skewed cache, conflicted data in bank1 between zone 0 and zone 1 (refer

- 49 to section 5.1) have much lower possibility for a conflict in MXOR because of the use of
two different mapping function in those two zones. As figure 5.3 shows, a set in bank0
can be placed in more than one location in bank1.

Bank1

Bank0

Set 1

Add1

Set k+2

Add2

Figure 5.3: Possible address mapping for two sets in bank0 using four xor mapping functions.

Figure 5.3 shows clearly that addr1 and addr2 are not conflicting any longer in
bank1 and have more placements in bank0. This of course does not remove the
possibility of conflicts between the same zones, where the mapping functions used
decides the degree of conflicts. Having zones of operation with different mapping
functions shows great flexibility of MXOR; the designer can specify the number of
functions and the complexity of each to give the best utilization in the cache.

- 50 5.2.1 An example of xor mapping functions
Up to six mapping functions were used and simulated; in this research the functions
selected as follows:
Let block address (A3, A2, A1, A0), where:
A0: Block Offset
A1: Set portion
A3: Set equivalent taken from the tag.
A4: The rest of the tag.
Functions:
F0 = A2 ⊕ A1
F1 = (~T.A2) ⊕ A1
F2 = (T.A2) ⊕ A1
F3 = (~W.A2) ⊕ A1
F4 = (W.A2) ⊕ A1
F5 = (W.T.A2) ⊕ A1, where:
T = 2730 = 101010101010 (shifted to the right to meet the number of set bits)
W = 3556 = 110111100100 (shifted to the right to meet the number of set bits)
~ : Is the complimentary operator.
. : Is the AND operator.
⊕ : Is the XOR operator.

- 51 5.2.2 Replacement policy used
This research proposes MXOR as a simple and more practical form of the skewed
cache, so a simple replacement policy was applied to achieve this purpose. A one-bit flag
is kept with each block; whenever the algorithm chooses a block for replacement, the
corresponding flag is reset to one, meaning that it has been updated recently.
First, the index of requested address is checked to decide the zone of operation.
Then, the corresponding functions are applied; finally, the replacement occurs based on
the flags of the two selected blocks. The replacement occurs in the bank with flag holding
the value zero; then the value of this flag is flipped to one and the flag of the other bank is
reset to zero. If both flags hold the same value, replacement occurs in the first bank if
both values are one and in the second bank if both values were zero. The flags are
updated afterwards.

5.2.3 Method of implementation
MXOR was implemented in SimpleScalar toolset as a level-1 cache. Both files
cache.c and cache.h were modified and new functions were added to the code. In
SimpleScalar, both replacement policy and associativity are connected together with the
cache data structure, by representing each set as a linked list. In conventional cache
schemes, this way of data structure works well. On the other hand, for MXOR, where a
different way of replacement policy is used, this structure is not helpful at all, and the
linked list becomes impractical for the implementation.

- 52 The linked list was removed, and flags were added to each set to serve as the flag
indicators. In addition, each block was represented by a pointer, so for each set there were
two pointers to represent two blocks in the two banks. This way the code became
practical for further changes, with a possibility to add separate parameters for each block.
To index into the cache, the algorithm first checks if there is a hit in bank0; if there
is no hit, the algorithm performs xoring on the index and then checks bank1. If there is a
miss again, a replacement policy is applied, and so on.
To verify the correctness of this implementation, the conventional cache schemes
were first checked with no index xoring; the results were the same as the results of the
original SimpleScalar code before modifying.

5.3. Simulation Results and Analysis
Simulation was run using five spec benchmarks: Quake, applu, swim, parser and
Bzip2. A cache size of 32KB with block sizes of 8, 16 and 32 bytes was tested. The
simulation was performed on two-way set-associative and on MXOR using two, four and
six functions each time.
Figures 5.4-5.6 show simulations for two, four and six mapping functions
respectively, where the functions were distributed evenly on both banks of the data cache.

- 53 -

Cache size:32KB
Block Size:8B

miss ratio

0.2
0.15
2W

0.1

2W-Mxor
2W-skew

0.05
sw
in
pa
rs
er
bz
ip
2

pl
u

ap

qu
ak

e

0

0.1
0.08
0.06
0.04
0.02
0

2W
2W-Mxor
2W-Skew

qu
ak
e
ap
pl
u
sw
in
pa
rs
er
bz
ip
2

miss ratio

Cache Size:32KB
Block Size:16B

0.05
0.04
0.03
0.02
0.01
0

2W
2W-Mxor
2W-Skew

qu
ak
e
ap
pl
u
sw
in
pa
rs
er
bz
ip
2

miss ratio

Cache Size:32KB
Block Size:32B

Figure 5.4: Results for MXOR cache using two functions.

- 54 -

C ac h e s i ze : 3 2 K B
B l o c k s i ze : 8 B

2W

0 .1

2
ip
bz

pa

sw

pl
ap

qu

rs
er

2W -S k e w
in

2W -M xor

0
u

0.05
ak
e

miss ratio

0 .2
0.15

0 .1
0.08
0.06
0.04
0.02
0

2W
2W -Mxo r
2
bz
ip

rs
er

in

pa

sw

pl
ap

qu

u

2W -S k e w

ak
e

miss ratio

C ac he Si ze : 3 2 KB
B l o c k Si ze : 1 6 B

0.05
0.04
0.03
0.02
0.01
0

2W
2 W -Mx o r
2
ip
bz

er

pa

rs

in
sw

pl

u

2 W -S k e w
ap

qu

ak

e

miss ratio

C ac h e Si ze : 3 2 KB
B l o c k Si ze :3 2 B

Figure 5.5: Results for MXOR cache using four functions.

- 55 -

Cache Size: 32KB
Block Size: 8B

miss ratio

0.2
0.15
2W

0.1

2W-Mxor

0.05

2W-Skew

qu
ak
e
ap
pl
u
sw
in
pa
rs
er
bz
ip
2

0

0.1
0.08
0.06
0.04
0.02
0

2W
2W-Mxor
2W-Skew

qu
ak
e
ap
pl
u
sw
in
pa
rs
er
bz
ip
2

miss ratio

Cache Size: 32KB
Block Size: 16B

0.05
0.04
0.03
0.02
0.01
0

2W
2W-Mxor
2W-Skew

qu
ak
e
ap
pl
u
sw
in
pa
rs
er
bz
ip
2

miss ratio

Cache Size: 32KB
Block Size: 32B

Figure 5.6: Results for MXOR cache using six functions.

- 56 As the results show, this new setup has achieved an improvement in the miss rate.
Although the performance was not as good as the skewed cache, this simple cache shows
a possibility of having a better performance than the conventional cache schemes.
The results in Figure 5.5 show that when using four functions and with 32 bytes for
block size, most of the benchmarks had a lower miss rate than the conventional two-way
set associative (5% - 10% improvement).
Figure 5.7 shows the conflict miss in the quake benchmark for each bank of the cache
for cache size of 32 KB and block size of 8 bytes using MXOR and two-way set
associative.

Conflict Misses in Quake
25000000
20000000
15000000

2W-Set
MXOR

10000000
5000000
0
Bank0

Bank1

Figure 5.7: Shows the number of conflict misses in MXOR and two-way set associative.

The figure shows clearly that the conflict misses has improved; on the other hand,
when using the MXOR, the applu benchmark results in a higher conflict misses than in
the two-way set associative.

- 57 The results in figure 5.6 show that when using six functions and with 32 bytes for
block size, four benchmarks out of the five had a lower miss rate than the conventional
two-way set associative. The applu had almost the same results as the two-way set
associative.
Figure 5.8 shows the results for the quake benchmark only. The graph shows a
good performance when simulating this benchmark for all block sizes and for all number
of functions.

0.07
Miss Ratio

0.06
0.05
0.04

2-way
2 Func.
4 Func.
6 Func.

0.03
0.02
0.01
32
B

16
B

8B

0

Block Size

Figure 5.8: Results for MXOR cache when simulated with quake benchmark.

The results were not too good to replace the two-way set associative, since the
skewed cache gives almost half the miss rate of both MXOR and the two-way set
associative. The MXOR tried to remove the conflict misses in the banks, but the

- 58 selections of functions and zones of operation were the major reasons for this cache not
to achieve better results than the skewed cache.
The performance of the MXOR can be further improved by carefully selecting simple
mapping functions that results in less conflict misses for different zones. Figure 5.8 also
shows that selecting the number of functions could result in a better miss rate in the
cache.

CHAPTER VI
CONCLUSION AND FUTURE WORK

6.1.

Conclusion
Currently, cache memory has become a very important part of computer systems; it

can be found in all new systems. The main issue in designing cache memory is achieving
a low miss rate and a reasonable hardware complexity.
Many cache schemes have been proposed in the last ten years. Some of the new
cache schemes succeeded in achieving a low miss rate, but their hardware
implementation was complex, and those schemes were not implemented in commercial
computer systems. The goal of this study is to design a simple cache scheme that utilizes
the cache and achieves a low miss rate.
The store/load dependent data cache was the first attempt to build a cache with the
properties mentioned above. Although some simulation results were encouraging, this
study concluded that grouping related data members in each bank of the cache was not
successful. This cache failed because distance was not a successful parameter to separate
the data into groups.

- 59 -

- 60 As a second attempt, the MXOR was designed and implemented in SimpleScalar
and then simulated. Employing different xor mapping functions in different zones of the
cache was the basic principle behind MXOR.
The MXOR was simpler in implementation than the skewed cache, and performed
with better miss rate results in most of the benchmarks than the two-way set associative
schemes. Although the MXOR did not perform as well as the skewed cache, further study
of the xor functions that should be applied could make it perform better in the future.
Using xor mapping functions also achieves better utilization of the cache space,
since it disperses the data all over the cache in equal weights based on the mapping
functions chosen.

6.2.

Future Work
It will be interesting to study the behavior of data members in the data cache, and to

study the possibility to group members of data based on some criteria and using a specific
grouping parameter.
For example, the distance between instruction addresses as a parameter can be
studied; data that have addresses within a certain limit can be grouped together in each
bank of the cache.

- 61 Applying dynamic distance, as a parameter, might be a good solution for several
cases. In this way, most of the code can be covered and data are grouped together in a
more accurate way to remove conflict misses.
Another important study can be made on SLDC when applying xor-mapping
functions in an organized way, without mixing the groups between the banks.
In MXOR, a follow-up on this research would be studying different types of
mapping functions and to determine an optimized slicing number for the cache, which
slices it into zones, where each function is applied.
Applying the same replacement policy technique used in the skewed cache to flush
the flags every predefined period can also be an interesting issue to test; results can be
compared to the original MXOR scheme, and tradeoffs can be studied to balance the use
of such complicated replacement policy.

REFERENCES

[1]

Patt Y. N., Patel S. J., Evers M., Friendly D. H., and Stark J.,“ One Billion
Transistors, One Processor, One Chip,” IEEE Computers, September 1997, pp.
51-57.

[2]

Burger D., Goodman J. R., and Kagi A., ”Memory Bandwidth Limitations of
Future Microprocessors,” in Proceedings of ISCA ’96, 5/96, USA.

[3]

Handy J., “The Cache Memory book,” Second Edition, Academic Press, New
York, 1998, pp. 188-198.

[4]

Smith A. J., “Cache Memories,” Computing Surveys, vol. 14, 3, September, 1982.

[5]

Hennesy J. L., Patterson D., “Computer Architecture: A Quantitative Approach,”
Second Edition, Morgan Kaufmann Publishers, Inc., San Francisco, California,
1996, pp. 390-426.

[6]

Yul Chu, M. R. Ito, “The 2-way Thrashing-Avoidance Cache (TAC): An Efficient
Cache Scheme for Object-Oriented Languages,” IEEE Int. Conf. On Computer
design, 2000.

[7]

Yul Chu, M. R. Ito, “An Efficient Instruction Cache Scheme for Object-Oriented
Languages,” IEEE Int. Conf. On Computer design, 2001.

[8]

A. Seznec, “A case for two-way skewed associative caches,” Proceeding of the
20th International Symposium on Computer Architecture, May 1993, pp. 169-178.

[9]

Todd Austin, Eric Larson, Dan Ernst, “SimpleScalar: An Infrastructure for
Computer System Modeling,” IEEE Computer, 35(2): 56-67, Feb 2002.

[10]

A.
Seznec
and
J.
Hedouin,
The
CACHESKEW
simulator,
Http://www.irisa.fr/caps/PROJECTS/Architecture/CACHESKEW.html,
September 1997.

- 62 -

- 63 -

[11]

F. Bodin, A. Seznec, “Skewed associativity improves performance and enhance
predictability,” Proceeding of the 22nd Int. Symposium on Computer Architecture,
Santa-margarita, June 1995.

[12]

B. Calder, D. Grunwald, and B. Zorn, “Quantifying Behavioral Differences
between C and C++ Programs,” Journal of Programming Languages, 1994, Vol.
2, No. 4, pp. 313-351.

[13]

Djordjalian, A., “Minimally-skewed-associative caches,” Computer Architecture
and High Performance Computing 2002, Proceedings. 14th Symposium 28-30
Oct, Pages 100-107.

