A Shared memory multiprocessor system architecture utilizing a uniform by Casilio, Frank
Rochester Institute of Technology 
RIT Scholar Works 
Theses 
8-1-1998 
A Shared memory multiprocessor system architecture utilizing a 
uniform 
Frank Casilio 
Follow this and additional works at: https://scholarworks.rit.edu/theses 
Recommended Citation 
Casilio, Frank, "A Shared memory multiprocessor system architecture utilizing a uniform" (1998). Thesis. 
Rochester Institute of Technology. Accessed from 
This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in 
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact 
ritscholarworks@rit.edu. 
A Shared Memory Multiprocessor System Architecture





Partial Fulfillment of the





Committee Member: Date: ~, I f?) '9!J l?
Roy S. Czernikowski, Professor and Department Head
Date:
Tony H. Chang, Professor
Committee Member:--------,-------------
Department of Computer Engineering
College ofEngineering
Rochester Institute of Technology
Rochester, New York
August 1998
THESIS RELEASE PERMISSION FORM
Rochester Institute of Technology
College ofEngineering
A Shared Memory Multiprocessor System Architecture
Utilizing a Uniformly Shared Level 2 Data-Only Cache
I, Frank Casilio, hereby grant permission to any individual or organization to reproduce





Due to VLSI lithography problems and the limitation of additional
architectural
enhancements uniprocessor systems are nearing the end of their life
cycle. Therefore, it
is believed that Symmetric Multiprocessing (SMP) systems will be the next mainstream
computer. These systems allow multiple processors, accessing the same memory image,
to cooperate on a number of computational tasks as a single entity.
While multiprocessor systems can offer a substantial performance increase
compared to uniprocessor systems, major design considerations must be addressed to
achieve desired system efficiency levels. Managing cache coherence is a significant
problem in multiprocessor systems. Current implementations cope with this problem by
utilizing a cache coherence protocol. This protocol puts a large amount of overhead on
the system bus to ensure proper program execution, effectively decreasing overall system
performance. This thesis approaches the cache coherence problem from a new angle.
Instead ofutilizing a cache coherence protocol, a new memory system is proposed which
eliminates the need for a cache coherence protocol, by utilizing a shared level 2 data-only
cache. This new architecture allows for better utilization of the system and improved
performance and scalability.
A data rate analysis is conducted to demonstrate the potential performance
increase from the proposed architecture over conventional approaches. The data rate
model clearly shows an increase in system performance and utilization when using the
architecture proposed in this thesis.
ill
To My Parents,
without their constant love and




I would like to thank the following individuals for their support during
the
completion of this thesis. First, and foremost I would like to thank my graduate
committee members, Dr. Roy S. Cznernikowski, Dr. Tony Chang, and especially
Dr.
Muhammad Shaaban for the help and insight he offered into this thesis.
Secondly, I would like to thank all of my professors, managers, coworkers, and
peers who have given me the privilege to learn, experience, and grow with them during
my educational career.
Trademarks






Table of Contents vii
List of Figures ix
List ofTables x
List ofEquations ~ xi
Glossary xii
1 Introduction 1
1.1 VLSI Advancements 2
1.2 ArchitecturalAdvancements 4
1.2.1 Pipelining 4
1.2.2 Branch Prediction 6
1.2.3 SuperscalarDesign 7
1.2.4 Cache 7





1.4 TheQuest for aMainstream Supercomputer Architecture 14
1.4.1 SMP 75
2 Cache Coherence 19
2.1 CacheBasics 20
2.7.7 Cache Organization 20
2.1.1.1 DirectMapped Cache 21
2.1.1.2 Fully Associative 21
2.1.1.3 SetAssociative 22
2.7.2 Cache BlockLookup 22
2.1.3 ReplacementStrategy 23
2.1.3.1 LeastRecently Used (LRU) 23
2.1.3.2 Random 24
2.1.3.3 Fiist-In, First-Out (FIFO) '.'"'"'"".^24
2.1.4 Write Policy 24
2.1.4.1 Write-Through 24
2.1.4.2 Write-Back ZZZ^25
2.2 MultiprocessorCache coherence 25
2.2.7 Data Sharing 26
2.2.2 ProcessMigration 26
2.3 Ways toHandleCache Coherence 27
2.3.7 DirectoryBasedProtocols 27
2.3.2 SnoopyBusProtocols. 28




2.3.2.3 Write-Once Protocol 30
2.4 Consistency Models 31
2.4.1 Sequential Consistency 31
2.4.2 Weak Consistency 33
3 Design ofMP Architectures 35
3.1 CurrentMultiprocessor Implementation 35
3.7.7 The Chipset 36
3.1.1.1 Circuit SwitchedBuses 37
3.1.1.2 Split Transaction Bus 38
3.7.2 Memory Type 39
3.1.2.1 Reading a Cache Block FromMemory 40
3.7.3 TheMESI Cache Coherence Protocol 41
3.2 ProposedMultiprocessorArchitecture 43
3.2.7 CacheArbitration Unit (CAU) 46
3.2.2 SharedL2* Cache 47
3.2.3 SharedL2* bus 48
3.2.4 ProcessorRequirements 48
4 Performance Analysis 50
4.1 PerformanceAnalysisMethods 50
4.2 DataRateAnalysis 51
4.2.1 DataRateAnalysis for Current SMPMemory System's 52
4.2.2 Data RateAnalysis ofthe ProposedArchitecture 54
4.2.3 Invalidation Overhead. 54






Figure 1-1: GrandChallengeApplications 1
Figure 1-2: Chip Densityfor IntelMicroprocessors -2
Figure 1-3: Projected CPUFrequencyfor next 15years 3
Figure 1-4: Five stagepipeline -5
Figure 1-5:MemoryHierarchy. 9
Figure 1-6: Cache Effect on a Systems Performance 11
Figure 1-7: BlockDiagram ofan SMP System with 4 Processors 15
Figure 1-8: Cache effect on a SMP system 16
Figure 2-1: Cache Organization Schemes 21
Figure 2-2: A Typical Cache BlockAddress 22
Figure 2-3: Data Inconsistency due toData Sharing 26
Figure 2-4: Data Inconsistency due to ProcessMigration 27
Figure 2-5: Initial State oftheMemory System 28
Figure 2-6: State ofthe System after a Write-lnvalidation Operation 29
Figure 2-7: State oftheMemory System after a Write-Update Operation 30
Figure 2-8: State Diagramfor the Write-Once Protocol 31
Figure 2-9: The Sequential ConsistencyModel 32
Figure 2-10: The TSO Weak ConsistencyModel 34
Figure 3-1: The IntelDualPentium II ProcessorMemory System 36
Figure 3-2: A Circuit SwitchedBus 38
Figure 3-3: A Split Transaction Bus 39
Figure 3-4: TheMESI Write-Invalidate Protocol with Write-Back 41
Figure 3-5:ModifiedArchitecture to Support aL2* Cache 44
Figure 3-6:Modification to the CPUPackaging 45
Figure 4-1: Data Rate ofa Typical Program 52
Figure 4-2: Program Data Rate ofCurrent SMPMemory Systems 53
Figure 4-3: ProgramData Rate ofProposedMemory System 54
Figure 4-4: Effect ofInvalidation on CacheMisses while VaryingBlock Size 56
Figure 4-5: Effect ofInvalidation on CacheMisses while Varying Cache Size 57
Figure 4-6: Effect ofData Sharing on Bus Utilization 58
Figure 4-7: Performance Comparison when Programs Exhibit Fine Grain Sharing 61
Figure 4-8: Performance Difference when Programs Exhibitper-processor locality 62
IX
List ofTables
Table 1: Range ofsizes and access times in each level in the memory hierarchy
10
List ofEquations
Equation 1:Mappingfor Block in aDirectMapped Cache 21
Equation 2:Mappingfor Block in a SetAssociative Cache 22
Equation 3: Execution Timefor the CurrentArchitecture 59




A set of conductors connecting varies functional units in a computer.
A shared memory
bus specifically denotes a bus connecting the processors to the chipset
Branch Prediction,
A method to predict the destination of conditional branch instructions to reduce stalls in
the instruction pipeline
Cache,
A relatively small amount of high-speed memory that contains frequently used
instructions and data. It is intended to reduce the access times to the next higher level of
the memory hierarchy
Cache Coherence,
A problem which occurs in multiprocessor systems when multiple private caches have
different values of the same cache block
Cache Hit,
The data block requested by the processor exists in cache
Cache Miss,
The data word requested by the processor does not exist in cache. The entire cache block
containing the data word must be read from the next higher level ofmemory
Central Processing Unit (CPU),
Responsible for processing instructions in the computer system and managing cache
coherence in multiprocessor systems
Chipset,
Responsible for controlling all major functions in computer systems. The chipset
controls all access to memory and controls the system bus
Circuit Switched Bus,
A bus arbitration scheme which gives the bus master exclusive control over the bus until
its request has been filled
Consistency Model,
Specifies the order by which the events from one process should be observed by other
processes in the machine
Xll
Direct Mapped Cache,
A cache organization that allows a block to be placed in a specific location only inside
the cache
Dynamic Random Access Memory (DRAM),
A type of semiconductor memory in which the information is
stored in capacitors on a
integrated circuit. Typically each bit is stored as an amount of electrical charge
in a
storage cell consisting of a capacitor and a transistor.
Extended Data Out Dynamic Random Access Memory (EDO DRAM),
Allows the data outputs from memory to be kept active after the
control signals have
gone inactive. This can be used in pipelined systems for overlapping accesses where the
next cycle is started before the data from the last cycle is removed from the bus.
First-In, First-out (FIFO) Replacement,
A cache block replacement strategy that removes the block that has been the cache
for the
longest period of time
Fully Associative Cache,
A cache organization that allows a block to be placed anywhere inside the cache
Least Recently Used (LRU) Replacement,
A cache block replacement strategy that removes the block which has not been used in
the longest period of time.
Level 2 (L2) Cache,
A second level of cache that exists between the processor and main memory
Massively Parallel Processor (MPP),
A computer system made from commodity processors that uses physically distributed
memory to achieve a high level ofparallelism through a high bandwidth interconnect
Multiprocessor (MP),
See SymmetricMultiprocessing
Multiple Instruction Multiple Data (MTMD),
Each processor fetches its own instruction and operates on its own data
Multiple Instruction Single Data (MISD),
Each processor fetches its own instruction, but all processors operate on the same data
Pipelining,
An architectural enhancement where multiple instructions are overlapped in execution
Xlll
Rambus DRAM (RDRAM),
Intended to replace SDRAM in future computer systems. It offers sustained transfer rates
ofaround 1000Mbps, so faster buses can be implemented.
Random AccessMemory (RAM),
A data storage device for which the order of access to different locations does not affect
the speed of access
Reduced Instruction Set Computer (RISC),
A processor whose design is based on the rapid execution of a sequence of simple
instructions rather than on the provision of a large variety of complex instructions
Scalability,
The measure ofhow system performance increases as system resources are increased
Set Associative Cache,
A cache organization which divides the entire cache into separate sets which can house a
specific set ofblocks
Single Instruction Single Data (SISD),
See Uniprocessor
Single Instruction Multiple Data (SEVfD),
The same instruction is executed by multiple processors using different data
Snoopy Bus,
A bus based protocol, commonly utilized in shared memory multiprocessor cache
coherence protocols
Split Transaction Bus (STP),
A bus arbitration scheme by which a master does not hold onto the bus if the slave device
cannot respond immediately. Instead control is given to another device which can use it
at that moment
Superscalar,
An architectural enhancement for microprocessors by which multiple instructions are
processed simultaneously using dynamic scheduling alongwith compiler optimizations
SymmetricMultiprocessing (SMP),
A system configuration in which all multiple identical processors are connected together
via the same shared bus and have equal access to all resources
Synchronous Dynamic Random Access Memory (SDRAM),
A form of DRAM which adds a separate clock signal to the control signals. SDRAM
chips can contain more complex state machines, allowing them to support
"burst"
access
modes that clock out a series of successive bits
XIV
Uniprocessor,
A computer that has only one processor
Write Back,
A caching mechanism where cache blocks are written back to the next level in the
memory hierarchy only when needed
Write-Once Protocol,
A cache coherence mechanism which forces a cache block to be written back to the next
higher level ofmemory only after the first write by the processor
Write Invalidate,
A type of cache coherence protocol that invalidates all other copies of the cache block in
other processor's L2 cache
Write Through,
A caching mechanism by which cache blocks are written back to the next higher level of
memory after each write to the cache block
Write Update,
A type of cache coherence protocol that updates all other copies of the cache block in
other processor's L2 cache
Very Large Scale Integration (VLSI),
A term describing semiconductor integrated circuits composed of hundreds of thousands
of logic elements or memory cells.
XV
1 Introduction
For the past 20 years the majority of improvements in computing has
come from
more powerful processors. In today's information age computing power is being
challenged at all levels, from multimedia applications to the grand challenge problems.
The President instituted, in 1992, the five-year federal High Performance Computing and
Communications Initiative. This has spurred the development of advanced processor
technology and was initially focused on the solution of the grand challenges shown in
Figure 1-1. These are fundamental problems in science and engineering, with broad
economic and scientific impact, whose solution could be advanced by applying high





























































' ' 'tis' "li1 " 'ilk1 "
3Dbo
YEAR
MtssToeryPiraflel ? Modestly Parallel Sequential
^ficroprocessor -Supercomputer -Supercomputer
Figure 1-1: Grand Challenge Applications
What was once considered a supercomputer dedicated to solving particular
problems, now functions in tiny handheld devices. Hence, the need for faster processors
will always exist.
The ability to produce faster processors has been possible due to advances in
VLSI (Very Large Scale Integration) technology and computer architecture over the past
20 years.
1.1 VLSI Advancements
In 1965 Gordon Moore observed that the number of transistors per square inch on
integrated circuits had doubled every year since the integrated circuit was invented.
Moore predicted that this trend would continue for the foreseeable future. In subsequent
years, the pace slowed down a bit, but data density has doubled approximately every 18
months. To this point that theory has held through and in many cases the actual increase
has exceeded Moore's prediction. Figure 1-2 [INTEL] shows the transistor count for




























i i t i
1995 2000 2005 2010 2015
Projected *
Figure 1-2: Chip Density for IntelMicroprocessors
At the current rate of growth, processors with 1 billion transistors should surface
around 2010. At this point, clock frequencies of processors will be around 10GHz as

















r i 1 1 i 1 !
00
Projected
Figure 1-3: Projected CPU Frequency for next 15 years
This exponential increase has been accomplished by the incredible advancements
in VLSI technology. The feature size of modern computers has reached the 0.25Dm
mark and is dropping further. This smaller feature size has allowed designers to produce
smaller, cooler processors with a higher clock frequency.
However, it is believed that current lithography techniques for silicon will not be
applicable at feature sizes less then 0.1 Dm. Even with a 0.1Dm feature size, 1 billion
transistors would occupy an enormous amount of space and consume a large amount of
power. It is predicted that within the next 5 years current silicon VLSI technology will
reach its limit. Once this point is reached an alternative to Silicon, such as Gallium
Arsenide (GaAs), will be needed. However, this new technology would require
comprehensive modification of current VLSI technology and a complete retooling of
fabrication facilities, which would take an outrageous amount of time and money. This
3




The second reason that microprocessors were able to keep up with the desire for
more power is due to the advancements in computer architecture. Computer architecture
deals largely with the instruction set architecture, and performance enhancement issues of
CPU design.
There have been a number of dramatic changes to CPU architecture since the first
IC-based CPU was created. The list below is by no means a complete list of
advancements in computer architecture, but it serves as a point of reference to the impact
that architectural advancements have made.
1.2.1 Pipelining
Pipelining has had the most dramatic impact on the performance of the CPU.
This architectural improvement is an implementation technique that exploits parallelism
among instructions in a sequential instruction stream. It has the substantial advantage
that, unlike some speedup techniques, it is not visible to the programmer. Most modern
processors use some type of linear synchronous pipeline with added features such as data
forwarding and branch prediction.
A linear pipelined processor is a cascade of processing stages that are linearly
connected to perform a fixed function over a stream of data flowing from one end to the
other. The intent is to be able to introduce a new instruction into the pipeline at every
clock cycle so that no stage in the pipeline is every left idle. If this is accomplished then
the pipeline is considered full.
As stated previously, linear pipelined processors are constructed with k
processing stages. External inputs (operands) are fed into the pipeline at the first stage S;.
The processed results are passed from stage Si to stage S;+l, for all i=l,2,...,k-l. The
final result emerges at the last stage, Sk. Each result is passed to the next stage based
upon the clock cycle of the pipeline. Ideally, we expect the clock pulses to arrive at all
the stages at the same time. However, due to a problem known as clock skewing the
same clock may arrive at different stages with a time offset. To avoid this the clock cycle
of the pipeline must be the combined maximum of the execution time of the longest stage
of the pipeline and its clock skewed offset. The block diagram of a five stage pipeline is




























Figure 1-4: Five stage pipeline
Each stage of the pipeline performs one part of the processing of an instruction.
Therefore, up to five different instructions can be processed simultaneously.
While pipelining has resulted in a tremendous increase in the throughput of a
CPU, it has a large amount of overhead associated with its implementation. In addition,
resource and data dependencies among the instructions being processed in the pipeline
prevent full utilization of the pipeline. This manifests itself in terms of pipeline stall
cycles. Therefore, pipelining complicates the traditional processor by introducing the
need for additional advanced architectural concepts such as data forwarding and branch
prediction.
1.2.2 Branch Prediction
Data dependencies and branch instructions limit the performance of pipelined
processors due to the additional logic needed to cope with them. Branch instructions are
very common in any process due to the behavior of the program being executed. When a
pipeline is full and a branch is encountered the address of the next instruction to be
processed is not known until the previous instruction has finished executing since the
processor condition codes will not have been set correctly yet. When a branch is
encountered a branch prediction unit is responsible to pick the next instruction to be
executed in the pipeline. There are many advanced algorithms for choosing the correct
instruction, based on the past history of execution. This architectural enhancement is
essential to maintain the throughput of the pipeline at an acceptable level. Some branch
prediction schemes have been able to reach 95% accuracy. If the wrong branch is taken,
the pipeline must be flushed and execution is restarted from the last correct instruction
known. This is very costly and results in huge performance degradation.
1.2.3 Superscalar Design
Superscalar designs incorporate additional functional units that are used to
process a number of instructions simultaneously. These processors are sometimes called
multiple-issue processors, since more than one instruction can be issued to functional
units in a single clock cycle. The processor issues a varying number of instructions per
clock, which may be statically scheduled by the compiler or dynamically scheduled.
Usually, these instructions must be independent and will have to satisfy dependency
constraints. Such dependency constraints include resource, control and data
dependencies. If some instruction in the instruction stream is dependent or doesn't meet
the issue criteria, only the instructions proceeding that one in the sequence will be issued,
hence the variability in issue rate. Most modern processors have a superscalar design,
with some being able to issue up to 6 instructions at once if the conditions are
appropriate.
1.2.4 Cache
Along the same lines as pipelining cache has an enormous impact on the
throughput of a CPU. In the earliest microprocessor days instructions and data were
stored in main memory which is not located directly on the processor chip. This involves
incurring latency due to the memory access time. When dealing with main memory,
latency can lead to a lot of processor idle time since an external memory bus access
request is issued.
To deal with this, a small, fast chunk of memory is placed close to the CPU to
hold intermediate data that might be needed again soon. This small amount ofmemory is
called cache and it works on the principle of locality.
Locality states that most programs do not access all code or data uniformly. This
principle, plus the guideline that smaller hardware is faster, led to the hierarchy based on
memories of different speeds and sizes. Since fast memory is expensive, a memory
hierarchy is organized into several levelseach smaller, faster and more expensive per
byte than the next level. The goal is to provide a memory system with cost almost as
low
as the cheapest level ofmemory and speed almost as fast as the fastest level. The levels
of hierarchy usually subset one another; all data in one level is also found in the level
below, and all data in the lower level is found in the one below it.
The memory hierarchy of a computer system starts at the processor level with its
internal registers. These are the fastest and easiest for the processor to access. Next in
line is the Level 1 (LI) cache stored on the same die as the processor. This level of cache
also has a very low latency since it is operating at the same speed of the chip. The Level
2 cache, which is the next level in the hierarchy, can be on the same package as the CPU
or on the mainboard. In either case it has a higher latency since data must come through
the memory bus into the CPU. Main memory is the next level and is much larger (by
many orders of magnitudes) than L2 cache. Programs that are currently running are
stored in main memory and are accessed through the memory bus when they are needed.
Hard disk is generally considered the lowest level on the memory hierarchy chain. This
level is the largest and by far the slowest level since it involves an actual physical











Table 1 [HENPAT96] shows the range of sizes and access times of each level in
the memory hierarchy for machines ranging from low-end desktops to high-end servers.
Level 1 2 3 4
Called Registers Cache MainMemory Disk Storage







CMOS DRAM Magnetic disk










Backed by Cache Main
Memory
Disk Tape/CD
Table 1 : Range of sizes and access times in each level in the memory hierarchy
The need for cache is due to the fact the CPU performance has advanced faster
than memory performance. CPU performance has improved 35% per year until 1986,
55% per year since, while memory performance improved only 7% per year. With this in
mind, cache has proven to be a very effective way to improve overall system
performance.
Figure 1-6 shows the speed of the dotproduct, of two vectors on the cache based
RS/6000-980. For vector lengths greater than 2000 the cache cannot accommodate all















Figure 1-6: Cache Effect on a Systems Performance
While, each of these enhancements individually can increase processor
performance, modern processors use them all to make an extremely advanced design.
These enhancements, coupled with VLSI technology advancements, have made it
possible for computing power to grow at an exponential rate.
1.3 Flynn's Classification ofComputerArchitectures
All uniprocessor systems follow the von Neumann model. The von Neumann
architecture is characterized by a CPU and central memory system, with instructions and
data being read from memory.
After an instruction has been read the instruction is decoded and then any relevant
operands fetched from memory, the instruction is executed and the result stored back in
memory. The single data path between the CPU and memory over which both
instructions and data must pass, and the sequential nature of instruction execution
11
together limit the performance possible from the computer. This is sometimes known as
the von Neumann bottleneck. This is aided in uniprocessor's by pipelining and
superscalar design enhancements.
One form of classification for von Neumann machines is based on the number of
instructions that can be executed at any one time and on the number of chunks of data
that can be operated on at a time. In 1972 Michael Flynn introduced a classification of
various computer architectures based on notions of instruction and data streams. The
number of instructions is given as either SI for single instruction or MI for multiple
instruction and the number of pieces of data is given as either SD for single data orMD
for multiple data. Machines can thus be classified as SISD, MISD, SIMD orMTMD.
1.3.1 SISD
The classical von Neumann machine can be regarded as a single-instruction-
single-data machine in that at any one time only a single instruction is being executed,
and only a single piece of data is being operated upon. This is where part of the problem
arises, since we often want to perform the same instruction on many different pieces of
data, and the von Neumann machine requires us to fetch the same instruction many times,
once for each piece of data. In fact the situation is much worse since a von Neumann
machine will usually require us to create a loop, and so we will need to execute many
instructions for each piece of data. This can slow the machine down many times over
what the arithmetic unit is capable ofperforming.
1.3.2 MISD
The multiple instruction single data (MISD) architecture is the most uncommon
one. In this architecture, the same data stream flows through a linear array of processors,
12
executing different instructions on the stream. This kind of architecture is also known as
a systolic array for pipelined execution of specific algorithms.
1.3.3 SIMD
For problems in which the same operation needs to be performed on many pieces
of data, particularly those involving vectors and arrays, SIMD (single-instruction
multiple-data) architectures are often capable of high speeds. A single CPU controls
many arithmetic units, each of which operates on its own data. Each arithmetic unit
executes the same instruction as determined by the CPU, but uses data found in its own
memory. Thus all the elements of two vectors could be added together simultaneously,
increasing the speed of the operation many times over a SISD machine.
In practice, the provision ofmany arithmetic units is expensive, particularly since
many of them will not be in use at any given time. Even if a large number of arithmetic
units are provided, the size of vectors and arrays will rarely be a multiple of the number
of arithmetic units and so some inefficiency in the use of the arithmetic units will arise.
A more effective use of hardware can be obtained by pipelining the arithmetic
unit. A hardware floating point accelerator will already contain dedicated hardware for
each part of the calculation of a floating point operation. By pipelining the use of this
hardware, significant improvements can be made in processor performance. This
technique will not give as high a performance as a true SIMD machine, but the
improvements can be significant.
1.3.4 MIMD
The most general form of von Neumann architecture is the multiple-instruction-
multiple-data machine. A MTMD machine is usually a number of separate processors
13
connected together through some interconnection network. The actual format of
interconnection between the processors can take many forms, depending on the type of
problem, which the machine is designed to solve. This is the most common architecture
chosen for multiple processor machines because modern processors have the control
logic for parallel systems built in. Therefore, this is attractive since software,
replacement parts and additions to the system are easily accessible.
1.4 The Quest for a Mainstream SupercomputerArchitecture
As stated previously CPU performance advancements have come from two main
areas, VLSI and computer architecture improvements. It seems certain that the
advancements in VLSI technologies are hitting the limits. Also, most architectural
enhancements have been implemented in current designs. With this in mind, the future of
mainstream computing is in need of an alternative computer platform. This alternative
lies in parallel processing. Parallel processing involves utilizing more than one CPU in a
computer system; working cooperatively to achieve increased performance. As shown in
the previous section there are four architectures that could be used to implement parallel
machines. It is believed that the appropriate choice for future machines will be of the
shared memory MIMD
"tightly-coupled"
variety. These systems will usually contain
between 2 to O(10) CPU's on a single system board with uniformly shared memory
between the processors and an interconnection network on the board. Boards of this
nature are referred to as
"tightly-coupled"
due to the fact that the processors lie close to




An SMP node contains several identical processors, each typically with its own
on-chip cache and a larger off-chip cache, which have uniform
access to a shared
memory and other resources such as the network interface. Figure 1-7 shows a
block
diagram of a symmetric multiprocessing system.
Figure 1-7: Block Diagram of an SMP System with 4 Processors
In this scenario there are four processors each of which has their own local L2
cache outside of the CPU in addition to the LI cache inside of the processor. In SMP
systems each processor shares the same memory image. This means that if two different
processors accessed the same memory location, they would receive identical values.
Some important characteristics of SMP's include:
High-Speed Memory Bus - Since several processors need to get access to main
memory, a dedicated, high-throughput memory bus is required. Design of the memory
bus is critical in producing an efficient SMP architecture.
15
Separate Secondary Cache - Each processor in the system has its own secondary
(level 2) cache. The provision of separate caches for each processor requires complex
logic in the cache controller to make sure that a processor never works on data that has
been updated in another processor's cache. This problem is addressed through cache
coherence protocols that make sure the most recent value in the processors cache
corresponds to the data in memory. The primary advantage of a dedicated-cache design
is the ability to increase the number of processors in a system, without saturating the
memory bus. This approach seems to be the most popular for high-end multiprocessor
servers because it ensures optimum performance even when a system is scaled to its
maximum configuration. The size of the cache itself is also relevant to performance. As
a general rule, the larger the secondary cache, the better an SMP system will scale as
extra processors are added. Figure 1-8 shows the effect in TPS (transactions per second)

















0.000 H 1 1 1-
X i J) JI 10 X Si




CI CO O T CD CM <D O V CO1 CM* <u' o'
Mix Name
Figure 1-8: Cache effect on a SMP system
16
I/O to Memory Bus Bridge - In systems today the I/O bus interfaces with the
memory bus rather than directly to a CPU. This creates even more contention in SMP
systems since the CPU's must go through it to access resources. Therefore a high speed
I/O to Memory Bus Bridge is required.
Multiprocessor systems will be the main thrust once VLSI and architectural
advancements have reached the end of their lifetime. These systems will be found in
homes and businesses alike.
The MTMD architecture seems to be the future of mainstream high performance
computing. However, it involves system design complexity. A major obstacle in these
systems is cache coherence. Since there are multiple processors working cooperatively
on a single or multiple tasks, data is constantly being shared between the processors.
When one processor makes a change to some piece of shared data (currently stored in its
local cache) the other processors need to know about it in case they will need to use the
same piece of data. If other processors are not informed immediately, the value for the
data that they use may not be the most current. This problem is called cache consistency.
To alleviate this problem current systems use a cache coherence protocol to insure that
the data in a processors local cache is always up to date. This extra processing and
memory bus access results in a large amount of overhead for SMP systems.
Unfortunately, there are no current implementations that can alleviate this problem.
Instead current systems aim to deal with the problem in different ways, resulting in a
large overhead due to the coherence protocols used.
This thesis will present a new architecture for multiple processor systems, which
removes the cache coherence protocol required in shared memory MTMD architectures.
17
In chapter 2, we will investigate cache coherence and the various approaches to
cope with it. We follow by focusing chapter 3 on analyzing the current MTMD
architectures versus the proposed architecture. Then in chapter 4 we will compare
benefits gained from using the new architecture along with the changes that
need to be
made for it to be implemented feasibly. To this end, a data rate model of SMP systems is
employed to illustrate performance of each of the architectures. Finally, chapter 5 will
conclude the thesis and present directions for future work.
18
2 Cache Coherence
As discussed in section 1
.2.4,
L2 cache exists between the CPU and main memory
in a computer system. The purpose of L2 cache is to further reduce effective memory
access time by reducing the LI cache miss penalty, since main memory's speed is much
slower than that of the CPUs internal registers and LI cache.
In uniprocessor systems, cache is easily implemented with very little added design
overhead and complexity. When the processor needs information that does not exist in
its internal registers or LI cache it checks for the data in the L2 cache. If the data does
not exist in L2 cache, then the data is read from main memory, which may in turn need to
go to a mass storage device to retrieve the information, generating a page fault. If a piece
of data is found in a cache it is considered a cache hit, otherwise it is a cache miss. Cache
misses are simply one minus the cache hit ratio, which is the ratio of the number of items
that are found in cache versus the number of items requested. Once the data is found in
memory it is transferred to L2 cache in the form of a cache block. When the processor is
finished using a piece of data, it is updated in LI and L2 cache. Main memory updates
depend on the actual write policy used, either write-through or write-back, discussing in
section 2.1.4. This method of program execution is the backbone of all uniprocessor's
following the von Neumann model.
A sufficiently fast memory bus must be implemented between L2 cache and main
memory to meet the demand, for instructions and data by the processor.
19
2.1 Cache Basics
Before the cache coherence problem can be detailed it is important to have a good
understanding of how caches handle data. The memory hierarchy of computers breaks
information up into to blocks ofdata. These blocks ofdata are moved in and out of cache
as needed. An entire block of data (which includes many memory locations) is moved at
a time, due to the principle of spatial locality. Spatial locality states that items whose
addresses are near one another tend to be referenced close together in time [HENPAT96].
Therefore, when a new block is brought into the cache it is beneficial to bring in the
surrounding data also, since they will most likely be needed in the near future. The
design of cache subsystems involves four major issues that need to be addressed.
2.1.1 Cache Organization
The organization of a cache dictates where a block can be placed when it is
brought in from main memory. There are three cache organizations used today: direct
mapped, fully associative, and set associative. Figure 2-1 visually describes each of three
organizations. Their descriptions are contained in the following section.
20
FuHy associative:
block 12 can go
anywhere
Directmapped:
block 12 can go
only into block 4
(12 mod 8}
Set associative;
block 12 can ga
anywhere h set 0
(12 mod 4)
Block 0 12 3 4 5 6 7 Block 0 12 3 4 5 6 7 Block 0 1 2 3 4 5 6 7
Cache
n






Figure 2-1: Cache Organization Schemes
2.1.1.1 Direct Mapped Cache
In a direct mapped cache, each block has only one place where it can go into the
cache. The mapping for a block in a direct mapped cache is shown in Equation 1 .
(block address) MOD (number ofblocks in cache)
Equation 1 : Mapping for Block in a Direct Mapped Cache
2.1.1.2 FullyAssociative
In a fully associative cache, each block can appear anywhere in the cache. Fully




Finally, set associative caches, limit the number of places a block can be placed.
Blocks in the cache are broken off into groups of sets. Each block in memory is mapped
into a single set, generally using Equation 2.
(block address)MOD (number ofsets in the cache)
Equation 2: Mapping for Block in a SetAssociative Cache
Once a memory block has been assigned to a set it can be placed anywhere inside
the set.
2.1.2 Cache Block Lookup
Now that it is understood how caches get data into them, the process of reading
from a cache will be detailed next. Each block in a cache has an address tag associated
with it. When a processor wishes to retrieve data from the cache it uses the block's
address tag to reference the data. The tag of each block (the actual number of blocks
checked depends upon the organization of the cache) is compared against the tag




Figure 2-2: ATypical Cache BlockAddress
The index portion of the address is used to select the set in the cache, while the
block offset is used to select the actual piece of data in the cache block. The tag portion
of the address is compared to the processors requested tag.
22
A fully associative cache would have no index field since a block is not restricted
to any single set. Note that in a set associative cache the index field would be used to
select the set that contains the data. While in a direct mapped cache the index field would
select the actual block containing the data.
2.1.3 Replacement Strategy
The replacement strategy dictates which block is replaced when a cache miss
occurs. The actual process of selecting the block to be replaced when a cache miss
occurs is done by the cache controller. A cache miss occurs when the tag requested by
the CPU was not found in the cache. When a miss occurs, a block in the cache must be
replaced with a block from the next higher level of memory. In a direct mapped cache
there is no need for a replacement strategy since there is only one location that a block is
capable of going into. There are many replacement strategies that are used in cache
controllers. Three replacement strategies are Least-Recently Used, Random, and First-In,
First-Out (FIFO).
2.1.3.1 Least Recently Used (LRU)
The LRU replacement strategy records all accesses to cache blocks. When a
cache miss occurs the cache block that is replaced is the one that has gone unused for the
longest amount of time. This follows along the same lines as temporal locality.
Temporal locality states that a cache block that has been recently used is likely to be used
again in the near future. LRU replacement can become extremely expensive, especially
in large caches, since all accesses need to be recorded internally in the cache.
23
2.1.3.2 Random
The simplest strategy to employ is a random replacement strategy that is spread
uniformly across the cache. When a cache miss occurs a random block number is
selected and the selected block is replaced. Studies have found that while the random
replacement strategy may not be the most intuitive strategy its results are quite
impressive. The attractiveness in a random replacement strategy is in the ease of
implementation.
2.1.3.3 First-In, First-Out (FIFO)
FIFO replacement strategy replaces the cache block that has been in cache for the
longest period of time. This strategy has proven to yield worse results than the LRU and
random replacement strategies.
2.1.4 Write Policy
The final aspect about caches is the write policy. When a data value has been
modified in the processor registers, it is immediately written back to LI cache and L2
cache. Updating the data in main memory depends on the particular write policy being
employed in the system. There are two write policies that are used in cache design today.
2.1.4.1 Write-Through
In a write-through cache, when data is written to L2 cache it is also written to
main memory simultaneously. Therefore, main memory always contains an exact copy
of the data that is in the L2 cache of the processor. Write-through cache's put a large
amount of overhead onto the memory bus since it is not always necessary to have an
updated copy of the data in main memory. For this reason, write-through caches are not
widely used. However, write through caches are extremely easy to implement since all
24
writes are sent to L2 and main memory at the same time and no
additional logic is needed
in the cache.
2.1.4.2 Write-Back
In write-back caches, data is written back only to the L2 cache. When the
cache
becomes full or the cache block is being replaced, the data is then updated in main
memory. Therefore, all writing to main memory from L2 cache is done when a cache
miss is encountered. The advantage of the write back cache is that all writes from the
processor occur locally at the speed of the cache memory (much faster than main
memory). Since main memory is only updated when the cache block is needed, the
memory bandwidth requirements of a write back policy are much more lenient than that
of the write-through policy. This frees up bandwidth for other devices in the system,
most importantly, other processors in multiprocessor systems.
2.2 Multiprocessor Cache coherence
In a multiprocessor system, data inconsistency can occur in adjacent levels of
memory or within the same level. Therefore, it is possible for the current data in main
memory to differ for its most recent value, since the most recent value would be stored
only in the processors local cache (depending on the write policy, it may also be in main
memory). In addition, L2 caches of other processors may contain even older data values
of the same memory location. This is not possible in uniprocessor systems since there is
only one processor in the system that will ever modify the data. There are two possible
ways inconsistent data can appear in a cache, data sharing or process migration.
25
2.2.1 Data Sharing
Since data in multiprocessor system is commonly shared between many processes
executing on different processors it is possible for the private caches on each processor to
contain different copies of the same shared data. Figure 2-3 [HWANG93] shows how
inconsistency can occur when dealing with shared data.
Processors














Before update Write-through Write-back
Bus
Figure 2-3: Data Inconsistency due to Data Sharing
2.2.2 ProcessMigration
In multiprocessor systems, processes frequently migrate from one processor to
another. Unfortunately, shared data that is residing in a processor's local cache does not
migrate with the process. Therefore, it is possible for a processor to update a shared data
value in its L2 cache, get interrupted, hand the process over to the OS which in turn
hands it over to another processor. This problem is known as process migration and is a
common occurrence resulting in data inconsistencies. Figure 2-4 [HWANG93] visually



















Before Migration Write-through Write-back
Figure 2-4: Data Inconsistency due to Process Migration
2.3 Ways to Handle Cache Coherence
Multiprocessor systems have widely varying ways to handle inconsistent data in
memory. In massively parallel processors (MPP's) a directory-based scheme is used,
while in SMP systems snoopy bus protocols are used.
2.3.1 Directory Based Protocols
Directory based coherence protocols are commonly used in large scale,
distributed memory systems where a fast interconnection network exists between each of
the nodes. In a distributed directory scheme each memory unit has a directory structure
which contains listings ofwhich cache currently has copies of its memory blocks. When
a read miss occurs in a cache, a request message is sent to the memory unit that it
received the cache block from. The memory unit then updates its value from the cache
with the most current copy and sends a copy to the requesting cache. Central directory
based schemes have a main directory which contains all the information relating to a
memory block's location in a processor's cache. Lookups are done using this central
directory only. Contention on the central directory has limited adaptation of this scheme
in actual systems.
27
2.3.2 Snoopy Bus Protocols
SMP systems use shared memory connected to a high-speed memory bus. These
systems also follow the von Neumann model of execution, allowing all the CPU's to
access the main memory asynchronously with respect to each other. In parallel
applications data sharing between processes running on different CPU's requires
advanced features to keep data coherent.
Since caches are used in these designs, there will be data consistency problems in
main memory since the most recent data values would be stored in caches. To this end,
SMP systems employ a cache coherence mechanism to ensure that the value a processor
is reading from its cache is the most current one. Cache coherence requires both
hardware and software support to achieve acceptable levels in performance.
The hardware support for cache coherence comes in the form of a snoopy bus,
operating under a snoopy-bus protocol. When multiple private caches are tied together
on a single bus, the methods used to ensure consistency entail changes to the write
policies. The snoopy bus allows all processors in the system to monitor the traffic on the
memory bus. The processor is allowed to take appropriate action depending upon the
write policy. Figure 2-5 shows a SMP system with a shared memory variable loaded into






Figure 2-5: Initial State of theMemory System
28
In the next sections we will see how the caches and memory are modified to cope
with cache coherence.
2.3.2.1 Write-Invalidate Policy
In a write-invalidate policy when a processor writes a value to a cache block
in its
private cache it also sends an invalidation signal to all other caches which contain the











A fa) d) Processors
Figure 2-6: State of the System after aWrite-Invalidation Operation
Figure 2-6 [HWANG93] shows the state of the system after a write-invalidate
operation, by Pi. Since the Pi has a write through cache, main memory contains the
updated value that it received over the snoopy bus from Pi.
2.3.2.2 Write-Update Policy
In a write-update policy when a processor writes a value to a cache block in its
private cache it also updates other caches (if they contain the cache block) and main
memory with the new value. The update is done using the features of the snoopy bus,
which allows other processors to monitor bus activity. Figure 2-7 [HWANG93] shows
29
the state of a SMP system after the write-update operation, in which all caches now











Figure 2-7: State of theMemory System after aWrite-Update Operation
The write-update policy is extremely effective at ensuring data consistency.
However, it places an unnecessarily large amount of traffic on the memory bus since not
all processors may need the updated value.
2.3.2.3 Write-Once Protocol
James Goodman in 1983 proposed a cache coherence protocol for bus-based
multiprocessors. In order to reduce unnecessary bus traffic, the very first write of a cache
block uses a write-through policy. This will keep main memory consistent with the local
cache after the first write. After the first write, memory is updated using the write-back
policy [GOOD83]. Figure 2-8 [HWANG93] details Goodman's protocol, which uses 4







Figure 2-8: State Diagram for theWrite-Once Protocol
Each transaction in the figure represents extra overhead that is placed on the
memory bus. This traffic reduces the amount of utilization of the bus by other
processors, which in turn degrades the overall performance of the machine.
2.4 Consistency Models
Parallel applications executing on a parallel machine require data to be used in
multiple CPU's. Because of this it is very important to make other processors aware
about any changes to data that they may also have a copy. Consistency models specify
the order by which the events from one process should be observed by other processes in
the machine [Hwang93]. The two main consistency methods are sequential and
weakened consistency.
2.4.1 Sequential Consistency
Sequential consistency is when "the result of any execution [of the program] is the
same as if the operations of all processors were executed in some sequential order, and




[LAMP79]. Since data is loaded and stored identically to uniprocessor







Figure 2-9: The Sequential ConsistencyModel
Figure 2-9 [HWANG93] shows how the sequential consistency model can be
described. Each processor is connected to memory through the same switch ensuring that
no processor can update main memory out of order.
The single ported memory ensures that there is only one memory access operation
in progress at any one time. Therefore some queuing mechanism is needed to order and
serialize the memory references while they wait to be serviced.
In 1992 Sindhu, Frailong, and Cekleov specified that for sequential to exist the
following five axioms must be true [SINDHU92]:
1) A load by a processor always returns the value written by the latest
store to the same location by other processors.
2) The memory order conforms to a total binary order in which
shared memory is accessed in real time over all loads and stores
with respect to all processor pairs and location pairs.
3) If two operations appear in a particular program order, then they
appear in the same memory order.
32
4) The swap operation is atomic with respect to other stores, meaning
that no other store can intervene between the load and store parts
of a swap.
5) All stores and swaps must eventually terminate.
The sequential consistency model is enforced in hardware, on the fly. In this
model, all memory accesses are atomic and tightly ordered to ensure the accuracy of the
model. Therefore, all memory accesses must be global and a processor cannot issue
another memory access until the most recent shared memory access by a processor has
been performed globally. These mechanisms ensure that the correct program order is
observed.
2.4.2 Weak Consistency
The sequential consistency model demands the most memory bandwidth and
additional support (both hardware and software) to ensure its accuracy. To remove the
amount of bandwidth and extra work needed, various degrees of weaker consistency
models have been created. The TSO weak consistency model was developed by Sun
Microsystems'
SPARC architecture group. Sindhu, Frailong and Ceklov have specified
the TSO weak consistency model with the following six behavioral axioms [SINDHU92].
1) A load access is always returned with the latest store to the same
memory location issued by any processor in the system.
2) The memory order is a total binary relation over all pairs of store
operations.
3) If two stores appear in a particular program order, then they must
also appear in the same memory order.
4) If a memory operation follows a load in program order, then it
must also follow the load in memory order.
5) A swap operation is atomic with respect to other stores. No other
stores can interleave between the load and store parts of a swap.
33




















Figure 2-10: The TSOWeak Consistency Model
Figure 2-10 [HWANG93] shows the TSO consistency model, which is a
relaxed version of the sequential consistency model. The weak consistency model was
created to offer better performance in multiprocessors. Many times it is not necessary to
enforce a fully sequential program order. However, the weak consistency model requires
more hardware and software support. This makes it a more expensive option than the
sequential consistency model. Various degrees of the weak consistency model exist
which allow for stricter event ordering enforcement without making it entirely sequential.
While all systems ensure that the data is consistent it is important to investigate
how much work is being done to create a coherent system.
34
3 Design ofMPArchitectures
Multiprocessor architectures require an enormous amount of design
considerations to be efficient. Low-cost, high performance microprocessors have led to
the demand for small to medium-scale shared memory multiprocessors utilizing a shared
bus interconnect [CHIS092]. These systems have become popular for two reasons:
1.) The shared bus interconnect is easy to implement
2.) The shared bus interconnect allows an easy solution to the cache
coherence protocol described in Section 2.2.
However, shared bus multiprocessor systems require a memory system capable of
supplying the necessary bandwidth to keep the processors busy. This task alone is
difficult to implement due to line transmission problems, memory access speeds, board
layout constraints, cost and other factors. Moreover, cache coherence introduces an
additional level of overhead onto the shared memory interconnect model. This overhead
is necessary essential to ensure correct program execution and system dependability. The
current model of a multiprocessor system will be addressed next to identify the design
requirements needed when creating a shared memory multiprocessor.
3.1 CurrentMultiprocessor Implementation
The most important aspect of any multiprocessor is its memory system. Figure
3-1 [PEI100] shows the memory system of the Intel
Pentium II dual processor system.
The following components are used in the figure:
82443BX - The System Chipset
CK100 - Clock Synthesizer
Pentium II - Microprocessor
Memory
- MainMemory for the System



























Figure 3-1: The Intel Dual Pentium II ProcessorMemory System
The L2 cache is not shown explicitly in the above figure, however it resides inside
the packaging of the
Pentium'1 II processor. From Figure 3-1, it is obvious that the
major components that effect the memory system are the chipset, memory type, system
bus speed, bus arbitration scheme, bus utilization, and invalidation overhead. It is
important to note that, the invalidation overhead results in a enormous amount of traffic
that is placed on the system bus due the cache coherence protocol.
3.1.1 The Chipset
The chipset is the most important component in a computer system. It is
responsible for controlling the system bus, communicating between memory the
processors, and controlling the communication from the processor to all other devices in
the system. Therefore, the chipset must be able to respond to multiple requests in the
correct order dictated by the consistency model. All references to memory must pass
36
through the chipset before continuing on to main memory, so proper memory timing is
essential.
The bus arbitrator is an integral part of the chipset that is responsible for granting
bus ownership to different devices that are connected to the bus. In Figure 3-1 the bus
arbitrator logic is contained inside the chipset and bus ownership is granted to one of the
processors, through the control lines of the bus. The actual arbitration scheme chosen
depends on the characteristics of the particular system bus being implemented. Each time
a processor wishes to place a transaction on the memory bus, it must first inform the bus
arbitration logic in the chipset of its wish. The bus arbitrator will then determine when
the processor is allowed to control the system bus. In circuit switched buses, once the
processor is given control of the bus, it has ownership of the bus until its transaction is
complete. Bus arbitration mechanisms fall into two categories, circuit-switched buses
and split transaction buses.
3.1.1.1 Circuit Switched Buses
In a circuit switched bus, the bus master is granted exclusive use of the bus until
the entire transaction is complete. Therefore, the total time that the bus is held by the
master includes the latency of the slave device (i.e. memory or an invalidation broadcast).









Figure 3-2: A Circuit Switched Bus
Circuit switched buses are used in most existing bus designs due to their ease in
implementation and result in lower latency.
3.1.1.2 Split Transaction Bus
A split transaction bus, or STP bus, does not grant the master ownership of the
bus to the bus master for the entire duration of the transaction if the slave device is not
able to respond to the request immediately. In this case, the bus would be released by the
master and made available to other processors in the system. When the slave device is
able to respond to the request, it obtains ownership of the bus and transfers data to the
requesting device. STP buses can have a improved performance in shared memory
systems [CHIS092], however, they incur a higher implementation cost. A bus master












Figure 3-3: A Split Transaction Bus
An STP bus handles bus contention much better than a circuit switched bus, since
devices are not allowed to hold the bus if a request cannot be filled immediately.
Therefore, the effective bandwidth of an STP bus is higher than that of the circuit
switched bus.
3.1.2 Memory Type
Memory type has an enormous impact on the memory system performance.
Commercial multiprocessors employ two major versions ofmemory, EDO and SDRAM.
EDO generally runs at 60ns and is optimized for system buses running at 60-66Mhz.
SDRAM was introduced to take advantage of faster memory buses capable of operating
at lOOMhz. This allows for better utilization of the shared bus, so that the processors can
be fed data at a faster rate. Memory type has a direct relationship to its cost per
megabyte. Also, the chipset has to be capable of dealing with each type ofmemory that
can be placed in the system. SDRAM is currently the preferred memory type in SMP
systems. Future SMP systems will support faster memories such as RDRAM [CR1SP97].
39
3.1.2.1 Reading a Cache Block From Memory
When data is needed from main memory a penalty is associated
with the read.
The penalty derives from the following 3 components: the amount of time to gain
ownership of the bus, the amount of time to request data from main memory, and the
amount of time that memory takes to fill the request. For current
multiprocessor
architectures, a processor speed in the range of 400-600Mhz is reasonable, with the L2
cache running at the same speed as the processor, in addition a system bus clock cycle of
lOOMhz will be assumed. Due to the lOOMhz system bus speed, SDRAM will also be
assumed.
For a processor to read a value from memory the approximate latency incurred
would be directly related to the cache line fill rate. The cache line fill rate represents the
amount of time to fill each word in the cache line. Typical fill rates in systems today are
on the average of 2-1-1-1 when using SDRAM. This means that it takes 2 memory
cycles to fill the first word, and 1 cycle to fill each additional word. For most caches a
block size of four words is ideal. Therefore, it would take 5 memory cycles to fill the
cache line. Since the system bus is operating 4 times slower (lOOMhz as compared to
400Mhz) than the processor it would take a minimum of 20 processor cycles to complete
the read request, ignoring bus arbitration and any bus contention overhead. During this
time the processor is sitting idle waiting for its request to be filled. This is an enormous
amount oftime for the processor to be in a wait state. Therefore, it is important to reduce
the amount ofprocessor to memory communication, which would also reduce the amount
ofbus contention. Superscalar microprocessor designs have attempted to hide the latency
incurred when reading from main memory by executing more than one instruction per
40
clock cycle. The best way to reduce the amount ofutilization of the
bus is to reduce the
miss rates in the cache, since each cache miss results in a read from memory.
In actuality, current multiprocessor configurations such as the one in Figure 3-1
place an enormous amount of extra communication on the bus. This extra bus traffic is
due to the required communication in the cache coherence protocol.
3.1.3 The MESI Cache Coherence Protocol
The cache coherence protocol used in Figure 3-1 is the Modified-Exclusive-
Shared-Invalid (MESI) protocol [ANDSHA95]. MESI is a write-invalidate snoopy
protocol used in many current systems. The protocol is capable of supporting both write
back and write-through L2 caches. Figure 3-4 [HWAXU98] shows the state transition
diagram of theMESI protocol.
Events:
r. Local Read







INVD: fnvafidate w/o WB
WBINVD: WB & Invalidate
v: Logical OR
A: Logical AND
Figure 3-4: The MESIWrite-invalidate Protocol withWrite-Back
41
Most coherence protocols, including the MESI, incorporate a write-invalidate
strategy with write-back L2 caches since it places fewer overheads on the system bus.
TheMESI protocol establishes a state machine were each processor constantly modifying
cache blocks in its L2 cache. The state transitions in the MESI protocol result in extra
traffic placed on the bus so that each cache block can keep track of its current state and
adjust accordingly. To verify each cache block's state, bus snooping logic in the CPU
must constantly monitor all transactions that occur on the system bus. When a CPU
places a transaction on the system bus, all other CPUs must check to ensure that the
transaction does not affect any data that is currently stored in its local L2 cache. If the
transaction does affect the monitoring processor's L2 cache, then a state change must be
carried out in its L2 cache. This snooping activity is constantly active in each processor.
In a typical case, when a processor does not currently have a cache line in its L2
cache that is needed, the processor reads the line from main memory involving a > 20
cycles latency as described in section 3.1.2.1. Once the processor has the cache block in
its L2 cache it moves into the exclusive state of the protocol meaning that only itself has
a valid copy of the cache block. In this state the processor may read or write to the cache
line at will (after writing to it once, it moves into the modified state). However, if
another processor requests the cache line from memory, the processor which currently
has the cache line in the modify state must notice the request being sent on the bus and
supply the valid copy of the cache block, simultaneously moving the state of the cache
block in its L2 cache into the shared state. The shared state signifies that more than one
processor currently has a valid copy of the cache block in its L2 cache. Now, if either
processor would like to modify this cache block, it must write back the data to main
42
memory first, which requires placing a transaction on the system bus. The fact that the
systems bus is running four times slower than the processor is also to be noted.
The other
processor notices the request being sent on the bus and changes its cache block's state to
invalid, requiring it to read the most current value from main memory to modify the
cache block. This read involves another > 20 processor cycle latency. Also, if the
processor that currently has the valid block writes to it again, main memory would not be
updated after this write. If another processor wishes to read the most current value, the
processor with the valid data would have to notice the request and place the data on the
system bus, updating main memory once again.
Each state transition in the MESI protocol can be interpreted as extra transactions
that need to be placed on the system bus, resulting in increased bus utilization, contention
and wait states. Moreover, it is extremely common for a processor to read a cache block
from another processor. Then immediately after writing to the cache block, a request is
placed on the bus by a separate CPU that needs the cache block, eventually resulting in
the need to read the value again before writing to it. It is amazing that much more
research is conducted in coping with cache coherence than in removing cache coherence.
3.2 Proposed MultiprocessorArchitecture
In this thesis, we propose a new multiprocessor architecture that will remove the
need for a cache coherence protocol, ultimately increasing the performance of the system.
Moreover, it will be shown that the changes required to implement the proposed system
are minimal. The new architecture will have far reaching abilities well into new
generations of multiprocessor systems, due to the increased demand for multiprocessor
systems.
43































Figure 3-5: Modified Architecture to Support a L2* Cache
In addition to the system board changes, Figure 3-6 shows the changes that need
to be made to the processor packaging to incorporate the modifications needed for the














Figure 3-6: Modification to the CPU Packaging
The proposed architecture shown in Figure 3-5 adds an additional cache and bus
to the system. These additions will allow us to remove the cache coherence protocol and
any logic associated with its implementation. The reason that the cache coherence
protocol is not needed is that the proposed architecture inherently ensures consistency at
all levels ofmemory. For this to occur, a distinction in the type of data being used in the
system is necessary. This distinction already exists in all multiprocessor systems,
however, it is not efficiently utilized. When a parallel application is compiled, data that
is shared between more than one process are marked as shared. This is a normal
operation that is part of any parallel program compilation, and is needed to guarantee
proper program execution in all current multiprocessor systems. Therefore, data that is
shared can be treated separately, since it is always identifiable to the CPU. In the
proposed architecture, all shared data is stored in a separate L2 cache, marked as L2*,
which is not part of the current processor packaging.
45
When a piece of shared data is needed the processor looks for the cache block in
the shared
L2*
cache instead of its local L2 cache. If the cache block does not exist in
the shared
L2* it is read from main memory following the Von Neumann model of
program execution. However, if the cache block is found in the
L2*
cache, the processor
is given exclusive access to only the word needed in the cache block. In
this
configuration the processor is assumed to be running at 400MHz, the system bus is
operating at 100MHz, L2 cache also has a clock rate of 400MHz, and the shared
L2*
cache is running at 1/2 the processor clock rate, or 200Mhz. These values
represent
typical configurations in commercial multiprocessor systems.
The major changes needed to the current multiprocessor configuration are the





changes to the control logic of the processor.
3.2.1 CacheArbitration Unit (CAU)
The CAU is responsible for identifying if a request from the processor is intended
for the shared
L2*
cache. As part of the control lines of the bus between the processor
and the caches a single line would represent which cache the read request is intended for.
If the request was intended for the local L2 cache then the CAU would have no effect.
However, if the request was intended for the shared
L2*
cache then the CAU would act
as a switch, enabling the processor to read from the shared
L2*
cache. The CAU block
would be a very small piece of logic that would be placed on the same package as the
processor, and local L2 cache. The extra delay incurred from the CAU would be
extremely minimal since it requires the check of a single bit. A similar check would also
46




3.2.2 Shared L2* Cache
The shared L2* cache would be similar to the other caches in the system with
some minor requirements to ensure atomicity. The first and foremost are 4
"in-use"
bits
for each cache block. It is assumed that a cache block is four words wide to insure
compatibility with current multiprocessor implementations. These bits would allow only
the actual data word that is needed at that time, to be locked. Therefore, if one processor
needs a different word in the same block they are allowed to access it without having to




bit is set) no other processor is allowed to read or modify the
data. A simple queuing mechanism can be incorporated to grant outstanding requests in
the proper order. This queue would also handle cases were two processors request the
same cache block in the shared L2* cache, and it does not currently exist in the cache.
Only one processor (the first to request) should initiate the request to main memory for
the cache block. This logic would be implemented in the cache controller located inside
the cache. The cache controller would also act as the bus arbitrator for this additional
shared bus.
A random replacement strategy could be used due to its inexpensive
implementation cost and small computational needs. However, a cache block that has at
least one
"in-use"
bit set can not be replaced.
Since the cache is running at 1/2 the speed of the processor, larger (compared to
the local L2) caches could be used without a major increase in cost. Therefore, a read
47
from the shared L2* cache would take 2 processor cycles ignoring the CAU delay. A
shared
L2*
cache size of 2-4MB is sufficient to supply adequate performance. The
L2*
cache would be physically located on the system board.
3.2.3 Shared L2* bus
The shared
L2* bus would lie on the system board itself and would interact to the
processor via pins on the packaging of the CPU board. This would result only in a
minor
increase in price, since only a packaging change would be needed. Current slot
packaging contains a two-sided connection, an additional connection could be added to
support the additional bus, or a larger slot could be used. This bus would be running at
1/2 the processor clock speed since access to the bus would require an off-package




One advantage in the proposed architecture is that it would result in only minor
changes to the processor design. Once these changes have been incorporated no
additional logic would needed in the future. The most important change in the processor
would be the addition of a control line for cache selection. This control line would allow
the CAU to determine if a request to cache is intended for the shared L2* cache or its
local L2 cache. In all commercial processors, extra reserved pins are already
incorporated on the processor die. These reserved pins are set aside for any additional
future changes that the CPU vendor may need to make in the design.
Before a processor requests data from either L2 cache, it would first check the
type of data being requested. This information would be available to the processor since
48
parallel compiler's tag data as being shared or private as discussed in section 3.2. A
shared piece a data would require the processor to set this control line to an enabled state
to access the shared
L2*
cache.
The proposed architecture would result in a substantial increase in system
performance due to the removal of the cache coherence protocol. Also, since there is no
cache coherence protocol, the main system bus can be more efficiently utilized.




There is no doubt that the cache coherence protocol is a limiting factor in the
performance of multiprocessor systems. However, it is important to understand the
extent of overhead that the protocol requires. This chapter will detail the degradation of
systems due to the cache coherence protocol. Also, the current and proposed
architectures will be modeled and analyzed. Further, it will be evident that the
architecture proposed in section 3.2 would offer a tremendous increase in system
performance, and scalability.
4.1 PerformanceAnalysis Methods
To design a multiprocessor memory system with an adequate performance, the
designer must have access to sophisticated performance analysis tools. Analytical
models are computationally much cheaper than trace-driven simulation and allow a much
larger design space to be explored. However, they are generally considered to be less
accurate than trace-driven simulation. Hence, for evaluating design choices in
multiprocessor systems, the ideal tool would be a trace-driven simulation based upon
traces generated from actual execution of sample benchmark programs. Unfortunately,
trace-driven simulation is expensive, both in execution time and storage requirements (to
store the traces). Also, trace driven simulation requires specialized software tools not
always available or configured for the system being considered. The storage expense of
trace driven simulation can be reduced by parameterized trace driven simulation. In
parameterized simulation, artificial traces are generated on the fly using probability
distributions that have the same characteristics as the actual program traces. Therefore,
many separate traces would not be needed before simulation. Parameterized simulation
50
is still computationally expensive and is generally not considered to be as accurate as
actual trace driven simulation. Also, an analytical performance model can be developed
for the system that would allow for a mathematical investigation to be conducted.
For this thesis a data rate model ofprogram execution will be used. Since no known
simulator exists which lets the user specify an entirely new memory system this
approach was chosen. This model analyzes the rate of bus requests of programs
executing on multiprocessor systems to interpret the loads being placed on the memory
bus, local cache's, and memory.
4.2 Data Rate Analysis
Programs executing on multiprocessor system exhibit a particular rate of data
transfer between memory and processors. This can be coupled with the effects that are
incurred due to the coherence protocol to compare relative performance gains ofmemory
systems.
It is important to first closely study both architectures to identify how their
features will affect their system performance. It is obvious that no coherence protocol is
needed for the proposed architecture in Figure 3-5. How this change will manifest itself
terms of access speeds and cache hit/miss rates is the key to the data rate model. Since
the processor will never receive invalid data from the L2* cache, cache miss rates will
decrease dramatically in both the
L2*
and local L2 cache. Also, the proposed
architecture places much less contention on the memory bus, since fewer requests to
memory will be needed.
The performance of a memory system can be approximated by the rate that each
processor requests data. For a specific compilation of a program there will be a
51
corresponding data rate exhibited by the system. This data rate will be dependent on
cache hit/miss rates, memory latency, and the amount of data sharing, all ofwhich affect
the overall execution of the program. The default data rate will be shown first and then
values can be assessed based upon the architecture being examined.
The data rate that a program exhibits on the system bus can be broken up into three
categories, the amount of reads, writes and the bus contention. Furthermore, the reads
and writes can be broken down farther into reads of shared and not shared data. A similar
break down can be used for writes. Bus contention will be ignored since it will be
factored into penalties for a cache miss. The breakdown of the data rate for a typical
program can be seen in Figure 4-1.
Read Write
Non Shared Shared Non Shared Shared
Hit Miss Hit Miss Hit Miss Hit Miss
Figure 4-1 : Data Rate ofa Typical Program
Reads constitute a much larger portion of bus activity than writes, since all
instructions must be read from main memory before being processed. By calculating
hit/miss penalties, the performance of the memory system can be evaluated.
4.2.1 Data Rate Analysis for Current SMPMemory System's
Figure 4-2 gives the memory reference penalties for each access in the system.
52
Read Write
Non Shared Shared Non Shared Shared
Hit Miss Hit Miss Hit Miss Hit Miss
L2 MM L2 MM L2 MM L2+lnv MM-flnv
Figure 4-2: Program Data Rate ofCurrent SMPMemory Systems
Non-shared data is always obtained from either L2 or main memory. This follows
the execution model discussed in section 1.2.4. The penalties associated with a lookup in
L2 cache is 1 cycle since the L2 cache is running at the same speed of the processor. The
penalty associated with a read from main memory is >20 cycles, which was
calculated in
section 3.1.2.1. Non-shared data incurs no overhead from the invalidation, except any
contention that is on the bus, which can be ignored for this study. While bus contention
will be ignored, it offers another reason to abandon current SMP configurations. The
values for non-shared data are identical to ones that a study for uniprocessor's would
yield, if bus contention was ignored.
The penalties associated with shared data are also shown in Figure 4-2. The
memory access time for shared write data is the amount of time to fetch it from its
associated memory source plus additional cycles for the invalidation overhead due to a
bus utilization increase, snoopy protocol overhead, which is assumed to be 1. Also, the
system bus is running at 1/2 the processor clock speed. The invalidation overhead is only
present in writes, since reads will never modify a value. Therefore, the access time for a
write to a piece of shared data in L2 cache would be 2 processor cycles. A read from
main memory would still take
>20 cycles.
53
4.2.2 Data Rate Analysis of the Proposed Architecture
The data rate for programs executing on the proposed memory system do not incur
any extra overhead except for an additional cycle in the
L2*
lookup time. The figure for
the data rate of the proposed architecture is seen in Figure 4-3.
Read Write
Non Shared Shared Non Shared Shared






Figure 4-3: Program Data Rate ofProposedMemory System
The change in the data rate that will effect the performance of the system is
concerning shared data only. Non-shared data will have no effect in the data rate of the
program. The time to read a value from main memory is >20 processor cycles, since the
system bus speed is constant in both architectures. However, the time to read a value
from the shared
L2*
cache is 2 processor cycles since the L2* bus is running at 1/2 the
processor clock speed or 200Mhz for this thesis. In the next section we will investigate
the overhead incurred in typical write-invalidate systems. This overhead will result in
varying cache hit/miss rates for the two architectures.
4.2.3 Invalidation Overhead
In write-invalidate protocols, there are two sources of bus-related coherency
overhead. The first is the invalidation signal needed to maintain coherent caches. The
second is the additional cache misses that occur when processors need to reference
54
invalidated data. These misses, due to invalidated data, would not have occurred had
there been no sharing. They exist because the shared data had previously been written,
and therefore invalidated, by another processor.
The pattern of references to shared data can be characterized by two distinct
modes of behavior. In the first case, per-processor locality within the cache block, a
particular processor makes multiple, consecutive writes to words within a block,
uninterrupted by access from other processors. In the other case, fine-gain sharing,
processors contend for one or more words within the block, and the number of processor
consecutive writes is very low.
Whether a program exhibits per-processor locality or fine-grain sharing affects the
amount of coherency overhead incurred. Per-processor locality reduces the coherency
overhead by decreasing both the number of invalidations and the number of invalidation
misses. In fine-grain sharing, the number of invalidations and cache misses is higher.
The greater the number of processors contending for an address, the greater the number
of cache misses.
The invalidation overhead also effects bus utilization, specifically the effective
bus bandwidth. The bandwidth of the bus in a shared memory multiprocessor, is the
most crucial bottleneck in the system. Few processors can be attached to the bus if
caching is not used. For multiprocessor systems the most important consideration in a
memory and cache system should be how the cache organization affects bus utilization.
When caches have a high miss ratio the bandwidth becomes even more important, since
writes will involve additional invalidation overhead.
55
If a parallel program exhibits fine-grain sharing then the effective bus bandwidth
is reduced dramatically. This limits the scalability of the system, since the bus will be
saturated with a small number ofprocessors.
Figure 4-4 [EGGK89] shows the increased cache miss rate from TOPOT
[DEVA87], compared to uniprocessors, while varying cache block sizes. TOPOT does
topological compaction ofMetal Oxide Semiconductor (MOS) circuits, using dynamic







10000' j i .~j4fc..j j
M






4000' } v -i ? i
2000' y......:.-^. 4 1
i -
| ~-UNIPROCESSOR
4 1 16 32
Block Size (bytes)
Figure 4-4: Effect of Invalidation on Cache Misses while Varying Block Size
TOPOT has an extremely high number of invalidations because it exhibits fine
grained sharing. Almost all of the total number of cache misses are attributed to
invalidations. Figure 4-5 [EGGK89] shows the increases in cache misses of TOPOT,




















16 32 64 121 256 512
Cache Size (Kbytes)
Figure 4-5 : Effect of Invalidation on Cache Misses while Varying Cache Size
Once again invalidations are responsible for almost all cache misses that occur.
Cache size and block size are the two most optimized features in SMP memory systems.
Varying either one has a very minor effect in either figure. Typical cache sizes for a
multiprocessor is in the range of 512Kbytes, which results in high miss rates for TOPOT.
These values will result in an extremely inefficient multiprocessor system for this
application.
The bus utilization also has a drastic effect depending on the amount of sharing.
Programs that exhibit per-processor locality do not reduce the bandwidth as much as fine
grained sharing programs. However, bus utilization does limit the scalability of the
multiprocessor in any parallel program. Figure 4-6 [EGGK89] shows the bus utilization























Figure 4-6: Effect ofData Sharing on Bus Utilization
Due to the amount of sharing in this application the total number of bus cycles is
extremely high. While the amount of sharing varies greatly depending on compiler
optimizations, the figures offer a visual depiction of the impact the cache coherence
problem possesses on current SMP systems.
4.2.4 Comparison ofData Rate Models
The data rates models created in sections 4.2.1 and 4.2.2 can be compared by
varying the type (shared vs. non-shared) and rate of data (cache miss rates). The
following list ofvariables will be used for the data rate analysis model.
I = Total Number ofBus Transactions
R = Total Number ofBus Transactions that are Reads
W = Total Number ofBus Transactions that areWrites
Psr = Percentage ofReads that are for Shared Data
Pnr = Percentage ofReads that are for Non-Shared Data
Psw = Percentage ofWrites that are to Shared Data
Pnw = Percentage ofWrites that are to Non-Shared Data
E = Total Number ofExecution Units
Msr =Miss Rate for Shared Reads
58
MNr =Miss Rate forNon-Shared Reads
Msw=Miss Rate for Shared Reads
Mn\v=Miss Rate for Non-Shared Reads
Reads are responsible for approximately 70% of the traffic on the
bus due to
instructions and requests to read cache blocks due to invalidations. Therefore, a 70/30
ratio is used for reads/writes. The majority of Pnr are due to instruction reads
when the
instruction does not reside in the level 2 cache. Psw will impact the system the most since
it is directly related to the cache miss ratio in the current system. However, this value
will also effect PSR since most invalidation requests result in shared read misses by other
processors. In addition non-shared percentages, Pnr and Pnw, will affect their cache miss
rates due to false sharing inside of cache blocks.
Equation 3 is the total number of units that it takes for a process to complete on
the current system.
E = (R(Psr(1 + MSR(20)) + Pnr(1+ Mn-r(20))) + (W(PSW(2 + Msw(20)) + Pkw(1 + MMV(20)))
Equation 3 : Execution Time for the CurrentArchitecture
Equation 3 takes into account the invalidation overhead that will have a direct
effect on bus transactions. One factor that is not taken entirely into account is bus
contention since it is heavily dependent on the actual application. However, the bus
contention would be greater in the current architecture since invalidations result in
additional bus accesses. By varying the percentages of shared data and miss rates
accordingly an approximate execution time for the system can be determined.
59
For the proposed architecture, Equation 4 gives the total execution time
for a
process. Note that the equation accounts for the off-package request latency since the
shared cache is running at 1/2 the processor speed. This means that
L2* lookups would
take 2 processor cycles, while L2 lookups take 1 processor cycle.
E = (R(PSr(2 + MSR(20)) + Pnr(1+ Mnr(20))) + (W(PSW(2 +Msw(20)) + Pnw(1 + Mnw(20)))
Equation 4: Execution Time for the ProposedArchitecture
The inherent difference in the proposed architecture is that as the degree of
sharing increases, the system will exhibit better scalability characteristics. This will
allow more processors to be connected to the shared bus. Also, as the degree of sharing
increases, the number of transactions placed on the bus will be reduced since miss rates in
the proposed architecture will be much lower due to its insurance of consistency. The
following figure shows the resulting graph when the percentage of shared data is varied





























to co .&. cji en -j oo
30000000
5 S? S? 3? SS S? g? gS




Figure 4-7: Performance Comparisonwhen Programs Exhibit Fine Grain Sharing
Figure 4-7 shows the performance difference of the architectures with a program
that exhibits fine grain sharing. Notice that when there is very little sharing the proposed
architecture does not offer any performance gain or loss. This is due to the fact that the
L2*
cache is not getting utilized and all requests from the processor are getting satisfied
in the local L2 cache. However, as the amount of shared data increases the invalidation
requests begin to overwhelm the bus and the cache miss rates increase in the current
system. For programs exhibiting per-processor locality, Figure 4-8 shows the expected





































O ' KJ CO
a; o o o
^ ^o ^oo* ff> tf-
a
en a) ~j ao cq
D O CD
% of Shared Data
-? Proposed -- Current
Figure 4-8: Performance Difference when Programs Exhibit per-processor locality
When there is very little sharing, the proposed architecture will have a slightly
longer execution time since the L2* cache would add some additional delays to cache
lookups. Also, at this point invalidation requests would not overwhelm the bus.
However, as the amount of sharing increases the proposed architecture would result in a
lower execution time due to the elimination of invalidations.
It has been shown that the proposed architecture can increase performance in
multiprocessor systems. However, the exact amount of increase depends upon, the
amount of sharing between processors, compiler optimizations, and the
programmers'
knowledge of the parallel application being developed.
62
5 Conclusions
With the limitation of VLSI advancements, and the shortcoming of additional
architectural enhancements, designers are forced to look to other places to increase
system performance. The most obvious choice to increase system performance is SMP
Multiprocessing is widely accepted as the next step in mainstream computing systems.
However, some questions need to be addressed before SMP becomes a conventional
solution for individual users.
When designing an SMP system, trace driven simulation is a designer's ideal tool
for modeling memory system architectures. However, even without such tools a good
performance evaluation can be conducted. The lack of an efficient simulation tool has
led to the use of the data rate model used for the performance comparison.
In this thesis, a new architecture for multiprocessor memory systems was
investigated. This architecture removes the overhead that is placed on the system due to
the snoopy bus based cache coherence protocol detailed in section 2.3.2 commonly used
in current systems. The actual amount of overhead placed on the system can vary greatly
depending on the type of data sharing, and compiler optimizations. In addition, proper
parallel program techniques must be understood for the programmer to create efficient
applications.
The architecture we presented in section 3.2 does effectively remove all need for a
cache coherence protocol in the system. The modification to current SMP architectures
results in two advantages. First, there are no invalidation signals being utilized from the
cache coherence protocol. This results in better bus utilization and less bus contention,
increasing scalability. Second, the cache miss rates will decrease due to the fact that a
63
processor can never invalidate data in another processor's L2 cache. Therefore, the
effective bus bandwidth will increase allowing more processors to be added to increase
performance.
Another goal of this thesis was not require major modifications to the current
memory system and CPU. Another bus and cache must be added to the system board to
support the architecture. Also, minor modifications to the CPU and its packaging must
be made. These modifications would eliminate the snoopy logic contained in the cache
controller inside the processor.
A performance increase was shown from the data rate model detailed in section
4.2. The amount of overhead involved in current multiprocessor systems is quite
significant and does reduce the system performance dramatically as shown in the figures
in section 4.2.3. While the data rate model shows promising resulting, a full fledged
trace driven simulator is required to completely analyze any disadvantages that this
architecture might present.
5.1 FutureWork
Future work might involve the use of this idea on a daughter card for a system
board. The daughter card would contain a number of CPU's along with the
L2* bus and
cache. This daughter card could be made compatible with a uniprocessor daughter card.
Therefore, when individuals wish to upgrade their machine to increase performance they
could simply swap their CPU card without any other modifications to the system. This is
an attractive solution to the quickly changing computer world. It would allow users to
increase the performance of their systems without requiring them to purchase large
amounts of additional hardware.
64
6 References




IEEE. Trans. Computers, 1979




in Dubois and Thakkar, Scalable Shared-Memory
Multiprocessors, Kluwer Academic Publishers, Boston, MA, 1992.
[GOOD83] J. R. Goodman, "Using CacheMemory to Reduce Processor-Memory
Traffic", 1CSA 'S3, pp. 124-131, June 1983.
[CH1S092] M. Chiang, and G. S. Sohi, "EvaluatingDesign Choicesfor SharedBus




Computers, pp. 297-317, March 1992
[HWAXU98] K. Hwang and Z. Xu, "Scalable Parallel Computing",
WCB/McGraw-
Hill, 1998
[HWANG93] K.Hwang, "Advanced Computer
Architecture"
, McGraw-Hill, 1993
[HENPAT96] J. Hennessy andD. Patterson, "ComputerArchitecture: A Quantitative
Approach."
Morgan Kaufmann Publishers, Second Edition, 1996
[Pill 00] Intel lOOMhz
Pentium II processor/440BXAGPset Dual Processor
Customer Reference Schematics, Revision 1.0, April 1998,
http://developer.intel.com/design/pcisets/designex/BXDPDG10.htm
[ANDS95] J. Anderson and T Shanley, "Pentium Processor SystemArchitecture",
Second Edition., Addison-Wesley, 1995





[DUBRIG82] M. Dubois and F. Briggs, "Effects ofCache Coherence in
Multiprocessors", IEEE Trans. Comp., November 1982
[MBC86] M. Marsan, G Balbo and G Conte, "Performance Models of
Multiprocessor Systems", MIT Press 1986
[LENW95] D. Lenoski andW Weber, "Scalable Shared-Memory
Multiprocessing"
,
Morgan Kaufmann Publishers, 1995






[GELEN89] E. Gelenbe, "Multiprocessor Performance", JohnWiley & Sons, 1989
[MCKER88] P. McKerrow, "PerformanceMeasurement ofComputer Systems",
Addison-Wesley, 1988
[SUZU92] N. Suzuki, "SharedMemoryMultiprocessing", MIT Press, 1988
[GUSML97] E. Gustafsson and B. Nilbert, "Cache Coherence in Parallel
Microprocessors", UPPSALA, February 1997
[NAYK94] B. Nayfeh, and K. Olukotun, "Exploring the Design Spacefor a
Shared-Cache
Multiprocessor"
', IEEE Press 1994
[SMITH87] A. Smith, "Design ofCPU CacheMemories", Proc. of IEEE TENCON,
August 1987
[STENS90] P. Stenstrom, "A Survey ofCache Coherence Schemesfor
Multiprocessors"
,
IEEE Computer, Vol. 23, Number 6, June 1990
[PHM88] S. Prybylski, M. Horowitz, and J. Hennessy, "Performance tradeoffs in
cache design", ICSA 1988
[MITCH] B. S. Mitchell, "Managing Cache Coherence inMultiprocessor Computer
Systems", Widener University
[WANDU90] J. Wang and M. Dubois, "Memory-Access Penalties in Write-invalidate
Cache Coherence Protocols", Cache and Interconnect Architectures in
Multiprocessors, 1990




ACM Computing, Vol. 25,
No. 3, 1993




IEEE Trans, on Computer-Aided Design, Nov. 1987
[CRISP97] R. Crisp, "DirectRambus Technology: The NewMainMemory Standard',
IEEEMicro., Nov. 1997
66
