Performance of parallel algorithms on a broadcast-based architecture by Narravula, Harsha V.
Performance of Parallel Algorithms on a Broadcast-Based Architecture
A Thesis
Submitted to the Faculty
of
Drexel University
by
Harsha V. Narravula
in partial fulfillment of the
requirements for the degree
of
Doctor of Philosophy
November 2003
i 

ii
ACKNOWLEDGEMENTS
This dissertation would not have been possible without the help of my advisor,
Dr. Constantine Katsinis. Special thanks to him for his patience and guidance over
the years.
I would also like to thank Dr. Jeremy Johnson, Dr. Warren Rosen, Dr. Prawat
Nagvajara and Dr. Oleh Tretiak for serving in my committee.
Special thanks to Diane for letting me use and extend her simulator. Thanks
also to my friend and colleague Zhu, for his help in writing the thesis.
I would also like to thank my roommates Harish, Madhu and Ishan for their
friendship and support .
Lastly, I would like to thank my family for their support and guidance.
iii
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Current Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Distributed-Shared-Memory . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Broadcast-Based Networks . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
CHAPTER 2. SOME-BUS ARCHITECTURE . . . . . . . . . . . . . . . . . . . 8
2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 The Receiver Architecture . . . . . . . . . . . . . . . . . . . . . 12
2.2 DSM Implementation on the SOME-Bus . . . . . . . . . . . . . . . . . 15
2.3 Cache and Directory Controller Architecture . . . . . . . . . . . . . . 18
CHAPTER 3. SOME-BUS ARCHITECTURAL ENHANCEMENTS . . . . . . 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Data-Acknowledge Message Multicasting (Combining) . . . . . . . . 22
3.3 Block Capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Block Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 4. CASE STUDIES . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Matrix-Vector Multiplication on the SOME-Bus . . . . . . . . . . . . . 33
4.1.1 Matrix-Vector Multiplication Using Message Passing . . . . . 33
iv
4.1.2 Matrix-Vector Multiplication Using DSM . . . . . . . . . . . . 34
4.1.3 Cache Behaviour: No Enhancements . . . . . . . . . . . . . . . 36
4.1.4 Data-Acknowledge Message Multicasting . . . . . . . . . . . . 37
4.1.5 Matrix-Vector Multiplication Using Block Capture . . . . . . . 38
4.1.6 Matrix-Vector Multiplication Using Block Prefetch . . . . . . . 40
4.1.7 Block Prefetch in Exclusive State . . . . . . . . . . . . . . . . . 41
4.2 Matrix-Matrix Multiplication on the SOME-Bus . . . . . . . . . . . . 43
4.2.1 Matrix-Matrix Multiplication Using Block Capture . . . . . . 45
4.2.2 Matrix-Matrix Multiplication Using Block Prefetch . . . . . . 47
4.2.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 47
4.3 LU Block Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4 Sorting on the SOME-Bus Using DSM . . . . . . . . . . . . . . . . . . 61
4.4.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 65
CHAPTER 5. SCHEDULING OF ALGORITHMS . . . . . . . . . . . . . . . . 67
5.1 Matrix-Vector Multiplication Using Prefetch . . . . . . . . . . . . . . . 68
5.2 Matrix-Vector Multiplication Using Block Capture . . . . . . . . . . . 70
5.3 Matrix-Vector Multiplication Using Block Capture and Prefetch . . . 71
5.4 A General Procedure to Use Capture and Prefetch . . . . . . . . . . . 72
5.5 Comparison With a Traditional Architecture . . . . . . . . . . . . . . 78
CHAPTER 6. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . 82
LIST OF REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
APPENDIX A. DESCRIPTION OF THE SIMULATOR . . . . . . . . . . . . . . 90
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vLIST OF TABLES
4.1 LU Decomposition Performance Summary . . . . . . . . . . . . . . . 57
5.1 Matrix-Vector Multiplication Accesses at Each Processor . . . . . . . 68
5.2 Matrix-Vector Multiplication Accesses with Prefetching . . . . . . . . 69
5.3 Block Capture (Example) . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Dynamic Code Accesses . . . . . . . . . . . . . . . . . . . . . . . . . . 76
vi
LIST OF FIGURES
2.1 Parallel Receiver Array . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Readout Geometry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Optical Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Processor Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Channel Controller, Cache and Directory . . . . . . . . . . . . . . . . 18
3.1 Queue at One Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 All Input Queues at an Input . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Capture Hardware at Each Input of the Receiver . . . . . . . . . . . . 27
3.4 Global Control of Capture Hardware . . . . . . . . . . . . . . . . . . . 27
4.1 Matrix-Vector Multiplication Program . . . . . . . . . . . . . . . . . . 35
4.2 Matrix-Vector Multiplication Program (Showing Misses) . . . . . . . 36
4.3 Matrix-Vector Multiplication (P Accesses) . . . . . . . . . . . . . . . . 40
4.4 Multiplication-Vector Program Using Block Capture . . . . . . . . . . 41
4.5 Multiplication-Vector Program Using Prefetch . . . . . . . . . . . . . 42
4.6 Matrix-Vector Multiplication Program Using e_prefetch . . . . . . . . 43
4.7 Matrix-Matrix Multiplication Program . . . . . . . . . . . . . . . . . . 44
4.8 Matrix-Matrix Multiplication Using Block Capture . . . . . . . . . . . 46
4.9 Matrix-Matrix Multiplication Using Block Prefetch . . . . . . . . . . . 48
4.10 Miss Rate Reduction in Matrix Multiplication program . . . . . . . . 50
4.11 LU Block Decomposition Algorithm . . . . . . . . . . . . . . . . . . . 52
4.12 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
vii
4.13 Step 2 with Exclusive Prefetch . . . . . . . . . . . . . . . . . . . . . . . 55
4.14 Calculation of Transformation Ratio (u) . . . . . . . . . . . . . . . . . 59
4.15 Calculation of Transformation Ratio (optimized) . . . . . . . . . . . . 60
4.16 Sorting Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.17 Merge Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.18 Example of Case 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.19 Example of Case 3 (simple) . . . . . . . . . . . . . . . . . . . . . . . . 65
5.1 Matrix-Vector Multiplication (simple) . . . . . . . . . . . . . . . . . . 69
5.2 Matrix-Vector Multiplication with Reduced Congestion . . . . . . . . 69
5.3 Matrix Multiplication with Block Capture and Prefetch . . . . . . . . 73
5.4 Example of a More Dynamic Code . . . . . . . . . . . . . . . . . . . . 75
5.5 Dynamic Code (Case 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 Dynamic Code (Case 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.7 Finer Scheduling of Block Accesses . . . . . . . . . . . . . . . . . . . . 77
5.8 Matrix Multiplication Accesses (No Schedule) . . . . . . . . . . . . . 78
5.9 Matrix Multiplication Accesses (Schedule A) . . . . . . . . . . . . . . 79
5.10 Matrix Multiplication Accesses (Schedule AB) . . . . . . . . . . . . . 80
5.11 Matrix Multiplication Program (Schedule AB) . . . . . . . . . . . . . 81
viii
ABSTRACT
Performance of Parallel Algorithms on a Broadcast-Based Architecture
Harsha V. Narravula
Constantine Katsinis, Ph.D.
Research in high-end computing has produced enormous benefits to society.
While new data- and computation-intensive applications are appearing all the
time, there is evidence that present scalable parallel architectures may not be well
suited for these applications. To achieve petaflops computing, advances in hard-
ware technology, architecture, system software, and programming environments
is needed.
Due to advances in fiber optics and VLSI technology, interconnection networks,
which allow multiple simultaneous broadcasts, are becoming feasible. The Si-
multaneous Optical Multiprocessor Exchange Bus (SOME-Bus) is a low-latency
high-bandwidth, fiber-optic network with a unique feature that every processor
is directly connected to the other processor through a dedicated broadcast/output
channel. This thesis presents the multiprocessor architecture of the SOME-Bus and
examines the performance of representative algorithms for matrix operations and
sorting using the message-passing and distributed-shared-memory paradigms. It
shows that simple enhancements to the network interface and the cache and di-
rectory controllers can greatly improve the performance; for example, the commu-
nication time of a matrix-vector multiplication algorithm is reduced to O(1) using
DSM.
Existing parallel loop schemes are extended to make them suitable for the high-
end system under study. Efficient mapping of existing parallel software to the sys-
tem is studied. Software is implemented, tested and evaluated for performance
ix
on a simulator developed for the system. The thesis also presents enhancements
to the network interface and the cache and directory controllers, which allow sig-
nificant overlap of processing time with the communication time due to compul-
sory misses. Results from the simulated execution of simple algorithms such as
the matrix-matrix multiplication on the SOME-Bus show that block capture and
prefetch combined with an effective block replacement policy succeed in signifi-
cantly reducing the miss rate due to compulsory misses as the cache size increases,
while a similar increase of cache size in traditional architectures leaves the miss
rate (due to compulsory misses) unaffected.

1CHAPTER 1. INTRODUCTION
1.1 Current Architectures
High-performance computing is required for many applications, including sim-
ulation of physical phenomena, simulation of integrated circuits and neural net-
works, weather modeling, aerodynamics, and image processing. It has been in-
creasingly relying on microprocessor-based computer nodes, groups of which are
interconnected to form a distributed-memory multicomputer system. Large net-
works capable of accommodating arbitrary traffic patterns must necessarily have
large bisection bandwidths and usually have some regular topology. Typical in-
stallations use Quadrics [1] or Myrinet [2] switches to create fat trees, Clos net-
works or multidimensional tori, containing several dozen or hundreds of process-
ing nodes. Each node contains 4 to 8 microprocessors and a network interface card
attached to the I/O bus of the processing node. Point-to-point links connect inter-
face cards to switches. For example, a switch-based Myrinet architecture with 128
processing nodes can be constructed using 24 16-port switches in a Clos network
topology [2, 3].
Programming tends to rely on the message-passing paradigm requiring pro-
grammers to manage the distribution of data explicitly and use send/receive prim-
itives, which typically need the intervention of the operating system. Such inter-
ventions dramatically increase the latency of the operation to the extent that the
interconnection network capabilities (or lack of) become secondary. When the ini-
tiation of a send operation requires microseconds, there is no significance to the
fact that the interconnection network can transfer messages with latencies in the
order of nanoseconds. As a result, there is a tendency to use large messages to
2spread the cost of the high latencies. Consequently, scientific and data-processing
applications which require frequent small messages or which are sensitive to la-
tencies, result in low processor utilization and little scalability.
There is an increasing imbalance among processor speed, communications per-
formance and data access, resulting in high-end systems that are inefficient for
large-scale applications. Many complex simulation codes demonstrate poor com-
putational efficiency and scalability (utilizing less than a dozen processors for op-
timized codes), and poor single-processor utilization (as low as 1% to 5% of peak
processor performance). This effect increases the time and the cost of operation
and programming. Developing good software is very hard. System resources
are at different distances (with varying access costs) requiring complicated per-
formance models of the underlying hardware and software, and sophisticated re-
source managers and compilers to achieve some acceptable mapping of the appli-
cation on the underlying platform. Dynamically partitioning an application is even
harder if needs or resource availability change. Consequently, current teraflop sys-
tems are very underutilized, while at the same time they require enormous efforts
to facilitate and maintain them.
One additional issue which is critical to achieving high performance is con-
tention on the interconnection network. In [4] experiments using benchmarks
on hardware DSM multiprocessors show that contention has a significant impact,
which may account for as much as 34% of execution time in programs with irreg-
ular memory access patterns. The author describes an algorithm that alleviates
contention using dynamic page migration. The algorithm integrates a locality-
sensitive page migration criterion with a criterion that balances the number of re-
mote memory accesses and the associated protocol traffic across the DSM nodes.
Dai and Panda [5] simulated a detailed network model that captures all types of
contention in every part of the network of a hardware DSM multiprocessor. They
3have shown that network contention can have a significant impact on performance
and conducted a sensitivity analysis of architectural parameters that might affect
contention, such as the design of caches, CPU speed and network speed.
1.2 Distributed-Shared-Memory
DSM provides a simple programming model where communication is per-
formed by writing and reading shared memory locations, and barriers provide
synchronization. A natural implementation leads to the use of a single memory ad-
dress space distributed over all the processing elements; some memory segments
are only accessed locally while other segments are globally shared and managed
by a DSM protocol, enforcing the required memory consistency.
Many parallel applications are easier to formulate and solve using the shared-
memory paradigm rather than message passing. A distributed-shared-memory
(DSM) system can be viewed as a set of nodes or clusters, with local memories,
communicating over an interconnection network. DSM hides the message-passing
mechanism and provides a shared-memory model, attempting to combine ease of
programming and reduced contention. On each access to shared space, hardware
must determine if the requested data is in the local memory, and if not, the data
must be copied from remote memory. Actions are also needed when data is written
in shared space to preserve the coherence of shared data. DSM requires more archi-
tectural complexity than message-passing. This is one reason why software-based
DSM systems have been developed [6, 7]. However, these systems tend to rely on
the operating system to manage the replication of relatively large memory blocks
(pages) and consequently suffer from some of the problems relating to message-
passing implementations, in addition to false sharing and page thrashing. The
addition of full hardware support allows the true benefits of DSM to become a re-
4ality. Initial versions of such hardware supporting shared-memory [8] still rely on
interfaces attached to the I/O bus and therefore cannot provide cache coherence.
Future versions of interfaces would be connected directly to the memory bus [9]
to support shared-memory and implement the DSM protocol in hardware. A fully
hardware-supported DSM requires very little support from the operating system
and consequently the latencies experienced are much smaller. This is critical as
messages are also very small and are sent very frequently. The correct organiza-
tion and design of the interconnection network becomes a critical factor in that
case, especially as processors become faster or are replaced by larger SMPs.
The effects of interconnection network properties and data consistency proto-
cols have been the focus of extensive research. A DSM multiprocessor based on
a two-dimensional mesh is examined in [10] using a queuing network model and
simulation. For large values of remote memory request probability, it is observed
that the interconnection network saturates, and processor utilization stays below
35%. Both theoretical and simulation techniques are used in [11] to study a clus-
tered DSM multiprocessor with crossbars interconnecting processors and memo-
ries with a cluster, and processors with global memory. A study of four archi-
tectures with hardware support of shared memory is reported in [12]. Significant
latency is found, even under optimistic assumptions, especially in characterizing
the cache misses which result in traffic over the interconnection network, as evi-
denced by the very small network utilization in three architectures. A DSM imple-
mentation on a 16-node nCUBE is described in [13]. Experiments with four paral-
lel programs show reduced performance of matrix operations on distributed data
requiring a significant amount of data-transfer time compared to node-intensive
computation time. It is observed in [13] that such programs are unsuitable for
DSM unless a technique can be found to reduce the communication. A shared
memory multiprocessor based on a 4x4 mesh network with wormhole routing is
5studied in [14]. The performance of two hardware-based prefetching schemes are
evaluated with simulation. Cache misses in the range from 1% to 10% are observed
using five popular applications.
1.3 Broadcast-Based Networks
High-performance (and high-complexity) interconnection networks have been
proposed [15]. More practical networks have also been proposed, including net-
works based on the mesh with additional broadcast buses. The distributed cross-
bar switch hypermesh (an implementation of multidimensional hypergraph net-
works) is examined in [16], where blocking probabilities and average values of
message delays are calculated. Similarly, an optical implementation of hyperme-
shes using electrical and optical crossbars is examined in [17]. Although multiple
wavelengths are used, multiple senders may use the same wavelength, requiring
contention resolution.
A network which can offer an alternative to current networks relies on one-
to-all broadcast, where each processor can directly communicate with any other
processor; from the point of view of any processor, all other processors appear
the same. Such a network allows the user, or the compiler, to structure the data
and operations in the application code to better reflect the parallelism inherent in
the applications with the resulting benefit that messages experience smaller laten-
cies. It also allows the operating system to perform extensive thread placement
and migration dynamically to successfully manage the level of parallelism present
in large applications. The most useful properties of such a network of worksta-
tions are high bandwidth (scaling directly with the number of workstations), low
latency, no arbitration delay, and non-blocking communication. A network with
these properties can be constructed using optoelectronic devices (and multiple-
6wavelength data representations), relying on sources, modulators and arrays of
detectors, all being coupled to local electronic processors.
When communication traffic is relatively equally distributed over all nodes,
large bandwidth and scalability can be achieved only by carefully designed, reg-
ular, small-diameter networks; it is only the unloaded latency that is independent
of distance in wormhole networks. The effects of interconnection network prop-
erties and data consistency protocols have been the focus of extensive research
over the past several years. Traditional architectures (including crossbars, meshes
and hypercubes) have been examined [10–14]. Typical results show relatively
large data-transfer times and low processor utilization. High-performance (and
high-complexity) interconnection networks have also been examined including the
mesh with additional broadcast buses and the Generalized Hypercube [15]. The
distributed crossbar switch hypermesh (an implementation of multidimensional
hypergraph networks) is examined in [18]. Similarly, an optical implementation
of hypermeshes using electrical and optical crossbars is examined in [17]. Sev-
eral networks rely on optics but although multiple wavelengths are used, multiple
senders may use the same wavelength, requiring contention resolution.
Many modern large-scale applications have irregular and dynamic communi-
cation patterns. Even after extensive efforts of software tuning [19, 20], results
show that performance is often poor or moderate, mostly due to load imbalance,
barrier synchronization, and collective communication patterns. The major reason
of the moderate success lies in the nature of currently available interconnection
topologies (small-degree networks with large diameters), and in the mismatch be-
tween interconnection architecture and application structure. There is a resulting
proliferation of high-end systems that exhibit large variation on performance de-
pending on the application at hand. Good performance is observed only if the
application structure happens to match the system characteristics.
7The current challenges in development and use of high-end systems and ap-
plications are programming productivity, performance, portability, scalability and
reliability. Specifically, high-bandwidth, low-latency, hierarchical memory systems
must be developed based on emerging technologies. Scalable computer systems
should be designed using an overall system perspective balancing the performance
of processors, memory systems, interconnects, system software, and programming
environments and system efficiency must be improved for a broad class of user ap-
plications.
1.4 Structure of the Thesis
The rest of the thesis is organized as follows: Chapter 2 presents the SOME-
Bus multiprocessor architecture, DSM operation and a detailed design of the net-
work interface, Chapter 3 presents architectural enhancements to the SOME-Bus,
Chapter 4 discusses the case studies of matrix-vector multiplication, etc using mes-
sage passing and distributed-shared-memory, Chapter 5 presents a more general
approach to scheduling algorithms, and Chapter 6 presents the Future work and
Conclusions.
8CHAPTER 2. SOME-BUS ARCHITECTURE
2.1 Overview
The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [21] is a
low-latency high-bandwidth, fiber-optic network with a unique feature that every
processor is directly connected to the other processor through a dedicated broad-
cast/output channel. Its properties distinguish it from other optical networks ex-
amined in the past [16,17,22,23]. A receiver array at each node contains amorphous
silicon detectors built on the surface of the CMOS chip. The yield and cost of the
receiver is determined by the yield and cost of the CMOS device itself as no subse-
quent patterning is required. This is due to very low conductivity of the a-Si layer
(photo detectors) used to build superstructures on the surface of the electronic pro-
cessing devices. It also uses slant Bragg gratings written directly into the fiber as
narrow-band, inexpensive couplers. Since the receiver array itself does not per-
form any routing, its hardware complexity (including detector, logic and packet
memory storage) is small. And the dedicated output channel per node eliminates
blocking of a node due to contention for shared switching logic with other trans-
mitters.
In general, the SOME-Bus contains K fibers, each carrying M wavelengths or-
ganized in M/W channels, where each channel is composed of W wavelengths.
The total number of fibers is K = PW/M . A simple configuration with 128 nodes
(P = 128 channels) and W = 1 wavelength per channel would require K = 32
fibers with M = 4 wavelengths per fiber, and a receiver array at each node con-
taining 128 detectors organized as 32x4 over the surface of a single chip. Current
or foreseeable technology can be used to create configurations with more wave-
9lengths per fiber allowing each channel to reach higher bandwidth. In fact, such
optoelectronic CMOS chips with 256 laser drivers, 256 receivers and associated
buffers have already appeared [24]. Each of P nodes also has an input channel
interface based on an array of P receivers (each with W detectors) which simulta-
neously monitors all P channels.
Such an organization also eliminates any need for global arbitration and pro-
vides bandwidth that scales with the number of nodes in the system. Another
consequence of such an architecture is the ability to support multiple simultane-
ous broadcasts with distributed high-speed barrier synchronization mechanisms,
efficient implementation of cache consistency protocols, and support for process
group partitioning within the receiver array.
Recent advances in optical communications, Dense Wavelength Division Mul-
tiplexing (DWDM), and optielectronics provide motivation for a physical imple-
mentation of the SOME-Bus. Figure 2.1 shows the parallel receiver array and the
output coupler. Slant Bragg gratings [24–28] are written directly into the fiber core
and are used as narrow-band, inexpensive output couplers. This coupling of the
evanescent field allows the traffic to continue and eliminates the need for regen-
eration. The SOME-Bus also uses amorphous silicon (a-Si) photodetectors built
as superstructures on the surface of electronic processing devices. Due to the low
conductivity of the a-Si layer, no subsequent patterning is required, and therefore
the yield and cost of the receiver is determined by the yield and cost of the CMOS
device itself. Optical power budget analysis of a system with 128 nodes, 32 wave-
lengths per fiber and 10mW of power inserted into the fiber shows that the worst
case for output power, occurring where light from the first node is coupled out
by the receiver at the last node, is 2.46µW [21] which is more than sufficient for
present detectors.
10
Figure 2.1: Parallel Receiver Array
To increase wavelength selectivity a Bragg reflector can be created at the bottom
of the absorption region, and the region’s thickness can be chosen to form a cavity
where photons recirculate. Thus, each detector becomes a filter which is tuned to
the proper wavelength. An additional benefit of this design is that reduced ab-
sorption thickness results in decreased response time. Bragg gratings on the fiber
produce a spot of light which must be coupled to a planar photodetector, part of
whose surface is covered by metal. If the detector area is too large, capacitance
of the detector dominates its speed of response, but reduction of detector area re-
quires high signal intensity to maintain detectivity of light. Cylindrical micro-optic
lens arrays [29] can be used to collect the light and focus it on the active area of the
detectors, away from metal areas. This allows the detector area, and hence capac-
11
itance, to be reduced while retaining the low signal intensity requirement. The
readout geometry is shown in Figure 2.2.
Figure 2.2: Readout Geometry
Since the receiver array does not need to perform any routing, its hardware
complexity (including detector, logic, and packet memory storage) is small. This
organization eliminates the need for global arbitration and provides bandwidth
that scales directly with the number of nodes in the system. No node is ever
blocked from transmitting by another transmitter or due to contention for shared
switching logic. The ability to support multiple simultaneous broadcasts is a unique
feature of the SOME-Bus which efficiently supports high-speed, distributed barrier
12
synchronization mechanisms and cache consistency protocols and allows process
group partitioning within the receiver array.
2.1.1 The Receiver Architecture
Once the logic level signal is restored from the optical data, it is directed to the
input channel interface which consists of two parts: the optical interface and the
processor interface. Figure 2.3 shows the optical interface which includes physical
signaling, address filtering, barrier processing, length monitoring and type decod-
ing.
Figure 2.3: Optical Interface
Each receiver generates a data stream which is examined to detect the start
of the packet and the packet header. The header decode circuitry examines the
13
header field, which includes information on the message type, destination address
(or addresses) and length, to determine whether or not the message is a synchro-
nization message. If the message is a synchronization message, it is handled by the
barrier circuitry, otherwise the destination address is compared to the set of valid
addresses contained in the address decode circuitry. In addition to recognizing the
local node address, the address filter can recognize multicast group addresses as
well as broadcast addresses. Once a valid address has been identified, the message
is placed in a queue. If the address does not match, the message is ignored.
A special register is maintained at the receiver for implementing barriers in the
SOME-Bus. A processor specifies all the other processors involved in the barrier
by writing a ’1’ in the appropriate bits of the register. The hardware then clears the
appropriate bits as and when barrier sync messages arrive from the other proces-
sors. When the mask is completely cleared, synchronization is complete and the
processor can continue with its next computation.
Figure 2.4 shows the processor interface, which includes a routing network
(resolver circuit) and a queuing system. One queue is associated with each in-
put channel, allowing messages from any number of processors to arrive and be
buffered simultaneously, until the local processor is ready to remove them. The
Resolver circuit receives a request signal (Ri) from each non-empty queue and
produces the index of the next queue to be accessed under either the limited or
the exhaustive service disciplines.
The local processor can force the next queue selection through the Pin input. A
straightforward implementation of the resolver as a selection tree, using logic gates
to select the next queue and multiplexers to forward the corresponding queue in-
dex, requires only several hundred gates organized in log2(P ) levels. The time
required to select the next queue (polling walk time) is consequently very small
and can be overlapped with the queue access time. Arbitration may be required
14
Figure 2.4: Processor Interface
only locally in a receiver array when multiple input queues contain messages.
Etched micro-mirrors [30, 31] can be used to insert a signal into a fiber. Each
node uses a separate mirror and a separate laser source to insert each wavelength
of its channel. It is possible to integrate the transmitter sources on the same chip
with the detectors and the associated electronic circuits.
The SOME-Bus may appear to be equivalent to a crossbar but it has much more
functionality. A major consequence of this architecture is that, due to the mul-
tiple broadcast capability, no node is ever blocked from transmitting by another
transmitter, no arbitration is required and network bandwidth scales directly with
the number of nodes. No communication is ever blocked through contention for
shared switching logic. With P nodes, the diameter of the SOME-bus is 1, the time
needed for all-to-all communication with distinct messages is O(P ) and the time
needed for synchronization is O(1). Unlike a fully-connected point-to-point net-
15
work, where the number of transmitters and channels increases O(P 2), the number
of transmitters and channels of the SOME-Bus is O(P ), quite smaller than the num-
ber required in other popular architectures, such as the hypercube or the torus. The
number of receivers is P 2, which is larger than the number required in other archi-
tectures. They are arranged so that P receivers are fabricated as amorphous silicon
structures constructed as a thin film directly on the surface of a digital CMOS de-
vice, with no lithography required. Because of the low conductivity of the amor-
phous silicon layer, no subsequent patterning is required, and therefore the yield
and cost of the receiver is determined by the yield and cost of the CMOS device
itself. Since the receiver does not need to perform any routing, its hardware com-
plexity (including detector, logic, and packet memory storage) is small keeping the
cost small, too. The full receiver array can be implemented on a single chip even
for large values of P (P > 128). Therefore, the total receiver cost is approximately
O(P ) instead of O(P 2). The expansion of the system from P to 2P nodes is accom-
plished by incorporating a second receiver chip in each node and by using four
SOME-Bus segments to create twice the number of channels with each channel
being twice as long.
2.2 DSM Implementation on the SOME-Bus
The SOME-Bus can most readily support a CC-NUMA system where the shared
virtual address space is distributed across local memories, which can be accessed
both by the local processor and by processors from remote nodes, with different
access latencies. Traffic on the interconnection network consists of request and
data acknowledge messages, due to a cache miss on one (local) node, directed to
the memory of another (remote) node, and additional messages, which allow the
caches to maintain data consistency. Although the SOME-bus can utilize software
16
techniques for implementing cache coherence, it allows strong integration of the
transmitter, receiver and cache controller hardware to produce a highly integrated
system-wide cache coherence mechanism. This hardware-oriented system uses
cache blocks to enforce coherence, resulting in reduced probability of false sharing
and thrashing, compared to software-oriented systems, which use larger blocks
such as virtual memory pages.
On the SOME-bus, every processor may simply broadcast (on its own chan-
nel) messages which cause updates or invalidations at remote caches. Every re-
ceiver can also monitor its input channels for invalidation messages and signal the
cache controller to take appropriate action when locally cached data is affected. Al-
though the possibility of interconnection network saturation is eliminated, intense
cache consistency traffic can saturate the cache controller. A SOME-bus-based sys-
tem can take advantage of directory-based techniques which notify only those re-
mote caches with affected data blocks. This can be simply accomplished by in-
cluding a list of destinations in the invalidation message header. Messages are still
broadcast over the sending node’s output channel, but the decision to accept or
reject an input message is performed at the receiver input rather than the cache
controller of each remote node.
In addition to the processor and the cache, each node contains a directory which
maintains coherence information on the section of the distributed memory which
is implemented in that node. The directory is used to support the typical MESI,
write-invalidate protocol. A node that contains the portion of global memory and
corresponding directory entry for a particular data block is called the home node
for that block. A cache block can be in an invalid, shared, or exclusive state. Direc-
tory entries can be in an unowned, shared, or exclusive state.
A sequentially-consistent execution model is assumed: each processor executes
a program which continues until it encounters a cache miss. If the miss can be
17
serviced locally, the program waits for the memory access and then continues run-
ning. If it is a global cache miss which requires data or permission from a remote
node, the program is blocked until the required action is completed (data trans-
ferred from a remote memory or permission received), at which time it resumes
execution. A global miss may be due to a read or write miss at the local cache.
Accordingly, a data-request or ownership-request message is enqueued for trans-
mission on the output channel. After transmission, the message is enqueued at the
destination (remote) node, is serviced by the directory at the remote node and an-
other message is sent back to the originating node with data or acknowledgment.
The remote node directory controller performs the necessary memory accesses,
creates the response message and enqueues it for transmission on the output chan-
nel of the remote node. As part of servicing a message, a directory may send
messages to other nodes and receive data or acknowledgments. When the mes-
sage is received, the required data block may be in shared state or modified state.
The home directory may send a data message back to the requestor (in the case
of reading a shared block) or it may send downgrade or invalidation messages
and collect invalidation acknowledge messages. A waiting-message queue is used
to store previous requests waiting for acknowledge messages so the directory can
service multiple requests simultaneously.
As Figure 2.5 shows, the receiver array is integrated on a single chip which can
appear in the processor-memory bus, so that the processor may access it as part of
its memory. In addition, the receiver array is connected to the cache and directory
controllers for efficient protocol implementation.
18
Figure 2.5: Channel Controller, Cache and Directory
2.3 Cache and Directory Controller Architecture
Figure 2.5 shows the major blocks of the channel controller as well as the relat-
ing functionality of the cache and the directory. A processor reference may result in
a cache miss and a subsequent interaction with the directory. If necessary, a request
is created by the directory and is forwarded to the channel controller which creates
a request message and enqueues it in the output channel queue for transmission.
To support DSM, the channel receiver must pass incoming messages to either
the cache or the directory. As Figure 2.5 shows, incoming data or ownership re-
quest messages are sent to the directory, while invalidation requests are sent to
the cache. Similarly, incoming invalidation acknowledgments are collected by the
directory, and data or ownership acknowledgments are sent to the cache. In addi-
19
tion there is communication between cache and directory as needed to complete
the protocol. To maintain high performance, the receiver is capable of extracting
two messages from the channel input queues simultaneously. One message is di-
rected to the cache and one to the directory.
This is achieved through the use of queues which are dual-ported on the read
side. The queue structure has a status register which indicates if any messages exist
which are targeted for either the cache or the directory. It also has a dual resolver
(one for each attribute) to allow automatic polling. Both cache and directory keep
requesting the next message as long as the corresponding status bit indicates that
there are such messages. Each resolver generates the index of the next queue to be
polled among the queues with the corresponding attribute.
To support message passing, a direct connection is created between the channel
controller and the main memory. Small messages may be read by the processor di-
rectly from the selected queue. Large messages can be transferred from the proper
queue into the regular memory in a cut-through manner by the DMA controller.
Figure 2.5 also shows DMA channels connecting the transmitter and receiver to
the memory bus.
The AAN (Associative Attribute Network) is a search logic, which can search
the queues associatively for messages with a particular attribute (block number,
type, etc). For example, the Directory controller searches for DATA_REQ mes-
sages by specifying the same using an attribute signal. The input queues are then
searched in parallel, and the controller is notified of the presence or absence of such
a message. If a message is found, the controller is able to indicate to the queues, the
message it needs to extract. An enhancement to this design, and its use is further
discussed in Section 3.2.
20
CHAPTER 3. SOME-BUS ARCHITECTURAL ENHANCEMENTS
3.1 Introduction
Scientific programs tend to spend a significant part of their runtime stalled
on memory requests. Researchers have long recognized the destructive effect of
communication times on performance, and have developed sophisticated algo-
rithms, which attempt to minimize the amount of needed communication on tra-
ditional architectures, and new architectures to either reduce the communication
or overlap several communication activities. Research in this area is extensive,
and an example representative of related work is given. In [32], a pipelined fiber-
optic architecture is proposed which allows multiple processors to inject messages
into the network in a synchronized fashion. Matrix multiplication on this archi-
tecture is examined in [33]. The standard algorithm is shown to require time
O(Na/p + (N2 · log(p))/(p2/a)) where a = 3 and p is the number of processors.
The second term in the above equation is due to communication between pro-
cessors. The authors continue in [33] and [34] and examine the parallelization of
faster algorithms (with a < 3) on the pipelined fiber-optic architecture as well as
the torus and the hypercube. In all architectures examined in these articles, the
contribution of communication time to total run time is significant. The drawback
of such networks is their large diameter and the limited connectivity between pro-
cessors. Current fiber-optic networks often suffer from similar problems. They
attempt to offer higher bandwidth, but limit the time when a processor may access
the network (through arbitration or pipelining as in [32]) and use optical switches
which cause congestion in the network and increased latencies. Similar work in
architectures or algorithms is reported in [18, 24, 35–40].
21
Scientific programs frequently perform large, dense matrix operations which
typically exhibit little data reuse and therefore can cause caching strategies to be-
come inefficient. In simple caching policies, the cache controller fetches data from
memory only after the processor requests the data and the data is not found in
the cache. The processor is likely to stall because the computation cannot resume
until the main memory supplies the requested data. Under such a policy, compul-
sory misses (as well as coherence misses) become significant. More sophisticated
policies try to avoid many of these cache misses by using data prefetch, anticipat-
ing cache misses and fetching data from the memory system before the processor
needs the data. As a result, data transfer overlaps with computation, avoiding
stalls. Popular techniques include software and hardware initiated prefetching
which have been studied extensively. Several prefetching techniques are summa-
rized in [41], adaptive prefetching and additional latency reduction techniques are
examined in [42], and a prefetching scheme that relies on neighborhoods is devel-
oped in [43]. Software prefetching is examined in [44, 45], a combination of soft-
ware prefetching and clustering of independent read misses is examined in [46],
and software prefetching in combination with tiling and loop transformations is
examined in [47]. Also, producer-consumer coordination to prefetch data blocks is
presented in [48], and producer-initiated latency-reduction techniques are exam-
ined and compared in [49].
On the SOME-Bus architecture, latency reduction techniques such as block cap-
ture and prefetching can achieve an entirely higher level of functionality and po-
tential. The reason is that since all nodes are monitoring all channels, a block which
is prefetched by a node A may be acquired by another node B, if node B decides
that the block is potentially useful. Several cache enhancements are made possible
by the SOME-Bus architecture, so that a block can be found in the cache even on
the very first access. In following chapters, the effect of these enhancements on
22
performance is studied using several algorithms including matrix algorithms and
sorting.
3.2 Data-Acknowledge Message Multicasting (Combining)
One basic improvement that can be done at the directory of each home node,
is to design the directory controller so that it scans the input queues for several
requests that can be served together. The typical case when this can be done is
when the directory controller extracts from its queue a DATA_REQ message from
some processor Pa and after examining the rest of its queue, it finds DATA_REQ
messages from processors Pb and Pc for the same memory block. Then, the di-
rectory controller creates one DATA_ACK message with multiple destinations (Pa,
Pb, Pc in this example) and enqueues it for transmission on the output channel.
Because of the SOME-Bus broadcast, all destination nodes specified in the header
will receive the message at the same time.
Figures 3.1 and 3.2 show the relevant hardware in detail. Blocks from an in-
put channel are stored in a linked list structure that can be accessed by the cache
and the directory controller independently through the use of two different “front”
pointers. If the directory finds a DATA_REQ message enqueued in the list, it ex-
tracts the block number from the message, and passes it on to a AAN (Associative
Attribute Network) through the “BLOCK #” signal shown in Figures 3.1 and 3.2.
When this signal is active, the control logic at each of the input queues searches
the queue associatively, to find other DATA_REQ messages referring to the same
block. The control logic has access to certain attributes from each of the incoming
blocks, such as the type, block number, etc and the search in the queues is based
on this attribute information.
23
Figure 3.1: Queue at One Input
If no more messages are found, the DATA_ACK message is generated with
just one destination, and enqueued in the output queue. If more messages are
found for the same memory block, the directory then obtains their addresses and
invalidates the messages in the queues, and generates a DATA_ACK message with
the appropriate destinations. Hardware is designed such that, in the worst case
24
where there are no similar messages, the extra delay suffered by the message is
just one hardware clock cycle.
Figure 3.2: All Input Queues at an Input
25
3.3 Block Capture
In the absence of any enhancements, data is transferred after a miss; a miss is
allowed to occur and then the requesting node must wait for the data to be trans-
ferred from the home node. In the SOME-Bus, the combination of cache prefetch-
ing and a technique resembling data forwarding [48] can reduce or even eliminate
the waiting time due to a miss. This waiting may be reduced or even eliminated by
improving the design of the caches and the address filters at the receiver input to
capture any data block that happens to be within some predefined neighborhood.
Under the original design described earlier, the address decoder at each receiver
of a node monitors the address list in the destination field of each message and
signals the buffer controller to enqueue the incoming message if the node address
is included in the message destination list. In addition to the destination list, each
acknowledge message also contains the address of the associated data block. The
inclusion of the block address is necessary to allow multiple outstanding misses
to exist (and be served) simultaneously, either from a single thread or multiple
threads. The address matching hardware (called “Address Decode” in Figure 2.3
can be augmented to monitor the block address of the incoming message, when the
message is of the proper type. The buffer controller is instructed to enqueue any
message of type DATA_ACK whose data block address is within the predefined
neighborhood.
A fundamental issue is how the neighborhood is defined; how the receiver at
a node B decides which block is useful if node B has not asked for it already. In
general, nodes may declare (and change) what they consider their neighborhood.
For example, after synchronization, a group of nodes works together to do some
matrix operation. Before starting the calculations, they determine who their neigh-
bors are, and inform their own network controller. From there on, any block in-
26
jected in the network by anyone in the neighborhood is captured by everyone in
the neighborhood. This concept can be refined by designing the network controller
to perform tests and come to a decision to acquire a block. These tests use address
information from the block and a list of addresses generated by the local processor.
A neighborhood may simply be defined using a distance in the address space. A
number of least significant address bits are masked out of the address comparison
to select the neighborhood range. The center of the neighborhood may be placed
at some specific address (which may be different for each processor), or multiple
neighborhoods may be defined around a number of fixed addresses. The processor
uses special capture instructions to identify its future needs in terms of neighbor-
hood locations and sizes. Even more flexibility may be achieved by defining the
neighborhood centers as the block addresses of some blocks currently in the cache.
All the blocks in the cache or only a number of the most recently used blocks may
be used to define neighborhood centers.
Figures 3.3 and 3.4 shows the design of the network input that supports block
capture through the two ways described above. A table is maintained at the input
queues (one table entry per input channel), where each entry has four fields (low,
high, mask, valid). The instruction select_channel channel_num is used to select
the input channel to which a new neighborhood needs to be defined. The instruc-
tion capture_setup b0,b1,mask (the three arguments would be in registers) would
then write to the selected entry the supplied information. Mask, here, can be en-
coded into any regular expression to define the neighborhood. Once the neighbor-
hoods are defined, all entries of the table would be accessed simultaneously. An
incrementer was used for the mask in the simulations so that block b is captured
from an input channel if b0 <= b <= b1 and (b−b0)mod mask = 0. The instruction
capture_clear channel_num is used to invalidate a neighborhood.
If the processor makes a memory reference to a block that has just been cap-
27
Figure 3.3: Capture Hardware at Each Input of the Receiver
Figure 3.4: Global Control of Capture Hardware
tured, it will encounter a miss because the block has not been transferred to the
cache yet. Hence, the implementation of the input queues was changed to accom-
28
modate this problem. Whenever there is a read miss in the cache, the input queues
are searched for the block. The input queues, again, are searched based on an at-
tribute (in this case, the block number) using the AAN. Only a miss at the input
queues causes the cache controller to generate a message for a remote read.
In addition, figure 3.3 shows a FIFO (called capture FIFO) for the most recently
fetched blocks. When a miss or prefetch is encountered, and the cache instructs
the network controller to send a DATA_REQ message into the network, the cache
also sends the block number to the FIFOs at all inputs. The block number of any
passing message is compared in parallel with all entries in the FIFO to decide if
that block belongs to any neighborhood. A number of least significant bits may be
masked out of the address comparison to select the neighborhood limits. (Invali-
dated blocks need not be removed from the FIFO). To start the capture process, a
processor issues a few prefetches to place in the capture FIFOs the “seed” blocks
that will “attract” more blocks as the application progresses.
At the compiler level, it would be desirable to have directives that specify that
the blocks of a whole array can be captured. For example, the typical Parallel-C
declaration such as follows, instructs the compiler to create individual sub-arrays
on all processors and maintain proper pointer information (so that px = x; *(px++);
is possible even as we cross processor boundaries).
shared double x[1024];
This declaration can be changed as follows: The keyword capture would in-
struct the compiler to insert instructions in the C-startup code to create capture
specifications within each node, so that any data acknowledge message contain-
ing a block of array x, is captured by all processors.
capture shared double x[1024];
29
The maximum effect of block capture is when all accesses to blocks within a
specific group are visible by all processors in a specific set. Since blocks reside
within the memory contained within some node it is possible that some block ac-
cesses are local with respect to some processor and therefore normally would not
be visible to other processors. Normally, a reference to a local block causes a local
cache miss, which results in a data request getting delivered by the local network
controller to the local directory. The local network controller is involved, so that lo-
cal and remote accesses to a particular location are serialized properly, to maintain
sequential consistency [29]. A data-request message is enqueued in the network
controller queue and is eventually discarded at the proper time since the network
controller knows the address range of local blocks. If the local block is in SHARED
state, a local miss does not cause any messages to appear in the network, and con-
sequently, that read is not visible to the other nodes in the system. This function,
although appropriate under normal operation, tends to defeat the block-capture
function. To avoid this problem, it is required that all blocks that can be captured
be specified by a capture specification. Furthermore, any data-request message in
the network controller queue, which is due to a local miss, is sent into the network
if the block address satisfies any capture specification. The message is discarded
(and not transmitted) if no capture specification is satisfied.
3.4 Block Prefetch
Since a regular reference that encounters a miss will cause the processor to be-
come blocked, it is necessary to have another type of reference that simply causes
the cache controller to behave as if a miss were encountered but without block-
ing the processor. Such an instruction, usually called prefetch, is already available
and its functionality is used in superscalar processors to allow several outstand-
30
ing misses due to several issued instructions (An example of such a processor is
the MIPS R10000 with its PREF instruction). It can be produced by the compiler
when a reference without assignment or test is encountered. The prefetch instruc-
tion causes a reference similar to a read reference with the following exceptions:
1. If a hit is encountered, the instruction has no effect on the processor, and the
block becomes the most recently used.
2. If a miss is encountered and the block is not marked pending, a DATA_REQ
message is sent, the block is marked pending, and the processor continues
executing instructions.
3. If a miss is encountered and the block is marked pending, the processor is
blocked.
Prefetching alone would potentially be sufficient, if it is always known what to
prefetch and when. But this approach ignores the effect on the network:
1. If processors prefetch different blocks and prefetches occur at about the same
time, then on traditional switch-based networks there would appear bursts of
traffic causing congestion in the network. In the SOME-Bus the congestion is
smaller, but still the directory utilization would increase causing an increase
in the mean queue waiting time at the directory and a corresponding increase
in latency.
2. If two or more processors attempt to prefetch the same block at about the
same time, then the remote directory will create separate data acknowledge
messages which will be transmitted sequentially. Consequently, the prefetch
whose data acknowledge message was transmitted last will observe a much
larger latency than the other prefetches. These large (and variable) latencies
31
may completely eliminate the prefetch benefits. In fact, prefetching relies
on the ability of the programmer (or the compiler) to correctly predict laten-
cies. However, congestion at the network or at the remote processor results
in latencies being unpredictable, causing prefetches to be ineffective. (In the
SOME-Bus, data acknowledge combining may partially alleviate this prob-
lem, if prefetches occur close enough in time).
Case 2 is the more realistic, since it is possible that all processors may need
to access shared data stored in one or a few processors, or that the algorithm re-
quires access of shared data in a particular sequence, which is identical for several
processors. 1Consequently, prefetching alone is not only insufficient but can po-
tentially cause a decrease in performance by increasing the network traffic. If fact,
it is very important to decrease network traffic in the SOME-Bus (as is true for any
network) because each processor has only one output channel into the network,
and congestion can appear at that channel.
Simulation results have shown, that if a processor becomes a hot spot, there
is a marked increase in the queue waiting time at the output channel causing an
increase in latency. The benefit of the SOME-Bus compared to other architectures
is that traffic related to other nodes not associated with the hot spot is not affected.
The most effective approach is to use prefetching and block capture in combina-
tion. Block capture reduces network traffic by making many prefetches unneces-
sary. Prefetch instructions may still be included in the program but they result in
cache hits and have minimal effect on performance.
1In fact, SOME-Bus supports this type of applications that exhibit large degree of sharing and
small granularity. These applications cause a lot of communication between processors and become
completely ineffective when executed on other architectures. However, the SOME-Bus supports
them successfully.
32
The cost of write misses can be hidden through the introduction of an addi-
tional exclusive prefetch instruction (e_prefetch), which causes the cache to send
an OWNR_REQ message to the proper directory without blocking the processor.
The e_prefetch instruction causes a reference similar to a write reference with the
following exceptions:
1. If a hit is encountered, the instruction has no effect on the processor, and the
block becomes the most recently used.
2. If a miss is encountered and the block is not marked pending, an OWNR_REQ
message is sent, the block is marked pending, and the processor continues
executing instructions.
3. If a miss is encountered and the block is marked pending, the processor is
blocked.
33
CHAPTER 4. CASE STUDIES
4.1 Matrix-Vector Multiplication on the SOME-Bus
In this section and the following ones, N and P indicate the problem size and
the number of processors, respectively. For ease of discussion N and P are as-
sumed to be powers of 2, and N = P · 2b, where b > 1. Time is measured in clock
cycles. Also, it is assumed that in one clock cycle, a channel transfers one byte and
a processor executes one instruction.
For DSM implementations, the cache block size C (in double-precision words)
is assumed to be a power of 2, the transfer time of a message that carries a data
block is TL and the transfer time of a message that does not have a cache block is
TS . Messages of the first type include data acknowledge messages while request
messages and invalidations belong to the second type. When a regular memory
reference causes a remote miss, the processor is blocked and stops executing in-
structions. The communication cost of a local miss is negligible compared to the
cost of a remote miss. Therefore, the total execution time is the sum of the network
communication time and the time that the processor is busy executing instructions.
4.1.1 Matrix-Vector Multiplication Using Message Passing
Assuming that the multiplication is performed by row-wise stripping, each
processor has N/P rows of matrix A and N/P elements of vector x. Each processor
broadcasts its part of vector x in time O(N/P ). Each processor calculates N/P ele-
ments of vector y in time TC , which is O(N2/P ). For comparison, the performance
of a similar parallel algorithm on a mesh with P processors and cut-through rout-
ing is also examined. The multiplication is done by checkerboard partitioning and
34
the vector x is initially stored in one column of processors. The communication
time is O((N/P )(1 + log2(
√
P ))) due to the initial broadcast of the x vector blocks
over rows and the collection of partial results.
4.1.2 Matrix-Vector Multiplication Using DSM
The performance of the parallel multiplication algorithm, which calculates y =
Ax using distributed shared memory is examined in the following. When execut-
ing algorithms of this type, each processor reads data, which may be local or re-
mote, and calculates a part of the overall result. Performance is adversely affected
by cache misses: capacity misses tend to have a small effect as cache size becomes
relatively large; coherence misses also have a minor effect as processors mostly
read remote data. The most pronounced effect, which reduces performance, is due
to compulsory (first-reference) misses. Since a block cannot be in the cache on the
very first access, in traditional architectures the miss rate due to compulsory misses
remains constant as cache size is increased and becomes a major cause of perfor-
mance loss. The SOME-Bus architecture allows the enhancement of the caches so
that a block can be found in the cache even on the very first access. This section
examines several such enhancements and analyzes their effect on performance,
using the matrix-vector multiplication algorithm.
Since most algorithm analyses focus on communication time, in the follow-
ing analysis we begin by examining the time spent transferring data through the
network. However, ways to overlap the network communication time with pro-
cessing time are presented, at the end of this section. This overlap is most effective
when the data in one block receives processing time which is approximately equal
to, or larger than, the time necessary to obtain a copy of that block from a remote
directory.
35
It is assumed that the rows of array A are in the proper nodes, either in the
node’s local memory or in its cache. It is also assumed that neither vector x nor
vector y are in the proper node so that all initial references to elements of the vec-
tors cause misses. The basic multiplication program is shown in Figure 4.1. Each
processor is assigned N/P rows and performs the multiplication with vector x to
produce elements of vector y.
double A[N][N], x[N], y[N];
int NumberOfRows = N/P;   /* to each processor */
parallel for (p=0; p < P; p++)
{
  FirstRow = p * NumberOfRows;
  Multiply(A, x, y, FirstRow, NumberOfRows);
}
multiply (A, x, y, f, d)
{
  for (i=f; i < f + d; i++) /* LOOP A */
  {
    y[i]=0; /* possible write miss */
    for (j=0; j<N; j++) /* LOOP B */
    {
      y[i] = A[i][j] * x[j]; /* read misses */
    }
  }
}
Figure 4.1: Matrix-Vector Multiplication Program
If C is the number of elements of vector x in a single cache block, the process-
ing time per block is approximately TC = 4C, since there are approximately four
instructions in each iteration of loop B in Figure 4.1. Since it is assumed that the
number of instructions per second is approximately equal to the number of bytes
per second transferred by a channel, the time to transfer a block is also approxi-
36
mately TL = 4C (and slightly larger due to the header bytes). Consequently, the
time in network clock cycles to transfer a DATA_ACK (data acknowledge) mes-
sage containing a block of vector x is approximately equal to the processing time
of this (or any other) block of vector x.
4.1.3 Cache Behaviour: No Enhancements
In Figure 4.1, when index i is equal to f , the loop B causes N/C global misses
at each processor. Assuming that vector x fits in the caches, no more read misses
on vector x will occur, for i > f . The main focus in the following are the misses
caused by the first iteration of loop B which 4.2 shows separately.
multiply (A, x, y, f, d)
{
  y[f] = 0; /* write miss */
  for(j=0; j<N; j++) /* LOOP A */
    y[f] += A[f][j] * x[j] /* read misses */
  for(i=f+1; i< f + d; i++) /* LOOP B */
  {
    y[j] = 0; /* possible write miss */
    for (j=0; j<N; j++) /* LOOP C */
      y[i] += A[i][j] * x[j]; /* no read misses */
  }
}
Figure 4.2: Matrix-Vector Multiplication Program (Showing Misses)
The following analysis assumes that allN/C misses are remote, resulting in net-
work traffic. Loop A is executed by all processors at the same time. Read misses oc-
cur when the following references are made: x[0], x[C], x[2C],. . . , x[N−C]. With no
special hardware support, each home node will receive DATA_REQ (data request)
messages for the same block from P−1 nodes and will send the same block sequen-
37
tially to each requester. The last processor to be served will wait TS+(P−2)TL time
and will receive the requested data block at TS+(P −1)TL. Once this phenomenon
occurs due to all processors referencing x[0], they fall in step: the processor who
received service earlier on the first miss proceeds with its computation earlier and
will encounter the next miss earlier. Assuming that successive memory blocks of
vector x reside in different processors, all processors will reference x[C], and these
references will occur TS apart, and they will be directed to a directory different
from the one involved in the previous miss. The processor who received service
last on the first miss, receives data blocks within TS+TL time of each of the remain-
ing requests and will receive the last block at time TS+(P−2)TL+(N/C−1)(TS+TL)
since there are N/C misses on each processor.
4.1.4 Data-Acknowledge Message Multicasting
Using this improvement, the directory response times can be reduced. When all
processors encounter the first miss on x[0], they all send DATA_REQ messages to
the node who is home to that memory location. Since there will be some variation
on the transmission time of those messages, one will arrive first (say from node
Pa) and will be processed by the receiving directory. While the home directory
is sending a DATA_ACK to node Pa, all DATA_REQ messages will be enqueued,
and the home directory will subsequently send one DATA_ACK message to all
remaining nodes. Node Pa will receive the requested block in TS + TL time and
all other nodes will receive the requested block in TS + 2TL time. Subsequently,
node Pa will complete its computation earlier and encounter the next miss earlier.
During the time that the miss from node Pa is serviced by some home directory,
the requests from all other nodes (due to a miss on the same block) arrive at the
same directory. Then, all nodes will receive DATA_ACK messages (TS + TL) time
38
after their request. The last DATA_ACK message will be delivered at time TS +
2TL + (N/C − 1)(TS + TL).
In both cases described above, loop A will cause all read misses to vector x.
Assuming that both x and y can fit in the caches, there will be no more read misses
after loop A terminates. Loop A will cause (d/C)− 1 write misses (d is the number
of elements of vector y which will be calculated by each processor). These write
misses will cause OWNR_REQ (ownership request) messages to be sent to differ-
ent nodes, and therefore, loop A requires ((d/C)− 1)(TS + TL)) time.
4.1.5 Matrix-Vector Multiplication Using Block Capture
The multiply() function described above can benefit from the capture functional-
ity. Since all processors perform the same amount of work and have almost iden-
tical progress, they all encounter a miss of the same block of vector x. The first
DATA_REQ to arrive at the home node will cause a DATA_ACK message to be
sent to that requesting node, and since all other nodes have requested the same
block, their capture hardware will acquire that block from the bus and store it
in the cache. This function requires an improvement in the directory design, be-
cause the directory controller must be able to ignore all other requests for the same
block. Such an event will occur in the example discussed here, since all except the
first DATA_REQ messages arrive after the corresponding DATA_ACK has already
been composed and placed in the node output queue by the directory controller.
One way to accomplish this functionality is to design the directory to ignore all
requests that arrive before the DATA_ACK message starts transmission. This en-
tails a small risk that some requester may fail to capture and still be ignored by
the directory. (The flow control protocol will force the requesting node to resend
the message). However, as will be shown below, the real benefit of the capture
39
function comes from completely eliminating the need to send data requests. In the
example discussed here, the processors will progress in step with one processor
encountering a miss first and all processors acquiring the requested data block.
Since loop A causes N/C misses, the total communication time is (N/C)(TS + TL).
The lack of significant improvement is due to the fact that in loop A, misses are
allowed to occur causing the processors to remain idle while the requested data is
transferred.
The key to increase the performance of the multiply() function is to force several
misses to occur at the same time, causing multiple directories to transmit multiple
blocks simultaneously. All such blocks will be captured simultaneously by the
capture hardware before references occur which would have caused misses.
Loop A in Figure 4.2 causes misses when the following accesses occur: x[k ∗C],
k = 0,1, . . . ,(N/C) − 1. It is necessary to rewrite loop A so that the first P misses
will be caused by different processors at the same time. Let Q = P ·C be the num-
ber of vector x elements contained in P blocks.
for (k = 0; k < Q; k++)
    y[f] += A[f][j] * x[k];
Then, the loop references the first P blocks of vector x starting with x[0], as
shown in Figure 4.3.
The loop also references the first P blocks of vector x, but it starts with x[p · C],
makes all the references up to x[Q − 1], then x[0], and all the references up to
x[p · C − 1]. Figure 4.4 shows the resulting program. If this loop is executed
in parallel by all P processors, then processor p will first encounter a miss on
x[p · C]. These P references are to different blocks. Assuming that these blocks are
on different processors, all these P misses cause P DATA_REQ messages and P
40
for (k = 0; k < Q; k++) /* loop A */
{
}
      t = (k + p * C) mod Q;
      y[f] += A[f][t] * x[t];
Figure 4.3: Matrix-Vector Multiplication (P Accesses)
DATA_ACK messages in parallel. All DATA_ACK messages will be acquired by
all P processors and consequently every processor will only encounter one miss
during the execution of loop A in Figure 4.3. The size of the capture neighborhood
is equal to P blocks. Since this loop references Q elements of vector x, it must be
executed N/Q times (with the proper starting index into vector x). Since loop A
in Figure 4.4 causes N/Q = N/(P · C) misses, the total time due to read misses
is (N/(P · C))(TS + TL). The program in Figure 4.4 also encounters d/C write
misses (where d = N/P ), and therefore the total network time due to all misses is
2(N/(P · C))(TS + TL).
4.1.6 Matrix-Vector Multiplication Using Block Prefetch
In the execution of loop A in Figure 4.4, there is network activity only when k
is 0. Still, the performance of loop A is reduced because processors must wait after
a miss for the data to be transferred from the home node. To further increase the
performance of the loop, it is necessary to cause more data to be transferred before
a miss occurs at all. Loop A in Figure 4.4 is augmented with a reference that causes
a prefetch instruction. Figure 4.5 shows the resulting multiply() function. When k
is equal to C, all P misses have already occurred and data has been transferred.
At that time, the execution of the prefetch instructions by all processors causes all
the blocks which would have been necessary on the next iteration of the A loop
41
multiply (A,x,y,f,d)
{
  y[f] = 0; /* write miss */
  for (j = 0; j < N/Q; j++) /* LOOP A */
    for (k = 0; k < Q; k++) /* LOOP B */
    {
        t = (k + p * C) mod Q + j * Q;
        y[f] += A[f][t] * x[t];
    }
  for (i = f+1; i < f + d; i++) /* LOOP C */
  {
    y[i] = 0; /* possible write miss */
    for (j = 0; j < N; j++) /* LOOP D */
      y[i] += A[i][j] * x[j]; /* no read misses */
  }
}
Figure 4.4: Multiplication-Vector Program Using Block Capture
to be transferred immediately and stored in all caches well in advance of the time
that they are actually needed. No more misses occur after the j = 0 iteration of
the A loop and therefore the total network time due to read misses is (TS + TL).
The program in Figure 4.5 still encounters d/C write misses (where d = N/P ), and
therefore the total network time due to all misses is ((N/(P · C)) + 1)(TS + TL).
4.1.7 Block Prefetch in Exclusive State
A closer look into the operation of the program in Figure 4.5 shows that the
cost of most write misses can be hidden through the introduction of an addi-
tional exclusive prefetch instruction (e_prefetch), which causes the cache to send
an OWNR_REQ message to the proper directory without blocking the processor.
This instruction requires very careful use as it may violate sequential consis-
tency. It also requires the use of a compiler directive to cause the compiler to trans-
late a regular assignment to an e_prefetch instruction.
42
#define Q (P * C)
multiply (A,x,y,f,d)
{
  y[f] = 0; /* write miss */
  for (j = 0; j < N/Q; j++) /* LOOP A */
  for (k = 0; k < Q; k++) /* LOOP B */
  {
    t = (k + p * C) mod Q + j * Q;
    y[f] += A[f][t] * x[t];
    if ((k == C) && (j != N/Q−1)) x[p*C+(j+1)*Q];
  }
  for (i = f+1; i < f + d; i++) /* LOOP C */
  {
    y[i] = 0; /* possible write miss */
    for (j = 0; j < N; j++) /* LOOP D */
      y[i] += A[i][j] * x[j]; /* no read misses */
  }
}
Figure 4.5: Multiplication-Vector Program Using Prefetch
In Figure 4.5, it is clear that the blocks containing the elements of vector y can be
requested in exclusive state during the execution of loop A, since there is a signifi-
cant amount of time where the processors are executing instructions and there is no
network activity. Figure 4.6 shows that during the execution of loop A, (Q−2)N/Q
blocks of vector y are prefetched in exclusive state. The few remaining blocks can
also be prefetched in exclusive state using a similar technique during the execution
of loop C in Figure 4.6. If all write misses are hidden, the only network time is due
to the initial read misses and therefore the total network time is (TS + TL). The
dependency of the total execution time on network time is therefore O(1).
While the matrix multiplication algorithm is able to take advantage of the SOME-
Bus architecture, not all algorithms are as regular as this one, and processors will
not always progress in such a strongly coordinated fashion. Since synchronization
is very inexpensive on the SOME-Bus (its cost is O(1)), a following section (Section
43
multiply (A,x,y,f,d)
{
  s = (f+1) * C;
  y[f] = 0; /* write miss */
  for (j = 0; j < N/Q; j++) /* LOOP A */
    for (k = 0; k < Q; k++) /* LOOP B */
    {
      t = (k + p * C) mod Q + j * Q;
      y[f] += A[f][t] * x[t];
      if ((k == C) && (j != N/Q−1)) x[p*C+(j+1)*Q];
      if (((k mod C) == 0) && (k/C > 1))
      {
        y[s] = 0; /* e_prefetch instruction */
        s += C;
      }
    }
  for (i = f+1; i < f + d; i++) /* LOOP C */
  {
    y[i] = 0; /* possible write miss */
    for (j = 0; j < N; j++) /* LOOP D */
    y[i] += A[i][j] * x[j]; /* no read misses */
  }
}
Figure 4.6: Matrix-Vector Multiplication Program Using e_prefetch
4.3) examines whether it is useful to make the processors synchronize more often
to take advantage of block capture.
4.2 Matrix-Matrix Multiplication on the SOME-Bus
The performance of the parallel matrix-matrix multiplication algorithm, which
calculates Z = X · Y , is examined in this section. Let N and P indicate the matrix
size and the number of processors, respectively. If each processor has N/P rows
of matrix X and matrix Y , then the algorithms discussed in section 4.1 are directly
applicable. Using the algorithm of Figure 4.6 with the prefetch instructions, the
communication time of the matrix-matrix multiplication algorithm is O(N).
44
A more likely distribution of the matrix elements is in blocks rather than rows
over the processors. To simplify the discussion, it is assumed that P = R2, so
that the matrices may be allocated logically on a square grid of processors with
R =
√
P processors on each side. Each processor contains a square block of each
of matrices X and Y , of size D2, and calculates a block of matrix Z of size D2,
where D = N/R. The direct application of the matrix multiplication algorithm
shown in Figure 4.7, requires each processor to access D rows and D columns, or
a total of 2 · (N − D) · D/C blocks (since some blocks are local to the processor).
Since D = N/R, the communication cost is O(N2/
√
P ) and it is due to compulsory
(first-reference) misses.
double X[N][N], Y[N][N], Z[N][N];
int D = N / R; /* to each processor */
parallel for (p = 0; p < P; p++)
  multiply(p,X,Y,Z,D);
/* p is processor number, D is number of rows (and columns) in block */
multiply (p,X,Y,Z,D)
{
  int px,py;/* processor location in logical grid */
  int i, j, k, FirstCol, FirstRow;
  py = p / R;
  px = p mod R;
  FirstCol = D * px;
  FirstRow = D * py;
  for (i = FirstRow; i < FirstRow + D; i++)
    for (j = FirstCol; j < FirstCol + D; j++)
      for (k = 0; k < N; k++)
        Z[i][j] += X[i][k] * Y[k][j];
}
Figure 4.7: Matrix-Matrix Multiplication Program
45
4.2.1 Matrix-Matrix Multiplication Using Block Capture
If the cache at each node has enough capacity to hold the required blocks,
then compulsory cache misses are the primary reason for performance loss. In
the SOME-Bus architecture, the cache enhancements described in Chapter 3 can
also be used here to eliminate most compulsory misses. The multiplication algo-
rithm can be rewritten to take advantage of the block-capture hardware. Figure 4.8
shows that each processor starts the row-column multiplication at a different start-
ing value of index k, encountering misses only while accessing elements in the first
D2 block of either matrix X or matrix Y (when index kt is less than D). When a
processor accesses elements corresponding to values of index kt greater or equal
to D, no miss will occur since the relevant block has been previously accessed by
another processor and has been captured by this processor.
Scheduling the calculations properly allows a processor to access a block cap-
tured recently. Although some unneeded blocks may be captured, the cache re-
placement algorithm will remove them from the cache if not accessed by the pro-
cessor soon after the capture. The number of blocks that may be captured and the
relating cache replacement policy are discussed later in this section. Since only the
access ofD2 elements of matricesX and Y causes misses, the number of blocks that
must be read by a processor is 2 ·D2/C = 2 ·N2/(C · R2), and the communication
cost of each processor is O(N2/P ).
Figure 4.8 shows that each processor accesses D rows of matrix X from pro-
cessors on the same row of the logical
√
P · √P processor grid and D columns of
matrix Y from processors on the same column of the logical grid. Once a block of
matrix Y is read, all its elements are used in the calculation of the result, in order
to avoid creating a burst of misses when a column is first accessed. Once the block
rows and columns are specified, the multiplication and accumulation steps start
46
double X[N][N], Y[N][N], Z[N][N];
int D = N / R; /* to each processor */
parallel for (p = 0; p < P; p++)
  multiply(p,X,Y,Z,D);
/* p is processor number, D is number of rows (and columns) in block */
multiply (p,X,Y,Z,D)
{
  int px, py; /* processor at location x, y */
  int c, i, j, jt, k, kt, FirstCol, FirstRow;
  py = p / R;
  px = p mod R;
  /* calculate ipx, jpy */
  FirstCol = D * px;
  FirstRow = D * py;
  for (i = FirstRow; i < FirstRow + D; i++)
    for (jt = 0; jt < D; jt += C)
    {
      j = FirstCol + jt;
      for (kt = 0; kt < N; kt++)
      {
        k = (kt +FirstCol +D +FirstRow) mod N;
        for (c = 0; c < C; c++)
        Z[i][j+c] += X[i][k] * Y[k][j+c];
      }
    }
}
Figure 4.8: Matrix-Matrix Multiplication Using Block Capture
with index k being offset by a value equal to FirstRow + FirstCol + D, so that
neighboring processors (in the logical grid) start encountering misses on different
blocks allowing other processors to capture all blocks. Consequently, a processor
encounters misses only when index kt is less than D. All the related elements
(whose access causes a miss) in matrices X and Y are referred to as the critical
submatrices with respect to the selected processor. If block capture is successful,
no misses occur when elements outside the critical submatrices are accessed.
47
4.2.2 Matrix-Matrix Multiplication Using Block Prefetch
To eliminate the occurrence of the remaining misses, the prefetch capabilities
can be utilized, in such a way that immediately after restarting following a miss,
the processor performs the proper access which will transfer the next block that
might otherwise cause a miss. Figure 4.9 shows the resulting program.
While elements from one row of the critical submatrix in X are accessed, the
immediately following block is prefetched. Also, during access of rows of the fol-
lowing submatrix (after the critical one), the first block of the following row in the
critical submatrix is prefetched. Similarly, during the access of elements from a
column in the critical submatrix Y , the following block is prefetched. Also, during
access of the following submatrix in Y , the required blocks of the next column are
prefetched. As a result, each processor encounters two misses, one on the access of
the first block of a row in matrix X and one on the access of the first block on a col-
umn of matrix Y . Using the e_prefetch instruction, the write misses can be avoided
in a similar fashion as described in section 4.1. Consequently, the dependency of
the total execution time on network time is O(1).
4.2.3 Performance Analysis
It is important to note that although the expressions used to cause prefetch may
appear complicated, they are simple integer manipulations, requiring few instruc-
tions. In addition, they are executed very infrequently. Specifically, the column
block prefetches occur only while the first row of results is calculated, assuming
that the cache at each processor has enough capacity to hold the required blocks
and avoid thrashing. Consequently, the calculation overhead involves a few test
instructions which most often fail the test, and is therefore insignificant.
As mentioned in Chapter 3, the address decoding hardware at each receiver
48
Figure 4.9: Matrix-Matrix Multiplication Using Block Prefetch
49
is augmented to extract the address of the block contained in each DATA_ACK
message that appears on the corresponding channel and compared against the ad-
dresses of the blocks in the node cache. The extent of the capture is controlled by
selecting the number of least significant bits which are masked off and are not used
in the block address comparison. Although this number may be programmable,
simulations show that the application performance does not critically depend on
the particular value. The reason is that if the calculations are properly scheduled,
capturing more blocks than absolutely necessary has no adverse effect, since some
blocks will be quickly accessed (achieving the purpose of avoiding the occurrence
of a miss) and the remaining blocks will be soon replaced as specified by the
replacement policy. The block-capture parameters together with the prefetch in-
structions and the block replacement policy can be combined to avoid most of the
possible cache misses. Blocks in the cache are assigned two additional attributes
indicating if the block is in the cache due to capture or prefetch. Simulation results
show that one effective replacement policy is the following: 1) a block is captured
if it satisfies the address comparison and it can be stored in the cache by replac-
ing another (least recently used) captured block, 2) a prefetched block replaces the
least recently used captured block, 3) a block received due to a miss replaces the
least recently used captured block, and 4) if no captured blocks exist in the cache,
then no more blocks are captured; prefetched or regular blocks may then replace
the least recently used block.
A consequence of this policy is that fewer blocks are captured if the cache con-
tains several prefetched blocks. The multiplication in Figure 4.9 uses this effect to
prefetch blocks that belong to columns of matrix Y (which are used repeatedly)
and allows blocks belonging to rows of matrix X to be captured (and used fewer
times). The size of the cache (or the fraction of the cache allocated to captured and
prefetched blocks) has a direct effect of the application performance. Naturally, if
50
the cache size is smaller than the number of prefetched blocks, prefetched blocks
will be replaced and regular misses will occur from that point on, with subsequent
loss of performance. As the cache size increases, the presence of captured blocks
eliminates most compulsory misses and the application performance increases.
Figure 4.10: Miss Rate Reduction in Matrix Multiplication program
In the multiplication algorithm considered here, the total number of remote
blocks accessed by each processor is 2 ·D · (N −D)/C = 2 ·N2 · (R − 1)/(C · R2).
In our simulation, a cache size of h · N2/(C · R), for 1 < h < 2 is considered. The
lower limit of this range is approximately equal to the total number of blocks that
are prefetched. The upper limit is slightly higher than the total number of remote
51
blocks that must be received by a processor. Figure 4.10 shows the performance
of the multiplication algorithm as the cache size is increased. Appendix A gives a
detailed description of the simulator. The horizontal axis indicates the values of h.
The figure shows a reduction of the cache miss rate mostly because of the elimina-
tion of compulsory cache misses, a result due the SOME-Bus cache enhancements.
When the cache size is small, the cache miss rate is significant and as a result the
processor utilization is low. As the cache size increases, there is space in the cache
for captured blocks and the miss rate drops to almost zero. A traditional archi-
tecture (with no cache enhancements) would have resulted in a flat miss rate as
larger cache sizes would have no effect on the compulsory misses. The elimina-
tion of the compulsory misses allows the processors to be busy most of the time.
The channel utilization remains approximately the same and tends to decrease as
thrashing disappears. Interestingly, the channel utilization stays quite low, an indi-
cation that applications with even more demanding communication patterns can
be supported by the SOME-Bus architecture without loss of performance.
4.3 LU Block Decomposition
The performance of the parallel LU block decomposition algorithm is exam-
ined in this section. As mentioned previously, it is more likely that the distribution
of the matrix elements is in blocks rather than rows over the processors. Let A be
the matrix to be decomposed, with N and P being the matrix size and the num-
ber of processors respectively. The algorithm for the decomposition is shown in
Figure 4.11.
P = S2 is assumed so that the matrices may be allocated logically on a square
grid of processors with S processors on each side. Thus, each processor contains a
square sub-block, of size N/S, of the whole matrix A.
52
For K = 0, .., N−2 do
    1.Find maximum element in column K of array A from row K and below. 
    2.Exchange row K and row where the maximum element is found.
    3.For all rows R from K+1 to end of the array:
        a.Find ratio u = A[R][K]/A[K][K]
        b.Store u in A[R][K]
        c.Change row R so that
            New_row_R = row_R − u*row_K for all     elements to the right of the diagonal
            for (R = K + 1; R < N; R++)
            {
                 u = A[R][K]/A[K][K]
                 A[R][K] = u;
                 for (J = K+1; J < N; J++)
                {
                 }
            }
                    A[R][J] = A[R][J] − u * A[K][J];
Figure 4.11: LU Block Decomposition Algorithm
The three steps of the algorithm are well defined with very little possibility of
interaction or exchange of data between them to improve performance. That is,
the steps need to be performed in that order, and a step needs to be finished before
the next one can start execution at any processor. Hence, the only way to opti-
mize these steps is individually, and separate them with barriers to synchronize
the processors. Figure 4.12 describes the terminology used in the analysis.
Given K, only the processors to the right and below the pivot processor are
involved in the calculations, so, it is sufficient to examine the behavior of the pro-
gram for a fixed pivot processor Ptt and then iterate for t = 0, . . . , S − 1. The
following is an analysis of the different stages of the algorithm with and without
optimizations:
53
Terminology:
    C = Cache block size in array elements
    P = SxS processors
    Subarray at each processor = DxD, where D = N/S
    Pyx = Processor at y,x
Note: Uppercase indices refer to the full NxN array and 
lowercase indices refer to the subarray allocated at some processor.
    Row K = Pivot row.
    Pivot processor = Processor containing part or
      whole of row K of A.
    Major processor row (MPR) = Ptx for x=t+1,..,S−1, 
      The row of processors to the right of pivot processor along 
      the same row.
    Major processor column (MPC) = Pyt for y=t+1,..S−1, 
      The column of processors below the pivot processor along the
      same column.
    Major processor block = the square block of processors to the right 
      of and below the Major processor column and Major processor
      row respectively.
Figure 4.12: Terminology
Step 1: Find maximum element in column K of array A from row K and below
The processors on the major column independently find the maximum value
of their part of the array column, and store the index and the value in a predeter-
mined location in their memory space. All processors synchronize and then the
pivot processor reads the maxima found by the other processors in the major col-
umn. Since there are S − t − 1 processors below the pivot processor, there will be
S − t− 1 misses.
Since all local maxima are located in different processors, the pivot processor
may issue prefetches for all of them. Then, after the last prefetch is issued, it at-
tempts to read the first local maximum. If it encounters a miss it will wait for
the block and then will read it to find the first local maximum. By that point in
54
time, the remaining prefetched blocks have arrived at the pivot processor and it
encounters no more misses.
Step 2: Exchange row K and row where maximum element is found (row R)
All processors Ptx, x = 0, . . . , S − 1, on the row coinciding with the major row
are involved in the exchange (this is the full major row). If row R falls within the
rows contained by the processors in the major row, then no misses are encoun-
tered. Here, the worst case of row R not being in the major row of processors is
assumed, except for the last few values of K when the pivot processor is the one
at the bottom right. All processors on the full major row read a part of the remote
row R, store it locally and write to it, their part of row K.
for (k = 0; k < D; k++)
}
{
   temp = REMOTE_ARRAY(r,k);
   REMOTE_ARRAY(r,k) = LOCAL_ARRAY(i,k);
   LOCAL_ARRAY(i,k) = temp;
Note that indices are lower case indicating rows and columns of the subarray
within the proper processor. The first statement in the loop causesD/C read misses
and the second statement causes D/C write misses.
Exclusive prefetch instructions are used to overlap the communication time
with calculation time.
As shown in Figure 4.13, the next cache block is prefetched before processing
the current cache block. This results in complete overlap of the calculations with
the block transfer time. The processors might still encounter the first miss as there
are not enough instructions per block to be executed, but they will see a smaller
55
cache miss time.
k = 0;
exclusive_prefetch(REMOTE_ARRAY(r,k));
for (k = 0; k < D; k++)
 {
 }
   if ((k % C) == 0) 
     exclusive_prefetch(REMOTE_ARRAY(r,k+C));
   temp = REMOTE_ARRAY(r,k);
   REMOTE_ARRAY(r,k) = LOCAL_ARRAY(i,k);
   LOCAL_ARRAY(i,k) = temp;
Figure 4.13: Step 2 with Exclusive Prefetch
Step 3:
1. Find ratio u = A[R][K]/A[K][K]
2. Store u in A[R][K]
3. Change row R so that New_row_R = row_R − u · row_K for all elements to
the right of the diagonal.
Since there is no dependence between different rows, all rows of the array, for
R >= K + 1, may be processed simultaneously. In reality, each processor works
sequentially on all the rows that it contains. All active processors are synchro-
nized through barriers, so that in a horizontal row of processors, all processors are
working on the same row R.
Given row R, the ratio u is calculated by the processor on the major column
that contains row R. For each value of K, the processor will encounter one miss
because the pivot row is new due to the exchange in step 2. All processors on
56
that processor row go through a barrier and read the value. This causes S − t −
1 misses on the same block. Here, the ability of the directory is relied upon to
combine DATA_ACKs to different requesters when the same block is involved. In
the worst case, two DATA_ACK messages are sent. If the directory is idle when the
first request arrives, the directory will respond to it separately. All other requests
will arrive during that response and the directory will respond to them with one
multicast DATA_ACK. If the directory is busy when the first request arrives, all
requests will arrive while the directory is busy, and one multicast DATA_ACK is
sent.
After each processor receives the value of the ratio, it processes its row R and
accesses the corresponding part of row K. Assume that the pivot processor is not
the processor at bottom right. Then, the processors on the major row do not en-
counter misses when they access row K. but the processors below the major row
will encounter misses. These processors must process their part on D rows.
for (j = 0; j < D; j++)
{
   LOCAR_ARRAY_A[r][j] = LOCAR_ARRAY_A[r][j] −  
                            u * REMOTE_ARRAY[k][j];
}
All processors operate on all D elements except the processors on the major
column that operate on the elements to the right of column K. To eliminate the
different behavior of those processors, the above loop can easily be converted so
that all D elements of the remote array are accessed by only the proper elements of
the local row are affected. Let Ptt be the pivot processor. Each processor below the
pivot processor must read D/C cache blocks from the corresponding processor at
the major row. Since all these blocks are coming from the same processor, the best
57
that can be done here is to prefetch blocks and take advantage of the ability of the
directory controller to combine and multicast DATA_ACKs.
j = 0;
prefetch(REMOTE_ARRAY[k][j]);
for (j = 0; j < D; j++)
{
  if ((j % C) == 0) 
    prefetch(REMOTE_ARRAY[k][j+C]);
  LOCAR_ARRAY_A[r][j] = LOCAR_ARRAY_A[r][j] − 
                         u * REMOTE_ARRAY[k][j];
}
Here, there are enough instructions per cache block so that the processing time
and transfer time are approximately equal and only the first miss is visible.
The number of misses in each step of the algorithm without and with optimiza-
tions is given in Table 4.1, for a given K and t = K/D.
Table 4.1: LU Decomposition Performance Summary
Step Without Optimizations With Optimizations
1. Find maximum S-t-1 1
2. Exchange D/C 1 or D/C with reduced cost (a)
3a. Find u 1 1
3b. Procs read u (S-t-1) * D 1D or 2D
3c. Read pivot row D/C 1
Total misses D+2*D/C+(S-t-1)*(D+1) 2+3D+a*D/C
Also, note that some processors (containing the array diagonal) may do less
work and encounter fewer misses, but still must wait at a barrier for the other
58
processors. Hence only the worst case is shown.
Step 3 is done D times, so step 3b contains a factor of D in the number of misses.
It is assumed that the cache is large enough to contain the D elements of the pivot
row, so that the misses that are encountered reading the pivot row occur only once.
As K increases from 0 to N − 2, t also varies from 0 to R− 1. The first D values
of K result in t equal to 0, the next D values of K result in t equal to 1 and so on.
Then the total number of misses without optimizations is
t=S−2∑
t=0
(D ∗ (1 + 2 ∗D/C + (S − t− 1)) ∗ (D + 1)))
and the total number of misses with optimizations is
t=S−2∑
t=0
(D ∗ (3 + 2 ∗D + a ∗D/C))
When t = S − 1, only the bottom right processor is involved and there are no
misses. Then, the total number of misses with optimizations is D · (3 + 2D + a ·
D/C) · (S − 1) and the total number of misses without optimizations is
t=S−2∑
t=0
(D ∗ (1 + 2 ∗D/C + (S − t− 1) ∗ (D + 1)))
On simplification, the above expression reduces to
D ∗ (1 + 2 ∗D/C) ∗ (S − 1) +D ∗ (D + 1) ∗ S ∗ (S − 1)/2
And it can be noticed that the total number of misses with optimizations is
O(SD2) and the number of misses without optimizations is O(SD2) + O(S2 ·D2),
so that, the reduction in the number of misses is O(S2 ·D2) or O(N2).
It is clear from the above table that the major factor that limits further reduc-
tions on the number of cache misses is step 3b. This step causes the total number
of misses to be O(D2) since for each K and all D rows, the processors must read
59
the value of U from the processor on the major column. Most of those misses can
be eliminated by allowing each processor to calculate u on its own.
Given K, and t = K/D, pick a value of y such that t < y < S. Then processor
Pyt is on the major column. In the current version of the algorithm, Pyt calculates
u = A[R][K]/A[K][K]. Figure 4.14 shows the basic code (note: I and R refer to the
same row)
for (I = IMIN; I <= IMAX; I++)
{
   /* if this processor contains column K then 
        it must calculate A[I][K] / A[K][K] */
   if (this_proc_col == K / D)
   {
      temp = ARRAY[I][K] / ARRAY[K,K];
      LOCAL_ARRAY[i,k] = temp;
      flt_info[this_proc] = temp;
   }
   GBARRIER(BarrierGroup)
   /* t is the processor that calculated A[I][K] / A[K][K] */
   t = this_proc_row * S + K / D;
   temp = flt_info[t];
   /* calculate a part of the row */
   for (J = JMIN; J <= JMAX; J++)
   {
      temp1 = ARRAY[K,J];
      LOCAL_ARRAY[i,j] ?= temp * temp1;
   }
   GBARRIER(BarrierGroup)
}
Figure 4.14: Calculation of Transformation Ratio (u)
For each value of R (total of D values), the processors to the right of Pyt must
read this value, causing a total of D misses. The algorithm can be rewritten so that
each processor Pyx, t <= x < S calculates u. Figure 4.15 shows the modified code.
Then, on the first value of K, processor Pyx will encounter one miss to read
60
for (I = IMIN; I <= IMAX; I++)
{
   if ((K % C) == 0) prefetch(ARRAY[I][K])
   temp = ARRAY[I][K] / ARRAY[K][K];
   /* if this proc contains column K then it must 
                  store the ratio */
   if (this_proc_col == K / D)
       LOCAL_ARRAY[i][k] = temp;
   /* calculate a part of the row */
   for (J = JMIN; J <= JMAX; J++)
   {
       temp1 = ARRAY[K][J];
       LOCAL_ARRAY[i][j] ?= temp * temp1;
   }
   GBARRIER(BarrierGroup)
}
Figure 4.15: Calculation of Transformation Ratio (optimized)
A[K][K] from the pivot processor and D misses to read A[R][K] from its corre-
sponding major-column processor. However, for the next value of K, processor
Pyx will not encounter any misses to read A[R][K], since these two successive val-
ues of A[R][K] are in the same cache block. This discussion assumes that D cache
blocks containing A[R][K] for all the values of R that are contained in Pyt remain in
the cache of Pyx. Successive blocks of A[R][K] can be prefetched, so that no more
misses are encountered. In the above table, the entry indicating the cost of step
3b with optimizations must be changed to “1 or 2”, which is the cost of several
processors reading A[K][K] at the same time, and there is the additional cost of
D misses only for K = 0. Then, the total number of misses with optimizations is
D+D · (3+2+a ·D/C) · (S− 1), or O(aSD2) if there is partial overlap of processor
busy time and block fetch time. If there is full overlap, the total number of misses
with optimizations is D +D · (3 + 2 + 1) · (S − 1), or O(SD).
61
4.4 Sorting on the SOME-Bus Using DSM
Sorting is a fundamental problem in computer science. Following a well-known
algorithm, a sequence of N elements is partitioned into smaller parts, each part is
sorted separately, and the sorted parts are merged. Such an algorithm has been
proposed in the past [50]. A summary of the algorithm is as follows. This orig-
inal sequence with N elements is partitioned into N/P parts of size P each. For
ease of discussion, N and P are assumed to be powers of 2, and N = P2b, where
b > 1. Using the BSR model, P processors cooperate to sort each part of P elements
in separate steps. After this step, memory contains N/P parts, each separately
sorted. Then, all processors cooperate to merge those parts into sorted parts of size
2P, 4P, . . . , 2bP , until all elements are sorted. The merge operation is performed
b = log2(N/P ) times. [50] shows that the total merge time dominates the initial
sorting time and therefore the total time is O((N/P )log2(N/P )).
The algorithm on the SOME-Bus follows a similar principle with a slight vari-
ation that load balancing is taken into consideration to some extent. Ideally, at the
end of execution of the algorithm, each processor would have approximately equal
number of elements thus providing load balancing in terms of memory usage and
possibly performance in the later stages of the application.
Each processor has a part with approximately the same number N/P of ele-
ments of the sequence in its local memory. Initially, each processor sorts its own
part locally. Then groups of 2j, j = 1, . . . , b processors cooperate in parallel to
merge parts of the sequence into new, sorted parts of twice as many elements, un-
til the whole sequence is sorted. Figure 4.16 illustrates the procedure using P = 8
processors.
Slanted lines represent a sequence of numbers sorted in increasing order. A
slanted line over two or more processors means that the sorted sequence of num-
62
Figure 4.16: Sorting Procedure
bers are distributed over those processors in the proper order either evenly or un-
evenly. The algorithm is recursive and the processors perform the following oper-
ations at each iteration:
63
1. Each processor read the key-locations of every other processor in its group. A
group here is defined as the set of processors involved in the merging of two
subsequences. The group size is 2(stepi+1) where stepi is the iteration number
in the recursive algorithm.
2. The processors then map the two subsequences onto each other to figure out
how they overlap with each other. Figure 4.17 lists all the possible cases in
which the two subsequences can overlap with each other.
The figure also lists the action to be taken for each of the cases. Case 2, as
mentioned in the figure needs a complete swap of the two subsequences. This can
be performed in parallel by all processors holding the two subsequences.
Case 3 is explained further through an example. Assuming that the overlap is
as in Figure 4.18, processor P1 starts to merge the two sequences from points A
and E and stops only when it reaches the end of it’s sequence at B. In parallel, pro-
cessor P2 starts to merge the two sequences from points B and F. But, the processor
doesn’t exactly know the index or the exact location of the number F in the mem-
ory of P4. Hence it starts reading the sequence from the beginning of subsequence
at processor P4 until it gets to F (F is nothing but the first number greater than
the B), and continues merging thereafter. Similar is the case for processor P3. At
the end of the merge, processor P4 will have zero elements with the rest of them
having non-zero, but different number of elements.
In a more simpler case of case 3, for example Figure 4.19, load balancing is taken
care of by making processors at either end of the overlap merge in parallel in the
opposite direction (Processor P3 merges from points A and B, while processor P4
merges from points C and D) until each of them reach their number of elements,
thus ensuring equal number of elements in both processors.
Case 4 is treated similar to case 3 except that there is additional swapping done
64
Figure 4.17: Merge Cases
in parallel by the processors on either side of the overlap region.
Case 5 is also treated similar to case 3 and it happens to be the worst-case sce-
nario from a load balancing perspective. Processors on the left end up with the
whole merged sequence with no elements in the processors on the right.
Given that a barrier operation is much cheaper to perform on the SOME-Bus,
a more aggressive approach can be taken by adding another synchronization step
65
Figure 4.18: Example of Case 3
Figure 4.19: Example of Case 3 (simple)
after step (1), to exchange the indices of the start and end points of overlaps, i.e.
A, B, C, D, etc in Figure 4.18. Given the indices, all processors in the group know
exactly (in terms of block address itself) where the overlap starts and ends at each
processor, and hence can work independently to achieve perfect load balancing.
4.4.1 Performance Analysis
Each group of processors in an iteration work independently, hence the com-
munication time at step (1) to exchange the key locations is O(2stepi−1). The worst
case scenario for step (2) is when one processor has to read in a whole subsequence
from one half of the processors, to merge with its own subsequence. The commu-
66
nication time is O(2stepi ·N/P ).
Prefetching
Prefetching can be used either while swapping elements (case 2, etc) or merg-
ing. The processors will encounter a “full” miss on the very first access, and a
prefetch request for the next block at every iteration will reduce the latency of a
read miss by a small amount. The amount of computation to be performed at each
iteration is extremely small (write the element to local memory), hence the very
insignificant advantage of using prefetching.
Block Capture
The problem statement is such that there is very little sharing of data required
between the processors. Block capture can be used to some extent in step (1) of
the algorithm where all processors in a group need access to keylocations of all the
other processors in the same group. Again, the advantage is on the smaller scale
considering that there is very little data to be shared.
67
CHAPTER 5. SCHEDULING OF ALGORITHMS
This chapter shows some fundamental procedures that can be used to produce
a schedule of block accesses by each processor, to take advantage of the block cap-
ture and prefetch abilities of the SOME-Bus architecture. It must be noted that in
general this functionality is useful only if the application software exhibits some
regularity in the way shared memory is accessed. Dense matrix calculations and
similar operations can benefit the most with this functionality. On the contrary, an
application where processors walk through different parts of a linked list, the ele-
ments of which may be anywhere in shared memory, would not have as significant
a benefit.
Consider the matrix-vector multiplication algorithm:
for (j = 0; j < N; j++)    /* LOOP A */
      y[f] += A[f][j] * x[j];
If C is the size of the cache block, the above loop has N/C misses at x[0], x[C],
x[2C], . . . , x[N −C] or x[k ·C] for k = 0, 1, . . . , N/C − 1 without any prefetching or
capture of blocks.
Table 5.1 lists the sequence of accesses at each processor for an array of size
16x16 elements over 4 processors with the cache block size being equal to the size
of each element in the array.
N = 16, P = 4, C = 1
68
Table 5.1: Matrix-Vector Multiplication Accesses at Each Processor
P0 P1 P2 P3
x[0] x[0] x[0] x[0]
x[1] x[1] x[1] x[1]
x[2] x[2] x[2] x[2]
x[3] x[3] x[3] x[3]
x[4] x[4] x[4] x[4]
x[5] x[5] x[5] x[5]
x[6] x[6] x[6] x[6]
x[7] x[7] x[7] x[7]
x[8] x[8] x[8] x[8]
x[9] x[9] x[9] x[9]
x[10] x[10] x[10] x[10]
x[11] x[11] x[11] x[11]
x[12] x[12] x[12] x[12]
x[13] x[13] x[13] x[13]
x[14] x[14] x[14] x[14]
x[15] x[15] x[15] x[15]
As mentioned before, all block accesses are misses and there is large congestion
at every iteration of the algorithm, since all processors have misses and send data
request messages to the same node at the same time.
5.1 Matrix-Vector Multiplication Using Prefetch
A solution with simple prefetching reduces the read misses considerably but
does not help reduce the congestion as shown in Figure 5.1.
Prefetching can be done with a different starting point as shown in Figure 5.2;
Table 5.2 lists the sequence of prefetch requests at each iteration, and the accesses
are shown as “pr x[]” and “x[]” respectively.
There is a miss at t = 0 (first iteration), but the network is completely busy
even for such a simple program. Also, there are no more prefetch opportunities
available except to have a prefetch for the very first access even before the for loop.
69
for (j = 0; j < N; j++)     /* LOOP A */
{
    Prefetch(x[j+1]);
    y[f] += A[f][j] * x[j];
}
Figure 5.1: Matrix-Vector Multiplication (simple)
for (t = 0; t < N; t++)
{
    j= (t + p) mod N;
    Prefetch(x[(j+1) mod N]);
    y[f] += A[f][j] * x[j];
}
Figure 5.2: Matrix-Vector Multiplication with Reduced Congestion
Table 5.2: Matrix-Vector Multiplication Accesses with Prefetching
P0 P1 P2 P3
pr x[1] x[0] pr x[2] x[1] pr x[3] x[2] pr x[4] x[3]
pr x[2] x[1] pr x[3] x[2] pr x[4] x[3] pr x[5] x[4]
pr x[3] x[2] pr x[4] x[3] pr x[5] x[4] pr x[6] x[5]
pr x[4] x[3] pr x[5] x[4] pr x[6] x[5] pr x[7] x[6]
pr x[5] x[4] pr x[6] x[5] pr x[7] x[6] pr x[8] x[7]
pr x[6] x[5] pr x[7] x[6] pr x[8] x[7] pr x[9] x[8]
pr x[7] x[6] pr x[8] x[7] pr x[9] x[8] pr x[10] x[9]
pr x[8] x[7] pr x[9] x[8] pr x[10] x[9] pr x[11] x[10]
pr x[9] x[8] pr x[10] x[9] pr x[11] x[10] pr x[12] x[11]
pr x[10] x[9] pr x[11] x[10] pr x[12] x[11] pr x[13] x[12]
pr x[11] x[10] pr x[12] x[11] pr x[13] x[12] pr x[14] x[13]
pr x[12] x[11] pr x[13] x[12] pr x[14] x[13] pr x[15] x[14]
pr x[13] x[12] pr x[14] x[13] pr x[15] x[14] pr x[0] x[15]
pr x[14] x[13] pr x[15] x[14] pr x[0] x[15] pr x[1] x[0]
pr x[15] x[14] pr x[0] x[15] pr x[1] x[0] pr x[2] x[1]
pr x[0] x[15] pr x[1] x[0] pr x[2] x[1] pr x[3] x[2]
70
5.2 Matrix-Vector Multiplication Using Block Capture
Network traffic can further be considerably reduced by using block capture
with the same technique of making the processors start at different points in the
sequence of data to be accessed. A neighborhood definition will enable processors
to capture data from the bus that it doesn’t require immediately, but surely do in
the near future.
Q = P * C
for (j = 0; j < N/Q; j++) /* LOOP A */
   for (k = 0; k < Q; k++) /* LOOP B */
   {
      t = (k + p * C) mod Q + j * Q;
      y[f] += A[f][t] * x[t];
   }
When j = 0 given x[t] for t = (k+ p ·C)modQ, processor p makes the following
references x[p · C], x[p · C + 1], . . . , x[Q− 1], x[0], . . . , x[p · C − 1] in that order.
For example, for an array of size 16x16 elements (N = 16) over 4 proces-
sors (P = 4) and a cache block size of one array element (C = 1), we have
Q = 4, N/Q = 4.
for (j = 0; j < 4; j++) /* LOOP A */
   for (k = 0; k < 4; k++) /* LOOP B */
   {
      t = (k + p) mod 4 + j * 4;
      y[f] += A[f][t] * x[t];
   }
Table 5.3 lists the sequence of accesses and whether or not these accesses can be
enhanced to a “hit” or “capture”.
71
Table 5.3: Block Capture (Example)
P0 P1 P2 P3 Notes
x[0] x[1] x[2] x[3] misses, all capture
x[1] x[2] x[3] x[0] hits
x[2] x[3] x[0] x[1] hits
x[3] x[0] x[1] x[2] hits
x[4] x[5] x[6] x[7] misses, all capture
x[5] x[6] x[7] x[4] hits
x[6] x[7] x[4] x[5] hits
x[7] x[4] x[5] x[6] hits
x[8] x[9] x[10] x[11] misses, all capture
x[9] x[10] x[11] x[8] hits
x[10] x[11] x[8] x[9] hits
x[11] x[8] x[9] x[10] hits
x[12] x[13] x[14] x[15] misses, all capture
x[13] x[14] x[15] x[12] hits
x[14] x[15] x[12] x[13] hits
x[15] x[12] x[13] x[14] hits
5.3 Matrix-Vector Multiplication Using Block Capture and Prefetch
In this case of matrix-vector multiplication, the two techniques can be com-
bined to reduce both the network traffic and the number of misses considerably, as
follows.
Q = P * C;
for (j = 0; j < N/Q; j++) /* LOOP A */
  for (k = 0; k < Q; k++) /* LOOP B */
  {
    t = (k + p * C) mod Q + j * Q;
    y[f] += A[f][t] * x[t];
    /* if ((k == C) && (j != N/Q−1)) x[p*C+(j+1)*Q]; */
    if (k == C) x[p*C+(j+1)*Q];
  }
When j = 0 given x[t] for t = (k+ p ·C)modQ, processor p makes the following
72
references: x[p · C], x[p · C + 1], . . . , x[Q− 1], x[0], . . . , x[p · C − 1].
For an array of size 16x16 elements (N = 16) over 4 processors (P = 4) with
a cache block size equal to the size of one element in the array (C = 1), we have
Q = 4, N/Q = 4. Figure 5.3 lists the sequence of access and prefetches as before.
for (j = 0; j < 4; j++) /* LOOP A */
  for (k = 0; k < 4; k++) /* LOOP B */
  {
    t = (k + p) mod 4 + j * 4;
    y[f] += A[f][t] * x[t];
    if (k == 1) x[p*C+(j+1)*Q];
  }
5.4 A General Procedure to Use Capture and Prefetch
The scheduling in the previous section involved analyzing the algorithm for
accesses, and the optimal solution is clearly specific to that problem. It is clear
that there is a need for a more generic approach to utilizing the architectural en-
hancements of prefetching and capture. A compile/run-time analysis of the piece
of code at hand reveals its data access pattern and a generic set of procedures need
to be laid out, that can optimize the performance of the code at hand, given the
data access pattern. An attempt is made to lay out such guidelines in this section.
Most algorithms, for example the LU block decomposition, require barriers to
synchronize the processors and hence the following analysis revolves around the
location of these barriers and the data access patterns between each of these bar-
riers in the code. The analysis should apply to algorithms that don’t necessarily
use barriers within them (for example, the matrix-vector multiplication) because,
the whole algorithm can be treated as between two barriers, as is usually the case
when used as part of a super-computing application.
73
Figure 5.3: Matrix Multiplication with Block Capture and Prefetch
Let B = bk, k = 0, N − 1, listed in increasing order, be the set of blocks that all
P processors read between barriers. Then,
1. Schedule the block accesses in groups of P over the P processors, such that
within that group every processor starts with a different block access.
Block access 0 at processor p is bp, for p = 0, . . . , P − 1
Block access 1 at processor p is b(p+ 1)modP , for p = 0, ..., P − 1 etc.
2. Repeat scheduling the block accesses of each group, such that for each group
j, the P block accesses bt for t = j · P + p are performed by different proces-
sors. Each processor then performs the remaining P − 1 block accesses. Each
processor performs the N accesses in N/P groups of P accesses. When group
74
j is scheduled, processor p accesses blocks bt where t = j · P + (k + p)modP ,
k = 0, ..., P − 1.
3. When group j is scheduled, and k = 1, in processor p, schedule a prefetch of
block r = (j + 1) · P + p.
When the block references are more dynamic, a different version of the above
algorithm is used. We continue to assume that there are common block accesses
between processors, but the specific sequence of blocks is data dependent and is
determined dynamically at run time. The representative case is the execution of the
following code by P processors, where two (or more) different arrays are accessed
depending on some previously calculated value.
The algorithm is not directly applicable to code in 5.4 since it relies on each of
P processors making a specific sequence of block accesses, which may be captured
by all other processors. Since it is unknown at compile time, which processors will
reference the x or y arrays, the augmented scheduling algorithm must schedule
prefetches together with block access. Depending on the particular execution of
the program some prefetches will result in hits because the corresponding block
was captured.
As before, for an array of size 16x16 elements (N = 16) spread over 4 processors
(P = 4) with a cache block size equal to one element in the array (C = 1), we have
Q = 4,N/Q = 4. Each processor may execute code in either Figure 5.5 or Figure 5.6.
Table 5.4 shows two processors executing the first code and the other two exe-
cuting the second code. This example shows, that a large number of prefetches be-
come hits due to partial capturing of some cache blocks. Since the specific accesses
of each processor are unknown at compile time, the simplest capture specification
is to allow each processor to capture both x and y blocks. The blocks that are not
accessed soon are replaced by the cache management protocol.
75
While (Condition1)
{
}
    ...Code˘
    if (Condition2)
    {
    }
  
    else
    {
    }
    ...Code˘
    barrier();
        for (i = 0; i < M; i++)
          ... = y[i];
        for (i = 0; i < M; i++)
          ... = x[i];
Figure 5.4: Example of a More Dynamic Code
for (j = 0; j < 4; j++) /* LOOP A */
    for (k = 0; k < 4; k++) /* LOOP B */
    {
        t = (k + p) mod 4 + j * 4;
        z[f] += A[f][t] * x[t];
        if (k == 1) x[p*C+(j+1)*Q];
    }
Figure 5.5: Dynamic Code (Case 1)
Block accesses can also be finely scheduled so that block capture in combination
with properly placed barriers results in elimination of almost all misses. This is
shown in Figure 5.7
Figure 5.8 shows the subarrays accessed by each processor (DPROC = 4). Two
different sequences of blocks are accessed (belonging to arrays A and B), and con-
sequently, the procedures described earlier can be applied initially to schedule the
accesses of the blocks belonging to array A and then to schedule the accesses of the
76
for (j = 0; j < 4; j++) /* LOOP A */
    for (k = 0; k < 4; k++) /* LOOP B */
    {
        t = (k + p) mod 4 + j * 4;
        z[f] += A[f][t] * y[t];
        if (k == 1) y[p*C+(j+1)*Q];
    }
Figure 5.6: Dynamic Code (Case 2)
Table 5.4: Dynamic Code Accesses
P0 P1 P2 P3 Notes
pref acc pref acc pref acc pref acc
x[1] x[0] y[2] y[1] y[3] y[2] x[0] x[3] Misses
x[2] x[1] y[3]-h y[2] y[0] y[3] x[1]-h x[0] All hits
x[3] x[2] y[0]-h y[3] y[1] y[0] x[2]-h x[1] All hits
x[4] x[3] y[5] y[0] y[6] y[1] x[7] x[2] All hits
x[5] x[4] y[6]-h y[5] y[7] y[6] x[4]-h x[7] All hits
x[6] x[5] y[7]-h y[6] y[4] y[7] x[5]-h x[4] All hits
x[7]-h x[6] y[4]-h y[7] y[5] y[4] x[6]-h x[5] All hits
x[8] x[7] y[9] y[4] y[10] y[5] x[11] x[6] All hits
x[9] x[8] y[10]-h y[9] y[11] y[10] x[8]-h x[11] All hits
x[10] x[9] y[11]-h y[10] y[8] y[11] x[9]-h x[8] All hits
x[11]-h x[10] y[8]-h y[11] y[9] y[8] x[10]-h x[9] All hits
x[12] x[11] y[13] y[8] y[14] y[9] x[15] x[10] All hits
x[13] x[12] y[14]-h y[13] y[15] y[14] x[12]-h x[15] All hits
x[14] x[13] y[15]-h y[14] y[12] y[15] x[13]-h x[12] All hits
x[15]-h x[14] y[12]-h y[15] y[13] y[12] x[14]-h x[13] All hits
x[?] x[15] y[?] y[12] y[?] y[13] x[?] x[14] All hits
blocks belonging to array B (without damaging the earlier schedule).
Figure 5.9 shows one schedule of accesses to A blocks, which allows all proces-
sors to capture all blocks belonging to A after the first iteration. Still, this figure
shows several processors accessing the same blocks belonging to array B at the
same time.
77
p = processor_number;
y = p / DPROC;
x = p % DPROC;
/* node p calculates subarray Cyx=sum of Ayk*Bkx over k */
C = pc[p];
for (i = 0; i < SUBARRAYSIZE; i++)
  for (j = 0; j < SUBARRAYSIZE; j++)
    ARRAY_C(i,j) = 0;
    for (k = 0; k < DPROC; k++)
    {
      /* Ayk * Bkx */
      A = pa[y*DPROC+k];
      B = pb[k*DPROC+x];
      for (i = 0; i < SUBARRAYSIZE; i++)
       for (j = 0; j < SUBARRAYSIZE; j++)
       {
         u = 0;
         aa = A + i * SUBARRAYSIZE;
         bb = B + j;
         for (t = 0; t < SUBARRAYSIZE; t++)
         {
           u += (*aa) * (*bb); /* */
           aa += 1;
           bb += SUBARRAYSIZE;
         }
         ARRAY_C(i,j) += u;
       }
    }
Figure 5.7: Finer Scheduling of Block Accesses
The final schedule is shown in Figure 5.10, which relies on the fact that accesses
to blocks of A can be scheduled in several ways (still maintaining sequential ac-
cess). This schedule allows all processors to also capture all blocks belonging to B
after the first iteration.
Figure 5.11 shows the resulting program. The major difference compared to the
original program lies in the adjustment of the indices of the loops, which perform
the access to arrays A and B.
78
Figure 5.8: Matrix Multiplication Accesses (No Schedule)
5.5 Comparison With a Traditional Architecture
The capability of SOME-Bus with the proposed enhancements in Chapter 3 is
further highlighted when considering extending these enhancements to a more
conventional architecture such as the mesh. Message Combining, for example, al-
though benificial cannot provide the same level of performance improvement as
the SOME-Bus. This is because, a processor is usually receiving messages sequen-
tially over a single link, from the switch it is connected to. Even though, several re-
mote processors send DATA_REQ’s simultaneously, the unequal latencies between
nodes reduces the Message Combining oppurtunities significantly. Block Capture,
in its proposed form, cannot be implemented in a mesh architecture simply be-
cause the feature requires inherent broadcast/snoop capability in the underlying
architecture. Only a primitive form of block capture can be implemented through
the use of intelligent switches.
Prefetching, although quite effective, still suffers under some of the studied test
cases (LU Block Decomposition) due to large number of synchronizations required
by the algorithm (A prefetch cannot be issued for data that is being modified, until
79
Figure 5.9: Matrix Multiplication Accesses (Schedule A)
after a synchronization, to assure data integrity across the application). SOME-Bus
with its O(1) complexity barrier is extremely efficient compared to most architec-
tures.
80
Figure 5.10: Matrix Multiplication Accesses (Schedule AB)
81
p = processor_number;
y = p / dproc;
x = p % dproc;
/* node p calculates subarray Cyx=sum of Ayk*Bkx over k */
C = pc[p];
for (i = 0; i < SUBARRAYSIZE; i++)
  for (j = 0; j < SUBARRAYSIZE; j++)
    ARRAY_C(i,j) = 0;
    ks = y + x + 1;
    for (kt = 0; kt < dproc; kt++)
    {
      BARRIER
      k = (ks + kt) % dproc;
      /* Ayk * Bkx */
      A = pa[y*dproc+k];
      B = pb[k*dproc+x];
      for (i = 0; i < SUBARRAYSIZE; i++)
        for (j = 0; j < SUBARRAYSIZE; j++)
        {
          u = 0;
          aa = A + i * SUBARRAYSIZE;
          bb = B + j;
          for (t = 0; t < SUBARRAYSIZE; t++)
          {
            u += (*aa) * (*bb); /* */
            aa += 1;
            bb += SUBARRAYSIZE;
          }
          ARRAY_C(i,j) += u;
        }
    }
Figure 5.11: Matrix Multiplication Program (Schedule AB)
82
CHAPTER 6. CONCLUSIONS AND FUTURE WORK
Decrease in network traffic and overlap of communication time with compu-
tation time is essential to obtain good performance for a parallel program. Also,
most parallel programs require significant amount of synchronization during their
course of execution. The SOME-Bus architecture provides a unique opportunity
to address these issues. The inherent nature of the SOME-Bus (a dedicated out-
put channel per processor) significantly reduces the synchronization time between
processors and is also scalable. It is shown that additional hardware support in-
corporated into the SOME-Bus provides the programmer with additional features
that can be used while writing parallel programs.
Block capture can help reduce the network traffic significantly by making the
processors capture data off the bus that has not been requested by the processor
yet, but is known to need in the near future. Support for definition of neigh-
borhoods by the programmer, through the use of special instructions is shown
to greatly increase the flexibility of the architecture with regards to writing high
performance parallel programs. Block prefetch technique helps overlap the com-
munication time with computation time, by making the processor request for data
that it will need in the near future, instead of waiting until it actually needs that
data. And finally, the message combining technique helps increase the efficiency
of the directory controller by making it reply to several processors, requesting the
same memory block, with a single multicast message.
The above features combined with the inherent advantages of the SOME-Bus
architecture (fast barriers, high network bandwidth and low latency) is shown to
provide excellent performance for some very popular parallel algorithms. A com-
plete overlap of communication time with computation time is achieved in Matrix-
83
vector and Matrix-matrix multiplication. There is more data dependency in LU
block decomposition algorithm in that, the producers produce data and the con-
sumers need to wait for the data to be available before consumption. Despite the
inherent nature of the algorithm, the hardware features of the SOME-Bus is shown
to reduce the number of misses to O(SD). The theoretical results were verified
using a full-fledged SOME-Bus simulator.
Compiler support can be very beneficial for performance; the compiler can
identify the sharing between producers and consumers and insert the proper cap-
ture instructions. The UPC language includes the concept of affinity of an ad-
dress (a pointer), which basically identifies the node that the address belongs to.
It generates a set of instructions to figure out the affinity of an address every time
a memory reference is encountered. However, due to the operating system and
other middle layers, the number of such instructions was considerable. A more
direct approach was taken in the simulations, where the programmer had the op-
portunity to specify the affinity of data directly in the program.
A much more powerful notion of affinity is defined as a vector of attributes
associated with each cache block address. Any time a message in the network car-
ries a block, the affinity information is also included in the message, and is used
to assist each receiver in its decision whether to capture the block or not. A push
mechanism is feasible with such an implementation, where the producer knows
all the processors that need access to a block of data being generated, and auto-
matically multicasts the data to the appropriate processors. It has the potential to
considerably reduce the latencies involved in remote misses. The implementation
of synchronization among processors is different with such a technique, and, writ-
ing programs which can take advantage of such hardware, could present a very
interesting research area.
In reference to Section 3.3, with little additional logic, more complicated reg-
84
ular expressions can be used to define neighborhoods for each processor. This is
especially helpful when the memory access patterns of a parallel algorithm are
regular. A more detailed study into the feasibility of such expressions could be an
interesting area of research.
85
LIST OF REFERENCES
[1] “quadrics, available from http://www.quadrics.com,” .
[2] “myrinet, available from http://www.myrinet.com,” .
[3] Norton. C.D. and Cwik. T.A., “Early experiences with the myricom 2000
switch on an smp beowulf class cluster for unstructured adaptive meshing,”
in International Conference on Cluster Computing, 2001, pp. 7–14.
[4] Nikolopoulos. D. S., “Quantifying and resolving remote memory access con-
tention on hardware dsm multiprocessors,” in 16th International Parallel and
Distributed Processing Symposium, 2002, pp. 262–71.
[5] Panda DK Donglai Dai, “How much does network contention affect dis-
tributed shared memory performance?,” in 1997 International Conference on
Parallel Processing, 1997, pp. 454–61.
[6] C. Amza, A. L. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu,
and W. Zwaenepoel, “Treadmarks: Shared memory computing on networks
of workstations,” IEEE Computer, vol. 29, no. 2, pp. 18–28, 1996.
[7] Speight. E. and J.K. Bennett, “Brazos: A third generation dsm system,” in
Proceedings of the 1997 USENIX Windows/NT Workshop, 1997, pp. 95–106.
[8] “dolphinics, available from http://www.quadrics.com,” .
[9] A. G. Nowatzyk, M. C. Browne, E. J. Kelly, and M. Parkin, “S-connect: from
networks of workstations to supercomputer peformance,” in International
Symposium on Computer Architecture, 1995, pp. 71–82.
[10] Nemawarkar.S.S. and Agarwal. V.K. Govindarajan. R., Gao. G.R., “Analysis
of multithreaded multiprocessors with distributed shared memory,” in IEEE
Symposium on Parallel Distributed Processing, 1993, pp. 114–121.
[11] Forward. K.E Mabbs. S.A., “Performance analysis of mr-1, a clustered shared-
memory multiprocessor,” Journal of Parallel and Distributed Computing, vol. 20,
no. 2, pp. 158, 1994.
[12] Milutinovic. V. Grujic. A., Tomasevic. M., “A simulation study of hardware-
oriented dsm approaches,” IEEE Parallel and Distributed Technology, vol. 4, no.
1, pp. 74, 1996.
86
[13] Amit Agarwala. Chita R. Das., “Experimenting with a shared virtual memory
environment for hypercubes,” Journal of Parallel and Distributed Computing,
vol. 29, no. 2, pp. 228–235, Sept 1995.
[14] Stenstrom. P Dahlgren. F, “Evaluation of hardware-based stride and sequen-
tial prefetching in shared-memory multiprocessors,” IEEE Transactions on Par-
allel and Distributed Systems, vol. 7, no. 4, pp. 385, 1996.
[15] L.N. Bhuyan and D. P. Agrawal, “Generalized hypercube and hyperbus struc-
tures for a computer network,” 1EEE Transactions on Computers, vol. 33, no. 4,
pp. 323–333, 1984.
[16] Ould-Khaoua. M., “Comparative evaluation of hypermesh and multi-stage
interconnection network,” Computer Journal, vol. 39, no. 3, pp. 232, 1996.
[17] Szymanski. T., “Hypermeshes. optical interconnection network for parallel
computing,” Journal Parallel and Distributed Computing, vol. 26, no. 1, pp. 1,
1995.
[18] Kin. C.W. Hamdi. M., Tong. J., “Fast sorting algorithms on reconfigurable
array of processors with optical buses", parallel and distributed systems,” in
1996 International Conference on Parallel and Distributed Systems, 1996, p. 183.
[19] Ken Kennedy and Randy Allen, Optimizing Compilers for Modern Architectures:
A Dependence-based Approach, Morgan Kaufmann, 2001.
[20] Michael Wolfe, Optimizing Supercompilers for Supercomputers, MIT Press, 1989.
[21] C. Katsinis, “Performance analysis of the simultaneous optical multiprocessor
exchange bus,” Parallel Computing Journal, vol. 27, no. 8, pp. 1079–1115, 2001.
[22] Melhem. R. Gravenstreter. G., “Realizing common communication patterns
in partitioned optical passive stars (pops) networks,” IEEE Transactions on
Computers, vol. 47, no. 9, pp. 998–1013, 1998.
[23] Sahni. S. Rajasekaran. S., “Sorting, selection, and routing on the array with
reconfigurable optical buses,” IEEE Transactions on Parallel and Distributed Sys-
tems, vol. 8, no. 11, pp. 1123–1132, 1997.
[24] Plant.D.V., Venditti. M.B., Laprise. E., Faucher. J., Razavi. K., Chateauneuf. M.,
Kirk. A.G., and Ahearn. J.S., “256 channel bidirectional optical interconnect
using vcsels and photodiodes on cmos,” Journal of Lightwave Technolgy, vol.
19, no. 8, pp. 1093–1103, 2001.
[25] M. A. G. Abushagur Bouzid, A., “Thin-film approximate modeling of in-core
fiber gratings,” Optical Engineering, vol. 35, no. 10, pp. 2793–2797, 1996.
87
[26] Reekie. L Dong. L., Ortega. B., “Coupling characteristics of claddding modes
in tilted optical fiber gratings,” Applied Optics, vol. 37, no. 22, pp. 5099–5105,
1998.
[27] J. Erdogan. T., Sipe, “Tilted fiber phase gratings,” Journal of the Optical Society
of America, vol. 13, no. 2, pp. 296–313, 1996.
[28] Little. G. Lee. M., “Study of radiation modes for 45-deg tilted fiber phase
gratings,” Optical Engineering, vol. 37, no. 10, pp. 2687–2698, 1998.
[29] B. Nabet. L.G. Neto. M. A. Romero. J. W. Swart Ozelo. H. F. B., L.E.M. de Bar-
ros Jr., “Msm photodetector with an integrated microlens array for improved
optical coupling,” in Int. Microwave and Optoelectronics Conference (IMOC’99),
1999, pp. 472–475.
[30] T. Wang Li Y., “Distribution of light power and optical signals using embed-
ded mirrors inside polymer optical fibers,” IEEE Photonics Technology Letters,
vol. 8, no. 10, pp. 1352–1354, 1996.
[31] Li Y., T. Wang, and K. Fasanella, “Cost-effective side-coupling polymer fiber
optics for optical interconnections,” Journal of Lightwave Technology, vol. 16,
no. 5, pp. 892–901, 1998.
[32] K. Li Pan. Y., “Linear array with a reconfigurable pipelined bus system-
concepts and application,” Information Sciences–An International Journal, vol.
106, no. 3-4, pp. 237–258, 1998.
[33] Pan. V.Y. Keqin Li, “Parallel matrix multiplication on a linear array with a
reconfigurable pipelined bus system,” IEEE Transactions on Computers, vol.
50, no. 5, pp. 519–525, 2001.
[34] Keqin Li, “Scalable parallel matrix multiplication on distributed memory par-
allel computers,” Journal of Parallel and Distributed Computing, vol. 61, no. 12,
pp. 1709–31, 2001.
[35] B. Abali. F. Ozguner. and A. Bataineh., “Balanced parallel sort on hypercube
multiprocessors,” IEEE Transactions on Parallel and Distributed Systems, vol. 4,
no. 5, pp. 572–581, 1993.
[36] Al Ayyoub A. Ould Khaoua M. Day K., “On the performance of parallel
matrix factorisation on the hypermesh,” Journal of Supercomputing, vol. 20, no.
1, pp. 37–53, Aug 2001.
[37] Gaudiot J L Cerin C, “Algorithms for stable sorting to minimize communi-
cations in networks of workstations and their implementations in bsp,” in
IEEE Computer Society International Workshop on Cluster Computing, 1999, pp.
112–20.
88
[38] Walker DW Choi Jaeyoung, Dongarra JJ, “Parallel matrix transpose algo-
rithms on distributed memory concurrent computers,” in Scalable Parallel Li-
braries Conference, 1994, pp. 245–52.
[39] Venkatesh R Lau KK, Kumar MJ, “Parallel matrix inversion techniques,” in
IEEE Second International Conference on Algorithms and Architectures for Parallel
Processing, 1996, pp. 515–21.
[40] Jun Gu Qian Ping Gu, “Algorithms and average time bounds of sorting on
a mesh connected computer,” IEEE Transactions on Parallel and Distributed
Systems, vol. 5, no. 3, pp. 308–315, 1994.
[41] Lilja. D.J. Vander Wiel. S.P., “When caches aren’t enough: data prefetching
techniques,” IEEE Computer, vol. 30, no. 7, pp. 23–30, 1997.
[42] Stenstrom. P Dahlgren. F., Dubois. M., “Performance evaluation and cost
analysis of cache protocol extensions for shared memory multiprocessors,”
IEEE Transactions on Computers, vol. 47, no. 10, pp. 1041–1055, 1998.
[43] Koppelman. D.M, “Neighborhood prefetching on multiprocessors using in-
struction history,” in International Conference on Parallel Architectures and Com-
pilation Techniques, 2000, pp. 123–132.
[44] Pen Chung Yew Hock Beng Lim, “Efficient integration of compiler directed
cache coherence and data prefetching,” in 14th International Parallel and Dis-
tributed Processing Symposium, 2000, pp. 331–340.
[45] Baer. J. L. Ortega. D., Ayguade. E. and Valero. M., “Cost effective compiler
directed memory prefetching and bypassing,” in International Conference on
Parallel Architectures and Compilation Techniques, 2002, pp. 189–198.
[46] Adve. S.V. Pai. V.S., “Comparing and combining read miss clustering and
software prefetching,” in International Conference on Parallel Architectures and
Compilation Techniques, 2001, p. 0292.
[47] Saavedra. R.H., Weihua Mao, Daeyeon Park, Chame. J., and Sungdo Moon,
“The combined effectiveness of unimodular transformations, tiling, and soft-
ware prefetching,” in The 10th International Parallel Processing Symposium,
1996, pp. 39–45.
[48] Milutinovic. V. Milenkovic. A., “Cache injection on bus based multiproces-
sors,” in Seventeenth IEEE Symposium on Reliable Distributed Systems, 1993, pp.
341–346.
[49] Flynn. M.J. Byrd. G.T., “Producer consumer communication in distributed
shared memory multiprocessors,” Proceedings of the IEEE, vol. 87, no. 3, pp.
456–466, Mar 1999.
89
[50] Xiang. L and Ushijima K., “On time bounds, the work time scheduling prin-
ciple, and optimality for bsr,” IEEE Transactions on Parallel and Distributed
Systems, vol. 12, no. 9, pp. 912–21, 2001.
90
APPENDIX A. DESCRIPTION OF THE SIMULATOR
An assembly language driven simulator was created in order to examine the
performance of DSM with prefetch and grab operations on the SOME-Bus archi-
tecture. The simulator provides a detailed model of the processes and memory
and the DSM operation of every node on the SOME-Bus. It keeps track of every
memory access by each processor and its effect on individual data blocks.
The parameters of the CC-NUMA multiprocessor to be simulated are specified
as inputs to the simulator. Examples of these parameters include the number of
nodes, the number of threads per node, amount of memory allocated to each node
and the cache structure of each node (cache size, number of cache blocks and num-
ber of bytes per block).
The applications to be simulated consist of MIPS assembly language instruc-
tions created by compiling C programs with the Unified Parallel C (UPC) compiler.
Each thread has its own register set and sequence of instructions. The execution of
the application is complete when all threads have finished processing their respec-
tive assembly instructions. If an instruction does not involve a memory access,
any necessary changes to the thread’s register set are performed directly. If an
instruction involves a memory reference, a cache lookup is performed. If the ad-
dress causes a cache hit the thread remains in the RUNNING state and the next
instruction in the thread’s sequence will be examined on the next simulator clock
cycle. If the memory reference causes a miss in the cache but can be obtained from
the local memory, the thread waits for the access to complete in the SUSPENDED
state. When the access completes, the thread transitions back to the RUNNING
state and the next instruction in the thread’s sequence will be examined on the
next simulator clock cycle. If the memory reference causes a miss in the cache and
91
the memory access cannot be filled by the local node, the thread is placed in the
BLOCKED state and another thread from the pool of ready threads is chosen and
placed in the RUNNING state. The new thread will begin running by processing
the next instruction in its sequence on the next simulator clock cycle.
The operation of the processor, directory controller, cache controller and chan-
nel controller is simulated in detail in order to provide information needed to com-
pare the performance of each component for the DSM system before and after the
proposed prefetch and grab operations have been applied.
The PREF instruction of the MIPS IV instruction set was implemented in order
to provide prefetch capabilities. When the simulator processes the prefetch in-
struction, it performs a cache lookup and if the result is a cache miss a PREFETCH
Request message is sent to the Home node. The grab capability is achieved by
using a pre-defined “neighborhood”. When a message containing a data block is
sent to a node, the neighborhood is examined to determine if other nodes should
receive the data as well. All nodes that share the necessary neighborhood charac-
teristics for the data block will receive a copy of the outgoing message.
All simulation results are based on time units equal to one clock cycle. The
performance of the multiprocessor is evaluated/verified in terms of processor and
channel utilization, the number of simulation cycles required to execute an ap-
plication, average waiting times for the queues, and average round-trip times for
Data and Ownership request messages.
92
VITA
Harsha Vardhan Narravula was born in Kurnool, India and graduated from
Osmania University (Hyderabad, India) in 2000, with a bachelor’s degree in Elec-
trical and Computer Engineering. He joined Drexel University in 2000, and started
pursuing his PhD under Dr. Constantine Katsinis. His current research interests
include switch design, interconnection networks of parallel and distributed sys-
tems, parallel algorithms, and compiler design. During his years at Drexel, he has
held both research and teaching assistantships. He has been a teaching assistant
for several courses including “Computer Structures”, “Microcontrollers”, “Secure
Computing” and “Electrical Design Lab”.
He also worked as a consultant for Rydal Research for two years and was re-
sponsible for design, implementation and test of several components of a low-
latency high-performance processor interconnect. He has co-authored two journal
and three conference papers.

