Parallel processing for scientific computations by Alkhatib, Hasan S.
NASA-CR-19B013
School of Engineering
Department of Computer Engineering
Santa Clara, CA 95053
Parallel Processing for Scientific _ _ ,_
Computations _- - f ._..f
Final Report
Overview of Research Project Accomplishments
NASA Agreement Number NCC 2-644
March 29, 1995
Submitted to
Mr. Ken Stevens, Jr.
NASA-Ames Research Center
NAS System Division, Mail Stop 258-5
Moffett Field, CA 94035
(415)604-5949
(NASA-CR-I?8013) PARALLEL
PROCESSING FOR SCIENTIFIC
COMPUTATICNS Final Report
Clara Univ.) 49 p
(Santa
N95-24563
Unclas
63/62 0045510
https://ntrs.nasa.gov/search.jsp?R=19950018143 2020-06-16T07:25:46+00:00Z
1. Introduction
The scope of this project dealt with the investigation of the requirements to support
distributed computing of scientific computations over a cluster of cooperative workstations.
Various experiments on computations for the solution of simultaneous linear equations were
performed in the early phase of the p[roject to gain experience in the general nature and
requirements of scientific applications. A specification of a distributed integrated computing
environment, DICE, based on a distributed shared memory communication paradigm has
been developed and evaluated. The distributed shared memory model facilitates porting
existing parallel algorithms that have been designed for shared memory multiprocessor
systems to the new environment. The potential of this new environment is to provide
supercomputing capability through the utilization of the aggregate power of workstations
cooperating in a cluster interconnected via a local area network.
Figure 1
Clusters of Cooperating Workstations
The great majority of scientific applications require a fairly large amount of memory to
execute a task. If a task is to be partitioned into threads (sub-tasks) that are executed in
parallel, memory sharing is very desirable since it allows sharing variables among threads
within the same task. Shared memory multiprocessor systems have been the predominant
platform selected for executing large scientific applications for these reasons.
Workstations, generally, do not have the computing power to tackle complex scientific
applications, making them primarily useful for visualization, data reduction, and filtering as
far as complex scientific applications are concerned. There is a tremendous amount of
computing power that is left unused in a network of workstations. Very often a workstation is
simply sitting idle on a desk. A set of tools can be developed to take advantage of this
potential computing power to create a platform suitable for large scientific computations.
The integration of several workstations into a logical cluster of distributed, cooperative,
computing stations presents an alternative to shared memory multiprocessor systems. In this
project we designed and evaluated such a system.
Attached to this report are three papers published or accepted for publication, resulting from
this research project. These articles are:
1. Hasan S. AlKhatib, Qiang Li, Chi-Jiunn Jou, Tiekun Chen and Hassan Arafeh "DICE
- a Distributed Integrated Computing Environment for Multi-threaded Parallel
Processing", Proceedings of the Third International Systems Integration Conference,
Sao-Paulo, Brazil, August 15-19, 1994, pp 612-621.
2. Chi-Jiunn Jou, Hasan S. AlKhatib, Qiang Li and Tiekun Chen "Coherency Protocol and
Algorithm of the DICE Distributed Shared Memory", Proceedings of the ISCA
International Conference on Parallel and Distributed Computing Systems, Las Vegas,
NV, October 6-8, 1994. pp 796-801.
3. Chi-Jiunn Jou, Hasan S. AIKhatib and Qiang Li "Two-Tier Paging and Its Performance
Analysis for Network-based Distributed Shared Memory Systems", accepted for
publication in the IEICE Transactions on Information and Systems.
2. DICE Overview
DICE is a computing environment for executing multi-threaded tasks on a cluster of
networked workstations. In DICE, threads of a parallel task may run on separate
workstations sharing the same virtual address space. Threads communicate with each other
using shared memory. An overall system structure of DICE is shown in Figure 1.
DICE consists of three interactive subsystems: a distributed shared memory (DSM), a parallel
scheduler (PS), and a distributed run-time subsystem (DRS). DSM provides mechanisms for
sharing distributed memory among threads of a parallel task and hence supports the
underlying computing and communication paradigm. PS provides tools to initiate both local
and remote threads and to coordinate their execution over different workstations. DRS
workstation 1
Threads
DRE
"'0i"" "','" "
I
OS
workstation 2
Threads
DRE
"'0i"" ":'"
I
OS
000
workstation n
Threads
DRE
"'0i"" ":'" )Sgl /"
I
OS
Network
I I
Figure 2
System structure for DICE
provides the programmers interface to develop parallel tasks as well as the run-time
environment for their execution.
3. Distributed Shared Memory
In DICE, the physical memories of individual workstations in a cluster are treated as
resources for the virtual space of a multi-threaded parallel task. Pages of the address space of
a task can be shared among the threads of the same task. A task consists of multiple threads
that can run on different workstations in a cluster simultaneously. The virtual memory of
DICE is divided into private and shared spaces. Private space is local to a single workstation,
and is not shared among threads. An example of private space is the stack of a thread. Shared
space is global to all workstations, and is shared among all threads of a parallel task. Shared
space is further divided into read-only code and read-write data spaces. The initial
implementation of DICE will only support the shared data space.
DICE presents a new distributed shared memory design to attack the problems of false
sharing and thrashing. False sharing may occur in a typical distributed shared memory system
such as Ivy[ 1], since its consistency or access unit (eg. per word) is less than the sharing unit
(per page). The single-write nature of its coherency protocol may cause a "ping-pong"
behavior between multiple writers of a shared page, or the thrashing problem. To overcome
these problems, Mach[2] uses a shared memory server to perform the fault scheduling via a
queueing mechanism[3]. Mether[4,5] avoids these problems through the use of the
inconsistency. Clouds[6] avoid these problems by using a single-write-single-reader strict
coherence semantics. Mirage[7] reduces the effect of these problems by using a time window
scheme, in which the system guarantees that the writer of a page retains access to a page for a
fixed period of time. Munin[8] minimizes these problems by using multiple type-specific
coherency protocols.
To overcome these false sharing and thrashing problems, DICE DSM uses a hybrid memory
granularity and supports multiple coherency protocols. Shared memory is structured as a
two-layer paging system. The higher layer is a page, which is the same as the one in an existing
system. The lower layer is a paragraph, which is a small fixed-sized memory region within a
page. The memory sharing unit is a page, while the coherency unit is a paragraph. Each page
in the shared address space is divided into several small equal-sized paragraphs. Each
paragraph uses one and only one specific protocol at a time. The protocol used on a
paragraph can be changed to adapt to new application requirement. The default protocol
used on a paragraph is that of inconsistent memory, which only provides memory sharing
without coherency. Other coherency protocols include write-invalidate, write-update,
write-read-migrate, home-read-write, release-update, and entry-invalidate.
Write-invalidate, write-update, write-read-migrate, and home-read-write protocols
provide a strict consistency on copies of a shared paragraph. They resemble the
read-replication, full-replication, migration, and central algorithms in [9] respectively. Both
release-update and entry-invalidate protocols provides weak consistency memory model on
copies of paragraph. The weak consistency memory model is different from the strict
consistent memory model in that it does not guarantee memory coherency without the use of
explicit high-level synchronization operations. Parallel programs, therefore, would need to
impose an ordering on accesses to shared memory by using synchronization operations. This
protocol treats shared memory accesses differently from synchronization variable accesses.
The model supports two types of synchronization accesses: acquire and release. Similar to the
software release consistent protocol used in [8], release-update protocol ensures that all
previously modified data is updated before the release is performed on a synchronization
variable. Similar to the entry consistent protocol used in [10], entry-invalidate protocol
ensures that a consistent copy of paragraphs are pre-fetched when the acquire or entry of a
synchronization variable is performed.
DICE DSM is similar to Munin[8] system, since both of them use multiple type-specific
coherency protocols. However, the kinds of protocol support and their designs are different
between them. More significant difference between them are the memory structure and
granularity. DICE DSM uses fixed-sized paragraph flat memory space, while Munin uses
variable-size object structure memory space. The advantage of using fixed-sized paragraph
is that it allows the DSM to be implemented in hardware like MemNet [11]. This will improve
the performance significantly, and is the final prototype of DICE DSM.
DICE separates synchronization mechanism from shared memory. It supports two kinds of
synchronization variables locks and barriers. Whereas locks are used primarily for access
control, that is, to resolve competition among parallel threads, barriers are used for sequence
control, that is, to ensure correct timing among cooperating threads. Other kinds of
synchronization variables can be built on top of them. DICE uses distributed queueing
schemes for both lock and barrier synchronization protocols.
4. Parallel Scheduler
DICE PS is a self optimizing application specific scheduler. It is responsible for thread
scheduling and synchronization. The PS is implemented as a thread within the parallel task.
Each parallel task has one PS running on the workstation where the task initially start to run.
This special thread is created during application load-time.
When an application needs to create another thread or to terminate itself by joining with
other threads, it passes control of the execution to the PS. The PS will find the fastest way to
run the application by using the information in the task execution dependence tree, which is
created as an auxiliary file during the compiling of the source program.
The PS decides whether the local workstation has enough resources to run the different
threads, which threads to send to remote workstations to run, and which remote workstations
to send them to. It uses several tools to make intelligent decisions at run time. Those tools
are: CPU load estimator, network load estimator, an intelligent database, and the bidding
process.
The CPU load estimator runs on every workstation on the network and keeps track of the
load on that workstation. The network load estimator monitors the traffic on the network,
and helps the parallel scheduler in avoiding heavily loaded networks. A small and efficient
database records thread performance on each workstations under different CPU and
network load conditions. This database helps the bidding process by giving the workstations a
reasonable estimate of the expected run times of various threads.
When the parallel scheduler decides that it is best to send some threads to a remote
workstation to run, it needs a way to pick those workstations. Instead of forcing other,
possiblyheavily loaded,workstationsto take someof the threads,the parallel schedulerasks
for helpthrough the bidding process.It simplyasksfor help in running agiventhread andtells
the other workstations about the memory and CPU requirements of the thread. This
information is found in the intelligent database.The detail design of PS is based on our
previous work [12].
5. Distributed Run-Time Subsystem
DRS transforms the DICE DSM from a fiat space into an object-oriented structured space.
DICE DRS consists of a set of tools that implement the DICE. Application Programmer's
Interface, API, provides users with programming tools to develop and execute DICE
multi-threaded applications. The tools used during program development include a parallel
language and its compiler, library interface functions, and a linker.
A new Object-Oriented Dataflow language(OODL) will be designed used as the parallel
language used in DICE. One of the important features of object-oriented programming is
information hiding and encapsulation [13,14]. It provides a higher level of data abstraction in
modeling real world objects. Such concepts are helpful in designing parallel programs [ 13]. In
general, parallel programs are difficult to design because the programmer must consider
multiple execution threads instead of a single thread. All possible interactions among the
threads must be considered. Also, parallel programs are hard to maintain because a simple
change may affect the interaction pattern and result in global consequences. Information
hiding helps in reducing possible interactions that need to be considered, while data
encapsulation help in minimizing the maintenance effort when program changes are needed.
While the object-oriented model provides a high level of programming abstraction, it does
not naturally exploit parallelism of applications constructed with objects. A dataflow model
can expose and exploit the maximum amount of parallelism, as well as express data
dependence from different levels of abstraction in a very natural way. The combination of the
object oriented and datafiow concepts makes it easier for programmers to design large scale
multi-threaded parallel programs, and to build re-usable concurrent software modules.
The OODL language, in DICE, will be an extension of the object-oriented programming
language C + +. Dataflow constructs will be added to allow programmers to express
parallelism explicitly. The parallel compiler can be realized using a preprocessor to translate
the extended source code into C + + programs, which in turn are compiled into object code
using an existing C + + compiler.
The run-time library interface functions provide a collection of library routines that are
linked with each parallel program. They are invoked to support the service requests made by
systemprocessesat run-time. The OODL compiler will use these functions to realize the
parallelism expressedin the application programs. Thesefunctions canalso be usedby the
application directly.
6. Conclusions
The key results accomplished in this project include:
1. A design of a distributed shared memory system for distributed networked computing that
solves the problem of false-sharing. The DSM employs a two-tier paging scheme and a
set of management protocols and algorithms suitable for hardware support within the
architecture of a workstation.
2. The DSM scheme was evaluated analytically. The results verify the validity of benefit of
the two-tier paging scheme in solving the problem of false-sharing.
3. The DSM was alo simulated using the Block Oriented Network Simulator, BONES, and
was driven by a trace from a scientific application chosen from the Stanford's SPLASH
benchmarks. The results of the simulation confirmed the results of the analytical work
and also verified the utility of the use of the two-tier paging schem.
The papers attached to this summary report contain further details of the work performed
under this project.
References
1. K. Li, "IVY: A Shared Virtual Memory System for Parallel Computing," In
Proceeedings of the 1988 International Conference on Parallel Processing, pp. 94-101,
August 1988.
2 A. Forin, J. Barrera, and R. Sanzi, "The Shared Memory Server," Proceedings 1989
Winter USENIX Technical Conference, Februray, 1989, pp. 229-244.
3. A. Forin, J. Barrera, and R. Sanzi, Design, Implementation, and Performance
Evaluationof A Distributed Shared Memory Server for Mach, Technical Report
CMU-CS-88-165, Carneigie-Mellon University, Computer Science Department,
August, 1988.
4. R.G. Minnich and D. J. Farber, 'q'he Mether System: A Distributed Shared Memory
for SunOS 4.0," In Useunix -Summer 89, Usenix, 1989.
5 R.G. Minnich and D. J. Farber, "Reducing Host Load, Network Load, and Latency in
a Distributed Shared Memory," Proceedings of the lOth International Conference on
Distributed Computing Systems, Paris, France, June 1990.
6. U. Ramachandran, M. Ahamad, and M. Khalida, "Unifying Synchronization and Data
Transfer in Maintaining Coherence of Distributed Shared Memory," Proceeedings of
the 1989 International Conference on Parallel Processing, pp. 160-169, August 1989.
7. B.D. Fleisch, G. J. Popek, "Mirage: A Coherent Distributed Shared Memory
Design," Proceedings of the 12th ACM Symposium on Operating System Principles,
December 1989, pp. 211-222.
8. J.B. Carter, J. K. Bennett, and W. Zwaenepoel, "Implementation and Performance of
Munin," The 13th ACM Symposium on Operating Systems Principles, October 1990,
pp. 152-164.
9. M. Stumm, and S. Zhou, "Algorithms Implementing Distributed Shared Memory,"
IEEE Computer, Vol 23, No. 5, May 1990, pp. 54-64.
10. B. N. Bershad, M. J. Zekauskas, Midway: Shared Memory Parallel Programming
with Entry Consistency for Distributed Memory Multiprocessors, Technical Report
CMU-CS-91-170, Carneig-Mellon Univeristy, September 1991.
11. G. S. Delp, The Architecture and Implementation of MemNet: A High Speed
Shared-Memory Compute Communication Network. Ph.D. Thesis, University of
Delaware, Department of Electrical Engineering, Newark, DE, May 1988.
12. H. Arafeh and H. S. AlKhatib, and H. Barraclough, "MOPPS: A Scheme for
Managing Parallel Scientific Programs in a Distributed Architecture," Proceedings of
COMPCON'90, the Annual International Computer Conference of the IEEE
Computer Society, February 25 - March 2, 1990, San Francisco, CA, pp 387-395.
13. B. Cox, Object Oriented Programming - An Evolutionary Approach,
Addison-Wesley, 1986.
14. B. Stroustrup, "What is "Object Oriented Programming"?," IEEE Software, Vol 5,
No. 3, May 1988, pp. 10-20.
15. Y. Wu, T. G. Lewis, "Parallelism Encapsulation in C+ +," In Proceeedings of the
1990 International Conference on Parallel Processing, pp. 35-42, 1990.
HProceedings of the ISCA
International Conference
PARALLEL AND DISTRIBUTED COMPUTING
SYSTEMS
Los Vegas, Nevada U.S.A.
October 6-8, 1994
A PubUcatlon of
The International Society for
Computers and Their AppUcatlons - ISCA
ISBN: 1-880843-09-9
Coherency Protocol and Algorithm of The DICE Distributed Shared Memory
Chi-diunn Jou, Hasan S. AIKhatib, Qiang Li, and Alien Tiekun Chen
Abstract
DICE (Dismbuted Integrated Computing Environment)
DSM (Distributed Shared Memory) is an experimental sys-
tem, being developed at Santa Clara University, which sup-
ports the execution of multiple threads on a cluster of net-
worked workstations. This paper presents the coherency
protocol and algorithm of DICE DSM, which is a novel ap-
proach to the design of the virtuad-memory based DSM. In
DICE DSM, the shared memory uses a two-der paging sys-
term The first tier, page, is the common page used in an over-
ating system, The second tier is called aparagraph, which is a
smaller fixed-sized unit of memory contained within a page.
The introduction of paragraphsimproves system performanc¢
by reducingtheprobabilityoffalsesharingaswellas the size
of the unitof informationtransferredover the network for
maintenanc_ofmemory coherency.
Keywords: coherency protocol and algorithm,distrib-
umd shared memory, Meal area network.
1. Introduction
Computer Engineering Department
Santa Clara University
Santa Clara, CA 95053
in which the system g,uarantees that the writer of a page retains
accesstoapage fora fixedperiodoftime.Munin {2]handlesit
by using multiple consistency protocols and software release
consistency. Mether [d] reduces false sharing and thrashing
through the use of the incoherent memory.
DIC_ (Distributed Integrated Computing Environ-
mcnt)[I] presents a novel approach to handle the problem of
false sharing and thrashing. The shared memory is structured
as a two-tier paging system. The first tier, called page, is the
page commonly used inan operatingsystem. The second tier
is called a paragraph, which is a smaller fixed-sized block of
memory within a page. Paragraph is the coherency unit. The
introductionofparagraphsimproves system performance by
reducingtheprobabilityoffalseshanng aswellasthesizeof
the unit of information transferred over the network for main-
tenance of memory coherency.
An overview of the DICE DSM architecture is given in
section 2. Section 3 presents the memory coherency protocol.
The algorithm for realizing the complete DSM protocol is
presented in section J,. Section 5 discusses the expected sys-
tem performance and concludes.
A DistributedShared Memory (DSM) system supports
thesharingofa virtualaddressspaceamong processesrun-
ningon loosely-coupledprocessors.A number ofDSM sys-
tems over LA.Ns have been developed [Sl. Among them, Ivy
[5]isimplemented on anetwork ofApolloworkstations.The
memory ispaged,and copiesofpages may be replicatedin
different hosts. Strict coherency semantics arc used,and the
memory coherency is maintained by a write-invalidate with
dynamic ownership protocol. The owner of a page is locamd
via either a centralized manager, fixed distributed managers,
or an individual host which forwards the re.quest. Ivy is used
for applications employing muld_ed tasks. All threads
share the same virtual address space. False sharing may occur
in this system, since its consistency or access unit (e.g. word)
is less than the sharing unit (page). In addition, the single-
write nature of its protocol may cause a "ping-pong" behavior
between multiple writers of a shared page.
To overcome false-sharing and thrashing, some systems
employ special schemes. Clouds [7] avoids them by using a
single-writer-single-reader strict coherence semantics. Mi-
rage [3] reduces thrashing is by using a dine window scheme,
"This work was rapported by NASA-Am_ Rttte,are..hCe.nter grant,=
amber NCC 2-644 entitled "Parallel Pmcc_ing for Scientific Corn-
potations".
2. The DICE DSM Architecture
DICE is an experimental distributed computing sysmm
which aims at providing a computing _avironmcnt for tha ex-
ecution of multi--s.hreadcd tasks. A paralI¢t task may consist
of multiple threads that can tm scheduled to run on a cluster of
workstations simultaneously. A thread is an active program
,ndty that provides the notion of a computation. Threads on
separate workstations also share the same virtual addxess
space.,and communicate with each other using shared
memory. Synchronizationof threadsaccessingshared re-
sourcesisdone using_nctionsprovidedby adistributedrun-
time library.
FigureIshows t.hesystem structureofDIC'_. Itconsists
of threeinteractivesubsystems. DRS (dismbute.drun-dine
subsystem)providesuserswithprogramming toolstodevelop
and execute DICE multi--dar*aded applications. DSM (dis-
tributed shared memory) provides r.he underlying communi-
cation and computing paradigm for threads of a parallel task.
PS (parallel scheduler) is a self-.-optirnizing application-spe-
cific scheduler, and is responsible for thread scheduling and
synchronization.
In addition to a host processor and memory, each nod_ in
DICE also has a network processor and a Disrribated Shared
Memory Management Unit (DSMMU), DSMMU is an exten-
sion of the traditional MMU which supports paragraph valida-
1
796
.I_ _ t"%r.
wor Jr_tat)on [
FS i DSM
OS
f "['hrc=_
DR3
?S ' DSM
I
O$
tow- L_um¢_ tAN
_)r_3tat ioq i1
_J DSM
05
Rip=re I. S)_e.m A_cl_itcaure of DICE
tion/invalidationto achieve efficientmanagement of the
DSM. When data isaot availablelocallyand needs tobe
fetchedfrom aremotehost,theDSMMU triggersaspecialac-
cess fault,otherwise,the DSMMU performs the traditional
TLB operations.
3. Coherency Protocol
In DIC=_ a parallel task consists of multiple threads that
run on a cluster of workstations (hosts), simultaneously.
Shared data can be distributed and replicated on the physical
memory of the members of a cluster. The DSM system sup-
ports the sharing of virtual pages, and maintains coherency
among replicamddatacopiesacrossthenetwork. A parallel
task has a root host, on which it was first loaded and executed.
The root host. maintains the stare information for all shared
pagesused by thetask.Otherhostsinthe clustermaintainthe
siam informationfor the sharedpages thatarc currentlyin
theirlocalphysicalmemories.
In DICE each shared page of a parallel task has a home
hast. A home host maintains the seam information for its
pages, and ensures that the last copy of a page is not purged,
and keeps track of all copies of the paragraphs of its pages.
Other hosts in the cluster that have a copy of a page keep a
pointer of the home host. When a thread makes an attempt to
access a page for which it doe.s not have a copy, it communi-
cates with the home laost of the respective page in order to
complete the memory access transaction. When a host does
not know the home host for a certain page, a home-.info fault
will be mggered and a home--info request will be sent to the
root host. The root host replies with the information about the
home host for the requested page. If the home host is apt yet
assigned, the root host will assign the first requesting host as
the home host for the requested page. The root host will then
update its database and send to the requesting host a reply in-
forming this assignment.
The memory coherency of DICE DSM is maintained on
the paragraph level. A paragraph can simultaneously be read
by multiple hosts, but it can only be writmn by one host at a
time. Access rights to a paragraph can be read-wine, mad-
only, or none. An owner host is the most recent host that have
read-write access to that paragraph. The ownership of a para-
graph may be transferred from one host to another. There is
no ownership when two or more hosts have read--only access
rights to that paragraph. The Information about the ownership
era paragraph is maintained at the home host of the page con-
taming the paragraph.
When a readoperationisissuedtoa paragraphbya host
with none rights,a mad-data faultwillbe triggeredand a
read--datarequestwillbe senttotheparagraph'shome host.
When a write operation is issued to a paragraph by a host with
none right, a write-data fault will be triggered and a write-da-
ta request will be sent to the paragraph's home host. When a
wrim operation is issued to a paragraph by a host with read-
only access right, a write-access fault will be triggered and a
write-access request will be sent to the paragraph's home
host. In each case the home host directly or indirecdy re-
spends with the requested information.
At initialization, a home host is the default owner host for
all paragraphs within its respective pages. Any other host will
send a remote request to the home host when it tries to access a
paragraph of this page. If a read-dam request is received, the
home host will return a reply containing the most recent copy
of the desired paragraph when it is the owner host or there is no
owner host of that paragraph. The access rights of both home
and requesting hosts arc changed to read--only. If the home
host is not the owner host. it will forward this read-dam re-
quest to the owner host of that paragraph. The latter changes
its access right to read-=only, and then sends to both home and
requesting hosts a reply containing the most rec_nt copy of
that paragraph. After it receives the reply, both home and re-
questing hosts changes their access right to mad-only. Home
host will also reset the owner host of that paragraph to none. If
it is the requesting host. the home host will directly send the
mad--data request to the owner host. The latter chang_ its ac-
cess right to mad--only, and then sends back a reply which con-
rains the most recent copy of that paragraph. Having received
this reply, the home host changes its access right to read-only
and resets the owner host of that paragraph to none.
If a write-data request is received, the home host will re-
turn a reply contmning the most recent copy of the desired
paragraph if it is the owner of this paragraph. If multiple valid
copies exist, the home host will send invalidate requests to all
hosts holding the copies, and wait for confirmations from all
of them before returning the reply. Upon receiving the invali-
date request, each host changes its access right of that para-
graph to none and returns its confirmation to the home host.
The access tight of the home host is then changed to none,
while the requesting host becomes the owner host and its ac-
teas right is changed to read-write. If the home host isnot the
owner host. it will forward this write-data request to the own-
er host of that paragraph. The latter changes its access right to
nora, and then direcdy sends to the requesting host a reply
containing the most recent copy of that paragraph. After re-
ceiving the reply, the requesting host changes its access right
to read-write and sends a confirmation message to the home
797
|
host. Having received this confirmation message, the home
host updates its database and records that the requesting host
becomes the owner host of that paragraph. If the homo host is
the requesting host, it will directly send the write-<fatarequest
to the owner host. The latter changes its access right to none,
and then sends back a reply which contains the most recent
copy of that paragraph. Having received this reply, the home
host changms its access right to read-write and becomes the
owner host of that paragraph.
If a write-access request is received, the home host will
return the write--access confirmation when it is the owner of
that paragraph. If muldple valid copies exist, the home host
will send invalidate requests to all hosts (except the request-
ing one) holding the copies, and wait for confirmations from
all of them before returning the confirmation message. Upon
receiving the invalidate request, each host changes its access
tight of that i_at'agraph to none and returns its confirmation to
the home hosL The access right of the home host is then
changed to none, while the requesting host becomes the owner
host and its access right is changed to read-write. If the home
host is the requesting host, it will directly send the invalidate
requests to all hosts (except the requesting one) holding the
copies and wait for confirmations from all of them. Upon re-
ceiving the /nva//date request, each host changes its access
fight to none and returns its confirmation to the home host.
The home host then changes its access right to read-write and
becomes the owner host of that paragraph.
Figure 2 shows the seam diagram representing the loca-
tion of a valid paragraph. This state diagram reflects the pro-
total described above. At any dine, the location state era val-
id paragraph is either none. at home host, at owner host, or at
multiple hosts. The state is initially set to none when a home
rt_ mm_t--_attm
w.te--,ttm
wt_m--m
tJest
r_
,_1 o¢ w_J f_ut/rst_lt_(o. r_s-.or_ msstt
C--_ar_ 2. State ol_ mr me toe.tram od a w_t_l _
host has not yet been assigned. A home-b)/a fault and request
made by any host forces the root host to assign the requesting
host to become the home host. The state is then changed to at
home host. in this case the home host is the owner of the para-
graph.
A. paragraph will leave the at home host state when either
read--4ata or a write-data fault occurs. A read-data fauit
and request at any non-home host causes the paragraph to
transit to the at multipte hosts state. In this case there is no
owner host and multiple hosts have valid copies ( with mad_
onty access tights) of the paragraph. Note that these multipte
hosts always include the home host. A write--data fault and
request causes the paragraph to transit to the at owner host
state, where the requesting host becomes the owner host of the
paragraph.
The paragraph will leave the at multiple hosts state when
either a writ_--access or a write-data fault occurs. A write-
access or a write--data fault and request at any other non-
home host causes the paragraph to transit to the at owner he.re
state. A write-access fault and request at the home host
causes the paragraph to transit co the at home host state. A
mad--data fault and request at any other non-home host wiI1
sdlI ke_p the paragraph in the at multiple haas stare. Note that
a mad-data or a write-data fault will never occur at the home
host; since a home host has a valid copy of the paragraph (with
mad-only access rights) in the at multipte hosts seam.
- The paragraph may leave the at owner hosrstam when ei-
ther _ mad--data or a write-data fault occurs. A read-da_
fault and request at any other host cause, s the paragraph to
transit to the at multil)te hosts st,am. A )vrite--dara fault and
re,qtmsr arthe home host cause, s the paragraph to transit to the
athome host state. A write-data fault and request at any other
non-home host causes a change of ownership, but the para-
graph will still be in the at owner host state.
4. _Coherency Algorithm
To support the above protocol, a Page rabl_ (PT) and a
paragraph table (ParT) are used to maintain the state infor-
mation about shared pages and paragraphs. Each DICE appli-
cation maintains its own set of these tables. In addition to the
address mapping information and flags, PT also maintains the
information about the location of home host of each shared
page. This location information is denoted by the home hen
identffier or hid. ParTmaintains the information about the ac-
ctms fights to each paragraphs (act). The Part of the home
host also maintains the location of the owner host of a para-
graph (o/a), and the set of hosts (excluding the home host)
which have mad-only copies of the paragraph (co_ryset).
The coherency algorithm handles various kinds of para-
graph validation faults as described in section 3. These faults
include home--in/o, read--data, write-data, and wrim--access
faults. We divide the algorithm into four parts, corresponding
to the four fault types. Each part of this algorithm consists of a
fault handler and its server, as illustrated in Figures 3 to 6 for
the rx:specdve fault type. Nora that p and g, which are used
798
÷.
I Ill " ' '
within the algorithm, denote the current page and para=_-aph
numbers, respectively.
home--.mfo fmdl ha_mle_.
s_ h_rN_,_o requesl to root
aroate a _]" ra¢ _
IF ( hame-a_ confwmiuon ,i= r_l
SEC_e
FOR D_z;inlon i in 000
P'ur_]._. O;
;:vTTr].abcc = rela-wmr,
_NO 0(3:.
ENO
PT_oJ.h/d - me _ed home host:
FOR pllcjra(:_1 i ;n o DO
ENO 0(3;.
ENO
r_m_.
hame-- m/o fault
IF (p ul riot yll alliqllld _Itl i florae holt)
p" "rmne" _ - > *at _ _ sUue -!
_IJ_d - nmu==_
cmem a i_eT for p
t=OR _iin O 00
t=_'rl't].a¢=.
EN_
ELSE P "_ _ ho_ st=e - _ m ¢_nqa "t
s*ne tmme-_ nm_f to r_mm=_ nee:
mlun_
F'_um =. _he _mmm _ mm_r,;I nome-_o U
5- Discussions and Conclusions
We have pre.sented the memory coherency protocol and
algorithm of DICE DSM. The coherency protocol for this
cwo-_er paging system is now being simulated in software.
The performanc_ of DICE DSM system has been studied us-
ing an analytical model [4],which derivesan expressionfor
the speedup of the parallel parr of an application (or S_ ). In
dais analysis, a high-speed and low-;ateney ATM _ is cho-
sen as the underline platform, and the queuing time on the net-
work is assumed to be negligible.. The memory access unit is
assumed to be four bytes (or one word). Each page has P bytes
and k paragraphs per page. An application is executed by N
hosts, and uses M bytes of shared memory space. The behav-
ior of an application is represented by the percentage of data
memory accesses for total instructions (denoted by d); the
probabilities of read and write faults (denoted by N<and N,¢,,
which are the number of mad faults and write faults per
1.000,000 memory references per host); temporal locality
(denoted by :c,, which is the number of times that the same
paragraphs accessed continuously by a host); and spatial
Iocality factor (denoted by x,, which is the probability of a
certain region of shared memory being _ by a specific
host). The temporal locality x, is further represented by a
st_p uniform distribution(withparametersN0, Nt, and g,
which am thestartingpointer,endingpointer,and window siz-
Read data t_mWrnpn fault hill(_ltr:.
IF (I am home r_ost)
BEGIN _r to -_ ow_,_¢ _ suue - > ",'_ muJ#_ t_o'.'_"" sssta "t
read_dam ¢_:luest to Pw'F[gJ.o_t:
mcmve m_O d&=a re1_Wtrom PwT'[gJ.m_:
la_rT_gj._0oys_t ,= (P_l"_gJ.=¢l};
P'=rl"[g|.o_l • none:
E_O
ELSE
BE'_IN
read _a,= result to FTTOI.hid:
rer.,_we re_l c_ata realy from owner nest:
ENO:
uo_am _ co_y o( _1;
r,_wT'[gJ.a¢C = relcl--on_y;
unalotk heat
_ (I .m_ home no,_)
IF (no _ _OS¢)
BEG4N
s_md m_l d_ _ m r_mummml I_osl:
EN0
ElSE IF {I arn owner rm_-11
_EGIN /" "at home no=_" s=_ -::,= "it mu_e no==" m °!
P_rT_gl.a_ - _a-_n_.
sen_ m=(:l _=a re=_f to requesmq
?'=rT[gJ.¢oOysst = (mqt.,_mng ho_t};
t:_atl"[gJ.oid, non_.
5NO
EI._E p' _'ho=t m not me "1
BEGIN p "at _ no=" _- >-'_t_ _" s_am °I
P_T_gj.a_ = r=*"..-_,
P_T'LgJ.¢omr'_t - {P='_(gl-_ _t':
_m(gl.om -
E_O
BEGIN
iF (r_r,_nrm heat _ hem= ho_
_ m__a=m ml_, to hem* ho_'
ELSE
E_IO
• of this step), which approximates the bell-like normal dis-
_bution reflecting intuition that the chauc._ of a memory
location being acc=.ssed by a host decreases as the distance
grows from the previously accessed location.
The effects of changing S_ on system structm_ and
application behavior has been studied, and some of these re-
sults are shown below. Figure 7 shows that the =_ainin S_ be-
comes smaller and smaller as the network dam rate R, in-
creases. "Thismay justifythe above assumption that the
queuing time on the network isnegligibleinhigh-speedand
low=4amncy network. Figure $ shows thatS_ dmn'easesas
processor speed R_ increases. Note that the total execution
time for an application will still be reduced asR, in_,
althoug_ S, decreases.
Hgur_9 shows that S_ increases as the nurnberofpara=
graphs pea"page, k. increases up to a certain point. After that
point. S_, slighdy decreases as k furdaer increases. Further-
799
Wrttl alf.I gW'lgraan hmgl t_me,_(_r:
J.=_ am norn_ nasz)
_EG]N _ "_)towr)e¢ host" slalo - > %).)home nos/" _;_1¢m"/
sena _w_u) oma rm_Je,_ to P_gJ.ow:l:
rllc4m_ writ# Clit_llfeOfy |rocn P_tT_gJ.o_I:
Pam(gl.o_ - m,fs4_.
_NO
ELSE
s4mo wnm daca reouem to FTT_I.h,O:
u_e Ioca/cooy ol g;
t4-_lw_c tvalc a_clllm_,
_P 0 im rmc I'_ne no_ ancl r_b, _ nol _r_m P_lj_ _
slncl write a_l _fm_n _o/_T_|.l_cl:
JF (I am r_ r_
BEGIN P "at rruCm_ horns° strum - > "a( owrw nos_ stw_ o!
mnc_ _ _ to J hosts m Pwl"(gl.cowse_
_(oc_ I=mCmm_ I%mxe n)o.msts f_. g:
re_vm ,_MI_vm,0aOon comlm.m_:
hCTlgbacc, r_,e:
sm_ w,_ aam re_/co m(x)am_ nose
),_gl.c=wsm = 0:
_r_gl.m_ - m_u_ n_m
_.gE ;F 0 wn om_- rxsm
8EG;N /" "a_home nc_" m -> "a/c_m. _ m "!
_)mTT_l.acc-
sm_ _u) daua racy to ra_Jamng hose
_'r{gl.o,cl .. _m.ma_q rmc
_0
_L.q_ ,_"ore'm" rx_ _s nm rrm .!
BEGIN /)'_ mm_ nos_ m-.m _-!
r)c_m wmmj:ma conk_mmon 1ram nmumm_ )z_:
I_rT_gJ.a,d. _
'dn_aclc _o_slmq Uunl _ for _
Pm'l"[g J._.
IENO
more.. S_ is approximam[y the same for a fixed paragraph
size,,which is P/k. This behavior demonstrates usc_Iness of
the use of a paragraph with a smaller granularity than a page.
l_gurc 10 shows a similar behavior, for S_ i. rcianonship
with thenumber ofhostsN.
The analysisof thisperformancemodel demons_ams the
effectofusingparagraphwhich hasa smallergranularitythan
a page. TI_s smaller&oranulariP/ reduc_ the probabilityof
falsesharingand theamount ofdam tobe transfer_dover the
network. The perforrnanccof DICE DSM isalsogoingtobe
evaluatedby a L,'ac,'--drivensimulationmodel, which will
take considerationof network queuingdelayand give morn
realistic results,
The concept of usingparagraph is differentfrom thatof
us/rigc_he Hn¢ or from theonesjustusingsmalll_ge size,.
C,ach_..-based DSM has been used in multip_r sysmms,
which nce_Is to build theirown interconnecu_l network inmr-
face and use their own me_sag_-..bas_communication
HII/I{Ill wfttl--I¢lll _llrlqril_tl |IUR:
send _am_ r_Ja_ to a_ _ _ P_'T(gJ.coi_yset;
P_rfgl.co.y_ = (}:
E.L_:
BEGIN -:_
F.NO:
WMte I(:¢mm gm_illon fm_lt _m:.
P "al rnu_ nos_" _m - • "_t _ nosr state "!
s_m¢lima_mon _luel_ lo d nosut ((mcao_ reauu'u_ nos_) _ P_,r[gl.c_ys_:
¢ The _nmm Im ml_linq w_e_eces_ p_"_q_ multz
scheme. Incontrast,paragraph-basedorpage--basedDSIV[_=
used thesysmms overLA_s, usingtheexistingne_ork inu_
facewithstandardpacket-basedorcell-based networkcor_
municadon protocols.As compared with small page
paragraph reduces the complexity of the shared memory man
agcmentduo totheuse ofsmallsiz_ofpage table and the r_o_
layeredhicrarci'ncal page/paragraphs_'ucmre while allow_nl
a hosttocontinueusingthelargersizeof page astheu_nds:_
cur_nt memory demgn m umprocessor computer systcm=__
This r_ucuon of compiexaty _s also due to the using ot hom__
hosts in the protocol, which allows easily to Iocat_ the desiz_
memory unitwhile dismbudng the management of
memory over the hosts on a I_AN. ,._
References : :
[I] H.S.AiKha_ib,Q. Li,C 3ou,2".Chen, and H. Arat'eR
"DIC_ - A DistributedInu:grat_dComputing
ment forMulti-ThreadedParallelProcessing,"toba
appearedin flaeProceedings of lmernational Confer-
ence on System Integration, August 15--19, 1994, See
Faulo, Brawl.
[2] J. B. Carter, J. E. Bennett, and W. Zwaenepoct,
rnentadon and Performance of Munin," The 13_
Symposium on Operating Systems Principles, October : .'_
1990, pp. 152-164. 7_
[3] rL D. Fleisch, G. I. Popek, "Mirage: A Coherent DIS-
tributed Shared Memory Design," Proceedings af'J'_ :_/_
12th A CM Symposium on Operating System Princ_. te._
December 1989, pp. 211-222. ._
[4] C. Jou, H. S. AIKhatib, and Q. Li, Perforrnanc= A_naiy--_
sis of DICE Distribut_ Shared Memory System, Dis--_
:-._t
8OO PAil
! I 6
o ar pa ) k(#ofpara/page)
Figure 7. Sp vs k for different Rn. N=I6, Rp=50Mips,
M=64kbytc.s, P--,tkby_¢=, d=0.4, Nrf=500,
Nwf=[0, Xs=0.5, NO=t0, NI=100, _=t00.
----..,-_B,,,_--- Ism_$ I _mt
F/gum 8.Sp vsk for_ffcr_nt Rp. N=16,P,n=150Mbps,
M=d4kbytes, P--,Ckbyms,d=0.4,Nr/'=500,
Nwf---[0,Xs=0.S, N0=I0. NI=I00, g=100.
Ol_li_tllqil
O
k (# of para / page)
Figure 9. Sp vs k fordifferentP.N=I6, Rn= 150Mbps,
Rp=50Mips, M=S4kbyt_, d--0.¢,.l_'f=-500,
Nwf=10, X.s.._.5,NO--10,NI=I00, g=100.
8 I "-'='41-'- _'_
|-----=---.,,
I-"'--'"
_Q
N (#of hosts)
F_gltr_I0.Sp vs N fordifferentk.Rn=lSOMbps, d-._.4,
Rp=_0Mips.M=d4kby_,P'-___,
_rf=500,Nwf=i0, _0=_0,t_=10O,_=i00.
¢ibumd Computing Lab TechnicalRepor_No.
03231994, SantoClaraUniversity,1994.
[5] ICLA "IVY':A SharedV'wuml Memory Sysmm forPar-
zIlct Computing," In Proceedings of _e 1988 [r,_rna-
_ono_ Conferenceon ParaLlelProcessing,pp.94-101,
August 1988.
[61 IL G. Minnich and D.I. Fro'bet,"Reducin_ Hos_ Loa_L
Net'workLoad, and Lamncy ina DistributedShared
Memory," Proceedings of the l Oth /m_rnzzri.onzd Con-
ferenceon DistributedComparing Systems,Paris,
France,lunc 1990.
[7] U'. Ramachaadran, M. Abamad, and M. Kha.Hda, "Uni-
b/inK Synchronization and Dam Transfer in Maintain-
ing Cohm-enc_ofDismbuted SharedMemory, " Pro.-
ceedings of the 1989 Interna:ional Conference on
Parallel Processing, pp. 160-169, August 1989.
[8] NL Tam, I.M. Smith. and D..l'.Faxbex,_A Taxonomy-
Based Comparison of Several Distributed Shared
Memory Systems,'" A CM OperatingSystem R_view,
VoL 2¢,No. 3,July 1990,pp.40--67.
801
A Two--Tier Paging Scheme for Network-based Distributed Shared Memory Systems
Chi--diunn dou, Hasan S. AIKhatib, and Oiang Li
Abstract - Distributed computing over a network of workstations continues to be an illusive goal. Its
main obstacle is the delay penalty due to network protocol and OS overhead. We present in this paper a low
level hardware supported scheme for managing distributed shared memory (DSM), as an underlying paradigm
for distributed computing. The proposed DSM is novel in that it employs a two-tier paging scheme that re-
duces the probability of false sharing and facilitates an efficient hardware implementation. The scheme em-
ploys a standard OS page and divides it into fixed smaller memory units called paragraphs, similar to cache
lines.
An application address space is viewed as consisting of a shared data region, an unshared data region, a
stack region and a code region. Code, stack and unshared data regions are handled by the OS in the standard
manner without modification. The proposed scheme manages the shared data regions only. A hardware exten-
sion of a traditional MMU, Distributed MMU or DMMU, is introduced to support the DSM. Shared memory
coherency is maintained through a write-invalidate protocol. An analytical model is built to evaluate the sys-
tem sensitivity to various parameters and to assess its performance.
Keywords - distributed shared memory; false sharing; hardware support for distributed computing;
memory coherency protocol; performance evaluation; networks of workstations.
1. Introduction
Despite the tremendous progress made in local area networking over the past decade and a half, the
operating system and network protocol technologies have yet to address the main obstacle to distributed
computing, namely the delay due to the network overhead. Network speed has reached several
hundreds of Mbps, but the real issue is the network overhead latency in addition to sustained through-
put.
*This work was supported by NASA-Ames Research Center gants number NCC 2--644 entitled "Parallel Processing for
Scientific Computations".
The problem consists of a myriad of sub-problems, and is not simple to resolve. It requires a system-wide
consideration on the full integration of networks into the operating system, and a re--examination of network
protocols and the overall system architecture, including hardware support for both network protocols and the
OS. This integrated view is underway in a project at Santa Clara University, called DICE, a Distributed Inte-
grated Computing Environment [1 ]. DICE supports a distributed shared memory paradigm, DSM. This paper
presents the design and performance of DICE DSM.
A number of DSM systems based on LANs have been developed over the past decade[18]. Among them,
Ivy [ 13] is implemented on a network of Apollo workstations. The memory is paged, and copies of pages may
be replicated in different hosts. A multiple-readers and-single writer strict coherency semantics is used on the
page level. Memory coherency is maintained via a dynamic ownership protocol with a write-invalidate proce-
dure. The owner of a page is located using either a centralized manager, a group of fixed distributed managers,
or the individual host which forwards the request. Ivy is designed for multi-threaded applications. All threads
share the same virtual address space. False sharing may occur in this system, since its consistency or access
unit (e.g. word) is less than the sharing unit (page). In addition, the single-writer nature of its protocol may
cause a "ping-pong" behavior between multiple writers of a shared page, leading to thrashing.
The problems of false-sharing and thrashing have been addressed by other DSM systems. Clouds [15]
avoids them by using a single-writer-single-reader strict coherence semantics introducing instead significant
blocking delays. Mirage [9] reduces thrashing by using a time window scheme, in which the system guaran-
tees that the writer of a page retains access to a page for a fixed period of time, suffering again from blocking
delays. Munin [3] handles it by using multiple consistency protocols and software release consistency, hence
placing the burden on the user. Mether [14] eliminates false sharing and thrashing by ignoring memory coher-
ency altogether, leaving its burden to the application software.
DICE represents a novel approach to handling the problem of false sharing and thrashing. The shared por-
tionof memoryisstructuredasatwo-tier pagingsystem.Thefirst tier is anormalpage, and the second is
called aparagraph, which is a smaller fixed-size block of memory within a page. Coherency is maintained at
the level of a paragraph. The introduction of paragraphs improves system performance by reducing the proba-
bility of false sharing as well as the size of the unit of information transferred over the network for maintenance
of memory coherency. A Distributed Memory Management Unit, DMMU, an extension of the tradition-
al MMU, is designed to support the paragraph validation, and a special network controller is used to
support the accesses to the remote memory and the maintenance of memory coherence.
Section 2 of this paper gives the overview of the DICE architecture. The design of the DICE distributed
shared memory is described in section 3. An analytical model and the expected system performance are pre-
sented and discussed in section 4. Section 5 concludes this work and compares it to other approaches.
2. Overview of the DICE Architecture
DICE is an experimental distributed environment for executing multi-threaded tasks. A parallel task may
consist of multiple threads that can be scheduled to run simultaneously on a cluster of workstations. Threads
executing on separate workstations share the same virtual address space, and communicate with each other
using shared memory. Synchronization of threads accessing shared resources is done using functions provided
by a distributed run-time library.
DICE consists of three interactive subsystems. The DSM provides the underlying communication para-
digm among threads of a parallel task. The DRS (distributed run-time subsystem) provides users with pro-
gramming tools to develop and execute DICE multi-threaded applications. The PS (parallel scheduler) is a
self-optimizing application-specific scheduler, and is responsible for thread scheduling and synchronization.
3. Design Issues of the DICE DSM
DICE DSM is designed for a cluster of workstations connected via a high-speed, low-latency local area
network. The architecture of a node in a DICE system is shown in Figure 1. Each node consists of a host
processor and a physical memory module. The traditional MMU is replaced by a DMMU. The network inter-
face is attached directly to the memory bus and contains a network processor and a dual ported memory visible
I/O Bus
I
Host Processor
Physical-address Cache ]
Host Memory [
I/D
_lfD
Memory Bus
Dual-ported
Memor_ Network
Network Processor
Memory
VA: virtual address
PA: physical address
IfD: instruction & data path
[ otherI/O's ] [ disk ] Ipackets
C LAN
Figure 1. The Architecture of a DICE Node
both to the host and network processors, simultaneously. The dual ported memory holds data structures for
managing the shared memory.
3.1. Programmer's View of DICE DSM Environment
In DICE, a parallel task consists of multiple threads that can run on a cluster of workstations (nodes), simul-
taneously. Memory pages required by each thread, whether code or data, are allocated physical memory
blocks, at the respective node, where the thread is running. Shared data pages are distributed and repli-
cated among the nodes as needed by the threads. The DSM system is designed to support the sharing of
data pages. The DSM system also maintains the coherency among replicated data copies.
Each parallel task has a root node, on which it was first loaded and executed. The root node main-
tains state information for all pages, including shared pages used in the application, while other nodes
maintain the state information for the pages that are loaded in their local systems.
Code and non-shared data pages of a thread are loaded in the physical memory of the node where
the thread is scheduled for execution. Shared data pages, on demand, are first loaded into the physical
memory of the node. That node becomes the home node for the page. A home node maintains the com-
plete state information for its pages. It ensures that the last copy of a page is not purged, and keeps track of all
copies of paragraphs belonging to its pages. Other nodes in the cluster, that have a copy of a shared page, keep a
pointer to the page's home node. When a thread makes an attempt to access a page for which it does not have a
copy, it interacts with the home node of that page in order to complete the memory access. When a node does
not know the home for a certain page, a home-info fault is triggered and a home-info request is sent to the root
node. The root node replies with the information about the home node for the requested page. If a home is not
yet assigned for the page, the root node assigns the first requesting node the status of home for that page. The
root node then updates its table and sends the page to the requesting node. The requesting node, upon receiving
the page and the assignment of home status, updates its page table and creates a paragraph map table for that
page.
3.2. Coherency Protocol
The memory coherency of DICE DSM is maintained at the paragraph level. A paragraph can simulta-
neously be read by multiple nodes, but it can only be written by one node at a time. Access fights to a paragraph
can be read-write, read--only, or none. An ownernode of a paragraph is the node that has read-write access to
that paragraph. The ownership of a paragraph may be transferred from one node to another upon demand.
There is no owner for a paragraph, when two or more hosts have read-only access rights to that paragraph. The
Information about the owner of a paragraph is maintained by the home node of the page containing the para-
graph.
When a read operation is issued to a paragraph by a node with none rights, a read fault is triggered and a
read request is sent to the paragraph's home. When a write operation is issued to a paragraph by a node with
none rights, a write-data fault is triggered and a write-data request is sent to the paragraph's home. When a
write operation is issued to a paragraph by a node with read--only access rights, a write--access fault is triggered
and a write-access request is sent to the paragraph's home. In each case the home directly or indirectly re-
spondswith therequestedinformation.Thecoherencyof paragraphsi basicallymaintainedthroughawrite-
invalidateprotocol. Thedetailsof this protocolandits algorithmis shownin [11].
3.3. Management of Shared Memory
Page and paragraph tables are used to maintain the state information for shared pages and their paragraphs,
respectively. Each DICE application maintains its own set of these tables. A Page Table (PT), similar to a
traditional page table, provides the information about mapping the virtual addresses of pages to their corre-
sponding physical addresses, at their respective nodes. A Paragraph Validation Table (PVT), maintains the
information about the access rights of the page's paragraphs. Each entry of a PVTcontains a 2-bit field main-
taining the access rights of the local node to the respective paragraph. Note that there is no address translation
for paragraphs. Each node keeps a Page Table for Home information (PTH), which maintains the information
about the homes for its shared pages. Each home node of a page maintains a Paragraph Table (ParT) for that
page containing a pointer to the current owner of each paragraph and a list of nodes with read-only copies of
the paragraph. There is only one ParT for a page in the system. It is maintained by the home node of that page.
The PT and PVT are maintained in the dual-ported memory, inside the LAN interface. They are used by both
host and network processors. The PTH and ParT are maintained in the network subsystem, and are only used
by the network processor. Figure 2 shows the data structures for these tables.
DMMU is an extension of the traditional MMU. It is designed to support paragraph validation for efficient
handling of distributed shared memory. When data is not available locally and needs to be fetched from a
remote node, the DMMU triggers special access faults via an embedded hardware unit, PVLB (Paragraph Vali-
dation Lookaside Buffer) - to validate the access rights of paragraphs. The DMMU performs the traditional
TLB operations for all non-shared pages as well. When the DMMU does not find the entry it needs in its TLB,
it fetches the entry from the appropriate PT in memory. When an entry is loaded from the PT into the TLB, all
entries of its associated PVT (2 bits per paragraph) are simultaneously fetched and stored into the associated
PVLB. When an entry of the TLB is replaced, all entries of its associated PVLB are also replaced. Note, there
are no PVLBs for non-shared pages.
In Dual-ported Memory:
PT:
0
1
flags
!
physical page frame number
i !
!
, pointer to PVT
shared
PVT:
°I1
k acc rights
In Network Memory:
PTH:
0
1
n
(for home only)
ParT:
pParT ............
! i ! !
I !
, , pointer to ParT owner id
' (if home)I j
' home id
home
(set to 1 if local node is home)
Figure 2. Page and Paragraph Tables for Shared Pages in DICE
|
|
copyset
Figure 3 shows the structure of the TLB and the PVLB. Each entry in the TLB contains an address tag, a
physical page frame number, flags, and an S bit. The S bit is used to distinguish shared pages from non-shared
pages. Each TLB entry of a shared page has an associated PVLB, which has k two-bit access rights fields,
where k is the number of paragraphs within a page. The virtual address is grouped into three fields: a page
number, a paragraph number, and a paragraph offset. The page number is used as a key to match the address
tags in the TLB, while the paragraph number directly addresses the PVLB entries corresponding to the same
paragraph number. The latter operation will simultaneously select n PVLB entries, where n is the number of
PVLBs in the DMMU. Each PVLB has an associated logic L, which validates the access rights of the refer-
enced paragraph. By checking the stored two-bit access rights field and the current memory access type R/W,
logic L generates a Trap signal. The Trap signal is ONwhen any paragraph validation fault occurs. The Trap
causes a system trap and requires the software to distinguish the type of the current access fault and resolve it.
virtual address
[ page number ] para number [para offset ]
_V
q?_B
RrW
PVLB
_ L, Tr_
el
ta_ pfn flags S{
Trap
ace fightsR/W
Trap
RAWI
R
R
R
W
W
W
ace rig?hts Tra F comments
read-write no
read-only no
none yes read-data fault
read-write no
read-only yes write-access fault
none yes write-data fault
Logic L Function Table
Figure 5. TLB and PVIA3 Structures
If there is no Trap, the physical access to the paragraph proceeds without interruption. The function of logic L
is shown in the table inside Figure 3. The S bit of the selected TLB entry is used as a gate to control the final
selection of the Trap signal generated from the previously selected n PVLB entries. Note that the operations on
the PVLB are executed in parallel with the operations on the TLB, except for the final selection of the PVLB
output. Hence, if a memory reference does not generate a paragraph validation Trap, no significant extra delay
will be suffered by going through this additional PVLB unit compared to a traditional MMU.
The control unit of the DMMU contains the logic to manage the retrieval of entries from the PTs and the
PVTs in the dual-ported memory. It also controls the TLB and PLVB update operations, and handles other
related activities. When the retrieval of the entries of the PVT fails, the DMMU triggers a PVT trap resulting
into a home-info fault as described in section 3.1. Other paragraph validation faults are generated by the PVLB
as described above.
4. Performance Analysis
The performance of a DICE DSM system is mainly affected by the delays encountered in handling differ-
ent paragraph validation faults, which in turn depends on the execution delay of messages sent over the net-
work to resolve paragraph faults. In the following analysis, a performance metric is first defined. The system
and network model is presented. Thereafter, the application behavior model along with the protocol cost are
described. Finally, the performance results for different combinations of system configurations and applica-
tion profiles are shown and discussed.
4.1. Performance Metric
The performance of parallel systems is often measured in terms of speedup, which is the ratio of the execu-
tion time of a program run on a single processor to that run on a parallel system. We limit ourselves to the
speedup for the parallel part of an application only. We define the speedup for the parallelpart of an applica-
tion, Sp, as the ratio of the execution time of the parallel part of an application running on a single processor to
that running on a DICE DSM system.
LetusdenoteT s and Tasm to be the total execution time for the parallel part of an application by a single
node and by N nodes in a DICE DSM system, respectively. Let the processor speed of a single node be denoted
by R e MIPS. Let the total number of instructions required to be executed in the parallel part of the application
be denoted by Ia, and the average rate of shared data memory accesses per instruction be denoted by d,. Then,
I, 1 (___+ d,l.r_,_) (1)
r. = R-_ and r,_. = _ R,
where Tpcoa denotes the average protocol cost per shared data memory access, and will be derived in the fol-
lowing subsections, using an analytical system model. The term d, Ia Tp_ost represents the total overhead,
when using the DICE DSM. The speedup for the parallel part of an application Sp is therefore:
_ T, N (2)
Sp T,_,,. = 1 + d, Rp Tpco_
4.2. Network and System Model
In this analysis, a high-speed, low-latency ATM network is assumed to be the underlying local computer
network. The queuing time on the network is assumed to be small enough to be neglected. (A future study is
examining the effects of queuing delays.) The memory access unit is assumed to be one word (or four bytes).
Each paragraph has G words. An application is executed by N nodes.
A typical ATM network consists of a set of nodes connected via a mesh of switches. In an ATM network,
data is segmented into small fixed-length cells, routed, then reassembled at the destination using header infor-
mation contained in the cells. Due to the efficient structure of ATM frames, the waiting time for accessing the
network can be designed to be very short. In this model, each network message with length Lmsg takes
ty= + nce u tva_ processing time at the transmitting and the receiving nodes, nc,zl LcJRn transmission time,
and nc_ t,,_ processing time through an ATM switch; where nce_t is the number of cells needed to transmit the
whole message, or the ceiling ofLm_/(Lce H - Lha ) ;Lc, a andLha are cell size and header lengths, respective-
ly; tf= and tva, are fixed and variable parts of processing delays in the communicating nodes, respectively; R,, is
lO
thenetworkdatarate;t,e t is the average network switch latency a cell goes through in a typical ATM network.
Note that the processing time at the nodes includes the time for copying data between host memory and net-
work buffer, network processor latency, interrupt handling on reception of frames, and segmentation/reas-
sembly times.
The protocol cost is analyzed based on the time it takes for handling different kinds of paragraph validation
faults. This analysis includes all but home-info faults, since they only occur when a page is accessed by a node
for the first time. The fault handling time is expressed in terms of the total time for handling network messages,
including required interrupt handling delays at the local and remote nodes.
The whole message for either fault request or invalidation request can fit into a single ATM cell. The
messages for data reply will have the size of a paragraph, which may need one, two or more ATM cells depend-
ing on the size of the paragraph. The costs for these two different sizes of network messages, denoted by re-
quest messages, msg-r, and data messages, msg-d, are
t.,, (3)
tm__, : R--_+ t'x + t_ + t_
r o1'--,r 6 1,,..+,. ot"_-a = L=., - L_ -_. + t_. + L=_I - L_
From the memory coherency protocol, one can count the number of network messages involved in each
kind of fault. This message count also depends on the home and owner node relationship, as well as the number
of nodes within the copyset (the list of nodes with read-only copies of a paragraph), when a fault occurs. After
examining the protocol, one concludes that the cost of message are as follows: tmsg-, + t,,__ d for case el and
case nrd, 2t,,_g_r + 2tmsg_ a for case e2, (2N, a + 1)tmsg_ r + t ug_d for case nwd, and 2Nsa t,,___ for case
nwa. Here, N,a denotes the number of nodes within the copyset, when a fault occurs. Cases el and e2 repre-
sent the situation when a fault occurs while the copyset on the home node is empty. The former is the case when
the owner is the home, or when the requesting node is the home node. The latter is the case when the owner is
not the home and the requesting node is not the home node. Cases nrd and nwd and nwa represent the situa-
11
tionsfor a read fault, a write-data fault, a.nd a write-access fault occurrence, when the copyset on the home
node is not empty, respectively.
The average time spent for handling a paragraph fault depends on the probability of each of the above
cases as well as the probability of the number of nodes within the copyset, when a fault occurs. These probabil-
ities are estimated by simple probability models in this work. When a fault occurs, each node has equal proba-
bilities of 1,/N for having accessed and of (1 - l/N) for not having accessed this paragraph since the last time
the copyset was empty. Hence, the probability that the copyset is empty, when a fault occurs, is the case that
either none or any one node having accessed this paragraph. The probability that the number of nodes within
the copyset is i, when a fault occurs, denoted by p{Nse t = i} , is the case when any i+1 nodes have accessed
the paragraph. Therefore, we have
=o,_-(%,oo,'- (%,'1 ' -
( N )(1),+, (1-1).-,-, for/= 1,2,3, N-1 (6)P{N,_=i} = i+1 .....
In the DICE DSM, it is expected that a paragraph is accessed by its home node most frequently. Let x s
denote the probability that a paragraph is accessed by its home node. Other nodes are assumed to exhibit a
paragraph access probability that is uniformly distributed among all the non-home nodes with a total probabil-
ity of 1 - xs . Note thatx_ reflects the processor locality of parallel program behavior as described in [8].
The probability of each case is estimated by finding the conditional probabilities of each case, when either
read or write access faults occur. The probability of case nrd fault is 1. The rest of the probabilities are
(N - 2)(1 - x,)
P'= = 2(N - 1)x. + (N - 2)(1 - x.) p., = 1 - p.= (7)
(N - 1 - N.)(1 - x,)
p,,,._(N,._) = (N - 1)x. + N,=(1 - x,) + (N - 1 - N..t)(1 - x.) , Prm(N,.,) = 1 - p,,_(N,.t) (8)
wherePel'Pe2'Pnwcl' andpnwa are the probabilities of case el, case e2, case nwd, and case nwa, respectively.
12
Theaveragetimespentfor handlingaparagraphreadorwritefault,denotedby t,./ and tg, can be obtained
from Equations (3) through (8). After some simplification, we have
trf = (1 + e{Nse t = O}Pe2)(tmsg_ r + tmsg_d )
N-1
twf = [P{Nset = 0}(1 + Pc2) + S P{Nset = i}Pnwd(i)](tmsg-r + trnsg-d)
i=l
N-1
+ [E iP{Nset = i}](2tmsg -r)
i=l
(9)
(10)
4.3. Application Behavior Model and Average Protocol Cost
Torrellas et al. [19] proposed a model of sharing, which is classified into true sharing and false sharing.
Based on this sharing model, we divide the average protocol cost, Tpcost , into two parts: one part is caused by
true sharing misses, the other part is caused by false sharing misses. A miss is a true sharing miss, when a
processor or node misses, because the word was previously used by another node. A false sharing miss is
caused by multiple processors or nodes accessing different words within the same paragraph.
In this analysis, we first consider the application behavior independent of system architecture. The sharing
misses are based on an access unit (word), as the same way in the work done by Eggers and Katz in [7], instead
of a coherency unit (paragraph). Then, we integrate it with the effects of using a paragraph size consisting of
multiple words.
True sharing misses are varied significantly for different parallel applications, since they inherently de-
pend on the program behavior. True sharing misses are expected to increase as the number of processors or
nodes increases, since the frequencies and degrees of sharing increase. Hence, we use a simple linear relation-
ship to model this behavior. Letfi andfw denote the average rate of read faults per shared read data memory
access and average rate of write faults per shared write data memory access, respectively. Then, we have
fr =fro + frxN and fw = fwo + fwx N (11)
where f, o andf.,o are the base points offr andfw, respectively;f,,: andf_ are the incremental rates of/r andfw,
13
whenthenumberof nodesischanged,re@ectively.Notethatfr andf,, reflectthetemporallocalityof parallel
programbehavior.
Whenparagraphs,largerthanasingleword,aretakenintoaccount,thetruesharingmissesareexpectedto
dropastheparagraphsizeincreases.This isdueto thespatiallocalityof aparallelprogrambehavior,andthe
neighboringdatahavingbeenprefetchedbeforebeingused. Notethatweconsiderthesharingmissesonly
causedbythecoherencyprotocol,andignorethosecausedbyinsufficientphysicalmemoryto allocatespace.
Weusetheratio of miss ratios, proposed by Smith in [16], to model the effects of this behavior. Letrnrl and
m,,. denote the ratio of miss ratios when a paragraph size is G to that when a paragraph size is one word, and
when a paragraph size is G to that when a paragraph size is G/2, respectively. Then, we have
= mlOg2a (12)
mrl tr
Several research results [2,6,18] indicate that false sharing will be increased, when either the number of
processors or the coherency unit size is increased. Hence, we also use a simple linear relationship to model this
behavior. Let e_, denote the probability of false sharing misses. Then, we have
% = %0+ % N + % G
where e_ is the base point of ef_ ; ey= and _y are the incremental rates of efs, when N and G are changed,
respectively.
Combining Equations (9) through (13), one can derive the average protocol cost T_on as
Tpcost = mr4[(1 - W)frtr f + wfwt f ] + _fs[(1 -- W)trf + Wtw/] (14)
where w denotes the average rate of write operations per shared data memory access. In the above equation, the
two terms on the right side represent the protocol costs caused by true sharing misses and false sharing misses,
respectively.
4.4. Analytical Results
This section shows the effects of changing system structure and application profile on the speedup, Sp. A
14
typical value for each parameter is chosen to reflect a typical system architecture and a target parallel applica-
tion profile. We analyzed the effects on Sp by only changing one or two parameters at a time and fixing other
parameters to their typical values.
For program behavior parameters, the typical degree of sharing and access pattern are chosen to be 0. i for
both d s and w. The typical fault rates are chosen to be 0.001 for bothfr 0 andf_ o, and 0.001 for both f= andf_.
The typical locality factors are chosen to be 0.6 and 0.5 for mrr andxs. Typical false sharing factors are chosen
to be 0.000 l for e#o and 0.00001 for both tf= and el,r . These typical values are intended to represent the suit-
able network-based DSM applications and to reflect the significant effects of localities as well as false sharing.
For system parameters, the lengths for an ATM cell and header are fixed to 53 and 5 bytes, respectively.
Other parameters are varied to reflect the changes in of system technology and architecture. The typical system
is chosen to have 16 nodes and 100 MIPS. The typical network data rate is chosen to be 150 Mbps. The typical
ATM processing time is chosen to be 10 and 20 microseconds for tf= and tva r , respectively. This is derived
from the actual measurements of an ATM host-network interface in [4]. While, thet is chosen to be 10 microse-
conds, which corresponds to the store-forward delay time of a single switch for an ATM LAN.
Figures 4 through 13 show the expected behavior of Sp, when the size of a paragraph, G, is changed. This
behavior indicates that Sp increases as the paragraph size G increases up to a certain point. After that
point, Sp starts decreasing as G further increases. The peak values of Sp is when the paragraph size G is be-
tween 32 and 256 bytes. This is less than the page size used in a common operating system. This behavior
demonstrates the advantage of using a two-tier paging scheme. Note that the fixed small cell size (53 bytes)
used in ATM networks leads to the abnormal dent at a granularity of 64 bytes shown in Figures 5 through 9 and
11.
Figures 4 and 5 show that Sp decreases as the average rate of shared data memory accesses per instruction
d,, and the average rate of write operations per shared data memory access w increases, respectively. Figures 6
15
lo T __ --e--,,,=o.os
,r _._o.o.,
8t ./ \ -*-°'=°-_
_5_ J \ \
G (bytes/para)
Figure 4. Sp vs G for different ds
8 --e.--_o=o.ool,
_! frx=O.O0001_0._1
5 --.N--_o=o.oom,
frx=O.O001
w_4 --aJK---- frO=0.01,
frx-=0.1X_01
_ 2
0 J _ _ [ ', , i i I
¢0 <N ¢0 P4 O0 C_I
•- _, o
O4 00
G (bytes/para)
Figure 6. Sp vs G for different fro and frx
9 T
=f
7f
_6_ j
I;O
I'_V_=O.01
--')(_w=0.2
('I oo ¢N
G (bytes/pal-a)
I
Figure 5. Sp vs G for different w
_fwO=0.001,
7 ,_ II=_--_.O01,
6 /ll\"_& .-a-._,o_.oo,.
._ _ jI "_ .--)_,,,o=o._,.
¢1l _,/ =Jl= f_c=-0.0001
1 .
0 _ _ ,
_ N
e (bytes/para)
Figure 7. Sp vs G for different fw0 and fwx
_. s f /J/x ="--N.\X
, VIIA% %%
2 .
1
0 I
G (byte_para)
-.-,_)m- ran--_0.4
--II--m_-_0.5
_mrr-_._
"-')(-'- ran--0.7
"-]K'-- mrr--0.8
Figure 8. Sp vs G for different mar
7 I -'-(P-efso=o.ooo_,
efsx=-o.ooool,
6 _f,_y=o.oooooo_
_lI_._,
efx=0.00O01.
5 _,W=0.0OO0m
4. _o=o._,
ef,_=-O.O0001,
Q. efsy=O, O0001
_3"
"-')_- _r_=o.oool,
i_. elr'sx=-0. 000001,
2 o_y_o.oooom
---It-- =',_=o.ooo_,
1 - eh_l=0.0001.
et_G-.=O.0(X3C01
0 _ _-._=o.ooool,
_ '--'_I-- _a_o._1,
G (bytes/para) en.c=o.ooom,
_sy=o.oooool
Figure 9. Sp vs G for different efso,efsx, and efsy
16
ACk
e_t
¢0
G (bytes/para)
Figure I0. Sp vs G for different Xs
8 T --"_Rn=IOM_
6 --')(--'_.=150Mbp_+Rn=2Gbl_
_s
: \\
G (bytes/para)
Figure 12. Sp vs G for different Rn
10 T
9+
8+
_7+
tl
--_'--'- RI:;=50Mips
_ Rp=l OOMil_
I \ _, -a-_=_._,,
IN CO ('N 100 IN
G (bytes/para)
Figure 1i. Sp vs G for different Rp
8 -r _rr_-_ou_
tvar=-20u_
T,_lo.--W-- TT_=-10tm,Tvar=20us,
A Tsw=5Ou=
T'cat=_,
---)_---Trr_l Ou=,
=='T/ \N ,--,--"r--"
Tvar=150us,
T_10us
Tvar=2Ous,
© _ © IN _ IN T_w=_0u=
_, _, =----+--rr_=_oo_
c_ o= Tva,=-20u_
G (bytes) rs.,=_ 0u=I
Figure 13. Sp vs G for different Tfix, Tvar, and Tsw
18 ¸
16
14
_12 -
10-
"_G=32 _t_
4 - _ --I_G=e_ byte=
--'IIlI'--G=128 bytN2 i "-)_G=Z_ W=
01
IN _" _0 rid Cq "¢
N (# of nodes)
Figure 14. Sp vs N for different G
0.7 _ +G=32 byte=
- o.s_\_,, -'-_"
0.4
0.3
0.2
0.1
O I _ i i I i
e4 _1" (30 (D IN _1" OO
N (# Of nodes)
Figure 15. Sp/N vs N for different G
17
and 7 show that So decreases as the fault)ate parameters (i.e. fro ,f_,f,_ ,andre) increase. Figure 8 shows
that So decreases as the ratio of miss ratiosm,_ increases. Figure 9 shows that So decreases as the false sharing
parameters (i.e. e/0, ef_, and ef_ ) increase. Figure I0 shows that So increases as the probability of a para-
graph being accessed by its home node x, increases.
Figure 11 shows that Sp decreases as the processor speed Rp increases, as the benefits of parallel processing
diminish due to the increase in ratio of network overhead to execution time on each node. Note that the total
execution time for an application will still drop asRp increases, although Sp decreases. This asserts an impor-
tant expected fact that as processor speeds increase, it is important to reduce network overhead in order to ac-
complish the same high level of speedup.
Figure 12 shows that Sp increases as the network data rate R_ increases, and that the margin of gain in
Sv becomes smaller and smaller as the network data rate Rn increases. Figure 13 shows that Sp decreases as
the ATM processing and switching times (i.e. tf=, tvar , and tnn ) increase.
Figures 14 and 15 demonstrate the relationship of Sp and Sp/N with the number of nodes for different
paragraph sizes, respectively. Sp increases as N increases, and the margin of gain in Sp becomes smaller when
N is large.
5. Conclusions
In this paper, we present the design of a two-tier paging system for distributed shared memory,
where a paragraph, a much smaller memory unit than a page, is employed as the unit of coherency. The
system is modeled and the analysis demonstrates the benefits of the multiple granularity memory manage-
ment. The problem of false-sharing is alleviated, especially for systems with large page size and large objects.
The network latency for coherence maintenance is significantly reduced, since only a small amount of data has
to be transferred across the network for each remote memory access fault. Furthermore, the overhead of the
coherency protocol processing is reduced by introducing hardware support.
The proposed two-tier paging scheme is different from the two-level paging method used in a uniproces-
18
sor system. The latter bears two levels ot;address translations. In our two-tier paging design, the page is the
only address translation unit and the paragraph is the validation unit. There is no address translation for para-
graphs.
The concept of using a second tier page, namely a paragraph, is different from that of using a cache line.
The size of a paragraph is normally larger than a cache line. Although the paragraph coherency protocol and
algorithm is similar to the one used in cache-based DSM multiprocessor systems, the design and implementa-
tion consideration are quite different. In a network based distributed shared memory system communication
latency is significantly higher than that seen in a multiprocessor distributed shared memory system such as
DASH [ 12]. Network based DSMs are implemented in software with hardware support, while multiprocessor
based DSMs are implemented in hardware. Therefore, the allocation of and access to the coherency directories
are quite different.
The use of paragraphs as opposed to using a small page size reduces the complexity of the shared memory
management. If a small page size is used, very large page map tables will be required. By preserving the large
page size and using paragraphs only for shared pages the page map tables stay small and additional paragraph
map tables are needed for shared pages only. In addition to using the home node scheme we have distributed
the management of paragraphs to the home nodes of the pages only. Hence, the root node acts as the clearing
house for all application pages, and the home nodes act as the clearing houses for the paragraphs in their respec-
tive pages to which they are home.
A trace-driven simulation model that takes into consideration network queuing delays is under develop-
ment. This simulation model will be used to validate the analytical model described in section 4. This simula-
tion model is built with BONES DESIGNER[5], and the traces are generated by Tango Lite[ 10] when running
the parallel applications of Stanford SPLASH[16].
The current DICE DSM design is based on a strict consistency model and a write-invalidate coherency
protocol. Extensions by using multiple consistency and coherency protocols are under consideration. In future
version of DICE we plan to incorporate support for a relaxed consistency model to hide the large latency of
19
remotememoryaccessesby allowingbui'feringandmerging.
Acknowledgements
Therefereesprovidedvaluablecommentsonthecontentsof thisworkandthepresentationof thispaper.
References
[1] H. S. AIKhatib, Q. Li, C Jou, T. Chen, and H. Arafeh, "DICE - A Distributed Integrated Computing
Environment for Multi-Threaded Parallel Processing," Proceedings of the Third International Confer-
ence on System Integration, Vol. 1, August 1994, pp. 612-621.
[2] W. J. Bolosky, and M. L. Scott, "False Sharing and its Effect on Shared Memory Performance," Pro-
ceedings of the USENIX Symposium on Experiences with Distributed and Muttiprocessor Systems
(SEDMS IV), September 1993, pp. 57-72.
[3] J. B. Carter, J. K. Bennett, and W. Zwaenepoel, "Implementation and Performance of Munin," The
13th A CM Symposium on Operating Systems Principles, October 1990, pp. 152-164.
[4] A.T. Chandramohan, and H. M. Levy, "Limits to Low-Latency Communication on High-Speed Net-
works", ACM Transactions on Computer Systems, Vol. 11, No. 2, pp. 179-203, 1993.
[5] Comdisco Systems, Inc. BONES DESIGNER User's Guides. Comdisco Systems, Inc., 1993
[6] S. J. Eggers, and T. E. Jeremiassen, "Eliminating False Sharing," Proceedings of the 1991 Internation-
al Conference on Parallel Processing, I- Architecture, August 1991, pp. 377-381.
[7] S. J. Eggers, and R. H. Katz, "A Characterization of Sharing in Parallel Programs and its Applicability
to Coherency Protocol Evaluation", Proceedings of the 15th International Symposium on Computer
Architecture, May 1988, pp. 373-382.
[8] S. Eggers, and R. Katz, "The Effect of Sharing on the Cache and Bus Performance of Parallel Pro-
grams," Proceedings of the Third ASPLOS, April 1989, pp. 257-270.
[9] B. D. Fleisch, G. J. Popek, "Mirage: A Coherent Distributed Shared Memory Design," Proceedings of
the 12th A CM Symposium on Operating System Principles, December 1989, pp. 211-222.
20
[10] S.R. Goldschmidt,Simulationof Multiprocessors:AccuracyandPerformance,Ph.D.Thesis,Stan-
ford, 1993.
[11] C. Jou,H. S.A1Khatib,Q.Li, andA. T. Chen,CoherencyProtocolandAlgorithm of TheDICE
DistributedSharedMemorySystem,"Proceedings of the Seventh International Conference on Parallel
and Distributed Computing Systems, October 1994, pp. 796-801.
[12] D. Lenoski, J. Laudon, K. Gharachorloo, W. Weber, A. Gupta, and J. Hennessy, M. Horowitz, and
M. S. Lain, "The Stanford Dash Multiprocessors", IEEE Computer, Vol 25, No. 3, March 1992, pp.
63-79.
[13] K. Li, "IVY: A Shared Virtual Memory System for Parallel Computing," In Proceedings of the
1988 International Conference on Parallel Processing, pp. 94-101, August 1988.
[14] R. G. Minnich, Mether: A Memory System For Network Multiprocessors, Ph.D. Thesis, University
of Pennsylvania, 1991.
[15] U. Ramachandran, M. Ahamad, and M. Khalida, "Unifying Synchronization and Data Transfer in
Maintaining Coherence of Distributed Shared Memory," Proceedings of the 1989 International Con-
ference on Parallel Processing, pp. 160-169, August 1989.
[16] A. J. Smith, "Line (Block) Size Choice for CPU Cache Memories," IEEE Transactions on Comput-
ers, Vol. C-36, No. 9, September 1987, pp. 1063 - 1075.
[17] J. E Singh, W.-D. Weber, and A. Gupta, SPLASH: Stanford Parallel Applications for Shared-
Memory. Technical Report CSL-TR-91-469,Stanford University Computer Systems Lab, April 1991.
[18] M. Tam, J. M. Smith, and D. J. Farber, "A Taxonomy-Based Comparison of Several Distributed
Shared Memory Systems," ACM Operating System Review, Vol. 24, No. 3, July 1990, pp. 40--67.
[19] J. Torrellas, M. S. Lain, and J. L. Hennessy. "Shared Data Placement Optimizations to Reduce Multi-
processor Cache Miss Rates," Proceedings of the 1990 International Conference on Parallel Process-
ing, II- Software, August 1990, pp. 266-270.
21
Volume I
TtO
-- I . '.,w.
S.&,O PAULO, SP, BRAZIL AUGUST 15-19, 1994
EDITED BY :
PETEFiA. NG
NewJersey Institute of Technolob-'y
FUAD GA'rFAZ SOBRINHO
Instituto Intemacional de Integrac_o
de Sistemas
C.V. RAMAMOORTHY
university of California, Berkeley
RAYMOND 1".YEH
IntemationalSoftware Systems, Inc.
LAURENCE C. SEIFERT
Global Manufacturing and
Engineering, AT&I-
DICE - a Distributed Integrated Computing
Environment for Multi-Threaded Parallel Processing*
Hasan S. AIKhatib, Qiang Li, Chi-Jiunn Jou, Tiekun Chen, and Hassan Arafeh
Department of Computer Engineering, Santa Clara University
Santa Clara, CA 95053
Abstract - Often, the computing power of networks of
workstations is left unused. The objective of this project is to
develop a set of tools to take advantage of this potential com-
puting power and to create a platform suitable for large scien-
tific computations. This paper presents the architecture of a
Distributed Integrated Computing Environment (DICE)
consisting of a cluster of networked workstations. DICE
consists of three interactive subsystems. DSM (distributed
shared memory) provides the underlying communication
and computing paradigm for threads of a parallel task to ex-
ecute on a cluster of cooperatingworkstations. DRS (distrib-
uted run-time subsystem) provides users with programming
tools to develop and execute DICE multi-threaded applica-
tions. PS (parallel scheduler) is a self optimizing application
specific scheduler, and is responsible for thread scheduling
and synchronization.
1. Introduction
The majority of scaentific applications require a fairly
large amount of memory to execute a task. If a task is to
be partitioned into threads (sub-tasks) that are executed
in parallel, memory sharing is very. desirable, since it al-
lows sharing variables among threads within the same
task. Also, software based on shared memory is more
portable and machine independent as compared to that
of distributed memory which is architecture dependent.
The shared memory multiprocessor system has been
more and more popular for executing large scientific
applications for these reasons.
On the other hand, there is a tremendous amount of
computing power that is left unused in networks of work-
stations. Very often a workstation is simply sitting idle on
a desk. A set of tools can be developed to take advantage
of this potential computing power to create a platform
"Thiswork was supported by NASA-Ames Research Center granls
numberNCC 2-644 entitled "Parallel Proee_ing for Scientific
Computations".
suitable for large scientific computations. The integra-
tion of several workstations into a logical cluster of dis-
tributed, cooperative, computing station presents an al-
ternative solution to shared memory muttiprocessor
systems.
DICE (Distributed Integrated Computing Environ-
ment) is designed to meet these objectives. DICE em-
ploys virtual memory supported distributed shared
memory(DSM) as its underlying computing and commu-
rtication paradigm. It integrates DSM with a parallel
scheduling as well as a parallel programming subsystem.
In Figure 1, a distributed task '1' is running on four work-
stations, while a distributed task '2' is running on three
workstations. These distributed tasks are independent
of each other, and a workstation may have threads of two
or more tasks running on it, concurrently.
This paper presents the DICE architecture in the fol-
lowing sections. Section 2 identifies the related work in
this area. Section 3 describes the system architecture of
DICE. It consists of three subsystems, which are de-
seabed in sections 4 to 6, respectively. The interaction
among these subsystems is delineated in section 7. The
expected system performance is shown in section 8. Fi-
nally, section 9 gives a summary of the results accom-
)lished with this work.
Figure 1. Clusters of Cooperating Workstations
PIE_,fftG PAGE _AK NOT FILMED
0..8186-5502-X/94 $04.00 _ 1994 IEEE 612
2. RelatedWork
There are several systems designed to utilize the pro-
cessor power of idle workstations. These systems include
Sprite [24], V system [33], NEST [ 1], Butler [23], Condor
[20], REM [30], Stealth [17], and Sidle [16]. These sys-
tems provide remote execution or process migration faci-
lities. In addition to these features, DICE provides the
distributed shared memory (DSM) paradigm while using
these idle workstations.
A number of DSM systems over LANs have been de-
veloped recently [31]. Among them. Ivy [18.19] is im-
plemented on a network of Apollo workstations. The
memory is paged, and copies of pages may be replicated
in different hosts. Strict coherency semantics are used,
and the memory coherency is maintained by a write-m-
validate with dynamic ownership protocol. The owner of
a page is located via either a centralized manager, fixed
distributed managers, or an individual host which for-
wards the request. Ivy is used for applications employing
a multi-threaded task. All threads share the _ame virtu-
al address space. False sharing may occur in this system,
since its consistency or access unit (eg. word) is less than
the sharing unit (page). In addition, the single-write na-
ture of its protocol may cause a "ping-pong" behavior be-
tween multiple writers of a shared page, or the thrashing
problem.
To overcome false-sharing and thrashing, some sys-
tems employ special schemes. Mach [14] supports the
DSM with a shared memory server. False-sharing and
thrashing are handled by fault scheduling via a queuing
mechanism [13]. Clouds [27,2] is an object-oriented dis-
tnbuted operating system where objects can migrate
across processors. False sharing and thrashing are
avoided, since Clouds uses a single-writer-single-read-
er strict coherence semantics.
Mirage [12] is a DSM system implemented in the ker-
nel of the Locus distributed system [34]. Thrashing is re-
duced by using a time wmdow scheme, in which the sys-
tem guarantees that the writer of a page retains access to
a page for a fixed period of time. Munin [6,7] is a DSM
system implemented on the top of the V kernel [9],
which allows programmers to associate types with shared
data. Hence, muluple consistency protocols can be used.
A delay write update scheme is used for a read-mostly
protocol. Hence, thrashing can be reduced by using dif-
ferent combinations of data types.
Mether [21.22] is a software DSM implemented on
StmOS 4.0. It allows a process to access memory as either
consistent or inconsistent, and only a subset of a page to
be transferred. It also provides both demand-driven and
data-driven semantics for updating pages. All of these
operations are encoded m a few address bits in the virtu-
al address. False sharing and thrashing is reduced
through the use of the incoherent memory.
DICE presents a novel approach to handle the prob-
tern of false sharing and thrashing. The shared memory
is structured as a two-tier paging system. The first tier is
a page, which is the common page used in an operating
system. The second tier is aparagraph, which is a smaller
fixed-sized block of information contained within a page.
The introduction of small paragraph size improves sys-
tem performance, since it reduces the chance of false
sharing and the amount of data needed to be transferred
over the network.
Distributed run-time system, DRS _ another part of
DICE. A survey of object-oriented tanguages for paral-
lel environments is presented in [36]. Other program-
ruing languages and systems developed for distributed
systems are presented in [4]. Amber [8] and Orca [5,32]
are two such systems.
Parallel scheduler, PS is the third part of DICE. Sev-
eral approaches are taken by researchers at work on the
problem of parallel scheduling. They range from cen-
tralized control where global knowledge of the system is
maintained in one place [25,26], to distributed control
where all nodes have equal knowledge of the system.
Methods used vary from Baysian decision theory [28] to
data flow o_raphs [10].
The parallel scheduler in DICE is an extension to our
prior work done in MOPPS [3]. MOPPS is a self-tuning
parallel scheduler. It partitions the given application
into small tasks, schedules and coordinates these tasks
among network resources, and maintains a balanced load
between workstations without overburdening the com-
munication network.
3. System Structure
DICE is an experimental system which aims at pro-
viding a computing environment for the execution of
multi-threaded tasks. Figure 2 illustrates the system
workstation 1 workstation 2
; DsMI / Ps : DsM
os I/ os
workstation n
: DsMI
os ]
Figure 2. System Architecture of DICE
:l
I
613
structure of DICE. A parallel task may consist of multi-
ple threads that can be scheduled to run on a cluster of
workstations, sunultaneously. Athread is an active entity
that provides the notion of a computation. Threads on
separate workstations also share the same virtual ad-
dress space, and communicate with each other using
shared memory. Synchromzation of threads to access
shared resources is done using functions provided by the
distributed run-tune [ibrary.
4. Distributed Shared Memory
DICE DSM system consists of a cluster of worksta-
tions connected by a high-speed and low-latency local
area network. Other than a host processor and memory,
each node also has a network processor and a Distributed
Shared Memory Management Unit (DSMMU). DSMMU is
an extension of the traditional MMU to allow DSM to
handle shared memory efficiently. When data is not
available locally and needs to be fetched from a remote
host, DSMMU will trigger special access faults. Other-
wise, DSMMUjust performs the traditional TLB opera-
tions. An example of the architecture of a single host sys-
tem is shown in Figure 3. Note that this example uses
dual-ported memory, which allows both host processor
and network processor to access the data structures for
managing shared memory.
Each page of DICE DSM is the same as the common
page m a typical operating system, such as the SunOS.
Each page is further divided into several small equal-
Host Processor [
/D
_I/D
I/D
I Host Memory I Dual-po_ed
I I:O', I
C LAN
VA: virtual address
PA: physical address
I/D: instruction & data
I/D
pack¢_
Figure 3. The system structure of a host system.
sized paragraphs. Paragraphs are used as the unit for co-
herency. Pages are used as the unit for sharing. Memory
is allocated in a segment which may contain one or more
pages. Figure 4 illustrates the hybrid nature of this
memory structure.
Coherency Protocol
DICE mainly provides the computing environment
for the execution of multi-threaded tasks. A parallel
task consists of multiple threads that are scheduled to
run on a cluster of workstations, simultaneously. The
shared data of memory pages are also distributed and
replicated among these hosts. The DSM system sup-
ports the sharing of those pages, and maintains the co-
herency among replicated data copies. Each running
application has a root host, on which it was loaded and ex-
ecuted. The root host maintains the state irttbrmation
for all shared pages used in the application, while other
hosts maintain the state information for only the shared
pages that are used in their local systems.
DICE is a home-based virtual DSM system, in which
each shared page has a home host. A home host main-
tains the state information for its pages, ensures that the
last copy of a page is not purged, and keeps track of all
copies of the paragraphs within its pages. Other hosts
only keep the information about the locations of the
home host. A remote request for handling memory ac-
cess faults is always sent to the home host of the target
page. When a host does not know the home host for a
certain page and tries to access it, a home--inlb fault will
be triggered and a home-in/b request will be sent to the
:::ii:iiiiiiii: iii::ii: i|'-...
I
/,.SelCn_m,-3,,",
,:/ //j: ,:.: f" ,,/,,
Segments in
shared virtual
address space
'%, '%,, '_,, ',, ',,% ',,
, ",, Pa_,,n ....
Pages in a
segment.
i •
L
i
I
i
,par_grap_
ParagraphS
in a page
Figure 4. Segments, pages, paragraphs m
DICE DSM structure
roothost. If the home host is not yet assigned, the root
host will assign the first requesting host to be home host
of that page, update its own database, and send back a
reply confirming this assignment. Otherwise, the root
host simply sends back a reply giving the information of
the home host for that page.
The memory coherency of DICE DSM is maintained
on paragraph level. Each paragraph has an owner host,
which has the ownership of this paragraph. An owner
host always has an up-to-date copy of its paragraph, and
is the only host which permits to write to the paragraph.
The ownership of a paragraph may be transferred from
one host to another according to the coherency protocol.
Information about the current owner of a paragraph is
maintained at the home host of the page containing the
paragraph.
A paragraph can simultaneously be read by multiple
hosts, but it can only be written by its current owner host.
The access right of a paragraph for a particular host may
be either read-write, read-only, or none. A host can ac-
quire or upgrade its access rights by sending requests to
the home host of the page in which the desired paragraph
resides.
A host can immediately perform read and write oper-
ations on a paragraph ff it has read-write access for that
paragraph, or perform read operations on a paragraph if
it has read-only access for that paragraph. When a read
operation is issued to a paragraph with none rights, a
read-data fault will be triggered and a read-data request
will be sent to its home host. When a write operation is
issued to a paragraph with none rights, a write-data fault
will be triggered and a write-data request will be sent to
its home host. When a write operation is issued to a para-
graph with read-only access, a write-access fault will be
triggered and a write-access request will be sent to its
home host.
When a page is initialized, the home host is the default
owner host for all paragraphs within this page. Any other
host will send a remote request to the home host when it
tries to access any paragraph of this page. If a read-data
request is received, the home host will return back a re-
ply containing the most recent copy of the desired para-
graph when itself is the owner host of that paragraph.
The access rights of both home and requesting hosts are
changed to read-only, ff itself is not the owner host, the
home host will forward this read-data request to the
owner host of that paragraph..The latter changes its ac-
cess right to read-only, and,then directly sends to the re-
questing host a reply which contains the most recent copy
of that paragraph. After it receives the reply, the re-
questing llost changes its access right to read-only and
sends to the owner host an acknowledgement with the
received reply. Having received this acknowledgement
with reply, the home host also changes its access right to
read-only and becomes the owner host of that traragraph.
If itself is the requesting host, the home host will directly
send the read-data request to the owner host. The latter
changes its access right to read-only, and then sends back
a reply which contains the most recent copy of that para-
graph. Having received this reply, the home host
changes its access right to read-only and becomes the
owner host of that paragraph.
If a write-data request is received, the home host will
return back a reply containing the most recent copy of
the desired paragraph when itself is the owner host and
no other host has a valid copy of that paragraph, ff multi-
ple valid copies exist, the home host will send invalidate
requests to all hosts on which those copies are located,
and wait for confirmations from all of them before re-
turning back the reply. Upon receiving the invalidate re-
quest, each host changes its access right of that para-
graph to none and returns its confirmation to the home
host. The access right of the home host is then changed
to none, while the requesting host becomes the owner
host and its access right is changed to read-write. If itseff
is not the owner host, the home host will forward this wri-
te-data request to the owner host of that paragraph. The
latter changes its access right to none, and then directly
sends to the requesting host a reply which contains the
most recent copy of that paragraph. After it receives the
reply, the requesting host changes its access right to
read-write and sends an acknowledgement to the owner
host. Having received this acknowledgement, the home
host updates its database and indicates that the request-
ing host becomes the owner host of that paragraph. If
itself is the requesting host, the home host will directly
send the write-data request to the owner host. The latter
changes its access right to none, and then sends back a
reply which contains the most recent copy of that para-
graph. Having received this reply, the home host
changes its access right to read-write and becomes the
owner host of that paragraph.
If a write-access request is received, the home host will
return back the wrzte-access confirmation when no other
host has a valid copy of that paragraph. If three or more
valid copras exast, the home host will send invalidate re-
quests to all hosts (except itself and the requesting host)
on which those copies are located and wait for confirma-
tions from all of them before returning back the write-ac-
cess confirmation. Upon receiving the invalidate request,
each host changes its access right of that paragraph to
none and returns its confirmation to the home host. The
access right of the home host is then changed to none,
while the requesting host becomes the owner host and its
access right is changed to read-write. If itself is the re-
questing host, the home host will directly send the invali-
date requests to all hosts (except itself and the requesting
host) which have valid copies of that paragraph and walt
615
i
for confirmations from all of them. Upon receiving the
invalidate request, each host changes its access tight to
none and returns its confirmation to the home host. The
home host then changes its access right to read-write and
becomes the owner host of that paragraph.
5. Distributed Run-time Subsystem
DICE DRS transforms the DICE DSM from a flat
space into an object--oriented structured space. DRS
consists of a set of tools that implement DICE. Applica-
tion Programmer's Interface, API, provides users with
programming tools to develop and execute DICE multi-
threaded applications. The tools used during program
development include a parallel language and its compil-
er, library interface functions, a linker, and other system
services.
A new Object-Oriented Dataflow Language (OODL) is
being designed as the parallel language used in DICE.
One of the important features of object-oriented pro-
grammmg is information hiding and encapsulation
[11,29]. It provides a higher level of data abstraction in
modeling real world objects. Such constructs are helpful
in designing parallel programs [35]. In general, parallel
programs are difficult to design because the programmer
must consider multiple execution threads instead of a
single thread. All possible interactions among the
threads must be considered. Also, parallel programs are
hard to maintain because a simple change may affect the
interaction pattern and results in global consequences.
Information hiding helps in reducing possible interac-
tions that need to be considered, while data encapsula-
tion helps in minimizing the maintenance effort when
program changes are needed.
While the object--oriented model provides a high level
of programming abstraction, it does not naturally exploit
parallelism of applications constructed with objects. A
dataflow model can expose and exploit the maximum
amount of parallelism, as well as express data depen-
dence from different levels of abstraction in a very natu-
ral way. The combination of the obiect oriented and da-
tallow concepts makes it easier for programmers to
design large scale multi-threaded parallel programs, and
to build re-usable concurrent software modules.
The OODL language, in DICE, is an extension of
C + ÷. Dataflow constructs are added to allow program-
mers to express parallelism explicitly. The parallel corn-
prier can be realized using a preprocessor to translate the
extended source code into C + + programs, which in
turn are compiled into object code using an existing
C + + compiler.
The run-time library interface functions provide a col-
lection of library routines that are linked w_th each paral-
tel program. They are invoked to support the service re-
quests made by system processes at run-ttme. The
OODL compiler wril use these functions to realize the
parallelism expressed in the application programs.
These functions can also be used by the application di-
rectly.
The linker will create a standard execution fite such as
a.out and an execution dependency tree called a.tree.
The information kept in the dependency tree includes
the names of the parallel threads: information about the
resources of the threads, such as starting address and
memory requirements: and the predecessors and succes-
sors of each thread. This information will be used by the
parallel scheduler to create and allocate shared memory
segments, and to schedule threads on different worksta-
tions at run-time. The linker will arrange shared vari-
ables into shared segments, to simplify the management
of shared memory by the DSM subsystem.
DRS also provide services for executing applications
at load-time and run-time. These services include the
use of the DICE daemon(s), as well as the automatic cre-
ation of a root process and alias remote processes for a
parallel task.
For each workstation that participates in DICE, a dae-
mon process has to be present. This daemon is responsi-
ble for invoking DICE alias processes on remote worksta-
tions. Each DICE application creates a root process
when it starts. The workstation where the root process is
running is referred to as the root workstation. A DICE
application may have zero or more alias processes. An
alias process is created by the root process on a remote
workstation through a DICE daemon as needed.
The root process is a multi-threaded process which
runs on the root workstation. It is created when the par-
allel task is submitted to the system. InD[CE, the thread
is the unit of execution, while a process is the unit of re-
source allocation. Each process contains one or more
threads. The root process provides the virtual address
and system resources for threads running on the root
workstation. The root thread is the first thread of a par-
allel task. It is responsible for creating the parallel
scheduler and DSM manager threads before any applica-
tion threads start to run. It then becomes the first appli-
cation thread running on the root workstation. The root
process terminates when the parallel task is done.
An alias process is a reincarnation of the root process
on each remote workstation. An alias process is created
when a thread is scheduled to run on a remote worksta-
tion for the first time. The alias process supports the
same virtual address as the root process and system re-
sources for threads running on its workstation. These
threads include an alias primary thread, DSM manager,
and application threads. An alias primary thread is re-
616
sponsxbleforcreatingitslocalDSM manager as well as
the first application thread running on its local worksta-
tion. This alias primary thread, then, listens to thread-
create requests coming from the network. Subsequently,
it creates these requested threads of its own parallel task
on its local workstation. The alias primary thread ancl
DSM manager of a remote workstation will remain when
all of its application threads are terminated. The alias
primary thread waits for thread-creation requests from
the parallel scheduler, while DSM manager waits for
memory access requests from other workstations. When
the root process is done, the parallel scheduler sends out
a termination signal to all the alias processes of that par-
ticular task. This is to ensure that all alias processes are
terminated before the termination of the root process.
The DICE daemon process is a server that is responsi-
ble for invoking alias processes on a remote workstation.
After invoking an alias process, the daemon process will
have nothing to do with this application task. It will go
back to listening to requests from the network, ff a work-
station does not want to participate InDICE, it can simply
terminate this daemon process. A DSM manager is an
active entity on each workstation responsible for handl-
ing memory access faults. EachDSM manager maintains
a memory mapping table that maps each memory page to
its local workstation or other remote workstations.
6. Parallel Scheduler
DICE PS is a serf--optimizing application-specific
scheduler. It is responsible for thread scheduling and
synchronization. PS is implemented as a thread within
the parallel task. Each parallel task has one PS running
on the workstation where the task initially starts to run.
This special thread is created during the task load-time.
When an application needs to create another thread
or to terminate itseff by joining with other threads, it
passes control of execution to the PS. The PS will find
the fastest way to run the application by using the infor-
mation in a Task Execution Dependence Tree. which is
created as an auxiliary file during the compilation of the
source program.
The PS decides whether the local workstation has
enough resources to run the different threads, which
threads to send to remote workstations to run, and which
remote workstations to send them to. It uses several
tools to make intelligent decisions at run time. Those
tools are: a CPU load estimator, a network load estimator,
an intelligent database, and a bidding process.
The CPU load estimator runs on every workstation on
the network and keeps track of the load on that worksta-
tion. When the time comes to run a thread on the local
CPU, PS looks at the CPU load estimator for information
about the load on the local CPU. Similarly, when a bid
amves at a workstation, the decision whether to accept
the bid or not depends partially on the readings taken by
the CPU load estimator.
The network load estimator monitors the traffic on the
network. The network load estimator gives PS an up-to--
date reading of the network traffic. Smaller partitions
that takes a relatively short time to execute can become
too expensive to ship if transmission times become too
sever. In that case, it might be better to keep them on the
local workstation, defer shipping them, or combine two
or more into larger partitions.
The network load estimator has the responsibility to
provide PS with real time network traffic information.
The network load estimator can be as simple as a bus moni-
tor which continually updates a regaster (interpreted as
an integer) signify network utilization levels of high, me-
dium, or low.
A small and efficient database records thread per-
formance on each workstations under different CPU and
network load conditions. This database allows the bid-
dingprocess to generate a reasonable estimate of the ex-
pected run time of a thread on a particular target work-
station.
The intelligent database is designed to categorize dif-
ferent higher level operations of modules and parame-
tertze their computational and communication time re-
quirements. The contents of intelligent database are
tailored to the installation where it resides. The data-
base is initiated with the types of applications being run,
and its contents are updated as new applications are in-
troduced.
When PS decides that it is best to send some threads to
a remote workstation to run, it needs a way to pick those
workstations. Instead of forcing other, possibly heavily
loaded, workstations to take some of the threads, PS asks
for help through the bidding process. It stmply asks for
help in running a given thread and tells the other work-
stations about the memory and CPU requirements of the
thread. This information is found in the intelligent data-
base.
Upon each task completion, the intelligent database is
updated to reflect the most current experience. When
no data is available about an application, we can run it the
first time with gross overesurnates, or underestimates,
and let intelligent database learn about it. Simulation may
also be used to obtain initial esumates.
It is essential that intelligent database be queried and
updated quickly as it would be a system bottleneck and
might slow down the entire system if not properly de-
signed. Ultimately intelligent database can be implem-
ented in hardware as a content addressable memory.
617
In the bidding algorithm, PS weighs execution time
versus shipping and management tune for each resident
module. I.f execution time is greater than shipping and
management tune and the loads on the local workstation
is higher than a predefined threshold, the parallel sched-
uler broadcasts a global message through the network
asking for help. This "help wanted" message includes
enough reformation about the module to be sent enab-
ling other workstations to determine if they can offer
their help. The information includes the estimated mod-
ule execution tune, memory and disk requirements, and
any other re.formation that is useful m making the deci-
sion.
Those workstations which can potentially bid to accept
the module for processing will examine this workload in-
formation and determine whether it is feasible to bid. Ira
workstation is capable of assisting, it will return a mes-
sage stating its availability, and will commit to this bid for
a period long enough for the asking workstation to re-
ceive the return message and act on it. Through this pro-
cess. workstations that bid for help and are not accepted
will waste little time before considering later "help
wanted" messages.
Each workstation will monitor the network before
sending its reply to determine if any other workstations
have responded to the bid and will not send it reply if any
workstation did respond. It is assumed that the fin'st
workstation to reply will get the job, and there is no need
for others to do so. PS sends the module to the first
workstation that replies to the request.
PS repeats the help wanted messages for a given task
until either it receives a response or the task is at the
point where it has to be executed in order not to delay the
rest of the tasks.
As a thread is scheduled on a remote workstations to
run, its respective virtual address space segments are al-
located physical memory blocks on the same worksta-
tions. PS takes the consideration of available memory re-
source on a workstation when scheduling a thread over
there.
7. Interactions and Integration
DSM, DRS, and PS are three separate subsystems of
DICE. They interact with each other to provide an inte-
grated environment and to cooperatively work to provide
the distributed computing paradigm for a parallel task.
After a parallel task is compiled and linked, a task ex-
ecution tree file a.tree is created. PS uses this tree to
perform the parallel thread scheduling. When a thread
is to be created, the root process will transfer execution
control to PS. The latter will use a.tree fide to schedule it
on a local or remote workstation, and then transfer ex-
ecution control back to the application. Similarly, the ex-
ecution control will be transferred to PS when a thread
terminates itseLf by joining other threads. Figure 5 shows
the overall interaction between DRS and PS. The paral-
lel compiler and linker create the image of virtual
memory segments and the task execution tree. PS is in-
voked when a thread needs to fork or join with other
threads.
Furthermore, the root thread of DRS is responsible
for creating PS. The alias thread on each remote work-
station listens to remote thread creation requests sent by
PS, and creates threads locally.
Similarly, the active entity of the DSM subsystem
DSM thread is created by the root thread or alias threads
on different workstations. In the meantime, the data
structures needed by, DSM thread are also created and
initialized.
The efficiency of handling shared memory by DSM
subsystem is significantly affected by the layout of shared
variables on DSM memory segments and the allocation
of physical memory on different workstations by the par-
allel programming subsystem and parallel scheduler.
Figure 6 shows an example of the run time behavior of
the DSM subsystem.
Parallel
Compiler.
Maker
Figure 5.
Mu|ti-Threaded
__ wo o0o A_''-' -"' " "" " Workstation,,_?z?, B
',i Wontstation X
/ //
Shared Virtual
Addt_ Space m
Scheduling of a Parallel Task on a Cluster of Workstations
618
In Figure 6, the vu'tual address space of the parallel
task is on the teft side. Each shadowed paragraph within
the virtual address space represents a single virtual
memory segment. The physical address spaces on differ-
ent hosts are on the right side. The shadowed paragraph
within a host denotes a block of a physical memory, and
the other structure represents the segment map table.
The paragraphs wath arrowheads represent the corre-
sponding mappings between the memory segments and
the physical memory blocks on different hosts.
8. Performance and Discussion
The performance of DICE DSM system has been stu-
died using an analytical model, which derives an expres-
sion for the speedup of the parailet parr of an application (or
S_ ). The effects of changing Sp on system structure
and application behavior is shown and discussed in [15].
Some of these results are shown in this section. The sys-
tem and application parameters used in this model are
summarized in Table 1 in Appendix.'
High-speed and low-latency ATM LAN is assumed in
this model. We also assume that queuing time on the
network is negligible. This assumption is justified by the
results shown in Figure 7 (Appendix), which indicates
that the gain in S_ becomes smaller and smaller as the
network data rate R_ is increased. Figure 8 (Appendix)
shows that Sp decreases as processor speed R e in-
creases. Note that the total execution time for an appli-
cation wiU still be reduced as Rp increases, although
Sp decreases.
Figure 9 (Appendix) shows that S_ increases as the
number of paragraphs per page, k, increases up to a cer-
tain point. After that point, S_ slightly decreases ask
further increases. Furthermore, S, is approxtmately
the same for a fixed paragraph stze. which is P/k. This
behavior demonstrates usefulness of the use of a para-
graph with a smaller granularity than a page. Figure 10
(Appendix) shows a similar behavior, for S, m relation-
ship with the number of hosts N.
9. Conclusions
In this paper, we presents the architecture of a distrib-
uted computing environment DICE, which integrates
distributed shared memory with parallel scheduling and
distributed run-tune management. The analysis of per-
formance model demonstrates the usefulness of the use
of a paragraph with a smaller granularity than a page in
DICE system. This smaller granularity reduces the
chance of false sharing and the data size needed to be
transferred over the network.
The coherency protocol for this two-tier paging sys-
tem is also being simulated in software. The perform-
ance of DICE DSM is also being evaluated using a simu-
lation model, which takes into consideration network
queuing delay. The Object-Oriented Dataflow Lan-
guage and self-tuning Parallel Scheduler are under de-
velopment.
The current DICE DSM design is based on the strict
consistency model and write-invalidate coherency pro-
tocol. This design is intended to be extended by using
multiple consistency and coherency protocols. Multiple
protocols will be used to tailor broader application re-
quirements. DICE will incorporate the DSM design
with a relaxed consistency model to hide the large laten-
cy of remote memory accesses by allowing buffering and
merging.
References
Root Workstation Remote
Workstation 1
I I
I I
I I
Virtual Address Remote
Space Workstation 2
Figure 6. Distributed Shared Virtual Memory
[i] R. Agrawal, and A. K. Ezzat. "Location Independent
Remote Execution in NEST," IEEE Transactions on So[t-
ware Engineering, Vol 13, No 8, 1987, pp. 905-912.
[2] R. Ananthanarayanan, S. Menon. A. Mohindra, and U.
Ramachandran, "Experiences in Integrating Dtsmbuted
Shared Memory, _nth Virtual Memory Management."
ACM Operating System Review, Vol. 26, No. 3. July 1992,
pp. 4-26.
[3] H. ,aa'afeh and H. S. A1Khatib, and H. Barmclough,
','MOPPS: A Scheme for Managang Parallel Scientific
Programs in a Distributed Architecture," Proceedings of
COMPCON'90. the Annual International Computer
Conference of the IEEE Computer Soaety, February 25
- March 2, 1990, San Francisco, CA, pp 387-395.
[4] H. E. Bal. J. G. Steiner, and A. S. Tanenbaum, "Pro-
gramming Languages for Dismbuted Compuung Sys-
619
tems." ACM Computing Surveys, September 1989, pp.
261-322.
[5] H. E. Bal, M. E Kaashoek, and A. S. Tanenbaum. "Orca:
A Language For Parallel Programming of Distributed
Systems." [EEE Transactions on Software Engineering,
Vol 18, No. 3, March 1992. pp. 190-9-.05.
[6] J. K. Bennett. J. B. Carter, and W Zwaenepoel, "Adap-
tive Munin: Distributed Shared Memory Based on Ty-
pe-Specific Memory Coherence." Proceedings o1:the 2nd
ACM SIGP[_.AN Symposium on Principles and Practice ot:
Parallel Programming, 1990. pp. 168-175.
[7] J. B. Carter. J. K. Bennett. and W Zwaenepoet. "Imple-
mentation and Performance of Munin." The 13th ACM
Symposium on Operating Systems Principles. October
1990. pp. 152-164.
[8] J. S. Chase. E G. Amador, E. D. Lazowska. H. M. Levy,
and R. J. Littelefietd. "The Amber System: Parallel
Programming on a Network of Multiprocessors," Pro-
ceedings of the 12th ACM Symposium on Operating Sys-
tem Principles. December 1989, pp. 147-158.
[9] D. R. Cheriton. '`The V Distributed System." Communi-
cation o1:the ACM, Vol 31, No. 3. pp. 314--333. 1988.
[10] W W Chu. and L, M-T. Lan. "Task Allocation and
Precedence Relations for Distributed Real-Time Sys-
tems," [EEE Transactions on Computers, Vol C-36, No. 6,
June 1987, pp. 667--679.
[11] B. Cox. Object Oriented Programming - An Evolution-
ary Approach. Addison-Wesley, 1986.
[12] B. D. Fleisch. G. J. Popek, "Mirage: A Coherent Dis-
tnbuted Shared Memory Design." Proceedings o1:the 12th
ACM Symposium on Operating System Principles. Decem-
ber 1989. pp. 211-222.
[13] A. Forin. J. Barrera. and R. Sanzi. Design, Implemen-
tation, and Performance Evaluation of A Distributed
Shared Memory Server for Mach, Technical Report
CMU--CS-88-165. Carneig_e-Mellon University, Com-
puter Science Department. August, 1988.
[14] A. Forin, J. Barrera. and R. Sanzi. "The Shared
Memory Server." Proceedings 1989 Winter USENIX Tech-
nical Con/erence. February, 1989. pp. 229-244.
[15] C Jou. H. S. AIKhatib. and Q, Li. Performance Analysis
of DICE Distributed Shared Memory System. Dismb-
uted Computing Lab Technical Report No. 03281994,
Department of Computer Eng.ineering, Santa Clara
University, 1994.
[16] J. Ju. G. Xu. and J. Tad, "Parallel Computing Using Idle
Workstations." Operating System Review. July 1993, pp.
87-96.
[17] P. Krueger. R. Chawla. "The Stealth Distributed Sched-
uler." Proc. llth ICDCS 1991.
[t8] K. Li. Shared Virtual Memory on Loosely Coupled
Muttiprocessors. Ph.D. Thesis. Yale, September. 1986.
[19] K. Li. "IVY: A Shared Virtual Memory System for Par-
allel Computing." In Proceedings of the 1988 Internation-
al Conference on Parallel Processing, pp. 94--101. August
1988.
[20] M. T. Litzkow. "Condor- A Hunter of Idle Worksta-
tions." Proc. 8th ICDCS 1988. pp. 104-111.
[21] R. G. Minnich and D. J. Farber, "The Mether System:
A Distributed Shared Memory for SunOS 4.0." In Useu-
nix -Summer 89. Usenix. 1989.
[22] R. G. Minnich and D. J. Farber. "Reducing Host Load,
Network Load, and Latency in a Distributed Shared
Memory," Proceedings oft he I Oth International Confer-
ence on Distributed Computing Systems, Pans. France,
June 1990.
[23] D. A. Nichols. "Using Idle Workstations in a Shared
Computing Environment." Proceedings oft he 11th ACM
Symposium on Operating Systems Princzples, December
t987, pp. 5-12.
[24] J. K. Ousterhout. A. R. Cherenson. E Douglis, M. N.
Nelson. and B. B. Welch. ''The Sprite Network Operat-
ing System", IEEE Computer, February 1988. pp. 23-36.
[25] J. Pasquale. Knowledge-Based Distributed Systems
Managements, Report No. UCB/CSD 86/295. UC
Berkeley, Computer Science Division. June 1986.
[26] J. Pasquale. Using Expert Systems to Manage Distrib-
uted Computer Systems, Report No. UCB/CSD 87/334,
UC Berkeley, Computer Scaence Division, January 1987.
[27] U. Ramachandran, M. Ahamad. and M. Khalida, "Uni-
f3ring Synchronization and Data Transfer in Maintaining
Coherence of Distributed Shared Memory," Proceedings
o1:the 1989 International Conference on Parallel Process-
ing, pp. 160-169. August 1989.
[28] J. A. Stankovic. "An Application of Bayesian Decision
Theory to Decentralized Control of Job Scheduling",
[EEE Transactions on Computers. Vol C-34. No. 2, Feb-
mary 1985.
[29] B. Stroustrup, "What is "Object Oriented Program-
ming"?," [EEE Software. Vol 5. No. 3. May 1988. pp.
10-20.
[30] G. C. Shol a. "A Distributed Facality for Load Shanng
and Parallel Proce_ing Among Workstations," Journal of
System and So/tware. Vol 14. No. 3. pp. 163-172.
[31] M. Tam. J. M. Smith. and D. J. Farber. "A Taxonomy-
Based Comparison of Several Distributed Shared
Memory Systems." ACM Operating System Review. Vol.
24, No. 3, July 1990. pp. 40--67.
[32] A. S. Tanenbaum, M. E Kaashoetc andH. E. Bat,, "Par-
allel Programming Using Shared Objects and Broadcast-
ing," [EEE Computer, Vol 18. No. 3. August 1992. pp.
10-19.
[33] M. Theimer. K. Lantz. and D. Cheriton. "Preemptable
Remote Execuuon Facaliues for V-System.'" Proceedings
of the lOth ACM Symposium on Operating Systems Princi-
ples. December 1985. pp. 2-12.
[34] B. Walker, G. Popek. R. English. C. Kline. and G.
Thiel. "The LOCUS Dismbuted Operating System."
Proceedings of" the 9th Symposium on Operating System
Principles IZ 5 (November 1983). pp. 49-70.
[35] Y. Wu. T. G. Lewis. "Parallelism Encapsulation in
C + + .'" In Proceedings o1:the 1990 International Confer-
ence on Parallel Processing, pp. 35--4Z 1990.
[36] B. B. Wyatt. IC Kavi. and S. Hufnagel. "Parallelism in
Object-Oriented Languages: A Survey." IEEE So/rware.
November 1992. pp. 56--86.
620
Appendix
parameters
N
Rn
Rp
M
P
k
meanings
the number of hosts executing an application
network data rate
[processor speed
the total bytes of shared memory space for the running application
Ipage size
Ithe number of paragraphs per page
]the percentage of data memory accesses for total instructions
Nrf the number of read faults per 1,000,000 memory referenced per host
the number of write faults per 1,000,000 memory referenced per host
spatial locality factor
Nwf
Xs
No, NI, g temporal locality factors
Table1. System and application parameters in the performance model.
8 T
6
5
3
2
1
Oli!ll!qllbl
g
Rn-lO_ bgs
Rn-_OM _s
R n-LOON bOs
R n-tSON _s
Rn-_SON b_
Figure 7. Sp vs k for different Rn. N=16, Rp=50Mips,
M=64kbytes, P=4kbytes, d=0.4, Nrf=500,
Nwf=10, Xs=0.5, N0=10, NI=100, g=100.
7
6
5
2
1
(_ • _
_ Rpot00N _a
Rp,QOON 1_1
I [ ! ! t I
C'J Q
Figure 8. Sp vs k for differentRp.N=16, Rn=150Mbps,
M---64kbytes, P---4kbytes, d=0.4, Nrf=500,
Nwf=10, Xs=O.5, N0=-10, NI=100, g=100.
2
1
0 _ I i i ! i
P _LlrbYrJs
P_tkby_ls
Figure 9. Sp vs k for different P. N=16, Rn=150Mbps,
Rp=50Mips, M=64kbytes, d=0.4, Nrf=500,
Nwf=10, Xs=0.5, N0=10, NI=100, g=100.
10-
B---41--41
0 ! : _ t i
_ tt_ CI
N
,coL6
Figure 10. Sp vs N for different k.Rn=150Mbps,d=0.4,
Rp=50Mips,M=64kbytes,P--4kbytes,Xs---0.5,
Nrf=500, Nwf=10, N0=I0, NI=100, g=100.
621
