AS-COMA: An adaptive hybrid shared memory Architecture by Carter, John B. & Kuo, Chen-Chi
A S - C O M A :  A n  A d a p t i v e  H y b r i d  S h a r e d  M e m o r y  A r c h i t e c t u r e  *
C hen-C hi K uo, Jo h n  B. C a rte r , R av in d ra  K u ram k o te , M ark  Sw anson
{ c h e n c h i ,  r e t r a c ,  k u ra m k o t, sw a n so n } @ c s .U tah .e d u  
WWW: h t t p : //www . c s . u t a h . e d u /p r o j e c t s / a v a l a n c h e
UU CS-98-010
D e p a rtm e n t of C o m p u ter Science 
U n ivers ity  of U ta h , Salt Lake C ity, U T  84112
M arch  23, 1998 
Abstract
shared memory m ultiprocessors traditionally use either a cache coherent non­
memory access (CC-NUMA) or simple cache-only memory architecture (S- 
memory architecture. Recently, hybrid architectures th a t  combine aspects of 
both CC-NUMA and S-COMA have emerged. In this paper, we present two improvements 
over o ther hybrid architectures. The first improvement is a page allocation algorithm th a t 
prefers S-COMA pages a t low memory pressures. Once the local free page pool is drained, 
additional pages are m apped in CC-NUM A mode until they suffer sufficient remote misses 
to  w arran t upgrading to  S-COMA mode. The second improvement is a page replacement 
algorithm  th a t dynamically backs off the rate  of page rem appings from CC-NUMA to  S- 
COM A mode a t high memory pressure. This design dram atically reduces the am ount of 
kernel overhead and the num ber of induced cold misses caused by needless thrashing of the 
page cache. The resulting hybrid architecture is called adaptive S-COMA  (AS-COMA). 
AS-COMA exploits the  best of S-COMA and CC-NUMA, performing like an S-COMA 
machine at low memory pressure and like a CC-NUMA machine a t high memory pressure. 
AS-COM A outperform s CC-NUM A under almost all conditions, and outperform s other 
hybrid architectures by up to  17% a t low memory pressure and up to  90% a t high memory 
pressure.
K e y w o rd s :  D istributed shared memory, m ultiprocessor com puter architecture, memory 
architecture, CC-NUMA, S-COMA, hybrid.
T e c h n ic a l A re a s :  A rchitecture.
“T his work was supported by the Space and Naval Warfare System s Command (SPAW AR) and Advanced Research 
Projects Agency (A RPA ), Communication and Memory Architectures for Scalable Parallel Computing, ARPA order 





1 I n t r o d u c t i o n
Scalable hardw are distributed shared memory (DSM) architectures have become increasingly pop­
ular as high-end com pute servers. One of the  purported advantages of shared memory m ultipro­
cessors compared to message passing m ultiprocessors is th a t they are easier to  program , because 
program m ers are not forced to  track the  location of every piece of d a ta  th a t might be needed. 
However, naive exploitation of the shared memory abstraction can cause performance problems, 
because the performance of DSM multiprocessors is often limited by the am ount of tim e spent 
waiting for rem ote memory accesses to  be satisfied. W hen the overhead associated with accessing 
rem ote memory im pacts performance, program m ers are forced to  spend significant effort managing 
d a ta  placement, m igration, and replication -  the very problem th a t  shared memory is designed to 
eliminate. Thus, the value of DSM architectures is directly related to  the  extent to  which observable 
rem ote memory latency can be reduced to  an acceptable level.
The two basic approaches for addressing the memory latency problem are building latency- 
tolerating features into the microprocessor and reducing the average memory latency. Because of 
the growing gap between microprocessor cycle tim es and main memory latencies, modern micro­
processors incorporate a variety of latency-tolerating features such as fine-grained m ultithreading, 
lockup free caches, split transaction memory busses, and out-of-order execution [1, 11, 15]. These 
features reduce the performance bottleneck of both local and rem ote memory latencies by allow­
ing the processor to  perform useful work while memory is being accessed. However, o ther than  
the fine-grained m ultithreading support of the  Tera machine [1], which requires a large am ount 
of parallelism and an expensive and proprietary  microprocessor, these techniques can hide only a 
fraction of the to ta l memory latency. Therefore, it is im portan t to  develop memory architectures 
th a t  reduce the overhead of rem ote mem ory access.
Rem ote memory overhead is governed by three issues: (i) the num ber of cycles required to  satisfy 
each rem ote memory request, (ii) the frequency with which remote memory accesses occur, and (iii) 
the software overhead of managing the mem ory hierarchy. The designers of high-end commercial 
DSM system s such as the SUN UE10000 [18] and SGI Origin 2000 [6] have put considerable effort 
into reducing the rem ote memory latency by developing specialized high speed interconnects. These 
efforts can reduce the  ratio  of rem ote to  local memory latency to  as low as 2:1, bu t they require 
expensive hardw are available only on high-end servers costing hundreds of thousands of dollars. In 
th is paper, we concentrate on the second and th ird  issues, namely reducing the  frequency of remote 
memory accesses while ensuring th a t  the software overhead required to  do th is rem ains modest.
2
Previous studies have tended to  ignore the im pact of software overhead [5, 12, 16], bu t our findings 
indicate th a t  the effect of this factor can be dram atic.
Scalable shared memory multiprocessors traditionally  use either a cache coherent non­
uniform memory access (CC-NUMA) architecture or a simple cache-only memory architecture 
(S-COMA) [16]. Each architecture performs well under different conditions, as follows.
CC-NUM A is the m ost common DSM memory architecture. It is embodied by such machines 
as the Stanford DASH [7], SUN UE10000 [18], and SGI Origin 2000 [6]. In a  CC-NUM A, shared 
physical memory is evenly d istributed am ongst the nodes in the machine, and each page of shared 
memory has a home location. The home node of d a ta  can be determined from its global physical 
address. Processors can access any piece of global d a ta  by m apping a v irtual address to  the 
appropriate global physical address, but the am ount of rem ote shared d a ta  th a t can be replicated 
on a node is lim ited by the size of a node’s processor cache(s) and remote access cache (RAC) [8]. 
Thus, CC-NUM A machines generally perform poorly when the rate  of conflict or capacity misses 
is high, such as when a node’s caches are too small to  hold the entire rem ote working set or when 
the d a ta  access pa tte rn s  and cache organization cause cached rem ote d a ta  to  be purged frequently.
S-COMA architectures employ any unused DRAM  on a node as a cache for rem ote d a ta  [16], 
which significantly increases the am ount of storage available on each node for caching remote 
d ata . The performance of pure S-COMA machines is heavily dependent on the memory pressure 
of a particular application. P u t simply, memory pressure is a measure of the am ount of physical 
memory in a machine required to  hold an application’s instructions and da ta . A 20% memory 
pressure indicates th a t  20% of a m achine’s pages m ust be used to hold the initial (home) copy of 
the application’s instructions and data . A t this low memory pressure, on average 80% of a node’s 
physical memory is available to  be used as a page-grained cache of rem ote da ta . A lthough this 
ability to  cache rem ote d a ta  in local memory can dram atically  reduce the num ber of rem ote memory 
operations, pure S-COMA has a number of drawbacks. Page m anagement can be expensive. The 
page-grained allocation of the rem ote d a ta  cache can lead to  large am ount of internal fragm entation, 
and the requirem ent th a t  all shared d a ta  accessed by a node must be backed by a local DRAM 
page can lead to  thrashing a t high memory pressures.
Recently, hybrid architectures th a t combine aspects of both CC-NUMA and S-COMA have 
emerged, such as the Wisconsin reactive CC-NUMA  (R-NUMA) [5] and the  USC victim cache 
NUMA (VC-NUMA) [12]. Intuitively, these hybrid system s a ttem p t to  map the  rem ote pages for 
which there are the  highest num ber of conflict misses to  local S-COMA pages, thereby eliminating 
the g reatest num ber of expensive rem ote operations. All o ther rem ote pages are m apped in CC-
3
NUMA mode. Ideally, such system s would exploit unused available DRAM for caching w ithout 
penalty but the proposed im plem entations fail to  achieve this goal under certain conditions.
In this paper, we present two improvements over R-NUMA and VC-NUMA. The first improve­
m ent is a page allocation algorithm  the prefers S-COMA pages a t low memory pressures. Once the 
local free page pool is drained, additional pages are initially m apped in CC-NUMA mode until they 
suffer sufficient rem ote misses to  w arrant upgrading to  S-COMA mode. The second improvement 
is a page replacement algorithm  th a t  dynamically backs off the rate  of page rem appings between 
CC-NUMA and S-COMA mode a t high memory pressure. This design dram atically reduces the 
am ount of kernel overhead and the num ber of induced cold misses caused by needless thrashing of 
the page cache. The resulting hybrid architecture is called adaptive S-COMA  (AS-COMA).
R-NUMA [5] and VC-NUMA [12] initially m ap all pages in CC-NUMA mode, and then identify 
rem ote pages th a t  are suffering inordinate numbers of conflict misses to  rem ote node, so-called 
hot pages. U nfortunately, under heavy memory pressure, there are not enough local pages to 
accom m odate all hot rem ote pages and thrashing occurs, which severely degrades performance. 
In addition to  the in terrup t handling and flushing overheads induced by a rem ap request, page 
rem apping also increases the cold miss rate, because the contents of both the hot page and any 
victim page th a t was downgraded to  make room for it m ust be flushed from the processor cache(s).
AS-COMA initially m aps pages in S-COMA mode to  exploit S-COM A’s superior performance 
at low memory pressures. Doing so eliminates rem ote conflict misses and rem apping overhead when 
there is enough free memory to  cache all of a node’s working set in its local memory. To com bat 
page thrashing under heavy memory pressures, which occurs in S-COMA and to  a lesser degree in 
R-NUMA and VC-NUMA, AS-COMA uses a page replication backoff algorithm  to  detect thrashing 
and aggressively reduce its ra te  of page remapping. Under extrem e circumstances, AS-COMA goes 
so far as to  disable CC-NUM A <->■ S-COMA rem appings entirely.
We used detailed execution-driven simulation to  evaluate a num ber of AS-COMA design trade­
offs and then compared the resulting AS-COMA design against CC-NUMA, pure S-COMA, R- 
NUMA, and VC-NUMA. We found th a t  AS-COM A’s hybrid design provides the best behavior of 
both CC-NUMA and S-COMA. At low memory pressures, AS-COMA acts like S-COMA and out­
performs other hybrid architectures by up to  17%. A t high memory pressures, AS-COMA avoids the 
performance dropoff induced by thrashing and aggressively converges to  CC-NUMA performance, 
thereby outperform ing the o ther hybrid architectures by up to  90%. In addition, AS-COMA ou t­
performs CC-NUMA under alm ost all conditions, and a t its worst only underperform s CC-NUMA 
by 5%.
4
The rem ainder of this paper is organized as follows. In Section 2 we describe the basics of 
all scalable shared memory architectures, followed by an in-depth description of existing DSM 
models. Section 3 presents the design of our proposed AS-COMA architecture. We describe our 
sim ulation environm ent, test applications, and experiments in Section 4, and present the results of 
these experim ents in Section 5. Finally, we draw conclusions and discuss fu ture work in Section 6.
2 Background
In th is section, we discuss organization of the existing DSM architectures: CC-NUMA, S-COMA, 
R-NUMA, and VC-NUMA.
2 .1  D i r e c t o r y - b a s e d  D S M  A r c h i t e c t u r e s
All of the shared memory architectures th a t  we consider share a common basic design, illustrated 
in Figure 1. Individual nodes are composed of one or more comm odity microprocessors with private 
caches connected to  a coherent split-transaction memory bus. Also on the memory bus is a main 
memory controller with shared main memory and a d istributed shared memory controller connected 
to  a node interconnect. The aggregate main memory of the machine is d istributed across all nodes. 
The processor, main memory controller, and DSM controller all snoop the coherent memory bus, 
looking for memory transactions to  which they m ust respond.
The internals of a typical DSM controller also are illustrated in Figure 1. It consists of a memory 
bus snooper, a control unit th a t  m anages locally cached shared memory (cache controller), a control 
unit th a t retains s ta te  associated with shared memory whose “home” is the local main memory 
(directory controller) , a network interface, and some local storage. In all of the design alternatives 
th a t we explore, the local storage contains DRAM th a t is used to  store directory state .
W hen a local processor makes an access to  shared d a ta  th a t is not satisfied by its cache, a 
mem ory request is put on the coherent memory bus where it is observed by the DSM controller. 
The bus snooper detects th a t  the request was m ade to  shared memory and forwards the request 
to  the DSM cache controller. The DSM cache controller will then take one of the following two 
actions. If the d a ta  is in main memory, e.g., th is node is the m em ory’s “home” or the d a ta  is cached 
in a local S-COMA page, a coherency response is given th a t  allows the main memory controller to 
satisfy the request. Otherwise the request is forwarded to  the appropriate  remote node. Once a 
response has been received, the  DSM cache controller supplies the requested d a ta  to  the processor 
and potentially also stores it to  main memory.
5
F ig u re  1 Typical Scalable Shared Memory A rchitecture
A request for d a ta  th a t  is received from a  rem ote node is forwarded to  the  directory controller, 
which tracks the s ta tu s  of each line of shared d a ta  for which it is the home node. If the remote 
request can be supplied using the contents of local memory, the directory controller simply responds 
with the requested d a ta  and updates its directory sta te . If the directory controller is unable to 
respond directly, e.g., because a rem ote node has a dirty  copy of the requested cache line, it forwards 
the request to  the appropriate node(s) and updates its directory sta te .
The rem ote access overhead of these architectures can be represented as:
( Npagecache * Tpagecache) {N rem ote  * T rem ote) "t~ (-^ cold, * Trem ote) “I- 1 overhead•
Npagecache and Nremote represent the num ber of conflict misses th a t were satisfied by the page 
cache or rem ote memory, respectively. N coid represents the num ber of cold misses induced by 
flushing and rem apping pages, and thus is zero only in CC-NUMA model. Tpagecache and Tremote 
represent the latency of fetching the line from the local page cache or rem ote memory, respectively. 
Toverhead represents the software overheads of the S-COMA and the hybrid models to  support page 
rem apping, e.g., flushing.
Table 1 sum m arizes the remote memory overhead for each architecture and the critical factors 
determ ining performance, assuming a fixed am ount of memory. Table 2 provides the cost in term s
6
M o d e l R e m o te  O v e r h e a d P e r f o r m a n c e  F a c to r s
C C -N U M A (■^remote * I ' r e m o t e ) N e tw o r k  s p e e d
S -C O M A ( N p a g e c a c h e  * Tpagecache')~^~
{ N coid  * T r e r n o t e ')-\-
T o v e r h e a d
1. N e tw o r k  s p e e d
2 . S o f tw a re  o v e r h e a d
H y b r id
A r c h i te c tu r e s
{ N p a g e c a c he  * T p a g eCache')~\~ 
{ N r e m o t e  * ^ re m o te ) -!- 
( N coid  * T r e m o t e )  1 o v e r h e a d
1. N e tw o rk  s p e e d
2 . S o f tw a re  o v e r h e a d
T a b le  1 Remote M emory Overhead of Various Models
M o d e l S to r a g e  C o s t C o m p le x i ty
C C - N U M A N o n e N o n e
S -C O M A P a g e  c a c h e  s t a te :
1. 2 b i t s  p e r  b lo c k
2 . 4 4  b i t s  p e r  p a g e
1 .P a g e  c a c h e  s t a t e  lo o k u p
2 . lo c a l  r e m o te  p a g e  m a p
3 . P a g e - d a e m o n  a n d  V M  k e rn e l
H y b r id
A r c h i te c tu r e s
P a g e  c a c h e  s t a te :
1. 2 b i t s  p e r  b lo c k
2 . 4 4  b i t s  p e r  p a g e  
R e fe tc h  C o u n t :
6 b i t s  p e r  p a g e  p e r  n o d e
1 .P a g e  c a c h e  s t a t e  c o n t r o l l e r
2 . lo c a l  <-»■ r e m o te  p a g e  m a p
3 . P a g e - d a e m o n  a n d  V M  k e r n e l
4 . R e f e tc h  c o u n te r ,  c o m p a r a t o r  
a n d  i n t e r r u p t  g e n e r a to r
T a b le  2 Cost and Complexity of Various Models
of the  storage and complexity for each of the models. These issues will be explained in the following 
sections along with how each model works.
2 .2  C C - N U M A
In CC-NUM A, the first page access on each node to  a particular page causes a page fault, a t which 
tim e the local TLB and page table are loaded with a page translation to  the appropriate  global 
physical page. The home node of each page can be determ ined from its physical address. W hen 
the local processor suffers a cache miss to  a line in a rem ote page, the DSM controller forwards the 
memory request to  the  m em ory’s home node, incurring a significant access delay. Rem ote d a ta  can 
only be cached in the processor cache(s) or an optional rem ote access cache (RAC) on the DSM 
controller. Applications th a t  suffer a large num ber of conflict misses to  rem ote da ta , e.g., due to  
the limited am ount of caching of rem ote da ta , perform poorly on CC-NUMAs [5]. Unfortunately, 
these applications are fairly common [5, 14, 16]. Careful page allocation [2, 9], m igration [21], or 
replication [21] can alleviate this problem by carefully selecting or modifying the choice of home 
node for a given page of da ta , bu t these techniques have to  date  only been successful for read-only 
or non-shared pages.
7
The conflict miss cost in the CC-NUMA model is represented by (N remote * Tremoie), th a t  is, 
all misses to  shared memory with a remote home m ust be rem ote misses. To reduce this overhead, 
designers of some such system s have adopted high speed interconnect to  reduce (Tremote) [6, 13, 
18].
2 .3  S - C O M A
In the S-COMA model [16], the DSM controller and operating system cooperate to  provide access 
to  remotely homed da ta . In S-COMA, a mapping from a global v irtual address to  a local physical 
address is created a t the first page fault to  th a t shared mem ory page. The page fault handler selects 
an available page from the local DRAM page cache. At this tim e, the cache s ta te  information is 
updated in the local DSM controller to  indicate which global page this local page is caching. In 
addition, the valid bit associated with each cache line in the page is set to  invalid to  indicate th a t, 
while the page mapping is valid, no remote d a ta  is actually cached in the local page yet. If there 
are no free pages in the page cache when a page fault occurs, the page fault handler selects another 
S-COMA page to  replace, flushes this page’s cache lines from the local processor cache, and then 
m aps the faulting page.
W hen a local processor suffers a cache miss to  rem ote da ta , the DSM cache controller examines 
the valid bit for the line. If the valid bit is set, the page cache contains valid d a ta  for th a t  line, so 
it can be supplied directly from main memory, thereby avoiding an expensive rem ote operation. If, 
however, the requested line is invalid, the DSM cache controller m ust perform a rem ote request to  
acquire a copy of the desired data . W hen the rem ote node responds with the data , it is w ritten to  
the page cache, supplied to  the processor, and the valid bit is set.
S-COM A’s aggressive use of local memory to  replicate rem ote shared d a ta  can completely elim­
inate Nremote when the memory pressure on a node is low. However, pure S-COMA’s performance 
degrades rapidly for some applications as memory pressure increases. Because all rem ote d a ta  
must be m apped to  a local physical page before it can be accessed, there can be heavy contention 
if the number of local physical pages available for S-COMA page replication is small. Under these 
circumstances, thrashing occurs, not unlike thrashing in a conventional VM system. Given the high 
cost of page replacem ent, this can lead to  dismal performance.
In the S-COMA model, the conflict miss cost is represented by (Npagecache * Tpagecache) +  (Ncoid * 
Tremote) +  Toverhead- W hen memory pressure is low enough th a t  all of the rem ote d a ta  a node needs 
can be cached locally, page rem apping does not occur and both N coid and Toverhead are zero. As the 
memory pressure increases, and thus more remote pages are accessed by a node than can be cached 
locally, N coid and Toverhead increase due to  rem apping. N coid increases because the contents of any
8
pages th a t  are replaced from  th e  local page cache m ust be flushed from  th e  processor cache(s). 
Subsequen t accesses to  these pages will suffer cold misses in add ition  to  th e  cost of rem apping. 
An even worse problem  is th a t  as m em ory pressure approaches 100%, th e  tim e spen t in th e  kernel 
flushing and rem apping  pages (T overhead) skyrockets. Sources of th is  overhead include th e  tim e 
sp en t co n tex t sw itching betw een th e  user application  and th e  pageou t daem on, flushing blocks 
from  th e  v ictim  page(s), and rem app ing  pages.
2 .4  H y b r i d  D S M  A r c h i t e c t u r e s
T w o hybrid C C -N U M A /S -C O M A  arch itec tu res  have been proposed: R -N U M A  [5] and VC-NU M A 
[12]. We describe these a rch itec tu re s  in th is  section.
T he basic a rch itec tu re  of an  R -N U M A  m achine [5] is th a t  of a  C C -N U M A  m achine. However, 
unlike C C-N U M A , which “w astes” local physical m em ory no t required  to  hold hom e pages, R- 
N UM A uses th is  otherw ise unused s to rag e  to  cache frequently  accessed rem ote pages, as in S- 
C O M A . T his m echanism  requires a  num ber of m odest m odifications to  a conventional C C -N U M A ’s 
DSM  engine and o p era tin g  system , as described below.
In add ition  to  its norm al C C -N U M A  o p era tio n , th e  d irec to ry  contro ller in an  R -N U M A  m achine 
m ain ta ins an a rray  of coun ters  th a t  track s  for each page th e  num ber of tim es th a t  each processor 
has refetched a  line from  th a t  page, as follows. W henever a  d irec to ry  contro ller receives a  request 
for a  cache line from  a  node, it checks to  see if th a t  node is a lready  in th e  copyset of nodes for 
th a t  line. If it is, th is  request is a  re fe tch  caused by a  conflict m iss, and  no t a coherence or cold 
miss, and  th e  no d e’s refetch co u n te r for th is  page is increm ented . T he p er-p ag e /p e r-n o d e  coun ter 
is used to  determ ine which C C -N U M A  pages are generating  frequen t rem ote  refetches, and th u s  are 
good can d id a tes  to  be m apped to  an  S-C O M A  page on th e  accessing node. W hen a refetch coun ter 
crosses a  configurable th resh o ld  (e.g. 64), th e  d irec to ry  contro ller piggybacks an  indication  of th is 
event w ith  th e  d a ta  response. T h is  causes th e  DSM  engine on th e  requesting  node to  in te rru p t th e  
processor w ith  an ind ication  th a t  a  p a rticu la r page should be rem apped  to  a  local S-CO M A  page.
Pages are  rem apped from  C C -N U M A  m ode to  S-CO M A  m ode using essentially  th e  sam e m ech­
anism  as is used by S-CO M A  to  rem ap  pages. F irs t, all lines of th e  page being upgraded m ust be 
flushed from  th e  local processor cache(s) and RAC. T hen , if a  free page a lready  exists, th e  global 
v irtu a l address is m apped to  th e  selected local physical address, and  th e  DSM  engine is inform ed of 
th e  new m apping. If no free page ex ists, th e  fau lt hand ler first m ust select a  v ictim  page to  replace, 
th e  v ic tim ’s d a ta  m ust be flushed from  th e  page cache, and  its correspond ing  global v irtu a l address 
m ust be rem apped  back to  its  hom e global physical address.
9
By su p p o rtin g  b o th  C C-N U M A  and S-CO M A  access m odes in th e  sam e m achine, an R -NUM A 
m achine is able to  exploit available local m em ory as a large page cache for CC-N U M A  pages. By 
track ing  refetch counts, it is able to  select dynam ically  which C C -N U M A  pages should popu late  
th e  S-COM A cache based on access behavior. In a recent s tu d y  [5], R -N U M A ’s flexibility and 
in telligent selection of pages to  m ap  in S-C O M A  m ode caused it to  o u tperfo rm  th e  best of pure 
C C -N U M A  and pure S-CO M A  by up to  37% on som e applications.
However, a lthough  R -NUM A frequently  ou tperfo rm s bo th  C C -N U M A  and S-CO M A, it was also 
observed to  perform  as m uch as 57% worse on som e applications [5]. T his poor perform ance can be 
a ttr ib u te d  to  tw o problem s. F irs t, R -N U M A  initially  m aps all pages in C C -N U M A  m ode, and only 
upgrades them  to  S-COM A  m ode a fte r som e num ber of rem ote refetches occur, which in troduces 
needless rem ote  refetches when m em ory pressure is low. Second, R -N U M A  alw ays upgrades pages 
to  S-C O M A  m ode when th e ir refetch th resho ld  is exceeded, even if it  m ust evict an o th er ho t page 
to  do so. W hen m em ory pressure is high, and th e  num ber of ho t pages exceeds th e  num ber of free 
pages available for caching th em , th is  behavior resu lts in frequen t expensive page rem appings for 
little  value. T his leads to  perform ance worse th an  CC-N U M A , which never rem aps pages.
V C -N U M A  [12] tre a ts  its  R A C  as a victim  cache for th e  processor cache(s), i.e., only rem ote 
d a ta  evicted from  th e  processor cache(s) is placed in its RA C. V C -N U M A  reduces m em ory overhead 
by using th e  v ictim  cache tag s  and  page indices to  identify th e  re location  cand ida tes, instead  of 
m ain ta in ing  m ultip le refetch coun ters  per page in th e  d irec to ry  con tro ller as in R-N UM A . However, 
th is  solu tion requires significant m odifications to  th e  processor cache con tro ller and  bus protocol, 
changes th a t  are not feasible in system s built from  com m odity  nodes. T h e  designers of V C-N U M A 
noticed th e  tendency  of hybrid m odels to  th ra sh  a t  high m em ory p ressure and suggested a  th rash in g  
detec tion  schem e to  address th e  problem . T heir schem e requires a  local refetch coun ter per S-CO M A  
page, a  p rogram m able  break e v e n  num ber th a t  depends on th e  netw ork  la tency  and overhead of 
re locating  pages, and an e v a lu a t io n  th resh o ld  th a t  depends on th e  to ta l  num ber of free S-CO M A  
pages in th e  page cache. A lthough V C -N U M A  frequently  ou tp erfo rm s R -N U M A , th e  stu d y  did n o t 
iso late th e  benefit of th e  th rash in g  d e tec tion  schem e from  th a t  of th e  in teg ra ted  victim  cache. T hus, 
th e  effectiveness of th e ir th ra sh in g  de tec tio n  schem e under d ifferent a rch itec tu re  configurations was 
n o t m easured  and  th u s  th e  necessity of th e  e x tra  hardw are su p p o rt was no t clearly justified .
In these hybrid  m odels, th e  conflict m iss cost is represented  by ( N pagecac)le* T pagecach,e) +  { N remote* 
T'remote) ~l~ { N Cold * ^remote) T'overhead• -^pa^ecac/ie and closely depend on th e  re location
m echanism s. R em appings betw een C C -N U M A  and S-COM A m odes accoun t for th e  increased cold 
m iss ra te  (N coid), as described earlier. T overhead is th e  softw are overhead required  for th e  kernel to  
hand le  in te rru p ts , flush pages, and  rem ap  pages.
10
W hen th ere  are plentiful free local pages, th e  difference betw een th e  hybrid m odels and S- 
CO M A  is th a t  S-CO M A  does no t suffer from  as m any initial conflict m isses, nor does it pay for 
page rem apping . In such a  case, th e  relative costs betw een th e  tw o m odels can be represen ted  as:
^remote.hybrid ”1” ^cold.hybrid >  Ncold . sc om a  ^  ( i )
^ o v e r h e a d .h y b r id  ^  ^OT/er/ieacf. sc om a  ^  0 )  (2)
Npagecache.sccima  ^  Npagecache.hybrid  (3 )
As th e  m em ory pressure increases, R -N U M A  and V C-N U M A  suffer from  th e  sam e problem s 
as pu re  S-C O M A , although  to  a lesser degree. Even hot pages already  in th e  page cache begin to  
be rem apped . W hen th is occurs, th e  local page cache becom es less effective a t  satisfy ing conflict 
misses, and  N remote.hybrid +  N coid.h.ybrid increases. As before, th e  e x tra  cold m isses a re  induced by the  
cache flushes perform ed during  rem apping . Also as in S-COM A , as m em ory pressure approaches 
100%, th rash in g  causes kernel overhead (T overhead.hybrid) to  becom e significant. As a  resu lt, the  
perform ance of th e  hybrid  m odels d rops d ram atica lly  under high m em ory pressure, a lbeit no t as 
d ram atica lly  as pu re  S-COM A. T he p rim ary  reason th a t  th e  hybrids’ perfo rm ance dropoff is less 
d ra m a tic  is th a t  rem appings occur only every N (e.g., 64) rem ote refetches, no t on e v e r y  rem ote 
access as in S-CO M A . In a  w orst case, th e  re la tive cost betw een th e  hybrid  m odels and CC-N U M A  
under high m em ory pressure can be represen ted  as:
Nremote.hybrid “I” ^cold,hybrid ^remote.ccnuma i (4)
Toverhead.hybrid ^overhead.ccnuma ~  0. (5)
R elations (1), (2) and (3) suggest th a t  one way to  im prove th e  hybrid  m odels a t  low m em ory 
pressure is to  accelerate th e ir convergence to  S-C O M A . Likewise, re la tions (4) and (5) suggest th a t  
perform ance can be im proved by th ro ttlin g  C C -N U M A  S-COM A  tran s itio n s  a t high m em ory 
pressure. Unlike S-C O M A , in which rem apping  is required  for th e  a rch itec tu re  to  o p e ra te  correctly, 
th e  hybrid a rch itec tu res  can choose to  s to p  rem apping  and  leave pages in C C -N U M A  m ode.
In sum m ary , th e  perform ance of hybrid  S -C O M A /C C -N U M A  arch itec tu res  is significantly in­
fluenced by th e  m em ory pressure induced by a p a rticu la r application . Since it is com m on for users 
to  run th e  largest app lica tions th ey  can on th e ir  hardw are , th e  perform ance of an  arch itec tu re  a t 
high m em ory pressures is p articu la rly  im p o rta n t. T herefore, it is crucial to  conduc t perform ance 
stud ies of S-C O M A  or hybrid a rch itec tu res  across a  broad  spectrum  of m em ory pressures. An 
im proved hybrid  a rch itec tu re , m otiva ted  by th e  analysis above, th a t  perfo rm s well regardless of 
m em ory p ressure is discussed in th e  following section.
11
3 Adaptive S-COM A
A t low m em ory pressure, S-COM A ou tperfo rm s C C -N U M A , b u t th e  converse is tru e  a t  high m em ­
ory pressure [16]. T hus, our goal when designing A S-C O M A  w as to  develop a  m em ory a rch itec tu re  
th a t  perform ed like pure S-CO M A  when m em ory for page caching was plentiful, and like CC-N U M A  
when it is no t.
To exploit S -C O M A ’s superio r perform ance a t  low m em ory pressures, A S-C O M A  initially  m aps 
pages in S-C O M A  m ode. T hus, when m em ory pressure is low, A S-CO M A will suffer no rem ote 
conflict or capacity  misses, nor will it pay th e  high cost of rem apping  (i.e., cache flushing, page 
tab le  rem apping , TL B  refill, and induced cold m isses). O nly w hen th e  page cache becom es em pty  
does A S-C O M A  begin rem apping.
Like th e  previous hybrid arch itectu res, A S-C O M A  reac ts  to  increasing m em ory pressure by 
evicting “cold” pages from , and  rem apping  “h o t” pages to , th e  local page cache. However, w ha t 
sets A S-C O M A  a p a r t  from  th e  o th er hybrid a rch itec tu res  is its  ab ility  to  ad a p t to  differing m em ory 
pressures to  fully utilize th e  large page cache a t  low m em ory pressures and to  avoid th rash in g  a t  
high m em ory pressures. I t does so by dynam ically  ad ju stin g  th e  re fe tch  th resh o ld  th a t  triggers 
rem apping, increasing it when it notices th a t  m em ory pressure is high. If th e  refetch th resho ld  is 
too  low, rem appings will occur too  frequently, which leads to  th rash in g . If it is to o  high, rem appings 
th a t  could be usefully m ade will be delayed. By dynam ically  ad ju stin g  th e  refetch th resho ld  based 
on b o th  s ta tic  in fo rm ation  (e.g., th e  cost of re locating  a  page) and  dynam ic in form ation  (e.g., th e  
ra te  of page rem appings), A S-C O M A  is able to  ad a p t sm ooth ly  to  differing m em ory pressures.
A S-C O M A  uses th e  kernel’s VM  system  to  d e tec t th rash in g , as follows. T he kernel m ain ta ins 
a  pool of free local pages th a t  it can use to  satisfy  allocation  or relocation requests. T he pageout 
daem on a tte m p ts  to  keep th e  size of th is pool betw een f r e e - t a r g e t  and f r e e - m i n  pages. W henever 
th e  size of th e  free page pool falls below /ree_m m  pages, th e  pageou t daem on a t te m p ts  to  evict 
enough “cold” pages to  refill th e  free page pool to  f r e e - t a r g e t  pages. Only S-C O M A  pages are 
considered for rep lacem ent. To replace a  page, its  valid blocks are flushed from  th e  processor 
cache, and th en  its  corresponding  global v irtu a l add ress is rem apped  to  its  hom e physical address. 
C o ld  pages are d e tec ted  using a  s e c o n d  ch a n ce  a lgorithm : th e  TLB  reference b it associa ted  w ith 
each S-CO M A  page is reset each tim e it is considered for eviction by th e  pageout daem on. If th e  
reference bit is zero when th e  pageou t daem on nex t runs, th e  page is considered cold.
U nder low to  m o d era te  m em ory pressure, allocation  or re location requests can be perform ed 
im m ediately  because th e re  will be pages in th e  free page pool. However, a t  heavy m em ory pressure, 
th e  pageou t daem on will be unable to  find sufficient cold pages to  refill th e  free page pool. W henever
12
th e  pageou t daem on is unable  to  reclaim  a t least f r e e - t a r g e t  free pages, A S-COM A begins allocating  
pages in CC-N U M A  m ode u nder th e  assum ption  th a t  local m em ory can not accom m odate  the  
ap p lica tio n ’s en tire  w orking set. In add ition , it raises th e  refetch th resho ld  by a  fixed am o u n t to  
reduce th e  ra te  a t which “equa lly -ho t” pages in the  page cache replace each o th er. It also increases 
th e  tim e betw een successive invocations of th e  pageout daem on. Should th e  num ber of h o t pages 
d rop , e.g., because of a phase change in th e  program  th a t  causes a  num ber of ho t pages to  grow 
cold, th e  pageout daem on will de tec t it by detec ting  an increase in th e  num ber of cold pages. A t 
th is  po in t, it can reduce th e  refetch threshold .
Using th is  backoff schem e, th e  ra te  a t  which d estru c tiv e  flushing and rem apping  occurs is 
decreased, as is th e  num ber o f cold misses induced by rem apping . In add ition , th e  frequency a t 
which th e  pageout daem on is invoked is reduced, which elim inates con tex t sw itches and pageout 
daem on execution tim e. O verall, we found th is back p ressure on th e  replacem ent m echanism  to  
be ex trem ely  im p o rtan t. A s will be shown in Section 5, it alleviates th e  perform ance slowdowns 
experienced by R -N UM A  or V C-N U M A  when m em ory p ressure is high.
A S -C O M A ’s conflict m iss cost is identical to  S C O M A ’s when th ere  a re  enough local free pages 
to  accom m odate  th e  ap p lica tio n ’s w orking set. In such cases, th e  rem ote refetch cost of AS- 
C O M A  will be close to  (N vagecache * Tpagecache). Until m em ory pressure gets high, N rem will grow 
slowly. E ventually  th e  page cache will no longer be large enough to  hold all ho t pages. Ideally 
A S -C O M A ’s perform ance w ould sim ply degrade sm ooth ly  to  th a t  of C C-N U M A , (N rem * T rem), as 
m em ory pressure approaches 100%. Realizable A S-C O M A  m odels will fare  som ew hat worse due to  
th e  e x tra  kernel overhead incurred  before th e  system  stabilizes. N evertheless, A S-COM A is able to  
converge rapidly to  e ith e r S-C O M A  or CC-N U M A  m ode, depending  on th e  m em ory pressure.
4 Performance Evaluation
4 .1  E x p e r i m e n t a l  S e t u p
All experim ents were perform ed using an execution-driven sim ulation  of th e  H P  PA -R ISC  archi­
te c tu re  called P a in t (P A -in terp reter)[17 , 19]. P a in t was derived from  th e  M int sim ulator[20]. O ur 
sim ulation  environm ent includes detailed  sim ulation  m odules for a  first level cache, system  bus, 
m em ory controller, netw ork  in terconnect, and DSM  engine. I t provides a m ultip rogram m ed pro­
cess m odel w ith  su p p o rt for o p era tin g  system  code, so th e  effects of O S /u se r code in te rac tio n s  are 
m odeled. T he sim ulation  env ironm ent includes a  kernel based on 4.4BSD th a t  provides schedul­
ing, in te rru p t handling, m em ory m anagem ent, and lim ited system  call capabilities. T he m odeled 
physical page size is 4 kilobytes. T h e  VM  system  was m odified to  provide th e  page tran s la tio n ,
13
a lloca tion , and replacem ent su p p o rt needed by th e  various d is trib u ted  shared  m em ory m odels. All 
th ree  hybrid  arch itectu res we s tu d y  ad o p t B SD 4.4’s page allocation  m echanism  and paging pol­
icy [10] w ith  m inor m odifications. F r e e - m i n  and f r e e - ta r g e t  (see Section 3) were se t to  5% and 7% 
of to ta l m em ory, respectively. W e ex tended  th e  first touch allocation  algorithm  [9] to  d is trib u te  
hom e pages equally to  nodes by lim iting  th e  num ber of hom e pages th a t  are allocated  a t  each node 
to  a  p ro p o rtio n a l share  of th e  to ta l  num ber of pages. Once th is  lim it is reached, rem aining pages 
a re  allocated  in a  round robin fashion to  nodes th a t  have no t reached th e  lim it.
T h e  m odeled processor and DSM  engine are  clocked a t  120M Hz. T h e  system  bus m odeled is 
H P ’s R unw ay bus, which is also clocked a t 120MHz. All cycle coun ts rep o rted  herein are w ith 
respect to  th is  clock. T he ch a rac te ris tic s  of th e  L I cache, RA Cs, and netw ork th a t  we m odeled are 
shown in Table 3.
For m ost of th e  SPLA SH 2 app lica tions we studied , th e  d a ta  se ts  provided have a p rim ary  
w orking se t th a t  fits in an  8-kbyte cache[22]. We, therefore , m odel a  single 8-kilobyte d irect- 
m apped  processor cache to  com pensa te  for th e  sm all size of th e  d a ta  sets, which is consisten t w ith  
previous s tud ies of hybrid a rc h ite c tu re s^ , 12].
W e m odel a  4-bank m ain m em ory contro ller th a t  can supply  d a ta  from  local m em ory in 58 
cycles. T h e  size of m ain m em ory and  th e  am o u n t of free m em ory used for page caching was varied 
to  te s t th e  d ifferent m odels u nder varying m em ory pressures.
W e m odeled a  sequentia lly -consisten t w rite-invalidate consistency pro tocol. DSM  d a ta  is moved 
in 128-byte (4-line) chunks to  am ortize  th e  cost of rem ote com m unication  and reduce th e  m em ory 
overhead of d irec to ry  s ta te  in fo rm ation . As p a r t  of a rem ote m em ory access, th e  DSM  engine w rites 
th e  received d a ta  back to  th e  R A C  or m ain m em ory as ap p ro p ria te . O ur C C -N U M A  and hybrid 
m odels are no t “pu re ,” as we em ploy a  128-byte RA C con tain ing  th e  las t rem ote  d a ta  received as 
p a r t  of perform ing  a  4-line fetch . T h is  m inor optim ization  had a  larger im pact on perform ance 
th a n  we had  an tic ip a ted , as is described  in th e  next section. We do n o t consider different RAC 
configurations in th e  hybrid a rch itec tu re s  for th is  study. An in itia l relocation  th resho ld  of 32,
Component Characteristics
LI Cache Size: 8-kilobytes. 32 byte lines, direct-mapped, virtually indexed, physically tagged, 
non-blocking, up to one outstanding miss, write back, 1-cycle hit latency
RAC 128 byte lines, direct-mapped, non-inclusive, non-blocking, up to one outstanding miss.
Networks 1 cycle propagation, 2X2 switch topology, port contention (only) modeled 
Fall through delay: 4 cycles (ratio between remote to local memory access latencies - 3:1)
T a b le  3 C ache and N etw ork C h arac te ris tic s
14
th e  num ber of rem ote refetches required  to  in itia te  a  page rem apping, is used in all th ree  hybrid 
arch itec tu res. T h e  relocation th resho lds were increm ented by 8 w henever th rash in g  is d etec ted  by 
A S-C O M A ’s softw are schem e or by V C -N U M A ’s h ardw are  scheme; R -N U M A  does no t em ploy a 
backoff schem e. V C-NU M A uses a  breakeven num ber of 16 for its  th ra sh in g  detec tion  m echanism . 
We did no t s im u la te  V C -N U M A ’s victim -cache behavior, because we considered th e  use of non­
com m odity  processors or busses to  be beyond th e  scope of th is  study . T hus, th e  resu lts reported  
for V C -N U M A  are  only relevant for evaluating  its  relocation stra tegy , and  no t th e  value of tre a tin g  
th e  page cache as a v ictim  cache[12].
Finally, Table 4 shows th e  m inim um  la tency  required to  satisfy  a  load or s to re  from  various 
locations in th e  global m em ory hierarchy. T h e  average latency in our sim ulation  is considerably 
higher th a n  th is  m inim um  because of con ten tion  for various resources (bus, m em ory banks, net­
works, e tc .) , which we accurate ly  m odel. T he rem ote to  local m em ory access ra tio  is ab o u t 3:1. 
N ote th a t  o u r netw ork m odel only accoun ts for in p u t p o rt conten tion .
4 .2  B e n c h m a r k  P r o g r a m s
We used six p rogram s to  conduct our s tu d y : b a r n e s ,  f  f  t ,  lu ,  o cean , and  r a d ix  from  th e  SPLASH-2 
benchm ark  su ite  [22] and em3d from  a  shared  m em ory im plem entation  of th e  Split-C  benchm ark  [4, 
3]. Table 5 shows th e  in p u ts  used for each te s t p rogram . T he colum n labeled H o m e  p a g e s  ind icates 
th e  num ber of shared  d a ta  pages in itially  a lloca ted  a t  each node. T hese num bers ind icate  th a t  
each node m anages from  0.5 m egabytes (b a rn e s )  to  2 m egabytes ( l u ,  em3d, and  o cean ) of hom e 
d a ta .
T h e  M a x im u m  r e m o te  p a g e s  colum n ind icates th e  m axim um  num ber of rem ote pages th a t  are 
accessed by a  node for each application , which gives an indication of th e  size of th e  ap p lica tio n ’s 
global w orking set. T he Idea l  p r e s s u r e  colum n is th e  m em ory pressure below which S-CO M A 
and A S-C O M A  m achines ac t like a  “perfect” S-C O M A , m eaning th a t  every node has enough free 
m em ory to  cache all rem ote pages th a t  it will ever access. Below th is  m em ory pressure, S-CO M A
Data Location Latency
LI Cache 1 cycle
Local Memory 58 cycles
RAC 23 cycles
Remote Memory 147 cycles
T a b le  4  M in im u m  Access La tency
15
and  A S-CO M A  never experience a  conflict m iss to  rem ote d a ta , nor will th ey  suffer any kernel or 
page daem on overhead to  rem ap  pages.
D ue to  its  sm all defau lt problem  size and long execution tim e, l u  was run on ju s t  4 nodes - all 
o th e r app lications were run on 8 nodes.
5 Results
F igures 2 and  3 show th e  perform ance of C C -N U M A , S-CO M A , and th ree  hybrid C C -N U M A /S - 
CO M A  arch itec tu res  (A S-C O M A , VC-N U M A , R-N U M A ) on th e  six applications. T h e  left colum n 
in each figure displays th e  execution tim e of th e  various a rch itec tu res  relative to  C C-N U M A , and 
ind icates w here th is  tim e was spen t by each p ro g ram 1. T he righ t colum n in each figure displays 
w here cache m isses to  shared  d a ta  were satisfied2. N ote th a t  for readability , these g rap h s are 
ad justed  to  focus on th e  rem ote d a ta  accesses, and  th u s  th e  origin of th e  Y -axis is non-zero. We 
sim ulated  th e  app lica tions across a  range of m em ory p ressures betw een 10% and 90%. Only one 
resu lt is show n for C C-N U M A , since it is no t affected by m em ory pressure. As can be seen in th e  
g raphs, th e  re la tive perform ance of th e  d ifferent a rch itec tu res  can vary d ram atica lly  as m em ory 
pressures change. All resu lts include only th e  parallel phase of th e  various program s.






barnes 16K particles 102 552 16
em3d 40K nodes, 15%remote, 
20 iters
491 778 39
FFT 256K Points, 
tuned for cache sizes
390 1254 24
LU 1024x1024 matrix, 
16x16 blocks, contiguous
514 405 56
ocean 258x258 ocean 473 356 57
radix 1M Keys, Radix =  1024 259 1306 17
T a b le  5 P ro g ram s and P rob lem  Sizes Used in E xperim ents
1 U-SH-MEM:  stalled on shared memory. K - B A S E : performing essential kernel operations (i.e., those required by 
all architectures). K-OVERHD:  performing architecture-specific kernel operations, such as remapping pages and 
handling relocation interrupts. U-INSTR  and U-LC-MEM:  performing user-level instructions or non-shared memory 
operations. SYNC :  performing synchronization operations.
2HOME:  the local node is the data’s home, so it is supplied from local DRAM . S-COMA:  misses satisfied from 
the local page cache. RAC:  misses satisfied from the local RAC. COLD:  cold misses satisfied on a remote  node, 
including both essential cold misses and cold misses induced by remapping. CON F/CAPC:  conflict/capacity misses 
not satisfied locally but that instead result in remote accesses.
16
BARNES
* SYNC« U-LC-MEM s?U-INSTR ■,f K-OVERHD* K-BASE' U-SH-MEM
i I  I * £ I< < ?2 2 23 3 35 g I
BARNES
= CONF/CAPC 8 COLD a RAC « SCOMA
i i
a a 5 &
i i i
S * I
F ig u r e  2 P erfo rm ance C h a rts  for b a rn e s ,  em3d and f f t .  (Left: R elative Execution 
T im e. R ight: W here M isses W ere Satisfied)
17




F ig u r e  3 Perfo rm ance C h a rts  for lu ,  o cean , and r a d ix .  (Left: R elative E xecution 




5 .1  I n i t i a l  A l l o c a t i o n  S c h e m e s
We will first focus on th e  effect of th e  in itial allocation  policies. Recall from  Table 5 th a t  th e  “ideal” 
m em ory pressure for th e  six applica tions ranged from  16% to  57%. Below th is  m em ory pressure, 
th e  local page cache is large enough to  sto re  th e  en tire  working se t of a node. To isolate th e  im pact 
of in itially  a lloca ting  pages in S-CO M A, we sim ulated  S-COM A and th e  hybrid  arch itectu res a t 
a m em ory pressure of 10%, when no page rem appings beyond any in itial ones will occur. Table 6 
shows th e  percen tage of rem ote pages th a t  are refetched a t  least 32 tim es, and  th u s  will be rem apped  
from  C C -N U M A  to  S-CO M A  m ode in R -N U M A  or V C -N U M A , versus of th e  to ta l  num ber of rem ote 
pages accessed. T h is  percentage exhib its a  b road  range from  under 1% in f f t  to  over 95% in l u  
and r a d ix .
F irs t, to  illu s tra te  th e  im portance of em ploying a  hybrid m em ory a rch itec tu re  over a vanilla 
C C -N U M A  arch itec tu re , exam ine th e ir re la tive resu lts  a t  10% m em ory pressures, in F igures 2 and
3. U nder these  circum stances, A S-CO M A , like S-CO M A , ou tperfo rm s C C -N U M A  by 20-35% for 
four of th e  app lica tions ( lu , r a d ix ,  b a r n e s ,  and  era3d). Looking a t  th e  hybrid  arch itec tu res in 
iso lation, we can see th a t  for r a d ix ,  A S-C O M A  o u tperfo rm s R -N U M A  and V C -N U M A  by 17%. 
In r a d ix ,  th e  percen tage and to ta l num ber of rem ote  pages th a t  need to  be rem apped  are both  
qu ite  high, 98% and 10236 respectively. In th e  o th e r applications, th e  in itia l page allocation policy 
had little  im p ac t on perform ance. T here  is no s tro n g  correlation  betw een th e  num ber of pages th a t  
need to  be rem apped  and  perform ance. We can observe a 5% perform ance benefit in lu ,  where 
th e  percen tage of relocated  rem ote pages is very  high (99% ), b u t th e  to ta l  num ber is fairly small 
(1606).
T h ere  a re  tw o p rim ary  reasons why th e  in itial allocation policy did no t have a  s tro n g er im pact 
on perform ance. F irs t, our in te rru p t and  re location  op era tio n s are highly op tim ized , requiring only 
2000 and  6000 cycles, respectively, to  perform . T hus, th e  im pact of th e  unnecessary  rem appings 
and flushes is overw helm ed by o th er facto rs . Second, as an a r tifa c t of our experim ental setup ,
Program Total Remote Pages Relocated Pages % of Relocated Pages
barnes 4416 3498 80%
em3d 6224 1868 29%
FFT 10032 5 0.05%
LU 1620 1606 99%
ocean 2848 569 20%
radix 10448 10236 98%
T a b le  6 N um ber of R em ote P ages E ver Accessed versus C onflicted F requently
19
th e  in itial rem appings for several applica tions were no t included in th e  perform ance resu lts, as 
th ey  took  place before the  parallel phase when ou r m easurem ents are taken . T h is was th e  case 
for b a rn e s  and era3d. T he final tw o applications, f f t  and o cean , only access a  sm all num ber of 
rem ote pages enough tim es to  w a rran t rem apping , and  th u s  th e  im pact of initially  m apping  pages 
in S-CO M A m ode is negligible.
In sum m ary, if m em ory pressure is low and  local pages for replication  are ab u n d an t, an S-CO M A- 
preferred in itia l allocation  policy can im prove th e  perform ance hybrid a rch itec tu res  m odera te ly  by 
accelerating  th e ir convergence to  pu re  S-C O M A  behavior. However, th e  perform ance bo o st is 
m odest.
5 .2  T h r a s h i n g  D e t e c t i o n  a n d  B a c k o f f  S c h e m e s
T he perform ance of hybrid DSM  arch itectu res depends heavily on th e  m em ory pressure. P erfo r­
m ance seriously degrades when th e  page cache can n o t hold all “h o t” pages and those pages s ta r t  
to  evict one an o th er. Intuitively, when th is  begins to  occur, th e  m em ory system  should sim ply 
t re a t  th e  page cache as a place to  sto re  a  rea so n a b le  se t of h o t pages, and s to p  try in g  to  fine tu n e  
its  con ten ts  since th is  tu n in g  adds significant overhead. P rev ious stud ies have no t considered th e  
kernel overhead (T overhead),  b u t we found it to  be very significant a t high m em ory pressures. Once 
th e  page cache holds only h o t pages, fu r th e r a tte m p ts  to  refine its  con ten ts  lead to  th rash in g , which 
involves unnecessary  flushing of ho t d a ta , cache flushes, and induced cold misses. Since one ho t 
page is replacing an o th er, th e  benefit of th is  rem apping  is likely to  be m inim al com pared  to  the  
cost of th e  rem apping  itself. As a  resu lt, th e  perform ance of a  hybrid a rch itec tu re  will quickly drop  
below th a t  of C C -N U M A  if a  m echanism  is no t p u t in place to  avoid th rash in g . As described in 
Section 3, th e  pageout daem on in A S-C O M A  d e tec ts  th rash in g  when it canno t find cold pages to  
replace, a t which po in t it reduces th e  ra te  of page rem appings, going so far as to  sto p  it com pletely if 
necessary. As can be seen in F igures 2 and 3, th is  can lead to  significant perform ance im provem ents 
com pared  to  R -N U M A  and V C-NU M A  under heavy m em ory pressure.
We can divide th e  six app lica tions into tw o groups: (i) app lications w here th e re  a re  sufficient re­
m ote conflict m isses th a t  handling  th rash in g  effectively can lead to  large perform ance gains (b a rn e s , 
em3d, and r a d ix ) ,  and  (ii) app lications in which m inim al efforts to  avoid th rash in g  are sufficient 
for handling  high m em ory pressure ( f f t ,  o cean , and  lu ) .
T he behavior of em3d shows th e  d anger of focusing solely on reducing rem ote  conflict misses 
when designing a  m em ory arch itec tu re . As show n in F igu re  2, th e  perform ance of em3d on the  
hybrid a rch itec tu res  is q u ite  sensitive to  m em ory pressure. R -N U M A  ou tp erfo rm s C C -N U M A  
until m em ory p ressure  approaches 70%, a fte r which tim e its  perform ance d rops quickly. C C -N U M A
20
o u tperfo rm s R-N U M A  by 5% a t 70% m em ory pressure and by 50% a t 90%. Looking a t  th e  detailed  
breakdow n of w here tim e is sp en t, we can see th a t  increasing kernel overhead is the  cu lp rit. In em3d, 
approx im ate ly  29% of rem ote  pages, i.e., 230 pages, are eligible for re location (see Table 6), b u t a t  
70% m em ory pressure th ere  are only 210 free local pages. It tu rn s  o u t th a t  for em3d, m ost of th e  
rem ote pages ever accessed are  in th e  node’s w orking set, i.e., th ey  are “h o t” pages. T hus, above 
70% m em ory pressure, R -N U M A  begins to  th rash  and its  perform ance degrades badly. Looking a t  
th e  righ t colum n of F igu re  2, we can see th a t  th is perform ance d ropoff occurs even th ough  th ere  
are  significantly fewer rem ote conflict m isses (C O N F /C A P C ) in R-N U M A  th an  in C C -N U M A  or 
A S-C O M A . T he cost of co n stan tly  rem apping  pages betw een C C -N U M A  and S-CO M A  m ode and 
th e  increase in rem ote cold misses overw helm s th e  benefit of th e  reduced num ber of rem ote conflict 
m isses. T h is  behavior em phasizes th e  im portance of d e tec tin g  th ra sh in g  and reducing th e  ra te  of 
rem appings when it occurs.
Recognizing th is  problem , V C-N U M A  uses ex tra  h ardw are  to  d e tec t th rash ing . However, its 
m echanism  is no t as effective as A S-C O M A ’s. V C-N U M A  s ta r ts  to  underperform  C C -N U M A  a t 
th e  sam e m em ory pressure th a t  R -NU M A does, 70%. W hile V C -N U M A  ou tperfo rm s R -N U M A  by 
22% a t 90% m em ory pressure, it underperfo rm s C C-N U M A  by 27% and A S-CO M A  by 31%. In 
co n tra s t, A S-CO M A  o u tp erfo rm s CC-N U M A  even a t  90% m em ory pressure, when th e  o th er hybrid 
a rch itec tu res are th rash in g . It does so by dynam ically  tu rn in g  off relocation as it determ ines th a t  
th is  relocation has no benefits because it is sim ply replacing ho t pages w ith  o th er ho t pages. T his 
resu lts in m ore rem ote co n flic t/cap ac ity  misses th an  th e  o th er hybrid  a rch itec tu res, b u t it reduces 
th e  num ber of cold m isses caused by flushing pages during  rem apping  and th e  kernel overhead 
associa ted  w ith  handling  in te rru p ts  and rem apping. As a  resu lt, A S-COM A  o u tperfo rm s VC- 
NUM A by 31% and R -N U M A  by 53% a t  90% m em ory pressure. M oreover, desp ite  having only 
a sm all page cache available to  it and  a rem ote w orking se t larger th a n  th is  cache, A S-C O M A  
o u tperfo rm s CC-N U M A .
B arn e s  exh ib its very high sp a tia l locality. It accesses large dense regions of rem ote m em ory, and 
th u s  can m ake good use of a  local S-COM A page cache3. As shown in in Table 5, b a r n e s ’s ideal 
m em ory pressure is 16%. Like em3d, m ost of th e  rem ote pages th a t  are accessed are p a r t  of th e  
w orking set and  “h o t” for long periods of execution. W e observed th a t  th rash in g  begins to  occur 
a t 50% m em ory pressure. As in em3d, R -N UM A  reduces th e  num ber of rem ote co n flic t/cap acity  
m isses a t  high m em ory pressures, a t  th e  cost of increasing kernel overhead and rem ote cold misses.
3N ote that b arn es is very com pute-intensive, and a problem size that can be simulated in a reasonable amount of 
tim e requires only approximately 100 home pages per node of data. Since there are only about 50 free pages per 
node available for page replication at 70% memory pressure, we did not simulate barn es at higher memory pressures 
since the results would be heavily skewed by small sample size effects.
21
As a  result, it is able to  ou tperfo rm  C C -N U M A  a t low m em ory pressure, b u t is only able to  break 
even by the  tim e m em ory pressure reaches 70%. Similarly, V C -N U M A ’s backoff m echanism  is not 
sufficiently aggressive a t  m o d era te  m em ory pressures to  stop  th e  increase in kernel overhead or cold 
misses. In p articu la r, V C -N U M A  only checks its  backoff in d ica to r when an average of tw o replace­
m ents per cached page have occurred , which is no t sufficiently often to  avoid th rash in g . As shown 
in th e  previous stu d y  [12], V C -N U M A  does no t significantly o u tp erfo rm  R -N U M A  until m em ory 
pressure exceeds 87.5%. O nce again , A S-C O M A ’s adap tive  rep lacem ent algorithm  d etec ts  th ra sh ­
ing as soon as it s ta r ts  to  occur, and th e  resu lting  backoff m echanism  causes perform ance to  degrade 
only slightly  as m em ory pressure increases. As a  result, it consisten tly  o u tperfo rm s C C-N U M A  by 
20% across all ranges of m em ory pressures, and  ou tperfo rm s th e  o th e r hybrid  arch itec tu res by a 
sim ilar m argin  a t  high m em ory pressures.
Unlike b a r n e s ,  r a d ix  exh ib its a lm ost no sp a tia l locality. E very  node accesses every page of 
shared  d a ta  a t  som e tim e during  execution. As such, it is an ex trem e exam ple of an application  
w here fine tu n in g  of th e  S-CO M A  page cache will backfire - each page is roughly as “h o t” as any 
o th er, so th e  page cache should sim ply be loaded w ith som e reasonable se t of “h o t” pages and 
left alone. W ith  an ideal m em ory pressure of 17% and low sp a tia l locality, th e  perform ance of 
pu re  S-C O M A  is 6.7 tim es worse th an  C C -N U M A ’s a t  m em ory pressures as low as 30%. A lthough 
th e  perfo rm ance of bo th  R -N U M A  and VC-NUM A are significantly  m ore s tab le  th a n  th a t  of S- 
C O M A , th ey  to o  suffer from  th rash in g  by th e  tim e m em ory pressure reaches 70%. T he source of 
th is  perform ance deg radation  is th e  sam e as in em3d and b a r n e s  - increasing kernel overhead and 
(to  a  lesser degree) induced cold misses. O nce again, R -N U M A  induces fewer rem ote  accesses th an  
C C -N U M A , b u t th e  kernel overhead required to  su p p o rt page relocation  is such th a t  R-N U M A  
underperfo rm s C C-N U M A  by 75% a t 70% m em ory pressure and by a lm ost a fac to r of tw o a t  90% 
m em ory pressure. Once again, V C -N U M A ’s backoff a lgorithm  proves to  be m ore effective th a n  
R -N U M A ’s, b u t it still underperfo rm s C C -N U M A  by roughly 40% a t high m em ory pressures. AS- 
C O M A , on th e  o th er hand , deposits a  reasonable subset of “h o t” pages in to  th e  page cache and then  
backs off from  replacing fu rth e r pages once it de tec ts  th rash in g . As a  resu lt, even for a program  
w ith  a lm ost no sp a tia l locality, A S-C O M A  is able to  converge to  CC-N U M A -like perform ance (or 
b e tte r )  across all m em ory pressures. A t 90% m em ory pressure, A S-C O M A  o u tperfo rm s V C-NU M A 
by 35% and R -N U M A  by 90% a t  high m em ory pressures, and it rem ains w ith in  5% of C C -N U M A ’s 
perform ance. T h e  slight d eg rad a tio n  com pared  to  CC-N U M A  is due to  th e  sh o rt period of th rash in g  
th a t  occurs before A S-C O M A  can de tec t it and com pletely s to p  relocations.
A pplications in th e  second ca tegory  ( f f t ,  o cean , and lu )  exh ib it good page-grained locality. 
All th ree  app lica tions only have a  sm all se t of “h o t” pages, which can  be easily replicated  using
22
a sm all page cache, or references to  rem ote pages are  so localized th a t  th e  sm all (128-byte) RAC 
in our sim ulation was able to  satisfy  a  high percen tage of rem ote accesses. As a  resu lt, th rash in g  
never occurs and th e  various backoff schem es are n o t invoked. T hus, th e  perform ance of th e  th ree  
hybrid algo rithm s is alm ost identical.
T he perform ance resu lts for f f t  and  o c e a n  are alm ost identical, a lbeit for different reasons. 
For these app lications, all of th e  a rch itec tu re s  perform ed equally well, except for pure S-COM A, 
which perform s poorly  a t  high m em ory pressures. As can be seen in Table 6, only a  tiny  fraction  
of pages in f f t  are accessed enough to  be eligible for relocation, so all of th e  hybrid  arch itectu res 
effectively becom e CC-N U M A s. S-C O M A  m ust m ain ta in  inclusion betw een th e  processor cache 
and th e  page cache, so kernel overhead due to  th rash in g  occurs a t 90% m em ory pressure, which 
causes perfo rm ance to  d rop  significantly. Som ew hat surprisingly, f f t  has such high sp a tia l locality  
in its references to  rem ote m em ory th a t  th e  128-byte RAC plays a  m a jo r role in satisfy ing rem ote 
accesses locally. T h e  reason th a t  perform ance is s tab le  across all m em ory pressures in o ce an  can 
be seen in th e  righ t hand  g raph  of F igu re  3. Even a t 90% m em ory pressure, only 3% of cache 
m isses are to  rem ote  d a ta , and m ost such accesses can be supplied from  a  local S-CO M A  page or 
th e  R A C . As a  resu lt, all of th e  a rch itec tu res  o th e r th a n  pure S-CO M A , which suffers th e  sam e 
problem  as in f f t ,  perform  w ithin 3% of one an o th er.
Finally, in lu ,  each process accesses every rem ote  page enough tim es to  w a rran t rem apping  (see 
Table 6), sim ilar to  r a d ix .  However, every process uses each set of shared  pages in th e  problem  
set for only a sh o rt tim e before m oving to  an o th e r set of pages. T hus, unlike r a d ix ,  only a  sm all 
set of rem ote  pages are active a t any tim e, and  a  sm all page cache can hold each process’s active 
w orking set com pletely. So, while 7% of C C -N U M A ’s cache m isses m ust be satisfied by rem ote 
nodes, p ractically  all cache misses a re  satisfied locally in th e  o th er a rch itec tu res . As a result, 
all of th e  hybrid  arch itec tu res  ou tperfo rm  C C -N U M A  by approx im ate ly  33% across all m em ory 
pressures. Even pu re  S-COM A  ou tp erfo rm s C C -N U M A  a t a  90% m em ory pressure, although  its  
overall perform ance is 15% worse th a n  th e  hybrid  arch itec tu res because of load im balance.
In sum m ary , for applica tions th a t  do n o t suffer frequen t rem ote cache m isses or for which the  
active w orking se t of rem ote pages is sm all a t  any given tim e, all of th e  hybrid  a rch itec tu res  perform  
qu ite  well, often  ou tperfo rm ing  C C -N U M A . However, for applica tions w ith  less sp a tia l locality  or 
larger w orking sets, th e  m ore aggressive rem app ing  backoff m echanism  used by A S-C O M A  is crucial 
to  achieving good perform ance. In such app lica tions, A S-CO M A  o u tperfo rm ed  th e  o th er hybrid 
a rch itec tu re s  by 20% to  90%, and e ith er o u tperfo rm ed  or broke even w ith  C C -N U M A  even a t 
ex trem e m em ory pressures. G iven p ro g ram m ers’ desire to  run  th e  largest problem  size th a t  they
23
can on their m achines, th is  s tab ility  of A S-CO M A  a t high m em ory pressures could prove to  be an 
im p o rtan t fac to r in g e ttin g  hybrid a rch itec tu res  adop ted .
6 Conclusions
T he perform ance of hardw are  d is trib u ted  shared  m em ory is governed by th ree  facto rs: (i) rem ote 
m em ory latency, (ii) th e  num ber of rem ote misses, and  (iii) th e  softw are overhead of m anaging 
th e  m em ory hierarchy. In th is  paper, we evaluated  th e  perform ance of five DSM  arch itec tu res  
(CC-N U M A , S-C O M A , R -N U M A , VC-N U M A , and  A S-C O M A ) w ith special a tten tio n  to  th e  th ird  
facto r, system  softw are overhead. F u rtherm ore , since users of SM Ps tend  to  run th e  largest applica­
tions possible on th e ir hardw are , we paid special a tte n tio n  to  how well each arch itec tu re  perform ed 
under high m e m o r y  p re ssu re .
We found th a t  a t  low m em ory pressure, a rch itec tu res  th a t  were m ost aggressive ab o u t m apping  
rem ote pages in to  th e  local page cache (S-COM A and A S-C O M A ) perform ed best. In o u r study , S- 
C O M A  and A S-C O M A  ou tperfo rm ed  th e  o th e r a rch itec tu res  by up to  17% a t low m em ory pressures. 
As m em ory pressure increased, however, it becam e increasingly im p o rtan t to  reduce th e  ra te  a t 
which rem ote  pages were rem apped  in to  th e  local page cache. S -C O M A ’s perform ance usually 
dropped  d ram atica lly  a t  high m em ory pressures. T h e  perform ance of V C-NU M A  and R -N U M A  
also dropped  a t  high m em ory pressures, a lbeit n o t as severely as S-COM A, due to  th rash in g . T his 
th rash in g  phenom enom  has been largely ignored in previous stud ies, b u t we found th a t  it had  a 
significant im pact on perform ance, especially a t  th e  high m em ory pressures likely to  be preferred  
by power users.
In co n tra s t, A S -C O M A ’s softw are-based schem e to  d e tec t th rash in g  and reduce th e  ra te  of page 
rem appings caused it to  ou tperfo rm  V C-N U M A  and R -N U M A  by up to  90% a t high m em ory 
pressures. A S-C O M A  is able to  fully utilize even a  sm all page cache by m apping  a  su b se t of 
“h o t” pages locally, and  th en  backing off fu r th e r rem apping . T his m echanism  caused A S-C O M A  to  
ou tperfo rm  even C C -N U M A  in five o u t of th e  six app lica tions we studied, and only underperfo rm  
CC-N U M A  by 5% in th e  six th .
C onsequently, we believe th a t  hybrid C C -N U M A /S -C O M A  arch itec tu res can be m ade to  per­
form  effectively a t  all ranges of m em ory pressures. A t low m em ory pressures, aggressive use of 
available D R A M  can elim inate m ost rem ote conflict m isses. A t high m em ory pressures, reducing 
th e  ra te  of page rem appings and  keeping only a  subse t of “h o t” pages in th e  sm all local page cache 
can lead to  perfo rm ance close to  or b e tte r  th a n  C C -N U M A . To achieve th is  level o f perform ance,
24
th e  overhead of system  softw are m ust be carefully considered, and  careful a tten tio n  m ust given to
avoiding needless system  overhead. A S-COM A  achieves these goals.
References
[1] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. The Tera computer 
system. In Proceedings o f  the 1990 International Conference on Supercomputing, pages 1-6, September 
1990.
[2] W.J. Bolosky, R.P. Fitzgerald, and M.L. Scott. Simple but effective techniques for NUMA memory 
management. In Proceedings of  the 12th A C M  Symposium on Operating S ystem s Principles, pages 
19-31, December 1989.
[3] S. Chandra, J.R. Larus, and A. Rogers. Where is time spent in message-passing and shared-memory 
programs? In Proceedings o f  the 6th Symposium on Architectural Support fo r  Programming Languages  
and Operating Systems,  pages 61-73, October 1994.
[4] D. E. Culler, A. Dusseau, S. C. Goldstein, A. Krishnamurthy, S. Lumetta, T. von Eicken, and K. Yelick. 
Parallel programming in spit-c. In Proceedings of  Supercomputing ’93, pages 262-273, November 1993.
[5] B. Falsafi and D.A. Wood. Reactive NUMA: A design for unifying S-COMA and CC-NUMA. In 
Proceedings of  the 24th Annual International Symposium on C om puter  Architecture, pages 229-240, 
June 1997.
[6] J. Laudon and D. Lenoski. The SGI Origin: A ccNUMA highly scalable server. In SIGA RC H 97,  pages 
241-251, June 1997.
[7] D. Lenoski, J. Laudon, K. Gharachorloo, A. Gupta, and J. Hennessy. The directory-based cache 
coherence protocol for the DASH multiprocessor. In Proceedings o f  the 17th Annual In ternational  
Symposium on C om puter Architecture, pages 148-159, May 1990.
[8] D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. Hennessy, M. Horowitz, and M. S. 
Lam. The Stanford DASH multiprocessor. IEEE Computer, 25(3):63—79, March 1992.
[9] M. Marchetti, L. Kontothonassis, R. Bianchini, and M.L. Scott. Using simple page placement policies 
to reduce the code of cache fills in coherent shared-memory systems. In Proceedings o f  the Ninth  
A C M /I E E E  In ternational Parallel Processing Symposium (IPPS),  April 1995.
[10] M.K. Mckusick, K. Bostic, M.J. Karels, and J.S. Quarterman. The Design and Implementation o f  the
4-4BSD operating system ,  chapter 5 Memory Management, pages 117-190. Addison-Wesley Publishing 
Company Inc, 1996.
[11] MIPS Technologies Inc. M IP S  R 10000  Microprocessor U ser ’s Manual, Version 2.0, December 1996.
[12] A. Moga and M. Dubois. The effectiveness of SRAM network caches in clustered DSMs. In Proceedings 
o f  the Fourth Annual Symposium on High Perform ance C om puter  Architecture, 1998.
[13] A. Nowatzyk, G. Aybay, M. Browne, E. Kelly, M. Parkin, B. Radke, and S. Vishin. The S3.mp 
scalable shared memory multiprocessor. In Proceedings o f  the 1995 International Conference on Parallel  
Processing, 1995.
[14] S. E. Perl and R.L. Sites. Studies of Windows NT performance using dynamic execution traces. In 
Proceedings of  the Second Symposium on Operating S ys tem  Design and Implementation, pages 169-184, 
October 1996.
[15] V. Santhanam, E.H. Fornish, and W.-C. Hsu. Data prefetching on the HP PA-8000. In Proceedings of  
the 24th Annual In ternational Symposium on Com puter Architecture, pages 264-273, June 1997.
[16] A. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for Simple COMA. In Proceedings 
of  the F irst Annual Symposium on High Performance C om puter  Architecture, pages 276-285, January 
1995.
25
[17] L.B. Stoller, R. Kuramkote, and M.R. Swanson. PAINT- PA instruction set interpreter. Technical 
Report UUCS-96-009, University of Utah - Computer Science Department, September 1996.
[18] Sun Microsystems. Ultra Enterprise 10000 System Overview. http://hhh.sun.com/sarvers/datacenter/products/starfi r « .
[19] M. Swanson and L. Stoller. Shared memory as a basis for conservative distributed architectural simu­
lation. In Parallel and D istr ibuted  Simulation (P A D S  ’97), 1997. Submitted for publication.
[20] J.E. Veenstra and R.J. Fowler. Mint: A front end for efficient simulation of shared-memory multipro­
cessors. In M A S C O T S  1994, January 1994.
[21] B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data 
locality on CC-NUMA compute servers. In Proceedings of  the 7th Symposium on Architectural Support  
f o r  Programming Languages and Operating Systems, October 1996.
[22] S.C. Woo, M. Ohara, E. Torrie, J.P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization 
and methodological considerations. In Proceedings o f  the 22nd Annual International Symposium on 
C om puter Architecture, pages 24-36, June 1995.
26
