Software fine-grain distributed shared memory (FGDSM) 
Introduction
Clusters of small-scale symmetric multiprocessors (SMPs) are emerging as a promising approach to building cost-effective large-scale parallel computers. The relatively high volumes of small-scale SMP servers make them extremely cost-effective as building blocks. By connecting these low-cost nodes, system designers hope to construct large-scale parallel machines with better cost-performance than has been previously possible [4] .
To preserve application compatibility while maintaining a low system cost, many designers implement software distributed shared memory (DSM) over a network of SMPs. Most software DSMs [7, 15] use standard virtual memory translation mechanisms to maintain coherence at a page granularity (or larger) across SMPs. Transparent page-level coherence, however, can result in poor performance for applications with fine-grain sharing.
Alternatively, some systems implement fine-grain distributed shared memory (FGDSM) which allows for sharing data across the nodes at a cache block (e.g., 32-128 byte) granularity. FGDSMs are particularly attractive for implementing DSM on a network of SMPs because they transparently (i.e., without the involvement of the application programmer) extend the SMP fine-grain shared-memory abstraction across all the nodes.
SMP nodes provide an opportunity to improve performance in software FGDSM [22] . By sharing a single large memory cache for remote data among multiple processors, SMPs improve memory utilization. Processors within a node can directly share memory using fast SMP hardware mechanisms. Multiple processors can overlap computation with protocol handling to reduce execution time. SMP nodes can also reduce remote miss frequency by allowing data fetched by one processor to be used by others.
Sharing a node's resources also comes at a cost. Sharedmemory semantics dictates that coherence operations appear to execute atomically [22] . By overlapping protocol execution with computation, coherence checks in the application (on one processor) may execute simultaneously with coherence operations in a protocol handler (on another). While some systems directly support atomic coherence operations in hardware (e.g., Typhoon-0 [20] ), others implement these operations in a non-atomic sequence of software instructions (e.g., Blizzard-S [25] or Shasta [23] ). Nonatomic coherence operations require additional synchronization and may result in low SMP-node performance.
Contention for resources in an SMP node may also lower performance. Commodity network interface cards are typically placed far from processors on a slow peripheral bus and do not provide support for multiple message queues [6, 8] . As such, frequent network communication using a single pair of message queues on an SMP may result in a bottleneck [12] . Multiplexing computation and protocol execution on processors may also lead to cache interference, lower cache performance, and result in higher memory bus contention.
Besides performance, clustering processors into SMP nodes also impacts the cost trade-off. SMPs typically charge higher price premiums than uniprocessors. However, for a system with a given aggregate number of processors and amount of memory, SMPs substantially reduce the networking hardware requirement by reducing the number of nodes in the system as compared to uniprocessors.
In this paper we present Sirocco, a family of software FGDSMs derived from Wisconsin Blizzard [25] and implemented on network of Sun SparcStation 20s interconnected by Myrinet [5] . Sirocco systems range from an all-software design to a design with minimal custom hardware support for coherence operations. We identify and evaluate the sources of overhead in SMP-node implementations of software FGDSM. We compare Sirocco's performance on SMP nodes against uniprocessor nodes for systems with a given aggregate number of processors and amount of memory. We use performance measurements running eight shared-memory applications together with simple cost models to ask the question: "are SMPs cost-effective building blocks for software FGDSM?"
Our results indicate that SMP nodes: (i) result in performance competitive with uniprocessor nodes, (ii) substantially reduce hardware requirement and are more costeffective than uniprocessor nodes, (iii) significantly benefit from hardware support for coherence, and (iv) especially benefit systems with high-overhead coherence operations.
Our results also indicate that SMP-node performance may be highly sensitive to the protocol scheduling policy. In Sirocco, an idle processor on a node can handle protocol operations on behalf of another. Scheduling one processor to handle protocol messages for another may result in adverse cache effects in applications with bursty communication patterns.
The rest of the paper is organized as follows. Section 2 describes how Sirocco implements FGDSM on an SMP node. Section 3 qualitatively analyzes the impact of SMP nodes on FGDSM performance. Section 4 and Section 5 evaluate the performance and cost-effectiveness of the Sirocco systems respectively. Finally, Section 6 concludes the paper. Figure 1 illustrates the anatomy of a software FGDSM node [25] . The figure depicts the protocol-only resources in light gray, and resources used by both the protocol and the application in dark gray. The protocol maintains memory block sharing information in a directory and uses a pair of send/receive message queues to communicate with other nodes. A remote cache in memory temporarily stores fetched remote data. Shared data pages are distributed among designated home nodes. A set of fine-grain tags enforce access semantics for shared data in the remote cache and home pages. Upon a block access fault (i.e., an access violation on a shared memory block), the system inserts the relevant information into a fault queue. Processors execute both the application and the protocol software.
Sirocco: FGDSM on an SMP Node
Sirocco extends the software FGDSM in Blizzard [25] to target small-scale SMP rather than uniprocessor nodes. Unlike other SMP-node software FGDSMs (e.g., Shasta [22] ), Sirocco fully shares a node's resources among SMP processors. A single remote cache improves memory utilization by eliminating redundant copies of shared remote data. Sharing a memory cache especially benefits FGDSMs because memory caches typically suffer from page fragmentation [11] . In Sirocco, SMP processors directly share data in the remote cache and home pages using sharedmemory hardware and obviate the need for intra-node messaging. Sharing memory also enables combining request messages from multiple processors for a single memory block and allows a processor to use memory blocks fetched by others. Sharing protocol resources (e.g., the directory, message queues) allows idle processors to execute protocol handlers while other processors are busy computing [12, 7] .
Sharing resources, however, may violate the sharedmemory access semantics. Shared-memory dictates that coherence operations on data in the remote cache and home pages must appear to execute atomically [22, 25] . Figure 2 illustrates examples of atomic sequences required in FGDSM coherence operations. Coherence operations either correspond to fine-grain tag lookups upon a memory load or store operation in the application, or protocol actions (e.g., a writeback request for a dirty block) which require an atomic pair of accesses to the fine-grain tags and memory.
Uniprocessor-node implementations of FGDSM [25] or SMP-node implementations that do not allow resource sharing [23] guarantee atomicity of coherence operations in three ways. First, the resources are replicated among the processors and each processor always executes its own protocol handlers. As such, an application and a protocol handler simultaneously executing on multiple processors never access the same resources. Second, protocol handlers are only invoked if there is an access violation or through polling for messages and always execute to completion. As such, protocol actions always appear to execute atomically with respect to the application. Third, FGDSM's that implement tag lookup in software (e.g., Blizzard-S [25] and Shasta [23] ) carefully insert polling code to avoid handling messages in the middle of a coherence lookup.
In the rest of the section, we describe Sirocco's approach to sharing resources among SMP processors. The next section describes the protocol dispatch and execution model and how Sirocco coordinates accesses to protocol-only 
Protocol Dispatch and Execution Model
The protocol-only resources may require access coordination if multiple processors simultaneously execute protocol handlers. FGDSM protocol handlers, however, only consist of code to move a small data block between memory and the network, and update the corresponding protocol state. As such, moving network data in/out of memory often dominates a handler's execution time. Parallel handler execution is only beneficial if the network interface card provides mechanisms to efficiently transfer data blocks in/out of the network [17] , and implements either multiple message queues [6] or mechanisms to efficiently dispatch messages from a single queue to multiple handlers [8] . Unfortunately, many commodity network interface cards fail to satisfy the above requirements and hence preclude efficient simultaneous execution of multiple protocol handlers.
Unlike Shasta [22] , Sirocco obviates the need for synchronization around the protocol-only resources ( Figure 1 ) by serializing handler execution. In Sirocco, processors contend for a (software) lock to assume the role of the protocol processor upon an access violation or a message arrival, or while waiting at a barrier synchronization. The network interface card signals a message arrival by setting a flag in a user-accessible memory location. We use executable editing [16] and instrument the application code to poll the flag on every loop-backedge. Backedge polling obviates the need for user-level message interrupts which incur prohibitively high overheads (~70µs) in our commodity operating system. A protocol processor always goes back to computation by releasing the lock when it no longer needs to wait-e.g., a remote block arrives or all messages are received.
Sirocco multiplexes computation with running protocol handlers on all the processors. Alternatively, the system could dedicate one processor on every node to execute protocol handlers. A recent study [12] , however, concludes that a dedicated processor is not advantageous for slow commodity networking hardware and small-scale SMP nodes. While alternative scheduling policies are possible, they are beyond the scope of this paper.
Sirocco uses an array of per-processor fault records to implement the shared fault queue. Fault array accesses are of a producer-consumer nature in which the application always inserts new data and the protocol simply removes them. A simple per-processor signal flag in the fault array guarantees that fault information is correctly handed-off to the protocol.
Support for Atomic Coherence Operations
Support for atomic coherence operations depends on the fine-grain tag implementation. Much like Blizzard [25] , Sirocco provides a spectrum of tag implementations including custom hardware tags in a snoopy board [20] , ECCbased tags managed by the memory-controller [25] , and table-based tags maintained in software [23, 25] . Hardware tags perform a lookup atomically with the memory reference and eliminate overhead either entirely or for most memory references. Software tags perform a lookup in a (non-atomic) sequence of instrumented instructions and require explicit software synchronization to guarantee atomicity. Tag implementations also vary in the degree of support for atomic coherence operations in handlers. In the rest of the section, we describe in detail how various Sirocco systems support atomic coherence operations.
Sirocco-T0: Custom Board SRAM Tags
Sirocco-T0 uses the Typhoon-0 (T0) custom board [20] to snoop memory bus transactions, perform fine-grain access control tests through an SRAM lookup, and coordinate intra-node communication. T0 enforces fine-grain access semantics by asserting a bus error in response to memory transactions that incur access violations-e.g., read/write to an invalid memory block or write to a readonly block. An optimized kernel trap table delivers the bus error to the user level [26, 18] . The user-level code inserts the appropriate fault information into an array of per-processor fault records which the protocol code polls on.
Sirocco-T0 supports atomic coherence operations from the protocol directly in hardware. By writing to a T0 control register, handlers can atomically read a memory block while invalidating/downgrading the corresponding tag. Upon a write to the control register, T0 updates the tag and reads the data into a handler-accessible block buffer. When placing a fetched block into memory, a handler must atomically execute a sequence of memory writes and a tag upgrade. T0 provides uncached page-mapping aliases to memory [25] to allow bypassing hardware tag lookup while writing the data. Because handlers in Sirocco always execute to completion without interruption, the non-atomic sequence of memory writes and tag upgrade appear to execute atomically with respect to the application. Any application access violations during handler execution are caught by the system and are resumed immediately after the tag update.
Sirocco-E: Error-Correcting Code (ECC)
Sirocco-E (a descendent of Blizzard-E [25] ) uses deliberately incorrect error-correcting code (ECC) bits to identify invalid from read-only/read-write blocks. To distinguish read-only from read-write blocks, Sirocco-E uses virtual Sirocco-E manipulates page protection and implements ECC invalidates/downgrades in the OS using a custom system call interface. Atomicity is guaranteed by suspending memory activity on all but one of a node's processors, and through handshakes in the kernel. In Sirocco, there is a master processor on the memory bus capable of masking bus arbitration. A system call to invalidate a block issued from any processor but the master will send an interprocessor interrupt to the master. The master masks bus arbitration, reads the data into a user-accessible buffer in memory, writes incorrect ECC to memory, and releases bus arbitration. Downgrading the tag (from read-write to read-only) may involve changing the page protection and consequently a TLB shootdown. Sirocco-E performs atomic tag upgrades using uncached page-mapping aliases as in Sirocco-T0.
Sirocco-S: Software Tags
Sirocco-S stores the tags in memory and uses executable editing [16] to insert access control tests around sharedmemory loads and stores. Unlike its predecessor Blizzard-S [25] , Sirocco-S uses two forms of tests to detect access violations. Invalid memory is marked with a sentinel value that has a low probability of occurring in the program [24] . The most common test case uses a sequence of 3 instructions (3 cycles) to detect word and doubleword load operations to invalid memory blocks. When the test detects a sentinel, it performs a complete table lookup in order to distinguish access violations from innocent uses of the sentinel value. The rest of the memory operations (i.e., all stores and some loads) use a sequence of 5 test instructions (6 cycles) to index a tag table prior to the memory reference to detect access violations.
Unlike hardware tags, software tag table lookups use memory instructions and are not atomic with respect to data references. Sirocco-S guarantees atomicity through a software handshake between the application and the protocol handlers. The handshake augments the instrumentation with a pair of store and clear instructions to per-processor memory locations that protocol handlers poll on ( Figure 3) . Upon invalidating/downgrading a block, the protocol handler can safely modify the tag in advance, but must guarantee that all writes to the data from the application have completed.
Unfortunately, the handshake overhead may be high for applications with a large number of non-atomic instrumentations (i.e., all stores and some loads). Moreover, frequent handshaking with the protocol is unnecessary in applications with less frequent protocol activity. Sirocco-SB (B stands for backedge) addresses this problem and only uses a single clear instruction in loop-backedges in the application (Figure 3) . Upon a tag update, a handler sets the flags for all processors and simply verifies that all processors have reached a loop backedge at least once before reading the block. Sirocco-SB reduces overhead in applications with a low frequency of protocol activity while increasing the protocol waiting time in communication-intensive applications.
Upon an access violation the test code in Sirocco-S (Sirocco-SB) inserts the fault information into the fault array. To place a fetched block in memory, a handler first writes the data and then upgrades the tag. Because handlers execute to completion without interruptions, such an operation appears to execute atomically with respect to the application.
Our handshake methods in Sirocco assume a sequentially consistent memory system. Weaker memory models require fence instructions which may incur high overheads. Shasta replicates fine-grain tags among the processors and uses intra-node messaging to obviate the need for a software handshake and fence instructions on an Alpha Server [22] . Modern microprocessors, however, are using aggressive speculative techniques to provide sequentially consistent systems with performance competitive to weaker models [14] . Since these techniques also enhance performance of fence instructions in processors with weaker models, we expect our handshake methods to remain low-overhead alternatives to intra-node messaging in future systems.
Sirocco-ES: A Hybrid of ECC and Software Tags
Sirocco-ES is an attempt to take advantage of features in both Sirocco-E and Sirocco-S. Sirocco-ES uses ECC to identify invalid memory blocks and software tags to distinguish read-write from read-only blocks. In comparison to Sirocco-S, Sirocco-ES eliminates instrumentation overhead on load operations altogether, but introduces the cost of maintaining ECC tags. Compared to Sirocco-E, Sirocco-ES eliminates the use of high-overhead page protection mecha- 
Factors Affecting SMP-Node Performance
An FGDSM's performance on a network of SMPs depends on both application and system characteristics. Clustering processors into SMP nodes is beneficial if the application sharing patterns favor fast (local) SMP hardware shared-memory mechanisms over high-latency (remote) FGDSM mechanisms. An SMP node also provides the opportunity for an idle processor to overlap running protocol handlers with computation on other processors. SMP nodes, however, introduce additional overheads, which may result in lower overall performance. In this section we identify the sources of overhead in SMP-node implementations, and quantify overhead for common FGDSM operations.
We classify overhead into correctness and contention overhead. Correctness overhead corresponds to the minimum overhead associated with SMP-node implementations in the absence of contention among processors. Contention overhead refers to additional overhead due to resource sharing among multiple SMP processors. Table 1 depicts the cost of common FGDSM operations in a base system without SMP-node support, and the additional overhead of supporting SMP nodes. Software handshake incurs between 0 (for sentinel) to 2 (for table lookup) cycles of overhead upon tag lookup in Sirocco-S and Sirocco-ES, and 1 cycle of overhead upon loop-backedges in Sirocco-SB and Sirocco-ESB. Manipulating tags in ECC implementations may require a system call which incurs about 30 µs. SMP nodes may require an additional interprocessor interrupt (Section 2.2) for an overhead of 30 µs if the system call originates from a processor incapable of masking memory bus arbitration. The tag update overhead for software tags corresponds to the handshake cost in the handlers (Figure 3) .
Correctness Overhead
The table also presents remote miss times in Sirocco. The measurements correspond to minimum roundtrip miss times for a 128-block software protocol between two machine nodes. The range of miss times corresponds to the three types of remote misses: a read miss, a write miss, and an upgrade (write to a read-only block) miss. SMP nodes incur the additional overheads of passing information through a fault array and acquiring/relinquishing the protocol lock upon an access violation. In comparison, a uniprocessornode implementation (such as Blizzard) directly calls the appropriate protocol handler upon access violation and passes the fault information through processor registers. SMP nodes on average increase roundtrip miss times by about 7-18%.
Contention Overhead
SMP nodes also incur contention overhead due to resource sharing among multiple processors. While contention for (local) memory accesses can lead to queuing delays on the memory bus, contention for (remote) memory accesses can result in queuing delays for running protocol handlers. Applications not benefiting from clustering increase the demand for protocol execution by increasing the aggregate frequency of remote misses on a node. Allowing one processor to execute protocol handlers on behalf of others may also pollute the protocol processor's cache and increase latency by requiring a cache-to-cache transfer of data between a requesting processor and the receiving protocol processor [12] .
Performance Evaluation
In this section, we first present architectural details of our network of SMPs. Next, we present application speedups for our base systems which are uniprocessor-node FGDSM implementations (as in Blizzard) incurring no SMP-node overhead. We use the bases systems for performance comparisons against Sirocco in the rest of the paper. We proceed by evaluating the correctness overhead in Sirocco and finally measure the impact of clustering processors into SMP nodes on application performance.
Our ing a dual-processor module. The T0 custom board occupies one of the bus slots in Sirocco-T0 and therefore allows for only dual-processor nodes. The SMPs are interconnected using Myricom's Myrinet [5] switch-based network. Myrinet network interface cards connect to a node via a 25 MHz I/O bus. We use a 128-byte Stache software coherence protocol [19] to implement shared memory across the nodes. Table 2 presents speedups from shared-memory applications running on our base systems. We also take advantage of software DSM's flexibility, and use customized protocols that bypass shared memory and use direct messaging in two of our applications, em3d and appbt [9] . Speedups vary depending on an application's inherent parallelism, and its interaction with the FGDSM system. Sirocco-T0 implements the fine-grain tags in hardware and always achieves the best speedups. Sirocco-S always incurs instrumentation overhead and therefore favors applications with frequent access violations (e.g., em3d). In contrast, protocol coherence operations in Sirocco-E are expensive and hence it favors applications with less frequent (e.g., tomcatv and water) or no (e.g., em3d-cs and appbt-cs) access violations.
Base System Performance
Page protection overhead in Sirocco-E can degrade performance even in the absence of sharing if an application incurs frequent writes to read-only pages (i.e., pages with at least one read-only block). For instance, lu achieves reasonable speedups on Sirocco-T0, but exhibits a much lower performance on Sirocco-E. Sirocco-ES addresses this problem by performing tag lookups for stores in software and obviating the need for page protection. Sirocco-ES often either outperforms both Sirocco-E and Sirocco-S or performs close to the best of the two.
Correctness Overhead in SMP Nodes
We measure correctness overhead by comparing the Sirocco systems running on uniprocessor nodes against our base systems (incurring no SMP-node overhead). Figure 4 illustrates application execution times on Sirocco systems normalized to those on the corresponding base systems. On average, correctness overhead is negligible (< 3%) in hardware tags. The software tags require an application-protocol handshake and incur a higher overhead of up to 11%.
The performance impact of correctness overhead also varies across applications. In Sirocco-T0, applications with high sharing activity (e.g., em3d, appbt, barnes, and em3d-cs) incur higher correctness overhead. Correctness overhead in Sirocco-S has a higher performance impact on applications with frequent non-atomic instrumentations (e.g., barnes and lu). The loop-backedge handshake on average lowers the incurred correctness overhead in Sirocco-S (Sirocco-ES) by up to 4%.
Performance Impact of Clustering
In this section, we investigate the impact of clusteringi.e., grouping processors into SMP nodes-on application performance. We evaluate clustering by comparing Sirocco's (SMP-node) performance against that of our base system, while keeping the aggregate number of processors and amount of memory in the system constant.
Clustering affects the number of accesses to both local and remote memory. An SMP must satisfy all of the clustered processors' local memory accesses. While clustering converts certain memory accesses among neighboring (clustered) processors from remote to local, it aggregates all of the clustered processors' remote accesses. Because clustered processors share the remote cache and home pages in memory, a processor fetching remote data may also (implicitly) prefetch and convert remote accesses by others to local cache accesses.
Clustering affects performance in applications with dominant local memory accesses in three ways. First, clustered implementations at a minimum incur the SMP-node correctness overhead. Second, an increase in local accesses can introduce queuing delays in the node's memory bus. Third, executing protocol handlers on one processor on behalf of another may impact cache performance and an increase in the number of local accesses. Likewise, clustering affects performance in applications with dominant remote memory access patterns in two ways. First, the aggregate remote memory accesses increase the demand for executing protocol handlers. Second, SMP-node processors can improve performance by overlapping computation with protocol execution.
The per-node aggregate number of remote accesses in a clustered configuration depends on an application's sharing patterns ( Figure 5 ). Sharing patterns vary from strictly nearest-neighbor sharing in em3d and tomcatv, to mostly all-toall sharing in barnes. Nearest-neighbor sharing results in the same per-node aggregate number of remote accesses in both clustered and uniprocessor-node configurations; on every node, there are exactly two immediate neighboring remote processors in all configurations. In more complex sharing patterns, the per-node aggregate number of remote accesses depends on the degree of sharing in the remote cache and home pages. When the network is the bottleneck, performance improvements with clustering are due to implicit prefetching of shared memory blocks, which cannot occur in applications with nearest-neighbor sharing. Figure 6 presents application execution times on sixteen uniprocessor nodes, eight dual-processor nodes, and four quad-processor nodes for all Sirocco systems. The results are normalized to the corresponding base uniprocessornode systems. Tomcatv, lu, water, em3d-cs and appbt-cs are computation-intensive and primarily access local data, em3d and appbt are communication-intensive and frequently access remote data, and barnes accesses moderate amounts of remote data. Em3d, tomcatv and em3d-cs all exhibit nearest-neighbor sharing patterns ( Figure 5 ). As such, clustering does not affect the per-node aggregate number of remote accesses in these applications. Appbt uses shared-memory spin-locks and incurs frequent remote accesses on the critical path of execution. Clustering substantially reduces these remote accesses by converting them to local spin-lock accesses. The per-node aggregate number of remote accesses, however, increases in all the other applications (up to 50% in barnes).
Our overall results indicate that clustering offers competitive performance specially for hardware tag implementations. Dual-processor nodes perform very close to uniprocessor nodes for Sirocco-T0 and Sirocco-E and increase execution time on average by 13% in Sirocco-S and 11% in Sirocco-SB. Quad-processor nodes also exhibit performance competitive to uniprocessor nodes, and are especially beneficial for Sirocco-E, converting high-overhead FGDSM operations (e.g., write to read-only pages) to fast SMP local accesses. This result corroborates previous findings for (high-overhead) DVSM implementations on a network of SMPs [28] . Quad-processor nodes also increase synchronization time in the loop-backedge handshake because a protocol handler must wait for three processors to reach a loop-backedge. This result indicates that the loopbackedge handshake may be suitable for small-scale SMP nodes while instrumented synchronization may be suitable for larger SMP nodes.
Em3d-cs consistently exhibits the largest performance degradation across all tag implementations. At the end of each iteration in em3d-cs, the system schedules one processor (the first one to become the protocol processor) to receive all the incoming data. Because, the data are received in protocol processor's cache, subsequent accesses to the data by the consuming (i.e., computing) processor miss. Data belonging to other processors also pollute the protocol processor's cache. The combined effect significantly increases computation time in em3d-cs (> 50% for quadprocessor nodes). This result suggests that systems should allow custom protocols to schedule protocol execution on a particular processor for effective cache utilization. Our transparent shared memory protocol does not exhibit performance sensitivity to the scheduling policy because a (protocol) processor resumes computation as soon as the block it is waiting for arrives.
Appbt and em3d significantly benefit from clustering across tag implementations. In appbt, clustering improves performance by reducing the number of remote accesses. The performance impact is more pronounced for systems with high-overhead coherence operations such as Sirocco-E and Sirocco-ES. Appbt performs 89% and 32% better respectively on quad-processor nodes than uniprocessor nodes. Em3d makes effective use of the processors' idle time to service remote requests. Because of nearest-neighbor sharing patterns, quad-processor nodes eliminate remote accesses for two of the processors. In every iteration, these processors complete computation more quickly than others and compete acting as the protocol processor on behalf of the node, overlapping protocol execution with computation and improving performance by up to 18%.
Lu exhibits irregular clustering trends on Sirocco-E. Clustered neighbors are located along the y dimension of the 2-dimensional input matrix. Partitioning the matrix among sixteen processors using uniprocessor and dual-processor nodes results in a large number of writes to read-only pages. Because protocol execution is serialized, queueing delays at the protocol processor in dual-processor nodes degrade performance. In quad-processor nodes, all the processors along the y dimension belong to the same node eliminating the read-only pages. As such, performance significantly improves by 96% since the high-overhead FGDSM accesses are converted to local SMP accesses.
Not surprisingly, the choice of a handshake method impacts the clustering performance for software tags. Frequent access violations in em3d (on dual-processor nodes) and barnes, as well as large loop bodies with sparse loopbackedges in tomcatv increase handshake waiting time in the loop-backedge method. The latter, however, consistently improves performance for lu and water because of the low frequency of protocol activity in these applications.
Cost-Effectiveness of SMP Nodes
Although the manufacturing cost of computer products is typically related to the cost of components, cost from the perspective of a customer is related to price, which is also dictated by market forces [13] . High-performance products, for instance, tend to target smaller markets and as such carry larger margins, charging higher premiums.
A software FGDSM may either use uniprocessors or SMPs as building blocks. Depending on the degree of multiprocessing, SMP products can belong to either a low-margin desktop or high-margin server market. When clustering processors into SMP nodes, many machine components such as the number of processors and memory modules remain fixed across platforms. A clustered system, however, reduces the number of nodes in the machine and thus requires fewer motherboards, network interface cards, and network switches/routers. Because SMPs can carry higher price premiums, a reduction in the number of these components may be offset by an increase in the cost of a node.
In this section we ask the question: "are SMPs cost-effective building blocks for software FGDSM?" We use costperformance as the metric [27, 10] , and our base uniproces- sor-node FGDSM systems as the reference point. We say an SMP-node system is cost-effective, if it has a lower cost-performance than a uniprocessor-node system. A system is most cost-effective when it achieves the lowest cost-performance of all compared systems. We compare and contrast uniprocessor and SMP products from two vendors representing the low and high ends of the price spectrum respectively. Vendor prices vary over time but the trends suggest that the relative prices remain the same. We evaluate cost-performance-i.e., cost multiplied by execution time-using application execution times (from Section 4.3) and cost estimates for sample systems built from DELL [1] and Sun Microsystems [3] products representing two ends of the price spectrum for desktops and servers. Our platforms consist of a total of 16 processors and 1 GB of memory and use Myrinet [2] for networking. DELL products are low-end uniprocessor Dimension XPS PCs, high-end dual-processor Model 400 workstations, and quad-processor flagship PowerEdge 6100 servers. Sun Microsystems products are single and dual-processor desktops Model E1300, and low-end quad-processor Enterprise 450 servers. Table 3 depicts the cost-performance of Sirocco-S's clustered configurations normalized to that of its corresponding base uniprocessor-node system. Numbers lower than one indicate a cost-effective clustered configuration. Without a cost advantage (i.e., in equal-cost systems), clustering results in machines that are not cost-effective for most of the applications; clustered implementations incur higher overheads and therefore lower performance. For all vendor platforms and applications, quad-processor nodes perform best except for lu on DELL-based systems; lu exhibits a large performance degradation in a clustered system and therefore results in lower cost-performance. Sun's quadprocessor configurations offer a significant cost-performance advantage over their uniprocessor and dual-processor counterparts. SMP nodes also improve costperformance for DELL-based platforms but the high price premium of DELL's SMP products prevent them from having a high impact on cost-performance. These results, however, are based on a high-overhead clustered implementation (Sirocco-S) and may be conservative.
Implementations with hardware assist will be more favorable towards quad-processor nodes.
Conclusions
In this paper, we presented Sirocco, a family of FGDSM systems implemented on a network of SMP workstations. Unlike previous implementations of FGDSM, Sirocco targets low-cost SMPs rather than uniprocessors as building blocks. SMP nodes provide an opportunity to improve performance by allowing processors to communicate within SMP using fast hardware shared-memory mechanisms. Multiple SMP processors can also overlap application's execution with protocol execution thereby reducing execution time. Simultaneous sharing of node's resources (e.g., memory) between the application and protocol, however, requires mechanisms for guaranteeing atomic accesses. Contention for shared resources among SMP processors may also result in queueing delays and lower performance.
We measured performance for various Sirocco implementations ranging from an all-software approach with no additional hardware to a mostly-software approach with custom hardware support for atomic coherence operations. We evaluated the impact of clustering-i.e., grouping processors into SMP nodes-by comparing the performance of clustered implementations against that of a uniprocessornode implementation while keeping the aggregate number of processors and amount of memory in the system constant. Using simple cost models for desktop/server products, we finally asked the question "are SMPs cost-effective building blocks for software FGDSM?"
Experimental results from running shared-memory applications indicated that SMP nodes: (i) result in competitive performance with uniprocessor nodes, (ii) substantially reduce hardware requirement and are more cost-effective than uniprocessor nodes, (iii) significantly benefit from hardware support for coherence operations, and (iv) are especially beneficial for FGDSMs with high-overhead coherence operations. 
