Shared memory provides an attractive and intuitive pro-gramming model for large-scale parallel computing, but re-quires a coherence mechanism to allow caching for performance while ensuring that processors do not use stale data in their computation. Implementation options range from distributed shared memory emulations on networks of workstations to tightly coupled fully cache-coherent distributed shared memory multiprocessors. Previous work indicates that performance var-ies dramatically from one end of this spectrum to the other. Hardware cache coherence is fast, but also costly and time-consuming to design and implement, while DSM systems pro-vide acceptable performance on only a limit class of applica-tions. We claim that an intermediate hardware option-memory-mapped network interfaces that support a global physical address space, without cache coherence-can provide most of the performance benefits of fully cache-coherent hard-ware, at a fraction of the cost. To support this claim we present a software coherence protocol that runs on this class of ma-chines, and use simulation to conduct a performance study. We look at both programming and architectural issues in the context of software and hardware coherence protocols. Our results suggest that software coherence on NCC-NUMA ma-chines in a more cost-effective approach to large-scale shared-memory multiprocessing than either pure distributed shared memory or hardware cache coherence. a 1995 Academic press, I~C. 1
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.