21 research outputs found
Recommended from our members
Exploring Shared Memory Protocols in FLASH
ABSTRACT The goal of this project was to improve the performance of large scientific and engineering applications through collaborative hardware and software mechanisms to manage the memory hierarchy of non-uniform memory access time (NUMA) shared-memory machines, as well as their component individual processors. In spite of the programming advantages of shared-memory platforms, obtaining good performance for large scientific and engineering applications on such machines can be challenging. Because communication between processors is managed implicitly by the hardware, rather than expressed by the programmer, application performance may suffer from unintended communication – communication that the programmer did not consider when developing his/her application. In this project, we developed and evaluated a collection of hardware, compiler, languages and performance monitoring tools to obtain high performance on scientific and engineering applications on NUMA platforms by managing communication through alternative coherence mechanisms. Alternative coherence mechanisms have often been discussed as a means for reducing unintended communication, although architecture implementations of such mechanisms are quite rare. This report describes an actual implementation of a set of coherence protocols that support coherent, non-coherent and write-update accesses for a CC-NUMA shared-memory architecture, the Stanford FLASH machine. Such an approach has the advantages of using alternative coherence only where it is beneficial, and also provides an evolutionary migration path for improving application performance. We present data on two computations, RandomAccess from the HPC Challenge benchmarks and a forward solver derived from LS-DYNA, showing the performance advantages of the alternative coherence mechanisms. For RandomAccess, the non-coherent and write-update versions can outperform the coherent version by factors of 5 and 2.5, respectively. In LS-DYNA, we obtain improvements of 18% on average using the non-coherent version. We also present data on the SpecOMP benchmarks, showing that the protocols have a modest overhead of less than 3% in applications where the alternative mechanisms are not needed. In addition to the selective coherence studies on the FLASH machine, in the last six months of this project ISI performed research on compiler technology for the transactional memory (TM) programming model being developed at Stanford. As part of this research ISI developed a compiler that recognizes transactional memory “pragmas” and automatically generates parallel code for the TM programming mode
Cache Inclusion And Processor Sampling In Multiprocessor Simulations
The evaluation of cache-based systems demands careful simulations of entire benchmarks. Simulation efficiency is essential to realistic evaluations. For systems with large caches and large number of processors, simulation is often too slow to be practical. In particular, the optimized design of a cache for a multiprocessor is very complex with current techniques. This paper addresses these problems. First we introduce necessary and sufficient conditions for cache inclusion in uniprocessors and in multiprocessors with and without invalidations. Second, under cache inclusion, we show that an accurate trace for a given processor or for a cluster of processors can be extracted from a multiprocessor trace. With this methodology, possible cache architectures for a processor or for a cluster of processors are evaluated independently of the rest of the system, resulting in a drastic reduction of the trace length and simulation complexity. Moreover, many important system-wide metrics can be est..
A Tile Selection Algorithm for Data Locality and Cache Interference
Loop tiling is a well-known compiler transformation that increases data locality, exposes parallelism and reduces synchronization costs. Tiling increases the amount of data reuse that can be exploited by reordering the loop iterations so that accesses to the same data are closer together in time. However, tiled loops often suer from cache interference in the direct-mapped or low-associativity caches typically found in state-of-the-art microprocessors. A solution to this problem is to choose a tile size that does not exhibit self interference. In this paper, we propose a new tile selection algorithm for eliminating self interference and simultaneously minimizing capacity and cross-interference misses. We have automated the algorithm in the SUIF compiler and used it to generate tiles for a range of problem sizes for three scienti c computations. Our experimental results show that the algorithm consistently nds tiles that yield lower miss rates than existing tile selection algorithms. ..
Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures
In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses
Evaluation of Architectural Paradigms for Addressing the Processor-Memory Gap
Many high performance applications run well below the peak arithmetic performance of the underlying machine, with inefficiencies often attributed to poor memory system behavior. In the context of scientific computing we examine three emerging processors designed to address the wellknown gap between processor and memory performance through the exploitation of data parallelism. The VIRAM architecture uses novel PIM technology to combine embedded DRAM with a vector co-processor for exploiting its large bandwidth potential. The DIVA architecture incorporates a collection of PIM chips as smart-memory coprocessors to a conventional microprocessor, and relies on superword-level parallelism to make effective use of the available memory bandwidth. The Imagine architecture provides a stream-aware memory hierarchy to support the tremendous processing potential of SIMD controlled VLIW clusters. First we develop a scalable synthetic probe that allows us to parametize key performance attributes of VIRAM, DIVA and Imagine while capturing the performance crossover points of these architectures. Next we present results for scientific kernels with different sets of computational characteristics and memory access patterns. Our experiments allow us to evaluate the strategies employed to exploit data parallelism, isolate the set of application characteristics best suited to each architecture and show a promising direction towards interfacing leading-edge processor technology with high-end scientific computations