21 research outputs found

    Cache Inclusion And Processor Sampling In Multiprocessor Simulations

    No full text
    The evaluation of cache-based systems demands careful simulations of entire benchmarks. Simulation efficiency is essential to realistic evaluations. For systems with large caches and large number of processors, simulation is often too slow to be practical. In particular, the optimized design of a cache for a multiprocessor is very complex with current techniques. This paper addresses these problems. First we introduce necessary and sufficient conditions for cache inclusion in uniprocessors and in multiprocessors with and without invalidations. Second, under cache inclusion, we show that an accurate trace for a given processor or for a cluster of processors can be extracted from a multiprocessor trace. With this methodology, possible cache architectures for a processor or for a cluster of processors are evaluated independently of the rest of the system, resulting in a drastic reduction of the trace length and simulation complexity. Moreover, many important system-wide metrics can be est..

    A Tile Selection Algorithm for Data Locality and Cache Interference

    No full text
    Loop tiling is a well-known compiler transformation that increases data locality, exposes parallelism and reduces synchronization costs. Tiling increases the amount of data reuse that can be exploited by reordering the loop iterations so that accesses to the same data are closer together in time. However, tiled loops often suer from cache interference in the direct-mapped or low-associativity caches typically found in state-of-the-art microprocessors. A solution to this problem is to choose a tile size that does not exhibit self interference. In this paper, we propose a new tile selection algorithm for eliminating self interference and simultaneously minimizing capacity and cross-interference misses. We have automated the algorithm in the SUIF compiler and used it to generate tiles for a range of problem sizes for three scienti c computations. Our experimental results show that the algorithm consistently nds tiles that yield lower miss rates than existing tile selection algorithms. ..

    Cache inclusion and processor sampling in multiprocessor simulations

    No full text

    Compiler-Controlled Caching in Superword Register Files for Multimedia Extension Architectures

    No full text
    In this paper, we describe an algorithm and implementation of locality optimizations for architectures with instruction sets such as Intel's SSE and Motorola's AltiVec that support operations on superwords, i.e., aggregate objects consisting of several machine words. We treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only temporal but also spatial reuse. As compared to optimizations to exploit reuse in cache, the compiler must also manage replacement, and thus, explicitly name registers in the generated code. We describe an implementation of our approach integrated with a compiler that exploits superword-level parallelism (SLP). We present a set of results derived automatically on 4 multimedia kernels and 2 scientific benchmarks. Our results show speedups ranging from 1.3 to 2.8X on the 6 programs as compared to using SLP alone, and we eliminate the majority of memory accesses

    Evaluation of Architectural Paradigms for Addressing the Processor-Memory Gap

    No full text
    Many high performance applications run well below the peak arithmetic performance of the underlying machine, with inefficiencies often attributed to poor memory system behavior. In the context of scientific computing we examine three emerging processors designed to address the wellknown gap between processor and memory performance through the exploitation of data parallelism. The VIRAM architecture uses novel PIM technology to combine embedded DRAM with a vector co-processor for exploiting its large bandwidth potential. The DIVA architecture incorporates a collection of PIM chips as smart-memory coprocessors to a conventional microprocessor, and relies on superword-level parallelism to make effective use of the available memory bandwidth. The Imagine architecture provides a stream-aware memory hierarchy to support the tremendous processing potential of SIMD controlled VLIW clusters. First we develop a scalable synthetic probe that allows us to parametize key performance attributes of VIRAM, DIVA and Imagine while capturing the performance crossover points of these architectures. Next we present results for scientific kernels with different sets of computational characteristics and memory access patterns. Our experiments allow us to evaluate the strategies employed to exploit data parallelism, isolate the set of application characteristics best suited to each architecture and show a promising direction towards interfacing leading-edge processor technology with high-end scientific computations
    corecore