34 research outputs found

    Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

    Get PDF

    Quantitative performance evaluation of SCI memory hierarchies

    Get PDF

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique

    Efficient techniques to provide scalability for token-based cache coherence protocols

    Full text link
    Cache coherence protocols based on tokens can provide low latency without relying on non-scalable interconnects thanks to the use of efficient requests that are unordered. However, when these unordered requests contend for the same memory block, they may cause protocols races. To resolve the races and ensure the completion of all the cache misses, token protocols use a starvation prevention mechanism that is inefficient and non-scalable in terms of required storage structures and generated traffic. Besides, token protocols use non-silent invalidations which increase the latency of write misses proportionally to the system size. All these problems make token protocols non-scalable. To overcome the main problems of token protocols and increase their scalability, we propose a new starvation prevention mechanism named Priority Requests. This mechanism resolves contention by an efficient, elegant, and flexible method based on ordered requests. Furthermore, thanks to Priority Requests, efficient techniques can be applied to limit the storage requirements of the starvation prevention mechanism, to reduce the total traffic generated for managing protocol races, and to reduce the latency of write misses. Thus, the main problems of token protocols can be solved, which, in turn, contributes to wide their efficiency and scalability.Cuesta Sáez, BA. (2009). Efficient techniques to provide scalability for token-based cache coherence protocols [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/6024Palanci

    Cache-coherent distributed shared memory: perspectives on its development and future challenges

    Full text link

    The Performance of SCI Memory Hierarchies

    Get PDF
    This paper presents a simulation-based performance evaluation of a shared-memory multiprocessor using the Scalable Coherent Interface (IEEE 1596). The machines are assembled with one to 16 processors connected in a ring. The multiprocessor's memory hierarchy consists of split primary caches, coherent secondary caches and memory. For a workload of two parallel loops and three thread-based programs, secondary cache latency has the strongest impact on performance. For programs with high miss ratios, 16-node rings exhibit high network congestion whereas 4- and 8-node rings perform better. With these same programs, doubling the processor speed yields between 20 and 70% speed gains with higher gains on the smaller rings. 1 Introduction The Scalable Coherent Interface (SCI) is an IEEE standard for high performance interconnects supporting a physically distributed logically shared memory [18]. SCI consists of physical interfaces, a logical communication protocol, and a distributed ca..

    Software-Oriented Distributed Shared Cache Management for Chip Multiprocessors

    Get PDF
    This thesis proposes a software-oriented distributed shared cache management approach for chip multiprocessors (CMPs). Unlike hardware-based schemes, our approach offloads the cache management task to trace analysis phase, allowing flexible management strategies. For single-threaded programs, a static 2D page coloring scheme is proposed to utilize oracle trace information to derive an optimal data placement schema for a program. In addition, a dynamic 2D page coloring scheme is proposed as a practical solution, which tries to ap- proach the performance of the static scheme. The evaluation results show that the static scheme achieves 44.7% performance improvement over the conventional shared cache scheme on average while the dynamic scheme performs 32.3% better than the shared cache scheme. For latency-oriented multithreaded programs, a pattern recognition algorithm based on the K-means clustering method is introduced. The algorithm tries to identify data access pat- terns that can be utilized to guide the placement of private data and the replication of shared data. The experimental results show that data placement and replication based on these access patterns lead to 19% performance improvement over the shared cache scheme. The reduced remote cache accesses and aggregated cache miss rate result in much lower bandwidth requirements for the on-chip network and the off-chip main memory bus. Lastly, for throughput-oriented multithreaded programs, we propose a hint-guided data replication scheme to identify memory instructions of a target program that access data with a high reuse property. The derived hints are then used to guide data replication at run time. By balancing the amount of data replication and local cache pressure, the proposed scheme has the potential to help achieve comparable performance to best existing hardware-based schemes.Our proposed software-oriented shared cache management approach is an effective way to manage program performance on CMPs. This approach provides an alternative direction to the research of the distributed cache management problem. Given the known difficulties (e.g., scalability and design complexity) we face with hardware-based schemes, this software- oriented approach may receive a serious consideration from researchers in the future. In this perspective, the thesis provides valuable contributions to the computer architecture research society

    Análisis de rendimiento de la red de altas prestaciones en una infraestructura de computación paralela, a través de una aplicación HPC, como guía para la ejecución de procesos de cómputo

    Get PDF
    Realizar un análisis de Rendimiento de la Red de Altas Prestaciones en una infraestructura de computación paralela, a través de una aplicación HPC, con el fin de elaborar una guía de buenas prácticas para la ejecución de procesos de cómputo en el Supercomputador Quinde I de la Empresa Pública Siembra E.P.El presente trabajo de investigación comprende realizar un Análisis de Rendimiento de la Red de Altas prestaciones InfiniBand sobre una arquitectura de computación paralela del Supercomputador Quinde I, mediante el paso de mensajes sobre la interfaz MPI, utilizando la aplicación b_eff o también llamado Benchmark b_eff; el análisis se realizará sobre ciertos escenarios seleccionados basados en el HPC Challenge Benchmark y en la interconexión de los nodos de cómputo; donde se define los parámetros base para analizar el rendimiento de la red sobre ambientes paralelos. Una vez obtenidos los datos a través de la aplicación b_eff, se realizará un análisis de los escenarios y la comparación de los datos con otra infraestructura de Altas Prestaciones. Adicional, se realizará un análisis referencial de los datos obtenidos del Benchmark b_eff, con los datos del Test de Linpack. Parte de la presente investigación es aplicar los parámetros utilizados en el Análisis de Rendimiento de la Red de Altas prestaciones sobre un caso práctico con el fin de analizar el comportamiento de los procesos de cómputo en un ambiente real dentro de un proyecto de investigación científica. Al finalizar el análisis de los escenarios, comparación de los datos y la aplicación del caso práctico, se presenta una guía de buenas prácticas para la ejecución de procesos de cómputo en arquitecturas paralelas con el fin de obtener el mejor Rendimiento de la Red de Altas prestaciones InfiniBand en cualquier proyecto de investigación que se desarrolle en el Supercomputador Quinde I.Maestrí

    Eureka: a distributed shared memory system based on the Lazy Data Merging consistency model

    Get PDF
    Distributed Shared Memory (DSM) provides an abstraction of shared memory on a network of workstations. Problems with existing DSM systems are lack of portability due to compiler and/or operating system modification requirements, and reduced performance due to significant synchronization and communication costs when compared to their message passing counterparts (e.g., PVM and MPI). Our approach was to introduce a new DSM consistency model, Lazy Data Merging (LDM), which extends Data Merging (DM). LDM is optimized for software runtime implementations and differs from DM by 'lazily' placing data updates across the communication network only when they are required. It is our belief that LDM can significantly reduce communication costs, particularly for applications that make extensive use of locks. We have completed the design of "Eureka", a prototype DSM system that provides a software implementation of the LDM consistency model. To ensure portability and efficiency we use only standard UniXTM system calls and a publicly available software thread package, Cthreads, from the University of Utah. Furthermore, we have implemented and tested some of Eureka's core components, specifically, the set of communication and hybrid (Invalidate/Update) coherence primitives, which are essential for follow on work in building the complete DSM system. The question of efficiency is still an open problem, because we did not compare Eureka with other DSM implementations.http://archive.org/details/eurekadistribute1094535209NANABrazilian Navy author
    corecore