238 research outputs found

    An O(1) time complexity software barrier

    Get PDF
    technical reportAs network latency rapidly approaches thousands of processor cycles and multiprocessors systems become larger and larger, the primary factor in determining a barrier algorithm?s performance is the number of serialized network latencies it requires. All existing barrier algorithms require at least O(log N) round trip message latencies to perform a single barrier operation on an N-node shared memory multiprocessor. In addition, existing barrier algorithms are not well tuned in terms of how they interact with modern shared memory systems, which leads to an excessive number of message exchanges to signal barrier completion. The contributions of this paper are threefold. First, we identify and quantitatively analyze the performance deficiencies of conventional barrier implementations when they are executed on real (non-idealized) hardware. Second, we propose a queue-based barrier algorithm that has effectively O(1)time complexity as measured in round trip message latencies. Third, by exploiting a hardware write-update (PUT) mechanism for signaling, we demonstrate how carefully matching the barrier implementation to the way that modern shared memory systems operate can improve performance dramatically. The resulting optimized algorithm only costs one round trip message latency to perform a barrier operation across N processors. Using a cycle-accurate execution-driven simulator of a future-generation SGI multiprocessor, we show that the proposed queue-based barrier outperforms conventional barrier implementations based on load-linked/storeconditional instructions by a factor of 5.43 (on 4 processors) to 93.96 (on 256 processors)

    Directory Based Cache Coherency Protocols for Shared Memory Multiprocessors

    Get PDF
    Directory based cache coherency protocols can be used to build large scale, weakly ordered, shared memory multiprocessors. The salient feature of these protocols is that they are interconnection network independent, making them more scaleable than snoopy bus protocols. The major criticisms of previously defined directory protocols point to the size of memory heeded to store the directory and the amount of communication across the interconnection network required to maintain coherence. This thesis tries solving these problems by changing the entry format of the global table, altering the architecture of the global table, and developing new protocols. Some alternative directory entry formats are described, including a special entry format for implementing queueing semaphores. Evaluation of the various entry formats is done with probabilistic models of shared cache blocks and software simulation. A variable length global table organization is presented which can be used to reduce the size of the global table, regardless of the entry format. Its performance is analyzed using software simulation. A protocol which maintains a linked list of processors which have a particular block cached is presented. Several variations of this protocol induce less interconnection network traffic than traditional protocols

    The M-Machine Multicomputer

    Get PDF
    The M-Machine is an experimental multicomputer being developed to test architectural concepts motivated by the constraints of modern semiconductor technology and the demands of programming systems. The M- Machine computing nodes are connected with a 3-D mesh network; each node is a multithreaded processor incorporating 12 function units, on-chip cache, and local memory. The multiple function units are used to exploit both instruction-level and thread-level parallelism. A user accessible message passing system yields fast communication and synchronization between nodes. Rapid access to remote memory is provided transparently to the user with a combination of hardware and software mechanisms. This paper presents the architecture of the M-Machine and describes how its mechanisms maximize both single thread performance and overall system throughput

    Lock-Based cache coherence protocol for chip multiprocessors

    Get PDF
    Chip multiprocessor (CMP) is replacing the superscalar processor due to its huge performance gains in terms of processor speed, scalability, power consumption and economical design. Since the CMP consists of multiple processor cores on a single chip usually with share cache resources, process synchronization is an important issue that needs to be dealt with. Synchronization is usually done by the operating system in case of shared memory multiprocessors (SMP). This work studies the effect of performing synchronization by the hardware through its integration with the cache coherence protocol. A novel cache coherence protocol, called Lock-based Cache Coherence Protocol (LCCP) was designed and its performance was compared with MESI cache coherence protocol. Experiments were performed by a functional multiprocessor simulator, MP_Simplesim, that was modified to do this work. A novel interconnection network was also designed and tested in terms of performance against the traditional bus approach by means of simulation

    RICH: implementing reductions in the cache hierarchy

    Get PDF
    Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017- SGR-1414 and 2017-SGR-1328). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. This manuscript has been co-authored by National Technology & Engineering Solutions of Sandia, LLC. under Contract No. DENA0003525 with the U.S. Department of Energy/National Nuclear Security AdministrationPeer ReviewedPostprint (author's final draft

    Performance Aspects of Synthesizable Computing Systems

    Get PDF
    corecore