860 research outputs found

    A comparison of software and hardware synchronization mechanisms for distributed shared memory multiprocessors

    Get PDF
    technical reportEfficient synchronization is an essential component of parallel computing. The designers of traditional multiprocessors have included hardware support only for simple operations such as compare-and-swap and load-linked/store-conditional, while high level synchronization primitives such as locks, barriers, and condition variables have been implemented in software [9,14,15]. With the advent of directory-based distributed shared memory (DSM) multiprocessors with significant flexibility in their cache controllers [7,12,17], it is worthwhile considering whether this flexibility should be used to support higher level synchronization primitives in hardware. In particular, as part of maintaining data consistency, these architectures maintain lists of processors with a copy of a given cache line, which is most of the hardware needed to implement distributed locks. We studied two software and four hardware implementations of locks and found that hardware implementation can reduce lock acquire and release times by 25-94% compared to well tuned software locks. In terms of macrobenchmark performance, hardware locks reduce application running times by up to 75% on a synthetic benchmark with heavy lock contention and by 3%-6% on a suite of SPLASH-2 benchmarks. In addition, emerging cache coherence protocols promise to increase the time spent synchronizing relative to the time spent accessing shared data, and our study shows that hardware locks can reduce SPLASH-2 execution times by up to 10-13% if the time spent accessing shared data is small. Although the overall performance impact of hardware lock mechanisms varies tremendously depending on the application, the added hardware complexity on a flexible architecture like FLASH [12] or Avalanche [7] is negligible, and thus hardware support for high level synchronization operations should be provided

    Parallel hierarchical radiosity rendering

    Get PDF
    The radiosity equation is examined, and is found to contain a previously unexploited symmetry. This symmetry is formalized, and a solution method previously unusable in the field of computer graphics (conjugate gradients) is shown to be superior to all methods currently in use. A detailed analysis of all solution techniques previously applied to the radiosity problem is conducted, and results presented;So-called hierarchical methods have reduced the operational complexity of the N-body problem from O(N[superscript]2) to O(N log N) assuming a pre-set error tolerance. An algorithm following the same basic tenets has been applied to radiosity rendering by other researchers, and has reduced the operational complexity from O(N[superscript]2) to (arguably) O(N);Shortcomings in the state-of-the-art hierarchical radiosity method are pointed out, and enhancements are offered. A consistent treatment of various types of error is found to be absent from present methods. Catastrophic error is possible in the visibility assessment between two polygons. A self-consistency check is possible during the solution process, but never exploited;Until now, supercomputer-class computers have not been used to solve radiosity problems at a production-quality level even though realistic image synthesis has always been a prodigious consumer of computer time. A state-of-the-art hierarchical radiosity code is implemented on an nCUBE-2 parallel computer, and discussed in detail. The algorithm is found to have ample sources of parallelism, in both data- and operational modes. Its performance is analyzed in detail;The hierarchical method has only been applied to realistic image synthesis since 1991. Not surprisingly, many avenues of further research are open. Some are pointed out, and include: analytic determination of coupling factors, quantifying discretization error, incorporating specular light reflection modes into the hierarchical treatment, and exploring what other important physical problems might benefit from the hierarchical approach

    Tightly-Coupled Multiprocessing for a Global Illumination Algorithm

    Get PDF
    {dret | elf} @ dgp.toronto.edu A prevailing trend in computer graphics is the demand for increasingly realistic global illumination models and algorithms. Despite the fact that the computational power of uniprocessors is increasing, it is clear that much greater computational power is required to achieve satisfactory throughput. The obvious next step is to employ parallel processing. The advent of affordable, tightly-coupled multiprocessors makes such an approach widely available for the first time. We propose a tightly-coupled parallel decomposition of FIAT, a global illumination algorithm, based on space subdivision and power balancing, that we have recently developed. This algorithm is somewhat ambitious, and severely strains existing uniprocessor environments. We discuss techniques for reducing memory contention and maximising parallelism. We also present empirical data on the actual performance of our parallel solution. Since the model of parallel computation that we have employed is likely to persist for quite some time, our techniques are applicable to other algorithms based on space subdivision. 1

    Dynamic Energy Management for Chip Multi-processors under Performance Constraints

    Get PDF
    We introduce a novel algorithm for dynamic energy management (DEM) under performance constraints in chip multi-processors (CMPs). Using the novel concept of delayed instructions count, performance loss estimations are calculated at the end of each control period for each core. In addition, a Kalman filtering based approach is employed to predict workload in the next control period for which voltage-frequency pairs must be selected. This selection is done with a novel dynamic voltage and frequency scaling (DVFS) algorithm whose objective is to reduce energy consumption but without degrading performance beyond the user set threshold. Using our customized Sniper based CMP system simulation framework, we demonstrate the effectiveness of the proposed algorithm for a variety of benchmarks for 16 core and 64 core network-on-chip based CMP architectures. Simulation results show consistent energy savings across the board. We present our work as an investigation of the tradeoff between the achievable energy reduction via DVFS when predictions are done using the effective Kalman filter for different performance penalty thresholds

    Energy Efficient Network-on-Chip Architectures for Many-Core Near-Threshold Computing System

    Get PDF
    Near threshold computing has unraveled a promising design space for energy efficient computing. However, it is still plagued by sub-optimal system performance. Application characteristics and hardware non-idealities of conventional architectures (those optimized for nominal voltage) prevent us from fully leveraging the potential of NTC systems. Increasing the computational core count still forms the bedrock of a multitude of contemporary works that address the problem of performance degradation in NTC systems. However, these works do not categorically address the shortcomings of the conventional on-chip interconnect fabric in a many core environment. In this work, we quantitatively demonstrate the performance bottleneck created by a conventional NTC architecture in many-core NTC systems. To reclaim the performance lost due to a sub-optimal NoC in many-core NTC systems, we propose BoostNoC—a power efficient, multi-layered network-on-chip architecture. BoostNoC improves the system performance by nearly 2× over a conventional NTC system, while largely sustaining its energy benefits. Further, capitalizing on the application characteristics, we propose two BoostNoC derivative designs: (i) PG BoostNoC; and (ii) Drowsy BoostNoC; to improve the energy efficiency by 1.4× and 1.37×, respectively over conventional NTC system

    Validation of Weak Form Thermal Analysis Algorithms Supporting Thermal Signature Generation

    Get PDF
    Extremization of a weak form for the continuum energy conservation principle differential equation naturally implements fluid convection and radiation as flux Robin boundary conditions associated with unsteady heat transfer. Combining a spatial semi-discretization via finite element trial space basis functions with time-accurate integration generates a totally node-based algebraic statement for computing. Closure for gray body radiation is a newly derived node-based radiosity formulation generating piecewise discontinuous solutions, while that for natural-forced-mixed convection heat transfer is extracted from the literature. Algorithm performance, mathematically predicted by asymptotic convergence theory, is subsequently validated with data obtained in 24 hour diurnal field experiments for distinct thickness flat plates and a cube-shaped three dimensional object
    • …
    corecore