3,329 research outputs found

    When parallel speedups hit the memory wall

    Get PDF
    After Amdahl's trailblazing work, many other authors proposed analytical speedup models but none have considered the limiting effect of the memory wall. These models exploited aspects such as problem-size variation, memory size, communication overhead, and synchronization overhead, but data-access delays are assumed to be constant. Nevertheless, such delays can vary, for example, according to the number of cores used and the ratio between processor and memory frequencies. Given the large number of possible configurations of operating frequency and number of cores that current architectures can offer, suitable speedup models to describe such variations among these configurations are quite desirable for off-line or on-line scheduling decisions. This work proposes new parallel speedup models that account for variations of the average data-access delay to describe the limiting effect of the memory wall on parallel speedups. Analytical results indicate that the proposed modeling can capture the desired behavior while experimental hardware results validate the former. Additionally, we show that when accounting for parameters that reflect the intrinsic characteristics of the applications, such as degree of parallelism and susceptibility to the memory wall, our proposal has significant advantages over machine-learning-based modeling. Moreover, besides being black-box modeling, our experiments show that conventional machine-learning modeling needs about one order of magnitude more measurements to reach the same level of accuracy achieved in our modeling.Comment: 24 page

    Examining subgrid models of supermassive black holes in cosmological simulation

    Full text link
    While supermassive black holes (SMBHs) play an important role in galaxy and cluster evolution, at present they can only be included in large-scale cosmological simulation via subgrid techniques. However, these subgrid models have not been studied in a systematic fashion. Using a newly-developed fast, parallel spherical overdensity halo finder built into the simulation code FLASH, we perform a suite of dark matter-only cosmological simulations to study the effects of subgrid model choice on relations between SMBH mass and dark matter halo mass and velocity dispersion. We examine three aspects of SMBH subgrid models: the choice of initial black hole seed mass, the test for merging two black holes, and the frequency of applying the subgrid model. We also examine the role that merging can play in determining the relations, ignoring the complicating effects of SMBH-driven accretion and feedback. We find that the choice of subgrid model can dramatically affect the black hole merger rate, the cosmic SMBH mass density, and the low-redshift relations to halo properties. We also find that it is possible to reproduce observations of the low-redshift relations without accretion and feedback, depending on the choice of subgrid model.Comment: 12 pages, 12 figures, revised from referee comments, accepted by Ap

    Thread-Modular Static Analysis for Relaxed Memory Models

    Full text link
    We propose a memory-model-aware static program analysis method for accurately analyzing the behavior of concurrent software running on processors with weak consistency models such as x86-TSO, SPARC-PSO, and SPARC-RMO. At the center of our method is a unified framework for deciding the feasibility of inter-thread interferences to avoid propagating spurious data flows during static analysis and thus boost the performance of the static analyzer. We formulate the checking of interference feasibility as a set of Datalog rules which are both efficiently solvable and general enough to capture a range of hardware-level memory models. Compared to existing techniques, our method can significantly reduce the number of bogus alarms as well as unsound proofs. We implemented the method and evaluated it on a large set of multithreaded C programs. Our experiments showthe method significantly outperforms state-of-the-art techniques in terms of accuracy with only moderate run-time overhead.Comment: revised version of the ESEC/FSE 2017 pape

    Dynamic Power Management for Neuromorphic Many-Core Systems

    Full text link
    This work presents a dynamic power management architecture for neuromorphic many core systems such as SpiNNaker. A fast dynamic voltage and frequency scaling (DVFS) technique is presented which allows the processing elements (PE) to change their supply voltage and clock frequency individually and autonomously within less than 100 ns. This is employed by the neuromorphic simulation software flow, which defines the performance level (PL) of the PE based on the actual workload within each simulation cycle. A test chip in 28 nm SLP CMOS technology has been implemented. It includes 4 PEs which can be scaled from 0.7 V to 1.0 V with frequencies from 125 MHz to 500 MHz at three distinct PLs. By measurement of three neuromorphic benchmarks it is shown that the total PE power consumption can be reduced by 75%, with 80% baseline power reduction and a 50% reduction of energy per neuron and synapse computation, all while maintaining temporary peak system performance to achieve biological real-time operation of the system. A numerical model of this power management model is derived which allows DVFS architecture exploration for neuromorphics. The proposed technique is to be used for the second generation SpiNNaker neuromorphic many core system

    Implementation of digital pheromones in PSO accelerated by commodity Graphics Hardware

    Get PDF
    In this paper, a model for Graphics Processing Unit (GPU) implementation of Particle Swarm Optimization (PSO) using digital pheromones to coordinate swarms within ndimensional design spaces is presented. Previous work by the authors demonstrated the capability of digital pheromones within PSO for searching n-dimensional design spaces with improved accuracy, efficiency and reliability in both serial and parallel computing environments using traditional CPUs. Modern GPUs have proven to outperform the number of floating point operations when compared to CPUs through inherent data parallel architecture and higher bandwidth capabilities. The advent of programmable graphics hardware in the recent times further provided a suitable platform for scientific computing particularly in the field of design optimization. However, the data parallel architecture of GPUs requires a specialized formulation for leveraging its computational capabilities. When the objective function computations are appropriately formulated for GPUs, it is theorized that the solution efficiency (speed) can be significantly increased while maintaining solution accuracy. The development of this method together with a number of multi-modal unconstrained test problems are tested and presented in this paper
    corecore