267 research outputs found

    Multicore-optimized wavefront diamond blocking for optimizing stencil updates

    Full text link
    The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

    Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory

    Full text link
    New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead. For clusters of shared-memory nodes we demonstrate how temporal blocking can be employed successfully in a hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure

    National Natural Science Foundation of China

    Get PDF
    Abstract The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technologytogether they may have profound impact. This paper presents a case study (using the 1D Stencil computation) of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study: (1) chip-level global addressable memory -in particular the scratchpad memories (SPM) local to the processing cores; (2) fine-grain memory based synchronization (e.g. full-empty bit for fine-grain synchronization). Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization (e.g. timed tiling and variants), we developed and implement a number many-core based optimization for Godson-T. Our experimental study show good performance improvements in both execution time speedups and scalability, validated the value of globally accessed SPM and fine-grain synchronization mechanism (full-empty bits) under the Godson-T, and provide some useful guidelines for future compiler technology of many-core chip architectures

    Temporal blocking of finite-difference stencil operators with sparse "off-the-grid" sources

    Get PDF
    Stencil kernels dominate a range of scientific applications, including seismic and medical imaging, image processing, and neural networks. Temporal blocking is a performance optimization that aims to reduce the required memory bandwidth of stencil computations by re-using data from the cache for multiple time steps. It has already been shown to be beneficial for this class of algorithms. However, applying temporal blocking to practical applications' stencils remains challenging. These computations often consist of sparsely located operators not aligned with the computational grid (“off-the-grid”). Our work is motivated by modelling problems in which source injections result in wavefields that must then be measured at receivers by interpolation from the grided wavefield. The resulting data dependencies make the adoption of temporal blocking much more challenging. We propose a methodology to inspect these data dependencies and reorder the computation, leading to performance gains in stencil codes where temporal blocking has not been applicable. We implement this novel scheme in the Devito domain-specific compiler toolchain. Devito implements a domain-specific language embedded in Python to generate optimized partial differential equation solvers using the finite-difference method from high-level symbolic problem definitions. We evaluate our scheme using isotropic acoustic, anisotropic acoustic, and isotropic elastic wave propagators of industrial significance. After auto-tuning, performance evaluation shows that this enables substantial performance improvement through temporal blocking over highly-optimized vectorized spatially-blocked code of up to 1.6x

    Design and implementation of a 10 Gigabit Ethernet XAUI test systems

    Get PDF
    10 Gigabit Ethernet has been standardized (IEEE 802.3ae), and products based on this standard are being deployed to interconnect MANs, WANs, Storage Area Networks, and very high speed LANs. The XAUI portion of the standard is primarily concerned with short range (up to 50 cm) chip-to-chip communication across printed circuit board traces. The UNH-IOL 10 Gigabit Ethernet Consortium, an industry-supported organization, performs PHY layer testing on products using a test system that has been partially implemented on a Xilinx ML321 evaluation board using the Virtex II-Pro FPGA. A new implementation of the 10 Gigabit Ethernet XAUI test system on the existing ML321 evaluation board is presented in this thesis. The new design removes a number of limitations present in the original Xilinx test system, and it adds new features to the existing transmit and receive sub-systems that enable test engineers to expand the range of test cases and analyze them while simultaneously increasing the speed of testing. The new test system also eliminates the need for expensive test instruments
    • …
    corecore