17,340 research outputs found

    Generalized Methodology for Array Processor Design of Real-time Systems

    Get PDF
    Many techniques and design tools have been developed for mapping algorithms to array processors. Linear mapping is usually used for regular algorithms. Large and complex problems are not regular by nature and regularization may cause a computational overhead which prevents the ability to meet real-time deadlines. In this paper, a systematic design methodology for mapping partially-regular as well as regular Dependence Graphs is presented. In this approach the set of all optimal solutions is generated under the given constraints. Due to nature of the problem and the tight timing constraints of real-time systems the set of alternative solutions is limited. An image processing example is discusse

    Parallelization of the Wolff Single-Cluster Algorithm

    Get PDF
    A parallel [open multiprocessing (OpenMP)] implementation of the Wolff single-cluster algorithm has been developed and tested for the three-dimensional (3D) Ising model. The developed procedure is generalizable to other lattice spin models and its effectiveness depends on the specific application at hand. The applicability of the developed methodology is discussed in the context of the applications, where a sophisticated shuffling scheme is used to generate pseudorandom numbers of high quality, and an iterative method is applied to find the critical temperature of the 3D Ising model with a great accuracy. For the lattice with linear size L=1024, we have reached the speedup about 1.79 times on two processors and about 2.67 times on four processors, as compared to the serial code. According to our estimation, the speedup about three times on four processors is reachable for the O(n) models with n ≥ 2. Furthermore, the application of the developed OpenMP code allows us to simulate larger lattices due to greater operative (shared) memory available

    Efficient algorithms for reconfiguration in VLSI/WSI arrays

    Get PDF
    The issue of developing efficient algorithms for reconfiguring processor arrays in the presence of faulty processors and fixed hardware resources is discussed. The models discussed consist of a set of identical processors embedded in a flexible interconnection structure that is configured in the form of a rectangular grid. An array grid model based on single-track switches is considered. An efficient polynomial time algorithm is proposed for determining feasible reconfigurations for an array with a given distribution of faulty processors. In the process, it is shown that the set of conditions in the reconfigurability theorem is not necessary. A polynomial time algorithm is developed for finding feasible reconfigurations in an augmented single-track model and in array grid models with multiple-track switche

    Density Functional Theory calculation on many-cores hybrid CPU-GPU architectures

    Get PDF
    The implementation of a full electronic structure calculation code on a hybrid parallel architecture with Graphic Processing Units (GPU) is presented. The code which is on the basis of our implementation is a GNU-GPL code based on Daubechies wavelets. It shows very good performances, systematic convergence properties and an excellent efficiency on parallel computers. Our GPU-based acceleration fully preserves all these properties. In particular, the code is able to run on many cores which may or may not have a GPU associated. It is thus able to run on parallel and massive parallel hybrid environment, also with a non-homogeneous ratio CPU/GPU. With double precision calculations, we may achieve considerable speedup, between a factor of 20 for some operations and a factor of 6 for the whole DFT code.Comment: 14 pages, 8 figure

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics
    corecore