31,043 research outputs found

    Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures

    Get PDF
    A new solver featuring time-space adaptation and error control has been recently introduced to tackle the numerical solution of stiff reaction-diffusion systems. Based on operator splitting, finite volume adaptive multiresolution and high order time integrators with specific stability properties for each operator, this strategy yields high computational efficiency for large multidimensional computations on standard architectures such as powerful workstations. However, the data structure of the original implementation, based on trees of pointers, provides limited opportunities for efficiency enhancements, while posing serious challenges in terms of parallel programming and load balancing. The present contribution proposes a new implementation of the whole set of numerical methods including Radau5 and ROCK4, relying on a fully different data structure together with the use of a specific library, TBB, for shared-memory, task-based parallelism with work-stealing. The performance of our implementation is assessed in a series of test-cases of increasing difficulty in two and three dimensions on multi-core and many-core architectures, demonstrating high scalability

    A Multi-Kernel Multi-Code Polar Decoder Architecture

    Get PDF
    Polar codes have received increasing attention in the past decade, and have been selected for the next generation of wireless communication standard. Most research on polar codes has focused on codes constructed from a 2×22\times2 polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of 22. A few recent works have proposed construction methods based on multiple kernels of different dimensions, not only binary ones, allowing code lengths different from powers of 22. In this work, we design and implement the first multi-kernel successive cancellation polar code decoder in literature. It can decode any code constructed with binary and ternary kernels: the architecture, sized for a maximum code length NmaxN_{max}, is fully flexible in terms of code length, code rate and kernel sequence. The decoder can achieve frequency of more than 11 GHz in 6565 nm CMOS technology, and a throughput of 615615 Mb/s. The area occupation ranges between 0.110.11 mm2^2 for Nmax=256N_{max}=256 and 2.012.01 mm2^2 for Nmax=4096N_{max}=4096. Implementation results show an unprecedented degree of flexibility: with Nmax=4096N_{max}=4096, up to 5555 code lengths can be decoded with the same hardware, along with any kernel sequence and code rate

    Design and optimization of a portable LQCD Monte Carlo code using OpenACC

    Full text link
    The present panorama of HPC architectures is extremely heterogeneous, ranging from traditional multi-core CPU processors, supporting a wide class of applications but delivering moderate computing performance, to many-core GPUs, exploiting aggressive data-parallelism and delivering higher performances for streaming computing applications. In this scenario, code portability (and performance portability) become necessary for easy maintainability of applications; this is very relevant in scientific computing where code changes are very frequent, making it tedious and prone to error to keep different code versions aligned. In this work we present the design and optimization of a state-of-the-art production-level LQCD Monte Carlo application, using the directive-based OpenACC programming model. OpenACC abstracts parallel programming to a descriptive level, relieving programmers from specifying how codes should be mapped onto the target architecture. We describe the implementation of a code fully written in OpenACC, and show that we are able to target several different architectures, including state-of-the-art traditional CPUs and GPUs, with the same code. We also measure performance, evaluating the computing efficiency of our OpenACC code on several architectures, comparing with GPU-specific implementations and showing that a good level of performance-portability can be reached.Comment: 26 pages, 2 png figures, preprint of an article submitted for consideration in International Journal of Modern Physics
    • …
    corecore