57 research outputs found
Shared Memory Architecture for Simulating Sediment-Fluid Flow by OpenMP
Simulation of fluid flow using Shallow water equations (SWE) and sediment movement below it using Exner equation is given. Both of the equations will be combined using splitting technique, in which SWE would be computed using Harten-Lax-van Leer and Einfeldt (HLLE) numerical flux, then Exner would be computed semi-implicitly. This paper elaborates the steps of constructing SWE-Exner model. To show the agreement of the scheme, two problems will be elaborated: (1) comparison between analytical solution and numerical solution, and (2) parallelism using OpenMP for Transcritical over a granular bump. The first problem is going to tell the discrete -, -, and -norm error of the scheme, and the second one will show the simulation result, speedup, and efficiency of the scheme, which is around
Shared Memory Architecture for Simulating Sediment-Fluid Flow by OpenMP
Simulation of fluid flow using Shallow water equations (SWE) and sediment movement below it using Exner equation is given. Both of the equations will be combined using splitting technique, in which SWE would be computed using Harten-Lax-van Leer and Einfeldt (HLLE) numerical flux, then Exner would be computed semi-implicitly. This paper elaborates the steps of constructing SWE-Exner model. To show the agreement of the scheme, two problems will be elaborated: (1) comparison between analytical solution and numerical solution, and (2) parallelism using OpenMP for Transcritical over a granular bump. The first problem is going to tell the discrete -, -, and -norm error of the scheme, and the second one will show the simulation result, speedup, and efficiency of the scheme, which is around
Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor
International audienceThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads
Power Distribution Management System revisited: Single-thread vs. Multithread Performance
Power Distribution Management System (PDMS) uses very sophisticated algorithms to deliver reliable and efficient functioning of power distribution networks (PDN). PDNs are represented using very large sparse matrices, whose processing is computationally very demanding. Dividing large PDNs into smaller sub-networks results in smaller sparse matrices, and further processing each sub-network in parallel significantly improves the performance of PDMS. Using multithreading to further process each sub-network however degrades PDMS performance. Single-thread processing of sub-network sparse matrices gives much better performance results, mainly due to the structure of these matrices (indefinite and very sparse) and synchronization overhead involved in multi-thread operations. In this paper an overview of PDMS system is presented, and its performance given single-thread and multiple threads is compared. The results have shown that for some applications, single-threaded implementation in multi-process parallel environment gives better performance than multithreaded implementation
Towards an algorithmic skeleton framework for programming the Intel R Xeon PhiTM processor
The Intel R Xeon PhiTM is the first processor based on Intel’s MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel
computations. The main novelty of the MIC architecture, relatively to GPUs, is its
compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular.
In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming,
hiding details such as resource management, parallel decomposition, inter-execution
flow communication, thus removing these concerns from the programmer’s mind. In
this context, the goal of the thesis is to lay the foundations for the development of a
simple but powerful and efficient skeleton framework for the programming of the Xeon
Phi processor. For this purpose we build upon Marrow, an existing framework for the
orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend
Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi.
We evaluate the newly developed framework, several well-known benchmarks, like
Saxpy and N-Body, will be used to compare, not only its performance to the existing
framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.projects PTDC/EIA- EIA/113613/2009 (Synergy-VM) and PTDC/EEI-CTP/1837/2012 (SwiftComp) for financing the purchase of the Intel R Xeon PhiT
Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems
A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence
Analysis of hybrid parallelization strategies: simulation of Anderson localization and Kalman Filter for LHCb triggers
This thesis presents two experiences of hybrid programming applied to condensed matter and high energy physics. The two projects differ in various aspects, but both of them aim to analyse the benefits of using accelerated hardware to speedup the calculations in current science-research scenarios. The first project enables massively parallelism in a simulation of the Anderson localisation phenomenon in a disordered quantum system. The code represents a Hamiltonian in momentum space, then it executes a diagonalization of the corresponding matrix using linear algebra libraries, and finally it analyses the energy-levels spacing statistics averaged over several realisations of the disorder. The implementation combines different parallelization approaches in an hybrid scheme. The averaging over the ensemble of disorder realisations exploits massively parallelism with a master-slave configuration based on both multi-threading and message passing interface (MPI). This framework is designed and implemented to easily interface similar application commonly adopted in scientific research, for example
in Monte Carlo simulations. The diagonalization uses multi-core and GPU hardware interfacing with MAGMA, PLASMA or MKL libraries. The access to the libraries is modular to guarantee portability, maintainability and the extension in a near future.
The second project is the development of a Kalman Filter, including the porting on GPU architectures and autovectorization for online LHCb triggers. The developed codes provide information about the viability and advantages for the application of GPU technologies in the first triggering step for Large Hadron Collider beauty experiment (LHCb). The optimisation introduced on both codes for CPU and GPU delivered a relevant speedup on the Kalman Filter. The two GPU versions in CUDA R and OpenCLTM have similar performances and are adequate to be considered in the upgrade and in the corresponding implementations of the Gaudi framework. In both projects we implement optimisation techniques in the CPU code. This report presents extensive
benchmark analyses of the correctness and of the performances for both projects
- …