57 research outputs found

    Shared Memory Architecture for Simulating Sediment-Fluid Flow by OpenMP

    Get PDF
    Simulation of fluid flow using Shallow water equations (SWE) and sediment movement below it using Exner equation is given. Both of the equations will be combined using splitting technique, in which SWE would be computed using Harten-Lax-van Leer and Einfeldt (HLLE) numerical flux, then Exner would be computed semi-implicitly. This paper elaborates the steps of constructing SWE-Exner model. To show the agreement of the scheme, two problems will be elaborated: (1) comparison between analytical solution and numerical solution, and (2) parallelism using OpenMP for Transcritical over a granular bump. The first problem is going to tell the discrete L1L^{1}-, L2L^{2}-, and L∞L^{\infty}-norm error of the scheme, and the second one will show the simulation result, speedup, and efficiency of the scheme, which is around 56.44%56.44\%

    Shared Memory Architecture for Simulating Sediment-Fluid Flow by OpenMP

    Get PDF
    Simulation of fluid flow using Shallow water equations (SWE) and sediment movement below it using Exner equation is given. Both of the equations will be combined using splitting technique, in which SWE would be computed using Harten-Lax-van Leer and Einfeldt (HLLE) numerical flux, then Exner would be computed semi-implicitly. This paper elaborates the steps of constructing SWE-Exner model. To show the agreement of the scheme, two problems will be elaborated: (1) comparison between analytical solution and numerical solution, and (2) parallelism using OpenMP for Transcritical over a granular bump. The first problem is going to tell the discrete L1L^{1}-, L2L^{2}-, and L∞L^{\infty}-norm error of the scheme, and the second one will show the simulation result, speedup, and efficiency of the scheme, which is around 56.44%56.44\%

    Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor

    Get PDF
    International audienceThis paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads

    Power Distribution Management System revisited: Single-thread vs. Multithread Performance

    Get PDF
    Power Distribution Management System (PDMS) uses very sophisticated algorithms to deliver reliable and efficient functioning of power distribution networks (PDN). PDNs are represented using very large sparse matrices, whose processing is computationally very demanding. Dividing large PDNs into smaller sub-networks results in smaller sparse matrices, and further processing each sub-network in parallel significantly improves the performance of PDMS. Using multithreading to further process each sub-network however degrades PDMS performance. Single-thread processing of sub-network sparse matrices gives much better performance results, mainly due to the structure of these matrices (indefinite and very sparse) and synchronization overhead involved in multi-thread operations. In this paper an overview of PDMS system is presented, and its performance given single-thread and multiple threads is compared. The results have shown that for some applications, single-threaded implementation in multi-process parallel environment gives better performance than multithreaded implementation

    Exascale Co-Design Center for Materials in Extreme Environments (ExMatEx) Annual Report - Year 2

    Full text link

    Towards an algorithmic skeleton framework for programming the Intel R Xeon PhiTM processor

    Get PDF
    The Intel R Xeon PhiTM is the first processor based on Intel’s MIC (Many Integrated Cores) architecture. It is a co-processor specially tailored for data-parallel computations, whose basic architectural design is similar to the ones of GPUs (Graphics Processing Units), leveraging the use of many integrated low computational cores to perform parallel computations. The main novelty of the MIC architecture, relatively to GPUs, is its compatibility with the Intel x86 architecture. This enables the use of many of the tools commonly available for the parallel programming of x86-based architectures, which may lead to a smaller learning curve. However, programming the Xeon Phi still entails aspects intrinsic to accelerator-based computing, in general, and to the MIC architecture, in particular. In this thesis we advocate the use of algorithmic skeletons for programming the Xeon Phi. Algorithmic skeletons abstract the complexity inherent to parallel programming, hiding details such as resource management, parallel decomposition, inter-execution flow communication, thus removing these concerns from the programmer’s mind. In this context, the goal of the thesis is to lay the foundations for the development of a simple but powerful and efficient skeleton framework for the programming of the Xeon Phi processor. For this purpose we build upon Marrow, an existing framework for the orchestration of OpenCLTM computations in multi-GPU and CPU environments. We extend Marrow to execute both OpenCL and C++ parallel computations on the Xeon Phi. We evaluate the newly developed framework, several well-known benchmarks, like Saxpy and N-Body, will be used to compare, not only its performance to the existing framework when executing on the co-processor, but also to assess the performance on the Xeon Phi versus a multi-GPU environment.projects PTDC/EIA- EIA/113613/2009 (Synergy-VM) and PTDC/EEI-CTP/1837/2012 (SwiftComp) for financing the purchase of the Intel R Xeon PhiT

    Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

    Get PDF
    A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

    Analysis of hybrid parallelization strategies: simulation of Anderson localization and Kalman Filter for LHCb triggers

    Get PDF
    This thesis presents two experiences of hybrid programming applied to condensed matter and high energy physics. The two projects differ in various aspects, but both of them aim to analyse the benefits of using accelerated hardware to speedup the calculations in current science-research scenarios. The first project enables massively parallelism in a simulation of the Anderson localisation phenomenon in a disordered quantum system. The code represents a Hamiltonian in momentum space, then it executes a diagonalization of the corresponding matrix using linear algebra libraries, and finally it analyses the energy-levels spacing statistics averaged over several realisations of the disorder. The implementation combines different parallelization approaches in an hybrid scheme. The averaging over the ensemble of disorder realisations exploits massively parallelism with a master-slave configuration based on both multi-threading and message passing interface (MPI). This framework is designed and implemented to easily interface similar application commonly adopted in scientific research, for example in Monte Carlo simulations. The diagonalization uses multi-core and GPU hardware interfacing with MAGMA, PLASMA or MKL libraries. The access to the libraries is modular to guarantee portability, maintainability and the extension in a near future. The second project is the development of a Kalman Filter, including the porting on GPU architectures and autovectorization for online LHCb triggers. The developed codes provide information about the viability and advantages for the application of GPU technologies in the first triggering step for Large Hadron Collider beauty experiment (LHCb). The optimisation introduced on both codes for CPU and GPU delivered a relevant speedup on the Kalman Filter. The two GPU versions in CUDA R and OpenCLTM have similar performances and are adequate to be considered in the upgrade and in the corresponding implementations of the Gaudi framework. In both projects we implement optimisation techniques in the CPU code. This report presents extensive benchmark analyses of the correctness and of the performances for both projects
    • …
    corecore