415 research outputs found

    An approach for an efficient execution of SPMD applications on Multi-core environments

    Get PDF
    Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogeneous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested.This research has been supported by the MINECO (MICINN) Spain under contracts TIN2011-24384 and TIN2014- 53172-PPeer ReviewedPostprint (author's final draft

    Parallel numerical modeling of hybrid-dimensional compositional non-isothermal Darcy flows in fractured porous media

    Get PDF
    This paper introduces a new discrete fracture model accounting for non-isothermal compositional multiphase Darcy flows and complex networks of fractures with intersecting, immersed and non immersed fractures. The so called hybrid-dimensional model using a 2D model in the fractures coupled with a 3D model in the matrix is first derived rigorously starting from the equi-dimensional matrix fracture model. Then, it is dis-cretized using a fully implicit time integration combined with the Vertex Approximate Gradient (VAG) finite volume scheme which is adapted to polyhedral meshes and anisotropic heterogeneous media. The fully coupled systems are assembled and solved in parallel using the Single Program Multiple Data (SPMD) paradigm with one layer of ghost cells. This strategy allows for a local assembly of the discrete systems. An efficient preconditioner is implemented to solve the linear systems at each time step and each Newton type iteration of the simulation. The numerical efficiency of our approach is assessed on different meshes, fracture networks, and physical settings in terms of parallel scalability, nonlinear convergence and linear convergence

    An approach for an efficient execution of SPMD applications on Multi-core environments

    Get PDF
    Executing traditional Message Passing Interface (MPI) applications on multi-core cluster balancing speed and computational efficiency is a difficult task that parallel programmers have to deal with. For this reason, communications on multi-core clusters ought to be handled carefully in order to improve performance metrics such as efficiency, speedup, execution time and scalability. In this paper we focus our attention on SPMD (Single Program Multiple Data) applications with high communication volume and synchronicity and also following characteristics such as: static, local and regular. This work proposes a method for SPMD applications, which is focused on managing the communication heterogeneity (different cache level, RAM memory, network, etc.) on homogenous multi-core computing platform in order to improve the application efficiency. In this sense, the main objective of this work is to find analytically the ideal number of cores necessary that allows us to obtain the maximum speedup, while the computational efficiency is maintained over a defined threshold (strong scalability). This method also allows us to determine how the problem size must be increased in order to maintain an execution time constant while the number of cores are expanded (weak scalability) considering the tradeoff between speed and efficiency. This methodology has been tested with different benchmarks and applications and we achieved an average improvement around 30.35% of efficiency in applications tested using different problems sizes and multi-core clusters. In addition, results show that maximum speedup with a defined efficiency is located close to the values calculated with our analytical model with an error rate lower than 5% for the applications tested

    Performance and area evaluations of processor-based benchmarks on FPGA devices

    Get PDF
    The computing system on SoCs is being long-term research since the FPGA technology has emerged due to its personality of re-programmable fabric, reconfigurable computing, and fast development time to market. During the last decade, uni-processor in a SoC is no longer to deal with the high growing market for complex applications such as Mobile Phones audio and video encoding, image and network processing. Due to the number of transistors on a silicon wafer is increasing, the recent FPGAs or embedded systems are advancing toward multi-processor-based design to meet tremendous performance and benefit this kind of systems are possible. Therefore, is an upcoming age of the MPSoC. In addition, most of the embedded processors are soft-cores, because they are flexible and reconfigurable for specific software functions and easy to build homogenous multi-processor systems for parallel programming. Moreover, behavioural synthesis tools are becoming a lot more powerful and enable to create datapath of logic units from high-level algorithms such as C to HDL and available for partitioning a HW/SW concurrent methodology. A range of embedded processors is able to implement on a FPGA-based prototyping to integrate the CPUs on a programmable device. This research is, firstly represent different types of computer architectures in modern embedded processors that are followed in different type of software applications (eg. Multi-threading Operations or Complex Functions) on FPGA-based SoCs; and secondly investigate their capability by executing a wide-range of multimedia software codes (Integer-algometric only) in different models of the processor-systems (uni-processor or multi-processor or Co-design), and finally compare those results in terms of the benchmarks and resource utilizations within FPGAs. All the examined programs were written in standard C and executed in a variety numbers of soft-core processors or hardware units to obtain the execution times. However, the number of processors and their customizable configuration or hardware datapath being generated are limited by a target FPGA resource, and designers need to understand the FPGA-based tradeoffs that have been considered - Speed versus Area. For this experimental purpose, I defined benchmarks into DLP / HLS catalogues, which are "data" and "function" intensive respectively. The programs of DLP will be executed in LEON3 MP and LE1 CMP multi-processor systems and the programs of HLS in the LegUp Co-design system on target FPGAs. In preliminary, the performance of the soft-core processors will be examined by executing all the benchmarks. The whole story of this thesis work centres on the issue of the execute times or the speed-up and area breakdown on FPGA devices in terms of different programs

    CFD investigation of a complete floating offshore wind turbine

    Get PDF
    This chapter presents numerical computations for floating offshore wind turbines for a machine of 10-MW rated power. The rotors were computed using the Helicopter Multi-Block flow solver of the University of Glasgow that solves the Navier-Stokes equations in integral form using the arbitrary Lagrangian-Eulerian formulation for time-dependent domains with moving boundaries. Hydrodynamic loads on the support platform were computed using the Smoothed Particle Hydrodynamics method. This method is mesh-free, and represents the fluid by a set of discrete particles. The motion of the floating offshore wind turbine is computed using a Multi-Body Dynamic Model of rigid bodies and frictionless joints. Mooring cables are modelled as a set of springs and dampers. All solvers were validated separately before coupling, and the loosely coupled algorithm used is described in detail alongside the obtained results

    Evaluating the performance of legacy applications on emerging parallel architectures

    Get PDF
    The gap between a supercomputer's theoretical maximum (\peak") oatingpoint performance and that actually achieved by applications has grown wider over time. Today, a typical scientific application achieves only 5{20% of any given machine's peak processing capability, and this gap leaves room for significant improvements in execution times. This problem is most pronounced for modern \accelerator" architectures { collections of hundreds of simple, low-clocked cores capable of executing the same instruction on dozens of pieces of data simultaneously. This is a significant change from the low number of high-clocked cores found in traditional CPUs, and effective utilisation of accelerators typically requires extensive code and algorithmic changes. In many cases, the best way in which to map a parallel workload to these new architectures is unclear. The principle focus of the work presented in this thesis is the evaluation of emerging parallel architectures (specifically, modern CPUs, GPUs and Intel MIC) for two benchmark codes { the LU benchmark from the NAS Parallel Benchmark Suite and Sandia's miniMD benchmark { which exhibit complex parallel behaviours that are representative of many scientific applications. Using combinations of low-level intrinsic functions, OpenMP, CUDA and MPI, we demonstrate performance improvements of up to 7x for these workloads. We also detail a code development methodology that permits application developers to target multiple architecture types without maintaining completely separate implementations for each platform. Using OpenCL, we develop performance portable implementations of the LU and miniMD benchmarks that are faster than the original codes, and at most 2x slower than versions highly-tuned for particular hardware. Finally, we demonstrate the importance of evaluating architectures at scale (as opposed to on single nodes) through performance modelling techniques, highlighting the problems associated with strong-scaling on emerging accelerator architectures

    OPTIMIZATION OF SIMULATIONS IN OPENSIMPPLLE

    Get PDF
    Computer software has become an integral tool in exploring scientific concepts and computational models. Models, such as OpenSIMPPLLE, use a complex set of rules developed by experts to predict the impact of fires, disease, and wildlife on large scale landscapes. OpenSIMPPLLE’s simulations are time-consuming when projecting far into the future. OpenSIMPPLLE needs to execute more efficiently to allow for faster completion of simulations. The increase in speed will also enable users to run simulations with more timesteps in shorter periods. There are plenty of ways to accomplish this. The work described here identifies three different methods for increasing efficiency. The first method is refactoring expensive operations, the second is applying design patterns, and the third is to introduce parallelism. The main objective of this work is to examine whether the intersection of parallelism and efficient design will combine in an optimal runtime while analyzing the best approaches to implement parallel techniques

    Dynamic load balancing of parallel road traffic simulation

    Get PDF
    The objective of this research was to investigate, develop and evaluate dynamic load-balancing strategies for parallel execution of microscopic road traffic simulations. Urban road traffic simulation presents irregular, and dynamically varying distributed computational load for a parallel processor system. The dynamic nature of road traffic simulation systems lead to uneven load distribution during simulation, even for a system that starts off with even load distributions. Load balancing is a potential way of achieving improved performance by reallocating work from highly loaded processors to lightly loaded processors leading to a reduction in the overall computational time. In dynamic load balancing, workloads are adjusted continually or periodically throughout the computation. In this thesis load balancing strategies were evaluated and some load balancing policies developed. A load index and a profitability determination algorithms were developed. These were used to enhance two load balancing algorithms. One of the algorithms exhibits local communications and distributed load evaluation between the neighbour partitions (diffusion algorithm) and the other algorithm exhibits both local and global communications while the decision making is centralized (MaS algorithm). The enhanced algorithms were implemented and synthesized with a research parallel traffic simulation. The performance of the research parallel traffic simulator, optimized with the two modified dynamic load balancing strategies were studied
    • …
    corecore