32 research outputs found

    GPU parallelization strategies for metaheuristics: a survey

    Get PDF
    Metaheuristics have been showing interesting results in solving hard optimization problems. However, they become limited in terms of effectiveness and runtime for high dimensional problems. Thanks to the independency of metaheuristics components, parallel computing appears as an attractive choice to reduce the execution time and to improve solution quality. By exploiting the increasing performance and programability of graphics processing units (GPUs) to this aim, GPU-based parallel metaheuristics have been implemented using different designs. RecentresultsinthisareashowthatGPUstendtobeeffectiveco-processors forleveraging complex optimization problems.In thissurvey, mechanisms involvedinGPUprogrammingforimplementingparallelmetaheuristicsare presentedanddiscussedthroughastudyofrelevantresearchpapers. Metaheuristics can obtain satisfying results when solving optimization problems in a reasonable time. However, they suffer from the lack of scalability. Metaheuristics become limited ahead complex highdimensional optimization problems. To overcome this limitation, GPU based parallel computing appears as a strong alternative. Thanks to GPUs, parallelmetaheuristicsachievedbetterresultsintermsofcomputation,and evensolutionquality

    Scalable and Reliable Sparse Data Computation on Emergent High Performance Computing Systems

    Get PDF
    Heterogeneous systems with both CPUs and GPUs have become important system architectures in emergent High Performance Computing (HPC) systems. Heterogeneous systems must address both performance-scalability and power-scalability in the presence of failures. Aggressive power reduction pushes hardware to its operating limit and increases the failure rate. Resilience allows programs to progress when subjected to faults and is an integral component of large-scale systems, but incurs significant time and energy overhead. The future exascale systems are expected to have higher power consumption with higher fault rates. Sparse data computation is the fundamental kernel in many scientific applications. It is suitable for the studies of scalability and resilience on heterogeneous systems due to its computational characteristics. To deliver the promised performance within the given power budget, heterogeneous computing mandates a deep understanding of the interplay between scalability and resilience. Managing scalability and resilience is challenging in heterogeneous systems, due to the heterogeneous compute capability, power consumption, and varying failure rates between CPUs and GPUs. Scalability and resilience have been traditionally studied in isolation, and optimizing one typically detrimentally impacts the other. While prior works have been proved successful in optimizing scalability and resilience on CPU-based homogeneous systems, simply extending current approaches to heterogeneous systems results in suboptimal performance-scalability and/or power-scalability. To address the above multiple research challenges, we propose novel resilience and energy-efficiency technologies to optimize scalability and resilience for sparse data computation on heterogeneous systems with CPUs and GPUs. First, we present generalized analytical and experimental methods to analyze and quantify the time and energy costs of various recovery schemes, and develop and prototype performance optimization and power management strategies to improve scalability for sparse linear solvers. Our results quantitatively reveal that each resilience scheme has its own advantages depending on the fault rate, system size, and power budget, and the forward recovery can further benefit from our performance and power optimizations for large-scale computing. Second, we design a novel resilience technique that relaxes the requirement of synchronization and identicalness for processes, and allows them to run in heterogeneous resources with power reduction. Our results show a significant reduction in energy for unmodified programs in various fault situations compared to exact replication techniques. Third, we propose a novel distributed sparse tensor decomposition that utilizes an asynchronous RDMA-based approach with OpenSHMEM to improve scalability on large-scale systems and prove that our method works well in heterogeneous systems. Our results show our irregularity-aware workload partition and balanced-asynchronous algorithms are scalable and outperform the state-of-the-art distributed implementations. We demonstrate that understanding different bottlenecks for various types of tensors plays critical roles in improving scalability

    Parallel algorithms for computational fluid dynamics on unstructured meshes

    Get PDF
    La simulació numèrica directa (DNS) de fluxos complexes és actualment una utopia per la majoria d'aplicacions industrials ja que els requeriments computacionals son massa elevats. Donat un flux, la diferència entre els recursos computacionals necessaris i els disponibles és cobreix mitjançant la modelització/simplificació d'alguns termes de les equacions originals que regeixen el seu comportament. El creixement continuat dels recursos computacionals disponibles, principalment en forma de super-ordinadors, contribueix a reduir la part del flux que és necessari aproximar. De totes maneres, obtenir la eficiència esperada dels nous super-ordinadors no és una tasca senzilla i, per aquest motiu, part de la recerca en el camp de la Mecànica de Fluids Computacional es centra en aquest objectiu. En aquest sentit, algunes contribucions s'han presentat en el marc d'aquesta tesis. El primer objectiu va ser el desenvolupament d'un codi de CFD de propòsit general i paral·lel, basat en la metodologia de volums finits en malles no estructurades, per resoldre problemes de multi-física. Aquest codi, anomenat TermoFluids (TF), té un disseny orientat a objectes i pensat per ser usat de forma altament eficient en els super-ordinadors actuals. Amb el temps, ha esdevingut pel grup una eina fonamental en projectes tant de recerca bàsica com d'interès industrial. En el context d'aquesta tesis, el treball s'ha focalitzat en el desenvolupament de dos de les llibreries més bàsiques de TermoFluids: i) La Basics Objects Library (BOL), que es una plataforma de software sobre la qual estan programades la resta de llibreries del codi, i que conté els mètodes algebraics i geomètrics fonamentals per la implementació paral·lela dels algoritmes de discretització, ii) la Linear Solvers Library (LSL), que conté un gran nombre de mètodes per resoldre els sistemes d'equacions lineals derivats de les discretitzacions. El primer capítol d'aquesta tesi conté les principals idees subjacents al disseny i la implementació de la BOL i la LSL, juntament amb alguns exemples i algunes aplicacions industrials. En els capítols posteriors hi ha una explicació detallada de solvers específics per algunes aplicacions concretes. En el segon capítol, es presenta un solver paral·lel i directe per la resolució de l'equació de Poisson per casos en els quals una de les direccions del domini té condicions d'homogeneïtat. En la simulació de fluxos incompressibles, l'equació de Poisson es resol almenys una vegada en cada pas de temps, convertint-se en una de les parts més costoses i difícils de paral·lelitzar del codi. El mètode que proposem és una combinació d'una descomposició directa de Schur (DDS) i una diagonalització de Fourier. La darrera descompon el sistema original en un conjunt de sub-sistemes 2D independents que es resolen mitjançant l'algorisme DDS. Atès que no s'imposen restriccions a les direccions no periòdiques del domini, aquest mètode és aplicable a la resolució de problemes discretitzats mitjançat l'extrusió de malles 2D no estructurades. L'escalabilitat d'aquest mètode ha estat provada amb èxit amb un màxim de 8192 CPU per malles de fins a ~10⁹ volums de control. En el darrer capitol capítol, es presenta un mètode de resolució per l'equació de Transport de Boltzmann (BTE). La estratègia emprada es basa en el mètode d'Ordenades Discretes i pot ser aplicat en discretitzacions no estructurades. El flux per a cada ordenada angular es resol amb un mètode de substitució equivalent a la resolució d'un sistema lineal triangular. La naturalesa seqüencial d'aquest procés fa de la paral·lelització de l'algoritme el principal repte. Diversos algorismes de substitució han estat analitzats, esdevenint una de les heurístiques proposades la millor opció en totes les situacions analitzades, amb excel·lents resultats. Els testos d'eficiència paral·lela s'han realitzat usant fins a 2560 CPU.Direct Numerical Simulation (DNS) of complex flows is currently an utopia for most of industrial applications because computational requirements are too high. For a given flow, the gap between the required and the available computing resources is covered by modeling/simplifying of some terms of the original equations. On the other hand, the continuous growth of the computing power of modern supercomputers contributes to reduce this gap, reducing hence the unresolved physics that need to be attempted with approximated models. This growth, widely relies on parallel computing technologies. However, getting the expected performance from new complex computing systems is becoming more and more difficult, and therefore part of the CFD research is focused on this goal. Regarding to it, some contributions are presented in this thesis. The first objective was to contribute to the development of a general purpose multi-physics CFD code. referred to as TermoFluids (TF). TF is programmed following the object oriented paradigm and designed to run in modern parallel computing systems. It is also intensively involved in many different projects ranging from basic research to industry applications. Besides, one of the strengths of TF is its good parallel performance demonstrated in several supercomputers. In the context of this thesis, the work was focused on the development of two of the most basic libraries that compose TF: I) the Basic Objects Library (BOL), which is a parallel unstructured CFD application programming interface, on the top of which the rest of libraries that compose TF are written, ii) the Linear Solvers Library (LSL) containing many different algorithms to solve the linear systems arising from the discretization of the equations. The first chapter of this thesis contains the main ideas underlying the design and the implementation of the BOL and LSL libraries, together with some examples and some industrial applications. A detailed description of some application-specific linear solvers included in the LSL is carried out in the following chapters. In the second chapter, a parallel direct Poisson solver restricted to problems with one uniform periodic direction is presented. The Poisson equation is solved, at least, once per time-step when modeling incompressible flows, becoming one of the most time consuming and difficult to parallelize parts of the code. The solver here proposed is a combination of a direct Schur-complement based decomposition (DSD) and a Fourier diagonalization. The latter decomposes the original system into a set of mutually independent 2D sub-systems which are solved by means of the DSD algorithm. Since no restrictions are imposed in the non-periodic directions, the overall algorithm is well-suited for solving problems discretized on extruded 2D unstructured meshes. The scalability of the solver has been successfully tested using up to 8192 CPU cores for meshes with up to 10 9 grid points. In the last chapter, a solver for the Boltzmann Transport Equation (BTE) is presented. It can be used to solve radiation phenomena interacting with flows. The solver is based on the Discrete Ordinates Method and can be applied to unstructured discretizations. The flux for each angular ordinate is swept across the computational grid, within a source iteration loop that accounts for the coupling between the different ordinates. The sequential nature of the sweep process makes the parallelization of the overall algorithm the most challenging aspect. Several parallel sweep algorithms, which represent different options of interleaving communications and calculations, are analyzed. One of the heuristics proposed consistently stands out as the best option in all the situations analyzed. With this algorithm, good scalability results have been achieved regarding both weak and strong speedup tests with up to 2560 CPUs

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Supercomputing Frontiers

    Get PDF
    This open access book constitutes the refereed proceedings of the 7th Asian Conference Supercomputing Conference, SCFA 2022, which took place in Singapore in March 2022. The 8 full papers presented in this book were carefully reviewed and selected from 21 submissions. They cover a range of topics including file systems, memory hierarchy, HPC cloud platform, container image configuration workflow, large-scale applications, and scheduling

    PC-grade parallel processing and hardware acceleration for large-scale data analysis

    Get PDF
    Arguably, modern graphics processing units (GPU) are the first commodity, and desktop parallel processor. Although GPU programming was originated from the interactive rendering in graphical applications such as computer games, researchers in the field of general purpose computation on GPU (GPGPU) are showing that the power, ubiquity and low cost of GPUs makes them an ideal alternative platform for high-performance computing. This has resulted in the extensive exploration in using the GPU to accelerate general-purpose computations in many engineering and mathematical domains outside of graphics. However, limited to the development complexity caused by the graphics-oriented concepts and development tools for GPU-programming, GPGPU has mainly been discussed in the academic domain so far and has not yet fully fulfilled its promises in the real world. This thesis aims at exploiting GPGPU in the practical engineering domain and presented a novel contribution to GPGPU-driven linear time invariant (LTI) systems that are employed by the signal processing techniques in stylus-based or optical-based surface metrology and data processing. The core contributions that have been achieved in this project can be summarized as follow. Firstly, a thorough survey of the state-of-the-art of GPGPU applications and their development approaches has been carried out in this thesis. In addition, the category of parallel architecture pattern that the GPGPU belongs to has been specified, which formed the foundation of the GPGPU programming framework design in the thesis. Following this specification, a GPGPU programming framework is deduced as a general guideline to the various GPGPU programming models that are applied to a large diversity of algorithms in scientific computing and engineering applications. Considering the evolution of GPU’s hardware architecture, the proposed frameworks cover through the transition of graphics-originated concepts for GPGPU programming based on legacy GPUs and the abstraction of stream processing pattern represented by the compute unified device architecture (CUDA) in which GPU is considered as not only a graphics device but a streaming coprocessor of CPU. Secondly, the proposed GPGPU programming framework are applied to the practical engineering applications, namely, the surface metrological data processing and image processing, to generate the programming models that aim to carry out parallel computing for the corresponding algorithms. The acceleration performance of these models are evaluated in terms of the speed-up factor and the data accuracy, which enabled the generation of quantifiable benchmarks for evaluating consumer-grade parallel processors. It shows that the GPGPU applications outperform the CPU solutions by up to 20 times without significant loss of data accuracy and any noticeable increase in source code complexity, which further validates the effectiveness of the proposed GPGPU general programming framework. Thirdly, this thesis devised methods for carrying out result visualization directly on GPU by storing processed data in local GPU memory through making use of GPU’s rendering device features to achieve realtime interactions. The algorithms employed in this thesis included various filtering techniques, discrete wavelet transform, and the fast Fourier Transform which cover the common operations implemented in most LTI systems in spatial and frequency domains. Considering the employed GPUs’ hardware designs, especially the structure of the rendering pipelines, and the characteristics of the algorithms, the series of proposed GPGPU programming models have proven its feasibility, practicality, and robustness in real engineering applications. The developed GPGPU programming framework as well as the programming models are anticipated to be adaptable for future consumer-level computing devices and other computational demanding applications. In addition, it is envisaged that the devised principles and methods in the framework design are likely to have significant benefits outside the sphere of surface metrology.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Simulating the nonlinear QED vacuum

    Get PDF

    A Dataflow Framework For Developing Flexible Embedded Accelerators A Computer Vision Case Study.

    Get PDF
    The focus of this dissertation is the design and the implementation of a computing platform which can accelerate data processing in the embedded computation domain. We focus on a heterogeneous computing platform, whose hardware implementation can approach the power and area efficiency of specialized designs, while remaining flexible across the application domain. The multi-core architectures require parallel programming, which is widely-regarded as more challenging than sequential programming. Although shared memory parallel programs may be fairly easy to write (using OpenMP, for example), they are quite hard to optimize; providing embedded application developers with optimizing tools and programming frameworks is a challenge. The heterogeneous specialized elements make the problem even more difficult. Dataflow is a parallel computation model that relies exclusively on message passing, and that has some advantages over parallel programming tools in wide use today: simplicity, graphical representation, and determinism. Dataflow model is also a good match to streaming applications, such as audio, video and image processing, which operate on large sequences of data and are characterized by abundant parallelism and regular memory access patterns. Dataflow model of computation has gained acceptance in simulation and signal-processing communities. This thesis evaluates the applicability of the dataflow model for implementing domain-specific embedded accelerators for streaming applications

    Improving Compute & Data Efficiency of Flexible Architectures

    Get PDF
    corecore