7 research outputs found

    A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)

    Get PDF
    [Abstract]: Many problems of industrial and scientific interest require the solving of tridiagonal linear systems. This paper presents several implementations for the parallel solving of large tridiagonal systems on multi-core architectures, using the OmpSs programming model. The strategy used for the parallelization is based on the combination of two different existing algorithms, PCR and Thomas. The Thomas algorithm, which cannot be parallelized, requires the fewest number of floating point operations. The PCR algorithm is the most popular parallel method, but it is more computationally expensive than Thomas. The method proposed in this paper starts applying the PCR algorithm to break down one large tridiagonal system into a set of smaller and independent ones. In a second step, these independent systems are concurrently solved using Thomas. The paper also contains an analytical study of which is the best point to switch from PCR to Thomas. Also, the paper addresses the main performance issues of combining PCR and Thomas proposing a set of alternative implementations, some of them even imply algorithmic changes. The performance evaluation shows that the best implementation achieves a peak speedup of 4 with respect to the Intel MKL counterpart routine and 2.5 with respect to a single-threaded Thomas.This work was supported in part by the European Union鈥檚 Horizon 2020 Framework Programme for Research and Innovation under the Specific Grant Agreements Human Brain Project SGA1 and Human Brain Project SGA2 under Grant 720270 and Grant 785907, in part by the Spanish Ministry of Economy and Competitiveness under the Project Computaci贸n de Altas Prestaciones VII under Grant TIN2015-65316-P, in part by the Departament d鈥橧nnovaci贸, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programaci贸 i Entorns d鈥橢xecuci贸 Paral暖lels under Grant 2014-SGR-1051, in part by the Juan de la Cierva under Grant IJCI-2017-33511, in part by the Fujitsu under the Barcelona Supercomputing Center-Fujitsu Joint Project: Math Libraries Migration and Optimization, in part by the Ministerio de Econom铆a, Industria y Competitividad of Spain, in part by the Fondo Europeo de Desarrollo Regional Funds of the European Union under Grant TIN2016-75845-P, and in part by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups under Grant ED431C 2017/04, and in part by the Centro Singular de Investigaci贸n de Galicia accreditati贸n 2016-2019 under Grant ED431G/01.Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G/01Generalitat de Catalunya; 2014-SGR-105

    Towards an auto-tuned and task-based SpMV (LASs Library)

    Get PDF
    We present a novel approach to parallelize the SpMV kernel included in LASs (Linear Algebra routines on OmpSs) library, after a deep review and analysis of several well-known approaches. LASs is based on OmpSs, a task-based runtime that extends OpenMP directives, providing more flexibility to apply new strategies. Based on tasking and nesting, with the aim of improving the workload imbalance inherent to the SpMV operation, we present a strategy especially useful for highly imbalanced input matrices. In this approach, the number of created tasks is dynamically decided in order to maximize the use of the resources of the platform. Throughout this paper, SpMV behavior depending on the selected strategy (state of the art and proposed strategies) is deeply analyzed, setting in this way the base for a future auto-tunable code that is able to select the most suitable approach depending on the input matrix. The experiments of this work were carried out for a set of 12 matrices from the Suite Sparse Matrix Collection, all of them with different characteristics regarding their sparsity. The experiments of this work were performed on a node of Marenostrum 4 supercomputer (with two sockets Intel Xeon, 24 cores each) and on a node of Dibona cluster (using one ARM ThunderX2 socket with 32 cores). Our tests show that, for Intel Xeon, the best parallelization strategy reduces the execution time of the reference MKL multi-threaded version up to 67%. On ARM ThunderX2, the reduction is up to 56% with respect to the OmpSs parallel reference.This project has received funding from the Spanish Ministry of Economy and Competitiveness under the project Computaci贸n de Altas Prestaciones VII (TIN2015- 65316-P), the Departament d鈥橧nnovaci贸, Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programaci贸 i Entorns d鈥橢xecuci贸 Parallels (2014-SGR-1051), and the Juan de la Cierva Grant Agreement No IJCI-2017- 33511, and the Spanish Ministry of Science and Innovation under the project Heterogeneidad y especializaci贸n en la era post-Moore (RTI2018-093684-BI00). We also acknowledge the funding provided by Fujitsu under the BSC-Fujitsu joint project: Math Libraries Migration and OptimizationPeer ReviewedPostprint (author's final draft

    Towards enhancing coding productivity for GPU programming using static graphs

    Get PDF
    The main contribution of this work is to increase the coding productivity of GPU programming by using the concept of Static Graphs. GPU capabilities have been increasing significantly in terms of performance and memory capacity. However, there are still some problems in terms of scalability and limitations to the amount of work that a GPU can perform at a time. To minimize the overhead associated with the launch of GPU kernels, as well as to maximize the use of GPU capacity, we have combined the new CUDA Graph API with the CUDA programming model (including CUDA math libraries) and the OpenACC programming model. We use as test cases two different, well-known and widely used problems in HPC and AI: the Conjugate Gradient method and the Particle Swarm Optimization. In the first test case (Conjugate Gradient) we focus on the integration of Static Graphs with CUDA. In this case, we are able to significantly outperform the NVIDIA reference code, reaching an acceleration of up to 11× thanks to a better implementation, which can benefit from the new CUDA Graph capabilities. In the second test case (Particle Swarm Optimization), we complement the OpenACC functionality with the use of CUDA Graph, achieving again accelerations of up to one order of magnitude, with average speedups ranging from 2× to 4×, and performance very close to a reference and optimized CUDA code. Our main target is to achieve a higher coding productivity model for GPU programming by using Static Graphs, which provides, in a very transparent way, a better exploitation of the GPU capacity. The combination of using Static Graphs with two of the current most important GPU programming models (CUDA and OpenACC) is able to reduce considerably the execution time w.r.t. the use of CUDA and OpenACC only, achieving accelerations of up to more than one order of magnitude. Finally, we propose an interface to incorporate the concept of Static Graphs into the OpenACC Specifications.his research was funded by EPEEC project from the European Union鈥檚 Horizon 2020 Research and Innovation program under grant agreement No. 801051. This manuscript has been authored by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan, accessed on 13 April 2022).Peer ReviewedPostprint (published version

    Techniques To Facilitate the Understanding of Inter-process Communication Traces

    Get PDF
    High Performance Computing (HPC) systems play an important role in today鈥檚 heavily digitized world, which is in a constant demand for higher speed of calculation and performance. HPC applications are used in multiple domains such as telecommunication, health, scientific research, and more. With the emergence of multi-core and cloud computing platforms, the HPC paradigm is quickly becoming the design of choice of many service providers. HPC systems are also known to be complex to debug and analyze due to the large number of processes they involve and the way these processes communicate with each other to perform specific tasks. As a result, software engineers must spend extensive amount of time understanding the complex interactions among a system鈥檚 processes. This is usually done through the analysis of execution traces generated from running the system at hand. Traces, however, are very difficult to work with due to the overwhelming size of typical traces. The objective of this research is to present a set of techniques that facilitates the understanding of the behaviour of HPC applications through the analysis of system traces. The first technique consists of building an exchange format called MTF (MPI Trace Format) for representing and exchanging traces generated from HPC applications based on the MPI (Message Passing Interface) standard, which is a de facto standard for inter-process communication for high performance computing systems. The design of MTF is validated against well-known requirements for a standard exchange format. The second technique aims to facilitate the understanding of large traces of inter-process communication by automatically extracting communication patterns that characterize their main behaviour. Two algorithms are presented. The first one permits the recognition of repeating patterns in traces of MPI (Message Passing Interaction) applications whereas the second algorithm searches if a given communication pattern occurs in a trace. Both algorithms are based on the n-gram extraction technique used in natural language processing. Finally, we developed a technique to abstract MPI traces by detecting the different execution phases in a program based on concepts from information theory. Using this approach, software engineers can examine the trace as a sequence of high-level computational phases instead of a mere flow of low-level events. The techniques presented in this thesis have been tested on traces generated from real HPC programs. The results from several case studies demonstrate the usefulness and effectiveness of our techniques

    A Fast Solver for Large Tridiagonal Systems on Multi-Core Processors (Lass Library)

    No full text

    MS FT-2-2 7 Orthogonal polynomials and quadrature: Theory, computation, and applications

    Get PDF
    Quadrature rules find many applications in science and engineering. Their analysis is a classical area of applied mathematics and continues to attract considerable attention. This seminar brings together speakers with expertise in a large variety of quadrature rules. It is the aim of the seminar to provide an overview of recent developments in the analysis of quadrature rules. The computation of error estimates and novel applications also are described

    Generalized averaged Gaussian quadrature and applications

    Get PDF
    A simple numerical method for constructing the optimal generalized averaged Gaussian quadrature formulas will be presented. These formulas exist in many cases in which real positive GaussKronrod formulas do not exist, and can be used as an adequate alternative in order to estimate the error of a Gaussian rule. We also investigate the conditions under which the optimal averaged Gaussian quadrature formulas and their truncated variants are internal
    corecore