134 research outputs found

    Automatic Tuning to Performance Modelling of Matrix Polynomials on Multicore and Multi-GPU Systems

    Full text link
    [EN] Automatic tuning methodologies have been used in the design of routines in recent years. The goal of these methodologies is to develop routines which automatically adapt to the conditions of the underlying computational system so that efficient executions are obtained independently of the end- user experience. This paper aims to explore programming routines that can automatically be adapted to the computational system conditions thanks to these automatic tuning methodologies. In particular, we have worked on the evaluation of matrix polynomials on multicore and multi-GPU systems as a target application. This application is very useful for the computation of matrix functions like the sine or cosine but, at the same time, the application is very time consuming since the basic computational kernel, which is the matrix multiplication, is carried out many times. The use of all available resources within a node in an easy and efficient way is crucial for the end user.This work has been partially supported by Generalitat Valenciana under Grant PROM-ETEOII/2014/003, and by the Spanish MINECO, as well as European Commission FEDER funds, under Grant TEC2015-67387-C4-1-R and TIN2015-66972-C5-3-R, and network CAPAP-H. Also, we have work in cooperation with the EU-COST Programme Action IC1305, "Network for Sustainable Ultrascale Computing (NESUS)".Boratto, M.; Alonso-Jordá, P.; Gimenez, D.; Lastovetsky, A. (2017). Automatic Tuning to Performance Modelling of Matrix Polynomials on Multicore and Multi-GPU Systems. The Journal of Supercomputing. 73(1):227-239. https://doi.org/10.1007/s11227-016-1694-yS227239731Alberti PV, Alonso P, Vidal AM, Cuenca J, Giménez D (2004) Designing polylibraries to speed up linear algebra computations. IJHPCN 1(1/2/3):75–84Alonso P, Boratto M, Pinilla J, Ibañez J, Martinez J (2014) On the evaluation of matrix polynomials using several GPGPUs. Tech Rep Riunet/E10251/39615Anderson E, Bai Z, Bischof C, Demmel J, Dongarra J, Croz JD, Greenbaum A, Hammarling S, McKenney A, Ostrouchov S, Sorensen D (2013) LAPACK users guide, 2nd edn. SIAM, PhiladelphiaBlackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, Whaley RC (2001) An updated set of basic linear algebra subprograms (blas). ACM Trans Math Softw 28:135–151Caron E, Uter F (2002) Parallel extension of a dynamic performance forecasting tool. Sci Ann Cuza Univ 11:80–93Chandra R (2001) Parallel programming in OpenMP. Morgan Kaufmann, BurlingtonDemmel J, Marques O, Parlett BN, Vömel C (2008) Performance and accuracy of LAPACK’s symmetric tridiagonal eigensolvers. SIAM J.Sci Comput 30(3):1508–1526Frigo M, Johnson S (1998) FFTW: an adaptive software architecture for the FFT. In: Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing vol. 3, pp 1381–1384García L, Cuenca J, Giménez D (2007) Including improvement of the execution time in a software architecture of libraries with self-optimisation. In: ICSOFT 2007, Proceedings of the Second International Conference on Software and Data Technologies, Volume SE, Barcelona, Spain, pp 156–161, 22–25 JulyGarcía LP, Cuenca J, Giménez D (2014) On optimization techniques for the matrix multiplication on hybrid cpu+gpu platforms. Ann Multicore GPU Program 1(1):10–18Hasanov K, Quintin JN, Lastovetsky A (2014) Hierarchical approach to optimization of parallel matrix multiplication on large-scale platforms. J Supercomput 71(11):24–34Katagiri T, Kise K, Honda H (2005) RAO-SS: a prototype of run-time auto-tuning facility for sparse direct solvers. Tech Rep 22(1):1–10Katagiri T, Kise K, Honda H, Yuba T (2004) Effect of auto-tuning with user’s knowledge for numerical software. Proceedings of the 1st conference on computing frontiers, Ischia, Italy. ACM, New York, NY, USA, pp 12–25Nath R, Tomov S, Dongarra J (2010) An improved magma gemm for fermi graphics processing units. Int J High Perform Comput Appl 24(4):511–515Paterson MS, Stockmeyer LJ (1973) On the number of nonscalar multiplications necessary to evaluate polynomials. SIAM J Comput 2(1):60–66PLASMA (2015) Parallel linear algebra software for multicore architectures. Available in: http://www.netlib.org/plasma/ . Accessed 1 June 2015Tanaka T, Katagiri T, Yuba T (2007) D-spline based incremental parameter estimation in automatic performance tuning. In: International Conference on Applied Parallel Computing: State of the Art in Scientific Computing, PARA’06. Springer-Verlag, Berlin, Heidelberg, pp 986–995Vuduc R, Demmel J, Bilmes J (2004) Statistical models for empirical search-based performance tuning. Int J High Perform Comput Appl 18:65–94Whaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimizations of software and the ATLAS project. Parallel Comput 27:21–3

    Contributions to the efficient use of general purpose coprocessors: kernel density estimation as case study

    Get PDF
    142 p.The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators provide greater theoretical performance compared to traditional multi-core CPUs, but exploiting their computing power remains as a challenging task.This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. As a contribution to aid in this task, we present a thorough survey of performance modeling techniques and tools for general purpose coprocessors. Then we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that reduces considerably its computational complexity, called S-KDE. Furthermore, we have carried out two parallel implementations of S-KDE, one for multi and many-core processors, and another one for accelerators. The latter has been implemented in OpenCL in order to make it portable across a wide range of devices. We have evaluated the performance of each implementation of S-KDE in a variety of architectures, trying to highlight the bottlenecks and the limits that the code reaches in each device. Finally, we present an application of our S-KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models

    Modelos Paralelos para la Resolución de Problemas de Ingeniería Agrícola

    Full text link
    El presente trabajo se inscribe en el campo de la computación paralela y, más en concreto, en el desarrollo y utilización de modelos computacionales en arquitecturas paralelas heterogéneas para la resolución de problemas aplicados. En la tesis abordamos una serie de problemas que están relacionados con la aplicación de la tecnología en el ámbito de las explotaciones agrícolas y comprenden: la representación del relieve, el manejo de información climática como la temperatura, y la gestión de recursos hídricos. El estudio y la solución a estos problemas en el área en la que se han estudiado tienen un amplio impacto económico y medioambiental. Los problemas basan su formulación en un modelo matemático cuya solución es costosa desde el punto de vista computacional, siendo incluso a veces inviable. La tesis consiste en implementar algoritmos paralelos rápidos y eficientes que resuelven el problema matemático asociado a estos problemas en nodos multicore y multi-GPU. También se estudia, propone y aplican técnicas que permiten a las rutinas diseñadas adaptarse automáticamente a las características del sistema paralelo donde van a ser instaladas y ejecutadas con el objeto de obtener la versión más cercana posible a la óptima a un bajo coste. El objetivo es proporcionar un software a los usuarios que sea portable, pero a la vez, capaz de ejecutarse eficientemente en la ordenador donde se esté trabajando, independientemente de las características de la arquitectura y de los conocimientos que el usuario pueda tener sobre dicha arquitectura.Do Carmo Boratto, M. (2015). Modelos Paralelos para la Resolución de Problemas de Ingeniería Agrícola [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/48529TESI

    Un framework pour l'exécution efficace d'applications sur GPU et CPU+GPU

    Get PDF
    Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt

    Performance Portable Solid Mechanics via Matrix-Free pp-Multigrid

    Full text link
    Finite element analysis of solid mechanics is a foundational tool of modern engineering, with low-order finite element methods and assembled sparse matrices representing the industry standard for implicit analysis. We use performance models and numerical experiments to demonstrate that high-order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs. We demonstrate the reliability, efficiency, and scalability of matrix-free pp-multigrid methods with algebraic multigrid coarse solvers through large deformation hyperelastic simulations of multiscale structures. We investigate accuracy, cost, and execution time on multi-node CPU and GPU systems for moderate to large models using AMD MI250X (OLCF Crusher), NVIDIA A100 (NERSC Perlmutter), and V100 (LLNL Lassen and OLCF Summit), resulting in order of magnitude efficiency improvements over a broad range of model properties and scales. We discuss efficient matrix-free representation of Jacobians and demonstrate how automatic differentiation enables rapid development of nonlinear material models without impacting debuggability and workflows targeting GPUs

    Computational Methods and Graphical Processing Units for Real-time Control of Tomographic Adaptive Optics on Extremely Large Telescopes.

    Get PDF
    Ground based optical telescopes suffer from limited imaging resolution as a result of the effects of atmospheric turbulence on the incoming light. Adaptive optics technology has so far been very successful in correcting these effects, providing nearly diffraction limited images. Extremely Large Telescopes will require more complex Adaptive Optics configurations that introduce the need for new mathematical models and optimal solvers. In addition, the amount of data to be processed in real time is also greatly increased, making the use of conventional computational methods and hardware inefficient, which motivates the study of advanced computational algorithms, and implementations on parallel processors. Graphical Processing Units (GPUs) are massively parallel processors that have so far demonstrated a very high increase in speed compared to CPUs and other devices, and they have a high potential to meet the real-time restrictions of adaptive optics systems. This thesis focuses on the study and evaluation of existing proposed computational algorithms with respect to computational performance, and their implementation on GPUs. Two basic methods, one direct and one iterative are implemented and tested and the results presented provide an evaluation of the basic concept upon which other algorithms are based, and demonstrate the benefits of using GPUs for adaptive optics

    Enabling the use of embedded and mobile technologies for high-performance computing

    Get PDF
    In the late 1990s, powerful economic forces led to the adoption of commodity desktop processors in High-Performance Computing(HPC). This transformation has been so effective that the November 2016 TOP500 list is still dominated by x86 architecture. In 2016, the largest commodity market in computing is not PCs or servers, but mobile computing, comprising smartphones andtablets, most of which are built with ARM-based Systems on Chips (SoC). This suggests that once mobile SoCs deliver sufficient performance, mobile SoCs can help reduce the cost of HPC. This thesis addresses this question in detail.We analyze the trend in mobile SoC performance, comparing it with the similar trend in the 1990s. Through development of real system prototypes and their performance analysis we assess the feasibility of building an HPCsystem based on mobile SoCs. Through simulation of the future mobile SoC, we identify the missing features and suggest improvements that would enable theuse of future mobile SoCs in HPC environment. Thus, we present design guidelines for future generations mobile SoCs, and HPC systems built around them, enabling the newclass of cheap supercomputers.A finales de la década de los 90, razones económicas llevaron a la adopción de procesadores de uso general en sistemas de Computación de Altas Prestaciones (HPC). Esta transformación ha sido tan efectiva que la lista TOP500 de noviembre de 2016 sigue aun dominada por la arquitectura x86. En 2016, el mayor mercado de productos básicos en computación no son los ordenadores de sobremesa o los servidores, sino la computación móvil, que incluye teléfonos inteligentes y tabletas, la mayoría de los cuales están construidos con sistemas en chip(SoC) de arquitectura ARM. Esto sugiere que una vez que los SoC móviles ofrezcan un rendimiento suficiente, podrán utilizarse para reducir el costo desistemas HPC. Esta tesis aborda esta cuestión en detalle. Analizamos la tendencia del rendimiento de los SoC para móvil, comparándola con la tendencia similar ocurrida en los añosnoventa. A través del desarrollo de prototipos de sistemas reales y su análisis de rendimiento, evaluamos la factibilidad de construir unsistema HPC basado en SoCs móviles. A través de la simulación de SoCs móviles futuros, identificamos las características que faltan y sugerimos mejoras quepermitirían su uso en entornos HPC. Por lo tanto, presentamos directrices de diseño para futuras generaciones de SoCs móviles y sistemas HPC construidos a sualrededor, para permitir la construcción de una nueva clase de supercomputadores de coste reducido

    Energy Efficiency Models for Scientific Applications on Supercomputers

    Get PDF

    Hybrid algorithms for efficient Cholesky decomposition and matrix inverse using multicore CPUs with GPU accelerators

    Get PDF
    The use of linear algebra routines is fundamental to many areas of computational science, yet their implementation in software still forms the main computational bottleneck in many widely used algorithms. In machine learning and computational statistics, for example, the use of Gaussian distributions is ubiquitous, and routines for calculating the Cholesky decomposition, matrix inverse and matrix determinant must often be called many thousands of times for common algorithms, such as Markov chain Monte Carlo. These linear algebra routines consume most of the total computational time of a wide range of statistical methods, and any improvements in this area will therefore greatly increase the overall efficiency of algorithms used in many scientific application areas. The importance of linear algebra algorithms is clear from the substantial effort that has been invested over the last 25 years in producing low-level software libraries such as LAPACK, which generally optimise these linear algebra routines by breaking up a large problem into smaller problems that may be computed independently. The performance of such libraries is however strongly dependent on the specific hardware available. LAPACK was originally developed for single core processors with a memory hierarchy, whereas modern day computers often consist of mixed architectures, with large numbers of parallel cores and graphics processing units (GPU) being used alongside traditional CPUs. The challenge lies in making optimal use of these different types of computing units, which generally have very different processor speeds and types of memory. In this thesis we develop novel low-level algorithms that may be generally employed in blocked linear algebra routines, which automatically optimise themselves to take full advantage of the variety of heterogeneous architectures that may be available. We present a comparison of our methods with MAGMA, the state of the art open source implementation of LAPACK designed specifically for hybrid architectures, and demonstrate up to 400% increase in speed that may be obtained using our novel algorithms, specifically when running commonly used Cholesky matrix decomposition, matrix inverse and matrix determinant routines
    • …
    corecore