117 research outputs found

    Improved Accuracy and Parallelism for MRRR-based Eigensolvers -- A Mixed Precision Approach

    Get PDF
    The real symmetric tridiagonal eigenproblem is of outstanding importance in numerical computations; it arises frequently as part of eigensolvers for standard and generalized dense Hermitian eigenproblems that are based on a reduction to tridiagonal form. For its solution, the algorithm of Multiple Relatively Robust Representations (MRRR) is among the fastest methods. Although fast, the solvers based on MRRR do not deliver the same accuracy as competing methods like Divide & Conquer or the QR algorithm. In this paper, we demonstrate that the use of mixed precisions leads to improved accuracy of MRRR-based eigensolvers with limited or no performance penalty. As a result, we obtain eigensolvers that are not only equally or more accurate than the best available methods, but also -in most circumstances- faster and more scalable than the competition

    Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics

    Get PDF
    The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes.Comment: Presented at HiPEAC EEHCO '15, 6 page

    Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures

    Get PDF
    We propose an approach to estimate the power consumption of algorithms, as a function of the frequency and number of cores, using only a very reduced set of real power measures. In addition, we also provide the formulation of a method to select the voltage–frequency scaling–concurrency throttling configurations that should be tested in order to obtain accurate estimations of the power dissipation. The power models and selection methodology are verified using two real scientific application: the stencil-based 3D MPDATA algorithm and the conjugate gradient (CG) method for sparse linear systems. MPDATA is a crucial component of the EULAG model, which is widely used in weather forecast simulations. The CG algorithm is the keystone for iterative solution of sparse symmetric positive definite linear systems via Krylov subspace methods. The reliability of the method is confirmed for a variety of ARM and Intel architectures, where the estimated results correspond to the real measured values with the average error being slightly below 5% in all cases

    Using graphics processors to accelerate the computation of the matrix inverse

    Get PDF
    We study the use of massively parallel architectures for computing a matrix inverse. Two different algorithms are reviewed, the traditional approach based on Gaussian elimination and the Gauss-Jordan elimination alternative, and several high performance implementations are presented and evaluated. The target architecture is a current general-purpose multi-core processor (CPU) connected to a graphics processor (GPU). Numerical experiments show the efficiency attained by the proposed implementations and how the computation of large-scale inverses, which only a few years ago would have required a distributed-memory cluster, take only a few minutes on a hybrid architecture formed by a multi-core CPU and a GPU

    Toward a modular precision ecosystem for high performance computing

    Get PDF
    [EN] With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a "modular precision ecosystem" that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm-two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241. G Flegar and ES Quintana-Ortí were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 OPRECOMP .Anzt, H.; Flegar, G.; Gruetzmacher, T.; Quintana-Orti, ES. (2019). Toward a modular precision ecosystem for high performance computing. International Journal of High Performance Computing Applications. 33(6):1069-1078. https://doi.org/10.1177/109434201984654710691078336Anzt, H., Dongarra, J., & Quintana-Ortí, E. S. (2015). Adaptive precision solvers for sparse linear systems. Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing - E2SC ’15. doi:10.1145/2834800.2834802Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., … Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12), 2526-2533. doi:10.1016/j.cpc.2008.11.005Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Göddeke, D., Strzodka, R., & Turek, S. (2007). Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4), 221-256. doi:10.1080/17445760601122076Grützmacher, T., & Anzt, H. (2018). A Modular Precision Format for Decoupling Arithmetic Format and Storage Format. Euro-Par 2018: Parallel Processing Workshops, 434-443. doi:10.1007/978-3-030-10549-5_34Grutzmacher, T., Anzt, H., Scheidegger, F., & Quintana-Orti, E. S. (2018). High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation. 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3). doi:10.1109/ia3.2018.00015Hegland, M., & Saylor, P. E. (1992). Block jacobi preconditioning of the conjugate gradient method on a vector processor. International Journal of Computer Mathematics, 44(1-4), 71-89. doi:10.1080/00207169208804096Horowitz, M. (2014). 1.1 Computing’s energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). doi:10.1109/isscc.2014.6757323Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Strzodka, R., & Goddeke, D. (2006). Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components. 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. doi:10.1109/fccm.2006.57Tadano, H., & Sakurai, T. (2008). On Single Precision Preconditioners for Krylov Subspace Iterative Methods. Lecture Notes in Computer Science, 721-728. doi:10.1007/978-3-540-78827-0_83Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall. ACM SIGARCH Computer Architecture News, 23(1), 20-24. doi:10.1145/216585.21658

    Sobre el paralelismo anidado de tareas en la factorización LU de Matrices Jerárquicas

    Get PDF
    Ponencia presentada en las XXX Jornadas de Paralelismo (JP2019) y las IV Jornadas de Computación Empotrada y Reconfigurable (JCER2019) / Jornadas Sociedad de Arquitectura y Tecnología de Computadores (SARTECO, 18-19, septiembre 2019).En este artículo se presenta una versión paralela de la factorización LU de Matrices Jerárquicas (H-matrices) provenientes de Métodos de Elementos de Contorno (BEM). Estas matrices contienen estructuras internas cuya dimensión varía durante la ejecución de operaciones sobre las mismas, por lo que es necesario desligar las estructuras de datos de aquellas utilizadas para representar las dependencias en las tareas en las que se basa la implementación paralelizada. Utilizamos el modelo de programación OmpSs-2 y su runtime para determinar el flujo de datos intrínseco al paralelismo en tiempo de ejecución, así como para aprovechar las dependencias débiles de tareas y la "liberación temprana" (early release) de dependencias. Gracias a estas funcionalidades, puede acelerarse la ejecución de la versión paralela de la H-LU y mejorarse el rendimiento

    Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs

    Get PDF
    Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000 symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión 100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta evidencias que soportan tales afirmaciones

    Characterization of Multicore Architectures using Task-Parallel ILU-type Preconditioned CG Solvers

    Get PDF
    Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We investigate the eficiency of state-of-the-art multicore processors using a multi-threaded task-parallel implementation of the Conjugate Gradient (CG) method, accelerated with an incomplete LU (ILU) preconditioner. Concretely, we analyze multicore architectures with distinct designs and market targets to compare their parallel performance and energy eficiency

    Convolution Operators for Deep Learning Inference on the Fujitsu A64FX Processor

    Get PDF
    Ponència presentada a 2022 IEEE 34th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) celebrat a Bordeaux, França.The convolution operator is a crucial kernel for many computer vision and signal processing applications that rely on deep learning (DL) technologies. As such, the efficient implementation of this operator has received considerable attention in the past few years for a fair range of processor architectures. In this paper, we follow the technology trend toward integrating long SIMD (single instruction, multiple data) arithmetic units into high performance multicore processors to analyse the benefits of this type of hardware acceleration for latency-constrained DL workloads. For this purpose, we implement and optimise for the Fujitsu processor A64FX, three distinct methods for the calculation of the convolution, namely, the lowering approach, a blocked variant of the direct convolution algorithm, and the Winograd minimal filtering algorithm. Our experimental results include an extensive evaluation of the parallel scalability of these three methods and a comparison of their global performance using three popular DL models and a representative dataset
    corecore