372 research outputs found

    Evaluating Asymmetric Multicore Systems-on-Chip using Iso-Metrics

    Get PDF
    The end of Dennard scaling has pushed power consumption into a first order concern for current systems, on par with performance. As a result, near-threshold voltage computing (NTVC) has been proposed as a potential means to tackle the limited cooling capacity of CMOS technology. Hardware operating in NTV consumes significantly less power, at the cost of lower frequency, and thus reduced performance, as well as increased error rates. In this paper, we investigate if a low-power systems-on-chip, consisting of ARM's asymmetric big.LITTLE technology, can be an alternative to conventional high performance multicore processors in terms of power/energy in an unreliable scenario. For our study, we use the Conjugate Gradient solver, an algorithm representative of the computations performed by a large range of scientific and engineering codes.Comment: Presented at HiPEAC EEHCO '15, 6 page

    Efficient Numerical Algorithms for Balanced Stochastic Truncation

    Get PDF
    We propose an efficient numerical algorithm for relative error model reduction based on balanced stochastic truncation. The method uses full-rank factors of the Gramians to be balanced versus each other and exploits the fact that for large-scale systems these Gramians are often of low numerical rank. We use the easy-to-parallelize sign function method as the major computational tool in determining these full-rank factors and demonstrate the numerical performance of the suggested implementation of balanced stochastic truncation model reduction

    Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors

    Full text link
    [EN] We present a novel method for the QR factorization of large tall-and-skinny matrices that introduces an approximation technique for computing the Householder vectors. This approach is very competitive on a hybrid platform equipped with a graphics processor, with a performance advantage over the conventional factorization due to the reduced amount of data transfers between the graphics accelerator and the main memory of the host. Our experiments show that, for tall¿skinny matrices, the new approach outperforms the code in MAGMA by a large margin, while it is very competitive for square matrices when the memory transfers and CPU computations are the bottleneck of the Householder QR factorizationThis research was supported by the Project TIN2017-82972-R from the MINECO (Spain) and the EU H2020 Project 732631 "OPRECOMP. Open Transprecision Computing".Tomás Domínguez, AE.; Quintana-Ortí, ES. (2020). Tall-and-skinny QR factorization with approximate Householder reflectors on graphics processors. The Journal of Supercomputing (Online). 76(11):8771-8786. https://doi.org/10.1007/s11227-020-03176-3S877187867611Abdelfattah A, Haidar A, Tomov S, Dongarra J (2018) Analysis and design techniques towards high-performance and energy-efficient dense linear solvers on GPUs. IEEE Trans Parallel Distrib Syst 29(12):2700–2712. https://doi.org/10.1109/TPDS.2018.2842785Ballard G, Demmel J, Grigori L, Jacquelin M, Knight N, Nguyen H (2015) Reconstructing Householder vectors from tall-skinny QR. J Parallel Distrib Comput 85:3–31. https://doi.org/10.1016/j.jpdc.2015.06.003Barrachina S, Castillo M, Igual FD, Mayo R, Quintana-Ortí ES (2008) Solving dense linear systems on graphics processors. In: Luque E, Margalef T, Benítez D (eds) Euro-Par 2008—parallel processing. Springer, Heidelberg, pp 739–748Benson AR, Gleich DF, Demmel J (2013) Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures. In: 2013 IEEE International Conference on Big Data, pp 264–272. https://doi.org/10.1109/BigData.2013.6691583Businger P, Golub GH (1965) Linear least squares solutions by householder transformations. Numer Math 7(3):269–276. https://doi.org/10.1007/BF01436084Demmel J, Grigori L, Hoemmen M, Langou J (2012) Communication-optimal parallel and sequential QR and LU factorizations. SIAM J Sci Comput 34(1):206–239. https://doi.org/10.1137/080731992Dongarra J, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17. https://doi.org/10.1145/77626.79170Drmač Z, Bujanović Z (2008) On the failure of rank-revealing qr factorization software—a case study. ACM Trans Math Softw 35(2):12:1–12:28. https://doi.org/10.1145/1377612.1377616Fukaya T, Nakatsukasa Y, Yanagisawa Y, Yamamoto Y (2014) CholeskyQR2: A simple and communication-avoiding algorithm for computing a tall-skinny QR factorization on a large-scale parallel system. In: 2014 5th workshop on latest advances in scalable algorithms for large-scale systems, pp 31–38. https://doi.org/10.1109/ScalA.2014.11Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y, Yanagisawa Y (2018) Shifted CholeskyQR for computing the QR factorization of ill-conditioned matrices, arXiv:1809.11085Golub G, Van Loan C (2013) Matrix computations. Johns Hopkins studies in the mathematical sciences. Johns Hopkins University Press, BaltimoreGunter BC, van de Geijn RA (2005) Parallel out-of-core computation and updating the QR factorization. ACM Trans Math Softw 31(1):60–78. https://doi.org/10.1145/1055531.1055534Joffrain T, Low TM, Quintana-Ortí ES, Rvd Geijn, Zee FGV (2006) Accumulating householder transformations, revisited. ACM Trans Math Softw 32(2):169–179. https://doi.org/10.1145/1141885.1141886Puglisi C (1992) Modification of the householder method based on the compact WY representation. SIAM J Sci Stat Comput 13(3):723–726. https://doi.org/10.1137/0913042Saad Y (2003) Iterative methods for sparse linear systems, 3rd edn. Society for Industrial and Applied Mathematics, PhiladelphiaSchreiber R, Van Loan C (1989) A storage-efficient WY representation for products of householder transformations. SIAM J Sci Comput 10(1):53–57. https://doi.org/10.1137/0910005Stathopoulos A, Wu K (2001) A block orthogonalization procedure with constant synchronization requirements. SIAM J Sci Comput 23(6):2165–2182. https://doi.org/10.1137/S1064827500370883Strazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech. Rep. TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaTomás Dominguez AE, Quintana Orti ES (2018) Fast blocking of householder reflectors on graphics processors. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 385–393. https://doi.org/10.1109/PDP2018.2018.00068Volkov V, Demmel JW (2008) LU, QR and Cholesky factorizations using vector capabilities of GPUs. Tech. Rep. 202, LAPACK Working Note. http://www.netlib.org/lapack/lawnspdf/lawn202.pdfYamamoto Y, Nakatsukasa Y, Yanagisawa Y, Fukaya T (2015) Roundoff error analysis of the Cholesky QR2 algorithm. Electron Trans Numer Anal 44:306–326Yamazaki I, Tomov S, Dongarra J (2015) Mixed-precision Cholesky QR factorization and its case studies on multicore CPU with multiple GPUs. SIAM J Sci Comput 37(3):C307–C330. https://doi.org/10.1137/14M097377

    Applying OOC Techniques in the Reduction to Condensed Form for Very Large Symmetric Eigenproblems on GPUs

    Get PDF
    In this paper we address the reduction of a dense matrix to tridiagonal form for the solution of symmetric eigenvalue problems on a graphics processor (GPU) when the data is too large to fit into the accelerator memory. We apply out-of-core techniques to a three-stage algorithm, carefully redesigning the first stage to reduce the number of data transfers between the CPU and GPU memory spaces, maintain the memory requirements on the GPU within limits, and ensure high performance by featuring a high ratio between computation and communication

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    A Review of Lightweight Thread Approaches for High Performance Computing

    Get PDF
    High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for that purpose. Several lightweight thread (LWT) libraries have recently appeared offering lighter mechanisms to tackle massive concurrency. In order to examine the suitability of LWTs in high-level runtimes, we develop a set of microbenchmarks consisting of commonly-found patterns in current parallel codes. Moreover, we study the semantics offered by some LWT libraries in order to expose the similarities between different LWT application programming interfaces. This study reveals that a reduced set of LWT functions can be sufficient to cover the common parallel code patterns andthat those LWT libraries perform better than OS threads-based solutions in cases where task and nested parallelism are becoming more popular with new architectures.The researchers from the Universitat Jaume I de Castelló were supported by project TIN2014-53495-R of the MINECO, the Generalitat Valenciana fellowship programme Vali+d 2015, and FEDER. This work was partially supported by the U.S. Dept. of Energy, Office of Science, Office of Advanced Scientific Computing Research (SC-21), under contract DEAC02-06CH11357. We gratefully acknowledge the computing resources provided and operated by the Joint Laboratory for System Evaluation (JLSE) at Argonne National Laboratory.Peer ReviewedPostprint (author's final draft

    Modeling power consumption of 3D MPDATA and the CG method on ARM and Intel multicore architectures

    Get PDF
    We propose an approach to estimate the power consumption of algorithms, as a function of the frequency and number of cores, using only a very reduced set of real power measures. In addition, we also provide the formulation of a method to select the voltage–frequency scaling–concurrency throttling configurations that should be tested in order to obtain accurate estimations of the power dissipation. The power models and selection methodology are verified using two real scientific application: the stencil-based 3D MPDATA algorithm and the conjugate gradient (CG) method for sparse linear systems. MPDATA is a crucial component of the EULAG model, which is widely used in weather forecast simulations. The CG algorithm is the keystone for iterative solution of sparse symmetric positive definite linear systems via Krylov subspace methods. The reliability of the method is confirmed for a variety of ARM and Intel architectures, where the estimated results correspond to the real measured values with the average error being slightly below 5% in all cases
    corecore