144 research outputs found

    Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors

    Full text link
    [EN] Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures, the adaption of these libraries for asymmetric multicore processors (AMPs) is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the BLIS framework, and tailored for AMPs equipped with two types of cores: fast/power-hungry versus slow/energy-efficient. For this purpose, we integrate coarse-grain and fine-grain parallelization strategies into the library routines which, respectively, dynamically distribute the workload between the two core types and statically repartition this work among the cores of the same type. Our results on an ARM (R) big.LITTLE (TM) processor embedded in the Exynos 5422 SoC, using the asymmetry-aware version of the BLAS and a plain migration of the legacy version of LAPACK, experimentally assess the benefits, limitations, and potential of this approach from the perspectives of both throughput and energy efficiency. (C) 2016 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects CICYT TIN2011-23283 and TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universidad Complutense de Madrid was supported by project CICYT TIN2015-65277-R. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. dinnovacio, Universitats i Empresa.Catalán, S.; Herrero, JR.; Igual Peña, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES.; Adeniyi-Jones, C. (2018). Multi-threaded dense linear algebra libraries for low-power asymmetric multicore processors. Journal of Computational Science. 25:140-151. https://doi.org/10.1016/j.jocs.2016.10.020S1401512

    Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors

    Get PDF
    [EN] We analyze the benefits of look-ahead in the parallel execution of the LU factorization with partial pivoting (LUpp) in two distinct "asymmetric" multicore scenarios. The first one corresponds to an actual hardware-asymmetric architecture such as the Samsung Exynos 5422 system-on-chip (SoC), equipped with an ARM big.LITTLE processor consisting of a quad core Cortex-A15 cluster plus a quad-core Cortex-A7 cluster. For this scenario, we propose a careful mapping of the different types of tasks appearing in LUpp to the computational resources, in order to produce an efficient architecture-aware exploitation of the computational resources integrated in this SoC. The second asymmetric configuration appears in a hardware-symmetric multicore architecture where the cores can individually operate at a different frequency levels. In this scenario, we show how to employ the frequency slack to accelerate the tasks in the critical path of LUpp in order to produce a faster global execution as well as a lower energy consumption. (C) 2018 Elsevier B.V. All rights reserved.The researchers from Universidad Jaume I were supported by projects TIN2014-53495-R and TIN2017-82972-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P of MINECO and FEDER and 2017-SGR-1414 from the Generalitat de Catalunya.Catalán, S.; Herrero, JR.; Quintana Ortí, ES.; Rodríguez-Sánchez, R. (2018). Static scheduling of the LU factorization with look-ahead on asymmetric multicore processors. Parallel Computing. 76:18-27. https://doi.org/10.1016/j.parco.2018.04.006S18277

    Programming parallel dense matrix factorizations with look-ahead and OpenMP

    Get PDF
    [EN] We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of basic linear algebra subroutines (BLAS). The proposed approach is also different from the more sophisticated runtime-based implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of linear algebra package (LAPACK) functionality on any multicore platform with an OpenMP-like runtime.The researchers from Universidad Jaume I were supported by the CICYT Projects TIN2014-53495-R and TIN2017-82972-R of the MINECO and FEDER, and the H2020 EU FETHPC Project 671602 "INTERTWinE". The researchers from Universidad Complutense de Madrid were supported by the CICYT Project TIN2015-65277-R of the MINECO and FEDER. Sandra Catalan was supported during part of this time by the FPU program of the Ministerio de Educacion, Cultura y Deporte. Adrian Castello was supported by the ValI+D 2015 FPI program of the Generalitat Valenciana.Catalán, S.; Castelló, A.; Igual, FD.; Rodríguez-Sánchez, R.; Quintana Ortí, ES. (2020). Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Computing. 23(1):359-375. https://doi.org/10.1007/s10586-019-02927-zS359375231Anderson, E., Bai, Z., Susan Blackford, L., Demmel, J., Dongarra, J.J., Croz, J.D., Hammarling, S., Greenbaum, A., McKenney, A., Sorensen, D.C.: LAPACK Users’ guide. SIAM, 3rd edition (1999)Badia, R.M., Herrero, J.R., Labarta, J., Pérez, J.M., Quintana-Ortí, E.S., Quintana-Ortí, G.: Parallelizing dense and banded linear algebra libraries using SMPSs. Conc. Comp. 21, 2438–2456 (2009)Bientinesi, P., Gunnels, J.A., Myers, M.E., Quintana-Ortí, E.S., van de Geijn, R.A.: The science of deriving dense linear algebra algorithms. ACM Trans. Math. Softw. 31(1), 1–26 (2005)Bischof, C.H., Lang, B., Sun, X.: Algorithm 807: the SBR toolbox–software for successive band reduction. ACM Trans. Math. Softw. 26(4), 602–616 (2000)Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)Castelló, A., Mayo, R., Sala, K., Beltran, V., Balaji, P., Peña, A.J.: On the adequacy of lightweight thread approaches for high-level parallel programming models. Future Gener. Comput. Syst. 84, 22–31 (2018)Castelló, A., Peña, A.J., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S.: A review of lightweight thread approaches for high performance computing. In: Proceedings of the IEEE International Conference on Cluster Computing, Taipei, Taiwan (September 2016)Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLT: a unified API for lightweight thread libraries. In: Proceedings of the IEEE International European Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain (August 2017)Castelló, A., Seo, S., Mayo, R., Balaji, P., Quintana-Ortí, E.S., Peña, A.J.: GLTO: on the adequacy of lightweight thread approaches for OpenMP implementations. In: Proceedings of the International Conference on Parallel Processing, Bristol, UK (August 2017)Catalán, S, Herrero, JR., Quintana-Ortí, E.S., Rodríguez-Sánchez, R., van de Geijn, R.A.: A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. CoRR (2016) arXiv:1611.06365Catalán, S., Igual, F.D., Mayo, R., Rguez-Sánchez, R.: Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Clust. Comput. 19(3), 1037–1051 (2016)Chameleon project. http://project.inria.fr/chameleon/Demmel, J.: Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Paris (1997)Dongarra, J.J., Croz, J.D., Hammarling, S., Duff, I.: A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16(1), 1–17 (1990)FLAME project home page. http://www.cs.utexas.edu/users/flame/Golub, G.H., Van Loan, C.F.: Matrix Computations, 3rd edn. The Johns Hopkins University Press, Baltimore (1996)Goto, K., van de Geijn, R.A.: Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw. 34(3), 12:1–12:25 (2008)Goto, K., van de Geijn, R.: High performance implementation of the level-3 BLAS. ACM Trans. Math. Softw. 35(1), 4:1–4:14 (2008)Grosser, B., Lang, B.: Efficient parallel reduction to bidiagonal form. Parallel Comput. 25(8), 969–986 (1999)Gunter, B.C., van de Geijn, R.A.: Parallel out-of-core computation and updating the QR factorization. ACM Trans. Math. Soft. 31(1), 60–78 (2005)IBM. Engineering and Scientific Subroutine Library. http://www-03.ibm.com/systems/power/software/essl/ (2015)Intel. Math Kernel Library. https://software.intel.com/en-us/intel-mkl (2015)OmpSs project home page. http://pm.bsc.es/ompsshttp://www.openblas.net (2015)OpenMP API specification for parallel programming. http://www.openmp.org (2017)PLASMA project home page. http://icl.cs.utk.edu/plasmaQuintana-Ortí, E.S., van de Geijn, R.A.: Updating an LU factorization with pivoting. ACM Trans. Math. Softw. 35(2), 11:1–11:16 (2008)Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R.A., Van Zee, F.G., Chan, E.: Programming matrix algorithms-by-blocks for thread-level parallelism. ACM Trans. Math. Softw. 36(3), 14:1–14:26 (2009)Rodríguez-Sánchez, R., Catalán, Sandra, H., José, R., Quintana-Ortí, E.S., Tomás, A.E.: Two-sided reduction to compact band forms with look-ahead (2017) CoRR, arXiv:1709.00302Seo, S., Amer, A., Balaji, P., Bordage, C., Bosilca, G., Brooks, A., Carns, P., Castelló, A., Genet, D., Herault, T., Iwasaki, S., Jindal, P., Kale, S., Krishnamoorthy, S., Lifflander, J., Lu, H., Meneses, E., Snir, M., Sun, Y., Taura, K., Beckman, P.: Argobots: a lightweight low-level threading and tasking framework. IEEE Trans. Parallel Distrib. Syst. PP(99), 1–1 (2017)Smith, T.M., van de Geijn, R., Smelyanskiy, M., Hammond, J.R., Van Zee, F.G.: Anatomy of high-performance many-threaded matrix multiplication. In: Proceedings of IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS’14, pp. 1049–1059 (2014)StarPU project. http://runtime.bordeaux.inria.fr/StarPU/Stein, D., Shah, D.: Implementing lightweight threads. In: USENIX Summer (1992)Strazdins, P.: A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Technical Report TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, Australia (1998)Van Zee, F.G., van de Geijn, R.A.: BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans. Math. Softw. 41(3), 14:1–14:33 (2015)Whaley, C.R., Dongarra, J.J.: Automatically tuned linear algebra software. In: Proceedings of SC’98 (1998)Van Zee, F.G., Smith, T.M., Marker, B., Low, T., Van De Geijn, R.A., Igual, F.D., Smelyanskiy, M., Zhang, X., Kistler, M., Austel, V., Gunnels, J.A., Killough, L.: The BLIS framework: experiments in portability. ACM Trans. Math. Softw. 42(2), 12:1–12:19 (2016

    A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting

    Get PDF
    (c) 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first technique promotes worker sharing (WS) between the two tasks, allowing the threads of the task that completes first to be reallocated for use by the costlier task. The second technique allows a fast task to alert the slower task of completion, enforcing the early termination (ET) of the second task, and a smooth transition of the factorization procedure into the next iteration. The two mechanisms are instantiated via a new malleable thread-level implementation of the basic linear algebra subprograms, and their benefits are illustrated via an implementation of the LU factorization with partial pivoting enhanced with look-ahead. Concretely, our experimental results on an Intel-Xeon system with 12 cores show the benefits of combining WS+ET, reporting competitive performance in comparison with a task-parallel runtime-based solution.This work was supported in part by the Spanish Ministerio de Economia y Competitividad under Project TIN2014-53495-R, Project TIN2015-65316-P, and Project TIN2017-82972-R, in part by the H2020 EU FETHPC "INTERTWinE" under Project 671602, in part by the Generalitat de Catalunya under Project 2017-SGR-1414, and in part by the NSF under Grant ACI-1550493.Catalán, S.; Herrero, JR.; Quintana Ortí, ES.; Rodríguez-Sánchez, R.; Van De Geijn, R. (2019). A case for malleable thread-level linear algebra libraries: The LU factorization with partial pivoting. IEEE Access. 7:17617-17633. https://doi.org/10.1109/ACCESS.2019.2895541S1761717633

    Two-sided orthogonal reductions to condensed forms on asymmetric multicore processors

    Get PDF
    [EN] We investigate how to leverage the heterogeneous resources of an Asymmetric Multicore Processor (AMP) in order to deliver high performance in the reduction to condensed forms for the solution of dense eigenvalue and singular-value problems. The routines that realize this type of two-sided orthogonal reductions (TSOR) in LAPACK are especially challenging, since a significant fraction of their floating-point operations are cast in terms of memory-bound kernels while the remaining part corresponds to efficient compute-bound kernels. To deal with this scenario: (1) we leverage implementations of memory-bound and compute-bound kernels specifically tuned for AMPs; (2) we select the algorithmic block size for the TSOR routines via a practical model; and (3) we adjust the type and number of cores to use at each step of the reduction. Our experiments validate the model and assess the performance of our asymmetry-aware TSOR routines, using an ARMv7 big.LITTLE AMP, for three key operations: the reduction to tridiagonal form for symmetric eigenvalue problems, the reduction to Hessenberg form for non-symmetric eigenvalue problems, and the reduction to bidiagonal form for singular-value problems.The researchers from Universidad Jaume I were supported by project TIN2014-53495-R of MINECO and FEDER, and the FPU program of MECD. The researcher from Universitat Politecnica de Valencia was supported by the Generalitat Valenciana PROMETEOII/2014/003. The researcher from Universitat Politecnica de Catalunya was supported by projects TIN2015-65316-P from the Spanish Ministry of Education and 2014 SGR 1051 from the Generalitat de Catalunya, Dep. d'Innovacio, Universitats i Empresa.Alonso-Jordá, P.; Catalán, S.; Herrero, JR.; Quintana-Ortí, ES.; Rodríguez-Sánchez, R. (2018). Two-sided orthogonal reductions to condensed forms on asymmetric multicore processors. Parallel Computing. 78:85-100. https://doi.org/10.1016/j.parco.2018.03.005S851007

    Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures

    Get PDF
    We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version

    Graphs, Matrices, and the GraphBLAS: Seven Good Reasons

    Get PDF
    The analysis of graphs has become increasingly important to a wide range of applications. Graph analysis presents a number of unique challenges in the areas of (1) software complexity, (2) data complexity, (3) security, (4) mathematical complexity, (5) theoretical analysis, (6) serial performance, and (7) parallel performance. Implementing graph algorithms using matrix-based approaches provides a number of promising solutions to these challenges. The GraphBLAS standard (istc- bigdata.org/GraphBlas) is being developed to bring the potential of matrix based graph algorithms to the broadest possible audience. The GraphBLAS mathematically defines a core set of matrix-based graph operations that can be used to implement a wide class of graph algorithms in a wide range of programming environments. This paper provides an introduction to the GraphBLAS and describes how the GraphBLAS can be used to address many of the challenges associated with analysis of graphs.Comment: 10 pages; International Conference on Computational Science workshop on the Applications of Matrix Computational Methods in the Analysis of Modern Dat
    • …
    corecore