51 research outputs found

    Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The introduction of auto-tuning techniques in linear algebra shared-memory routines is analyzed. Information obtained in the installation of the routines is used at running time to take some decisions to reduce the total execution time. The study is carried out with routines at different levels (matrix multiplication, LU and Cholesky factorizations and linear systems symmetric or general routines) and with calls to routines in the LAPACK and PLASMA libraries with multithread implementations. Medium NUMA and large cc-NUMA systems are used in the experiments. This variety of routines, libraries and systems allows us to obtain general conclusions about the methodology to use for linear algebra shared-memory routines auto-tuning. Satisfactory execution times are obtained with the proposed methodology.Partially supported by Fundacion Seneca, Consejeria de Educacion de la Region de Murcia, 08763/PI/08, PROMETEO/2009/013 from Generalitat Valenciana, the Spanish Ministry of Education and Science through TIN2012-38341-C04-03, and the High-Performance Computing Network on Parallel Heterogeneus Architectures (CAPAP-H). The authors gratefully acknowledge the computer resources and assistance provided by the Supercomputing Centre of the Scientific Park Foundation of Murcia and by the Centre de Supercomputacio de Catalunya.Cámara, J.; Cuenca, J.; Giménez, D.; García, LP.; Vidal Maciá, AM. (2014). Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning. International Journal of Parallel Programming. 42(3):408-434. https://doi.org/10.1007/s10766-013-0249-6S408434423Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180(1), 1–5 (2009)Alberti, P., Alonso, P., Vidal, A.M., Cuenca, J., Giménez, D.: Designing polylibraries to speed up linear algebra computations. Int. J. High Perform. Comput. Netw. 1/2/3(1), 75–84 (2004)Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J.J., Du Croz, J., Grenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., D. Sorensen, S.: LAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia (1995)Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: ICCS (2013)Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)Cámara, J., Cuenca, J., Giménez, D., Vidal. A.M.: Empirical autotuning of two-level parallel linear algebra routines on large cc-NUMA systems. In: ISPA (2012)Caron, E., Desprez, F., Suter, F.: Parallel extension of a dynamic performance forecasting tool. Scalable Comput. Pract. Exp. 6(1), 57–69 (2005)Chen, Z., Dongarra, J., Luszczek, P., Roche, K.: Self adapting software for numerical linear algebra and LAPACK for clusters. Parallel Comput. 29, 1723–1743 (2003)Cuenca, J., Giménez, D., González, J.: Achitecture of an automatic tuned linear algebra library. Parallel Comput. 30(2), 187–220 (2004)Cuenca, J., García, L.P., Giménez, D.: Improving linear algebra computation on NUMA platforms through auto-tuned nested parallelism. In: Proceedings of the 2012 EUROMICRO Conference on Parallel, Distributed and Network Processing (2012)Frigo, M.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the ICASSP Conference, vol. 3, p. 1381 (1998)Golub, G., Van Loan, C.F.: Matrix Computations, 3rd edn. The John Hopkins University Press, Baltimore (1996)Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. (IJHPCA) 18(1), 135–158 (2004)Intel MKL web page.: http://software.intel.com/en-us/intel-mkl/Jerez, S., Montávez, J.-P., Giménez, D.: Optimizing the execution of a parallel meteorology simulation code. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. IEEE (2009)Katagiri, T., Kise, K., Honda, H., Yuba, T.: Fiber: a generalized framework for auto-tuning software. Springer LNCS 2858, 146–159 (2003)Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib-DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)Kurzak, J., Tomov, S., Dongarra, J.: Autotuning gemm kernels for the FERMI GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)Lastovetsky, A.L., Reddy, R., Higgins, R.: Building the functional performance model of a processor. In: SAC, pp. 746–753 (2006)Li, J., Skjellum, A., Falgout, R.D.: A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies. Concurrency Pract. Exp. 9(5), 345–389 (1997)Naono, K., Teranishi, K., Cavazos, J., Suda, R., (eds.): Software Automatic Tuning. From Concepts to State-of-the-Art Results. Springer, Berlin (2010)Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA gemm for FERMI graphics processing units. IJHPCA 24(4), 511–515 (2010)Petitet, A., Blackford, L.S., Dongarra, J., Ellis, B., Fagg, G.E., Roche, K., Vadhiyar, S.S.: Numerical libraries and the grid. IJHPCA 15(4), 359–374 (2001)PLASMA.: http://icl.cs.utk.edu/plasma/Püschel, M., Moura, J.M.F., Singer, B., Xiong, J., Johnson, J.R., Padua, D.A., Veloso, M.M., Johnson, R.W.: Spiral: a generator for platform-adapted libraries of signal processing algorithms. IJHPCA 18(1), 21–45 (2004)Seshagiri, L., Wu, M.-S., Sosonkina, M., Zhang, Z., Gordon, M.S., Schmidt, M.W.: Enhancing adaptive middleware for quantum chemistry applications with a database framework. In: IPDPS Workshops, pp. 1–8 (2010)Tanaka, T., Katagiri, T., Yuba, T.: d-Spline based incremental parameter estimation in automatic performance tuning. In: PARA, pp. 986–995 (2006)Vuduc, R., Demmel, J., Bilmes, J.: Statistical models for automatic performance tuning. In: International Conference on Computational Science (1), pp. 117–126 (2001)Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001

    Hard and Soft Error Resilience for One-sided Dense Linear Algebra Algorithms

    Get PDF
    Dense matrix factorizations, such as LU, Cholesky and QR, are widely used by scientific applications that require solving systems of linear equations, eigenvalues and linear least squares problems. Such computations are normally carried out on supercomputers, whose ever-growing scale induces a fast decline of the Mean Time To Failure (MTTF). This dissertation develops fault tolerance algorithms for one-sided dense matrix factorizations, which handles Both hard and soft errors. For hard errors, we propose methods based on diskless checkpointing and Algorithm Based Fault Tolerance (ABFT) to provide full matrix protection, including the left and right factor that are normally seen in dense matrix factorizations. A horizontal parallel diskless checkpointing scheme is devised to maintain the checkpoint data with scalable performance and low space overhead, while the ABFT checksum that is generated before the factorization constantly updates itself by the factorization operations to protect the right factor. In addition, without an available fault tolerant MPI supporting environment, we have also integrated the Checkpoint-on-Failure(CoF) mechanism into one-sided dense linear operations such as QR factorization to recover the running stack of the failed MPI process. Soft error is more challenging because of the silent data corruption, which leads to a large area of erroneous data due to error propagation. Full matrix protection is developed where the left factor is protected by column-wise local diskless checkpointing, and the right factor is protected by a combination of a floating point weighted checksum scheme and soft error modeling technique. To allow practical use on large scale system, we have also developed a complexity reduction scheme such that correct computing results can be recovered with low performance overhead. Experiment results on large scale cluster system and multicore+GPGPU hybrid system have confirmed that our hard and soft error fault tolerance algorithms exhibit the expected error correcting capability, low space and performance overhead and compatibility with double precision floating point operation

    Performance Modeling and Prediction for Dense Linear Algebra

    Full text link
    This dissertation introduces measurement-based performance modeling and prediction techniques for dense linear algebra algorithms. As a core principle, these techniques avoid executions of such algorithms entirely, and instead predict their performance through runtime estimates for the underlying compute kernels. For a variety of operations, these predictions allow to quickly select the fastest algorithm configurations from available alternatives. We consider two scenarios that cover a wide range of computations: To predict the performance of blocked algorithms, we design algorithm-independent performance models for kernel operations that are generated automatically once per platform. For various matrix operations, instantaneous predictions based on such models both accurately identify the fastest algorithm, and select a near-optimal block size. For performance predictions of BLAS-based tensor contractions, we propose cache-aware micro-benchmarks that take advantage of the highly regular structure inherent to contraction algorithms. At merely a fraction of a contraction's runtime, predictions based on such micro-benchmarks identify the fastest combination of tensor traversal and compute kernel

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    A language and a system for program optimization

    Get PDF
    Hardware complexity has increased over time, and as architectures evolve and new ones are adopted, programs must often be altered by numerous optimizations to attain maximum computing power on each target environment. As a result, the code becomes unrecognizable over time, hard to maintain, and challenging to modify. Furthermore, as the code evolves, it is hard to keep the optimizations up to date. The need to develop and maintain separate versions of the application for each target platform is an immense undertaking, especially for the large and long-lived applications commonly found in the high-performance computing (HPC) community. This dissertation presents Locus, a new system, and a language for optimizing complex, long-lived applications for different platforms. We describe the requirements that we believe are necessary for making automatic performance tuning widely adopted. We present the design and implementation of a system that fulfills these requirements. It includes a domain-specific language that can represent complex collections of transformations, an interface to integrate external modules, and a database to manage platform-specific efficient code. The database allows the system’s users to access optimized code without having to install the code generation toolset. The Locus language allows the definition of a search space combined with the programming of optimization sequences separated from the application’s reference code. After all, we present an approach for performance portability. Our thesis is that we can ameliorate the difficulty of optimizing applications using a methodology based on optimization programming and automated empirical search. Our system automatically selects, generates, and executes candidate implementations to find the one with the best performance. We present examples to illustrate the power and simplicity of the language. The experimental evaluation shows that exploring the space of candidate implementations typically leads to better performing codes than those produced by conventional compiler optimizations that are based solely on heuristics. Locus was able to generate a matrix-matrix multiplication code that outperformed the IBM XLC internal hand-optimized version by 2× on the Power 9 processors. On Intel E5, Locus generates code with performance comparable to Intel MKL’s. We also improve performance relative to the reference implementation of up to 4× on stencil computations. Locus ability to integrate complex search spaces with optimization sequences can result in very complicated optimization programs. Locus compiler applies optimizations to remove from the optimization sequences unnecessary search statements making the exploration for faster implementations more accessible. We optimize matrix transpose, matrix-matrix multiplication, fast Fourier transform, symmetric eigenproblem, and sparse matrix-vector multiplication through divide and conquer. We implement three strategies using the Locus language to create search spaces to find the best shapes of the base case and the best ways of subdividing the problem. The search space representation for the divide-and-conquer strategy uses a combination of recursion and OR blocks. The Locus compiler automatically expands the recursion and ensures that the search space is correctly represented. The results showed that the empirical search was important to improve performance by generating faster base cases and finding the best splitting. We also use Locus to optimize large, complex applications. We match the performance of hand-optimized kernels of the Kripke transport code for different input data layouts. The Plascom2 multi-physics application is optimized to find the best way to use a multi-core CPU and GPU. The use of Tangram, Hydra, and OpenMP provided an interesting search space that improved performance by approximately 4.3× on ZAXPY and ZXDOTY kernels. Lastly, in a similar fashion to how a compiler works, we applied a search space representing a collection of optimization sequences to 856 loops extracted from 16 benchmarks that resulted in good performance improvements

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing

    Get PDF
    Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future

    Vision 2040: A Roadmap for Integrated, Multiscale Modeling and Simulation of Materials and Systems

    Get PDF
    Over the last few decades, advances in high-performance computing, new materials characterization methods, and, more recently, an emphasis on integrated computational materials engineering (ICME) and additive manufacturing have been a catalyst for multiscale modeling and simulation-based design of materials and structures in the aerospace industry. While these advances have driven significant progress in the development of aerospace components and systems, that progress has been limited by persistent technology and infrastructure challenges that must be overcome to realize the full potential of integrated materials and systems design and simulation modeling throughout the supply chain. As a result, NASA's Transformational Tools and Technology (TTT) Project sponsored a study (performed by a diverse team led by Pratt & Whitney) to define the potential 25-year future state required for integrated multiscale modeling of materials and systems (e.g., load-bearing structures) to accelerate the pace and reduce the expense of innovation in future aerospace and aeronautical systems. This report describes the findings of this 2040 Vision study (e.g., the 2040 vision state; the required interdependent core technical work areas, Key Element (KE); identified gaps and actions to close those gaps; and major recommendations) which constitutes a community consensus document as it is a result of over 450 professionals input obtain via: 1) four society workshops (AIAA, NAFEMS, and two TMS), 2) community-wide survey, and 3) the establishment of 9 expert panels (one per KE) consisting on average of 10 non-team members from academia, government and industry to review, update content, and prioritize gaps and actions. The study envisions the development of a cyber-physical-social ecosystem comprised of experimentally verified and validated computational models, tools, and techniques, along with the associated digital tapestry, that impacts the entire supply chain to enable cost-effective, rapid, and revolutionary design of fit-for-purpose materials, components, and systems. Although the vision focused on aeronautics and space applications, it is believed that other engineering communities (e.g., automotive, biomedical, etc.) can benefit as well from the proposed framework with only minor modifications. Finally, it is TTT's hope and desire that this vision provides the strategic guidance to both public and private research and development decision makers to make the proposed 2040 vision state a reality and thereby provide a significant advancement in the United States global competitiveness

    A reference model for integrated energy and power management of HPC systems

    Get PDF
    Optimizing a computer for highest performance dictates the efficient use of its limited resources. Computers as a whole are rather complex. Therefore, it is not sufficient to consider optimizing hardware and software components independently. Instead, a holistic view to manage the interactions of all components is essential to achieve system-wide efficiency. For High Performance Computing (HPC) systems, today, the major limiting resources are energy and power. The hardware mechanisms to measure and control energy and power are exposed to software. The software systems using these mechanisms range from firmware, operating system, system software to tools and applications. Efforts to improve energy and power efficiency of HPC systems and the infrastructure of HPC centers achieve perpetual advances. In isolation, these efforts are unable to cope with the rising energy and power demands of large scale systems. A systematic way to integrate multiple optimization strategies, which build on complementary, interacting hardware and software systems is missing. This work provides a reference model for integrated energy and power management of HPC systems: the Open Integrated Energy and Power (OIEP) reference model. The goal is to enable the implementation, setup, and maintenance of modular system-wide energy and power management solutions. The proposed model goes beyond current practices, which focus on individual HPC centers or implementations, in that it allows to universally describe any hierarchical energy and power management systems with a multitude of requirements. The model builds solid foundations to be understandable and verifiable, to guarantee stable interaction of hardware and software components, for a known and trusted chain of command. This work identifies the main building blocks of the OIEP reference model, describes their abstract setup, and shows concrete instances thereof. A principal aspect is how the individual components are connected, interface in a hierarchical manner and thus can optimize for the global policy, pursued as a computing center's operating strategy. In addition to the reference model itself, a method for applying the reference model is presented. This method is used to show the practicality of the reference model and its application. For future research in energy and power management of HPC systems, the OIEP reference model forms a cornerstone to realize --- plan, develop and integrate --- innovative energy and power management solutions. For HPC systems themselves, it supports to transparently manage current systems with their inherent complexity, it allows to integrate novel solutions into existing setups, and it enables to design new systems from scratch. In fact, the OIEP reference model represents a basis for holistic efficient optimization.Computer auf höchstmögliche Rechenleistung zu optimieren bedingt Effizienzmaximierung aller limitierenden Ressourcen. Computer sind komplexe Systeme. Deshalb ist es nicht ausreichend, Hardware und Software isoliert zu betrachten. Stattdessen ist eine Gesamtsicht des Systems notwendig, um die Interaktionen aller Einzelkomponenten zu organisieren und systemweite Optimierungen zu ermöglichen. Für Höchstleistungsrechner (HLR) ist die limitierende Ressource heute ihre Leistungsaufnahme und der resultierende Gesamtenergieverbrauch. In aktuellen HLR-Systemen sind Energie- und Leistungsaufnahme programmatisch auslesbar als auch direkt und indirekt steuerbar. Diese Mechanismen werden in diversen Softwarekomponenten von Firmware, Betriebssystem, Systemsoftware bis hin zu Werkzeugen und Anwendungen genutzt und stetig weiterentwickelt. Durch die Komplexität der interagierenden Systeme ist eine systematische Optimierung des Gesamtsystems nur schwer durchführbar, als auch nachvollziehbar. Ein methodisches Vorgehen zur Integration verschiedener Optimierungsansätze, die auf komplementäre, interagierende Hardware- und Softwaresysteme aufbauen, fehlt. Diese Arbeit beschreibt ein Referenzmodell für integriertes Energie- und Leistungsmanagement von HLR-Systemen, das „Open Integrated Energy and Power (OIEP)“ Referenzmodell. Das Ziel ist ein Referenzmodell, dass die Entwicklung von modularen, systemweiten energie- und leistungsoptimierenden Sofware-Verbunden ermöglicht und diese als allgemeines hierarchisches Managementsystem beschreibt. Dies hebt das Modell von bisherigen Ansätzen ab, welche sich auf Einzellösungen, spezifischen Software oder die Bedürfnisse einzelner Rechenzentren beschränken. Dazu beschreibt es Grundlagen für ein planbares und verifizierbares Gesamtsystem und erlaubt nachvollziehbares und sicheres Delegieren von Energie- und Leistungsmanagement an Untersysteme unter Aufrechterhaltung der Befehlskette. Die Arbeit liefert die Grundlagen des Referenzmodells. Hierbei werden die Einzelkomponenten der Software-Verbunde identifiziert, deren abstrakter Aufbau sowie konkrete Instanziierungen gezeigt. Spezielles Augenmerk liegt auf dem hierarchischen Aufbau und der resultierenden Interaktionen der Komponenten. Die allgemeine Beschreibung des Referenzmodells erlaubt den Entwurf von Systemarchitekturen, welche letztendlich die Effizienzmaximierung der Ressource Energie mit den gegebenen Mechanismen ganzheitlich umsetzen können. Hierfür wird ein Verfahren zur methodischen Anwendung des Referenzmodells beschrieben, welches die Modellierung beliebiger Energie- und Leistungsverwaltungssystemen ermöglicht. Für Forschung im Bereich des Energie- und Leistungsmanagement für HLR bildet das OIEP Referenzmodell Eckstein, um Planung, Entwicklung und Integration von innovativen Lösungen umzusetzen. Für die HLR-Systeme selbst unterstützt es nachvollziehbare Verwaltung der komplexen Systeme und bietet die Möglichkeit, neue Beschaffungen und Entwicklungen erfolgreich zu integrieren. Das OIEP Referenzmodell bietet somit ein Fundament für gesamtheitliche effiziente Systemoptimierung
    • …
    corecore