36 research outputs found

    Parallel accelerated cyclic reduction preconditioner for three-dimensional elliptic PDEs with variable coefficients

    Full text link
    We present a robust and scalable preconditioner for the solution of large-scale linear systems that arise from the discretization of elliptic PDEs amenable to rank compression. The preconditioner is based on hierarchical low-rank approximations and the cyclic reduction method. The setup and application phases of the preconditioner achieve log-linear complexity in memory footprint and number of operations, and numerical experiments exhibit good weak and strong scalability at large processor counts in a distributed memory environment. Numerical experiments with linear systems that feature symmetry and nonsymmetry, definiteness and indefiniteness, constant and variable coefficients demonstrate the preconditioner applicability and robustness. Furthermore, it is possible to control the number of iterations via the accuracy threshold of the hierarchical matrix approximations and their arithmetic operations, and the tuning of the admissibility condition parameter. Together, these parameters allow for optimization of the memory requirements and performance of the preconditioner.Comment: 24 pages, Elsevier Journal of Computational and Applied Mathematics, Dec 201

    Accelerating advanced preconditioning methods on hybrid architectures

    Get PDF
    Un gran número de problemas, en diversas áreas de la ciencia y la ingeniería, involucran la solución de sistemas dispersos de ecuaciones lineales de gran escala. En muchos de estos escenarios, son además un cuello de botella desde el punto de vista computacional, y por esa razón, su implementación eficiente ha motivado una cantidad enorme de trabajos científicos. Por muchos años, los métodos directos basados en el proceso de la Eliminación Gaussiana han sido la herramienta de referencia para resolver dichos sistemas, pero la dimensión de los problemas abordados actualmente impone serios desafíos a la mayoría de estos algoritmos, considerando sus requerimientos de memoria, su tiempo de cómputo y la complejidad de su implementación. Propulsados por los avances en las técnicas de precondicionado, los métodos iterativos se han vuelto más confiables, y por lo tanto emergen como alternativas a los métodos directos, ofreciendo soluciones de alta calidad a un menor costo computacional. Sin embargo, estos avances muchas veces son relativos a un problema específico, o dotan a los precondicionadores de una complejidad tal, que su aplicación en diversos problemas se vuelve poco práctica en términos de tiempo de ejecución y consumo de memoria. Como respuesta a esta situación, es común la utilización de estrategias de Computación de Alto Desempeño, ya que el desarrollo sostenido de las plataformas de hardware permite la ejecución simultánea de cada vez más operaciones. Un claro ejemplo de esta evolución son las plataformas compuestas por procesadores multi-núcleo y aceleradoras de hardware como las Unidades de Procesamiento Gráfico (GPU). Particularmente, las GPU se han convertido en poderosos procesadores paralelos, capaces de integrar miles de núcleos a precios y consumo energético razonables.Por estas razones, las GPU son ahora una plataforma de hardware de gran importancia para la ciencia y la ingeniería, y su uso eficiente es crucial para alcanzar un buen desempeño en la mayoría de las aplicaciones. Esta tesis se centra en el uso de GPUs para acelerar la solución de sistemas dispersos de ecuaciones lineales usando métodos iterativos precondicionados con técnicas modernas. En particular, se trabaja sobre ILUPACK, que ofrece implementaciones de los métodos iterativos más importantes, y presenta un interesante y moderno precondicionador de tipo ILU multinivel. En este trabajo, se desarrollan versiones del precondicionador y de los métodos incluidos en el paquete, capaces de explotar el paralelismo de datos mediante el uso de GPUs sin afectar las propiedades numéricas del precondicionador. Además, se habilita y analiza el uso de las GPU en versiones paralelas existentes, basadas en paralelismo de tareas para plataformas de memoria compartida y distribuida. Los resultados obtenidos muestran una sensible mejora en el tiempo de ejecución de los métodos abordados, así como la posibilidad de resolver problemas de gran escala de forma eficiente

    An efficient GPU version of the preconditioned GMRES method

    Full text link
    [EN] In a large number of scientific applications, the solution of sparse linear systems is the stage that concentrates most of the computational effort. This situation has motivated the study and development of several iterative solvers, among which preconditioned Krylov subspace methods occupy a place of privilege. In a previous effort, we developed a GPU-aware version of the GMRES method included in ILUPACK, a package of solvers distinguished by its inverse-based multilevel ILU preconditioner. In this work, we study the performance of our previous proposal and integrate several enhancements in order to mitigate its principal bottlenecks. The numerical evaluation shows that our novel proposal can reach important run-time reductions.Aliaga, JI.; Dufrechou, E.; Ezzatti, P.; Quintana-Orti, ES. (2019). An efficient GPU version of the preconditioned GMRES method. The Journal of Supercomputing. 75(3):1455-1469. https://doi.org/10.1007/s11227-018-2658-1S14551469753Aliaga JI, Badia RM, Barreda M, Bollhöfer M, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2016) Exploiting task and data parallelism in ILUPACK’s preconditioned CG solver on NUMA architectures and many-core accelerators. Parallel Comput 54:97–107Aliaga JI, Bollhöfer M, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2016) A data-parallel ILUPACK for sparse general and symmetric indefinite linear systems. In: Lecture Notes in Computer Science, 14th Int. Workshop on Algorithms, Models and Tools for Parallel Computing on Heterogeneous Platforms—HeteroPar’16. SpringerAliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES (2011) Exploiting thread-level parallelism in the iterative solution of sparse linear systems. Parallel Comput 37(3):183–202Aliaga JI, Bollhöfer M, Martín AF, Quintana-Ortí ES (2012) Parallelization of multilevel ILU preconditioners on distributed-memory multiprocessors. Appl Parallel Sci Comput LNCS 7133:162–172Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2018) Accelerating a preconditioned GMRES method in massively parallel processors. In: CMMSE 2018: Proceedings of the 18th International Conference on Mathematical Methods in Science and Engineering (2018)Bollhöfer M, Grote MJ, Schenk O (2009) Algebraic multilevel preconditioner for the Helmholtz equation in heterogeneous media. SIAM J Sci Comput 31(5):3781–3805Bollhöfer M, Saad Y (2006) Multilevel preconditioners constructed from inverse-based ILUs. SIAM J Sci Comput 27(5):1627–1650Dufrechou E, Ezzatti P (2018) A new GPU algorithm to compute a level set-based analysis for the parallel solution of sparse triangular systems. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2018, Canada, 2018. IEEE Computer SocietyDufrechou E, Ezzatti P (2018) Solving sparse triangular linear systems in modern GPUs: a synchronization-free algorithm. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 196–203. https://doi.org/10.1109/PDP2018.2018.00034Eijkhout V (1992) LAPACK working note 50: distributed sparse data structures for linear algebra operations. Tech. rep., Knoxville, TN, USAGolub GH, Van Loan CF (2013) Matrix computationsHe K, Tan SXD, Zhao H, Liu XX, Wang H, Shi G (2016) Parallel GMRES solver for fast analysis of large linear dynamic systems on GPU platforms. Integration 52:10–22 http://www.sciencedirect.com/science/article/pii/S016792601500084XLiu W, Li A, Hogg JD, Duff IS, Vinter B (2017) Fast synchronization-free algorithms for parallel sparse triangular solves with multiple right-hand sides. Concurr Comput 29(21)Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. SIAM, PhiladelphiaSchenk O, Wächter A, Weiser M (2008) Inertia revealing preconditioning for large-scale nonconvex constrained optimization. SIAM J Sci Comput 31(2):939–96

    New Sequential and Scalable Parallel Algorithms for Incomplete Factor Preconditioning

    Get PDF
    The solution of large, sparse, linear systems of equations Ax = b is an important kernel, and the dominant term with regard to execution time, in many applications in scientific computing. The large size of the systems of equations being solved currently (millions of unknowns and equations) requires iterative solvers on parallel computers. Preconditioning, which is the process of translating a linear system into a related system that is easier to solve, is widely used to reduce solution time and is sometimes required to ensure convergence. Level-based preconditioning (ILU(ℓ)) has long been used in serial contexts and is widely recognized as robust and effective for a wide range of problems. However, the method has long been regarded as an inherently sequential technique. Parallelism, it has been thought, can be achieved primarily at the expense of increased iterations. We dispute these claims. The first half of this dissertation takes an in-depth look at structurally based ILU(ℓ) symbolic factorization. There are two definitions of fill level, “sum” and “max,” that have been proposed. Hitherto, these definitions have been cast in terms of matrix terminology. We develop a sequence of lemmas and theorems that provide graph theoretic characterizations of both definitions; these characterizations are based on the static graph of a matrix, G(A). Our Incomplete Fill Path Theorem characterizes fill levels per the sum definition; this is the definition that is used in most library implementations of the “classic” ILU(ℓ) factorization algorithm. Our theorem leads to several new graph-search algorithms that compute factors identical, or nearly identical, to those computed by the “classic” algorithm. Our analyses shows that the new algorithms have lower run time complexity than that of the previously existing algorithms for certain classes of matrices that are commonly encountered in scientific applications. The second half of this dissertation presents a Parallel ILU algorithmic framework (PILU). This framework enables scalable parallel ILU preconditioning by combining concepts from domain decomposition and graph ordering. The framework can accommodate ILU(ℓ) factorization as well as threshold-based ILUT methods. A model implementation of the framework, the Euclid library, was developed as part of this dissertation. This library was used to obtain experimental results for Poisson\u27s equation, the Convection-Diffusion equation, and a nonlinear Radiative Transfer problem. The experiments, which were conducted on a variety of platforms with up to 400 CPUs, demonstrate that our approach is highly scalable for arbitrary ILU(ℓ) fill levels

    SCALABLE INTEGRATED CIRCUIT SIMULATION ALGORITHMS FOR ENERGY-EFFICIENT TERAFLOP HETEROGENEOUS PARALLEL COMPUTING PLATFORMS

    Get PDF
    Integrated circuit technology has gone through several decades of aggressive scaling.It is increasingly challenging to analyze growing design complexity. Post-layout SPICE simulation can be computationally prohibitive due to the huge amount of parasitic elements, which can easily boost the computation and memory cost. As the decrease in device size, the circuits become more vulnerable to process variations. Designers need to statistically simulate the probability that a circuit does not meet the performance metric, which requires millions times of simulations to capture rare failure events. Recent, multiprocessors with heterogeneous architecture have emerged as mainstream computing platforms. The heterogeneous computing platform can achieve highthroughput energy efficient computing. However, the application of such platform is not trivial and needs to reinvent existing algorithms to fully utilize the computing resources. This dissertation presents several new algorithms to address those aforementioned two significant and challenging issues on the heterogeneous platform. Harmonic Balance (HB) analysis is essential for efficient verification of large postlayout RF and microwave integrated circuits (ICs). However, existing methods either suffer from excessively long simulation time and prohibitively large memory consumption or exhibit poor stability. This dissertation introduces a novel transient-simulation guided graph sparsification technique, as well as an efficient runtime performance modeling approach tailored for heterogeneous manycore CPU-GPU computing system to build nearly-optimal subgraph preconditioners that can lead to minimum HB simulation runtime. Additionally, we propose a novel heterogeneous parallel sparse block matrix algorithm by taking advantages of the structure of HB Jacobian matrices as well as GPU’s streaming multiprocessors to achieve optimal workload balancing during the preconditioning phase of HB analysis. We also show how the proposed preconditioned iterative algorithm can efficiently adapt to heterogeneous computing systems with different CPU and GPU computing capabilities. Extensive experimental results show that our HB solver can achieve up to 20X speedups and 5X memory reduction when compared with the state-of-the-art direct solver highly optimized for twelve-core CPUs. In nowadays variation-aware IC designs, cell characterizations and SRAM memory yield analysis require many thousands or even millions of repeated SPICE simulations for relatively small nonlinear circuits. In this dissertation, for the first time, we present a massively parallel SPICE simulator on GPU, TinySPICE, for efficiently analyzing small nonlinear circuits. TinySPICE integrates a highly-optimized shared-memory based matrix solver and fast parametric three-dimensional (3D) LUTs based device evaluation method. A novel circuit clustering method is also proposed to improve the stability and efficiency of the matrix solver. Compared with CPU-based SPICE simulator, TinySPICE achieves up to 264X speedups for parametric SRAM yield analysis without loss of accuracy

    Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors

    Get PDF
    En esta tesis doctoral se aborda la solución de sistemas dispersos de ecuaciones lineales utilizando métodos iterativos precondicionados basados en subespacios de Krylov. En concreto, se centra en ILUPACK, una biblioteca que implementa precondicionadores de tipo ILU multinivel para la solución eficiente de sistemas lineales dispersos. El incremento en el número de ecuaciones, y la aparición de nuevas arquitecturas, motiva el desarrollo de una versión paralela de ILUPACK que optimice tanto el tiempo de ejecución como el consumo energético en arquitecturas multinúcleo actuales y en clusters de nodos construidos con esta tecnología. El objetivo principal de la tesis es el diseño, implementación y valuación de resolutores paralelos energéticamente eficientes para sistemas lineales dispersos orientados a procesadores multinúcleo así como aceleradores hardware como el Intel Xeon Phi. Para lograr este objetivo, se aprovecha el paralelismo de tareas mediante OmpSs y MPI, y se desarrolla un entorno automático para detectar ineficiencias energéticas.In this dissertation we target the solution of large sparse systems of linear equations using preconditioned iterative methods based on Krylov subspaces. Specifically, we focus on ILUPACK, a library that offers multi-level ILU preconditioners for the effective solution of sparse linear systems. The increase of the number of equations and the introduction of new HPC architectures motivates us to develop a parallel version of ILUPACK which optimizes both execution time and energy consumption on current multicore architectures and clusters of nodes built from this type of technology. Thus, the main goal of this thesis is the design, implementation and evaluation of parallel and energy-efficient iterative sparse linear system solvers for multicore processors as well as recent manycore accelerators such as the Intel Xeon Phi. To fulfill the general objective, we optimize ILUPACK exploiting task parallelism via OmpSs and MPI, and also develope an automatic framework to detect energy inefficiencies

    Assessing the role of mini-applications in predicting key performance characteristics of scientific and engineering applications

    Full text link
    Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of "miniapps" has been created to serve as proxies for full scale applications. Each miniapp is designed to represent a key performance characteristic that does or is expected to significantly impact the runtime performance of an application program. In this paper we introduce a methodology for assessing the ability of these miniapps to effectively represent these performance issues. We applied this methodology to three miniapps, examining the linkage between them and an application they are intended to represent. Herein we evaluate the fidelity of that linkage. This work represents the initial steps required to begin to answer the question, "Under what conditions does a miniapp represent a key performance characteristic in a full app?

    Doctor of Philosophy

    Get PDF
    dissertationPartial differential equations (PDEs) are widely used in science and engineering to model phenomena such as sound, heat, and electrostatics. In many practical science and engineering applications, the solutions of PDEs require the tessellation of computational domains into unstructured meshes and entail computationally expensive and time-consuming processes. Therefore, efficient and fast PDE solving techniques on unstructured meshes are important in these applications. Relative to CPUs, the faster growth curves in the speed and greater power efficiency of the SIMD streaming processors, such as GPUs, have gained them an increasingly important role in the high-performance computing area. Combining suitable parallel algorithms and these streaming processors, we can develop very efficient numerical solvers of PDEs. The contributions of this dissertation are twofold: proposal of two general strategies to design efficient PDE solvers on GPUs and the specific applications of these strategies to solve different types of PDEs. Specifically, this dissertation consists of four parts. First, we describe the general strategies, the domain decomposition strategy and the hybrid gathering strategy. Next, we introduce a parallel algorithm for solving the eikonal equation on fully unstructured meshes efficiently. Third, we present the algorithms and data structures necessary to move the entire FEM pipeline to the GPU. Fourth, we propose a parallel algorithm for solving the levelset equation on fully unstructured 2D or 3D meshes or manifolds. This algorithm combines a narrowband scheme with domain decomposition for efficient levelset equation solving
    corecore