28 research outputs found

    Energy‐aware strategies for task‐parallel sparse linear system solvers

    Get PDF
    This is the pre-peer reviewed version of the following article: Energy‐aware strategies for task‐parallel sparse linear system solvers, which has been published in final form at https://doi.org/10.1002/cpe.4633. This article may be used for non-commercial purposes in accordance with Wiley Terms and Conditions for Use of Self-Archived Versions.We present several energy‐aware strategies to improve the energy efficiency of a task‐parallel preconditioned Conjugate Gradient (PCG) iterative solver on a Haswell‐EP Intel Xeon. These techniques leverage the power‐saving states of the processor, promoting the hardware into a more energy‐efficient C‐state and modifying the CPU frequency (P‐states of the processors) of some operations of the PCG. We demonstrate that the application of these strategies during the main operations of the iterative solver can reduce its energy consumption considerably, especially for memory‐bound computations

    Paralelización del calculo del numero pi utilizando librerías de precisión múltiple en un clúster de computadores

    Get PDF
    Ponència presentada en el XXXII Jornadas de Paralelismo (JP2022) y VI Jornadas de Computación Empotrada y Reconfigurable (JCER2022) SARTECO 2022La Computación de Altas Prestaciones (CAP), o High Performance Computing (HPC), es un campo de la Ingeniería Informática que tiene como principal objetivo extraer el mejor rendimiento de los recursos disponibles de un computador para la resolución de un problema informático. Alcanzar este objetivo requiere tener un conocimiento detallado de la arquitectura y del sistema operativo, así como de los algoritmos y los lenguajes de programación que permiten implementar códigos más eficientes. En este artículo se presenta la evaluación de dos librerías de precisión múltiple para CPU y cómo éstas se pueden utilizar en procesadores multinúcleo y en multicomputadores o clusters de procesadores multinúcleo para el cálculo de la constante numérica π

    Characterization of Multicore Architectures using Task-Parallel ILU-type Preconditioned CG Solvers

    Get PDF
    Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We investigate the eficiency of state-of-the-art multicore processors using a multi-threaded task-parallel implementation of the Conjugate Gradient (CG) method, accelerated with an incomplete LU (ILU) preconditioner. Concretely, we analyze multicore architectures with distinct designs and market targets to compare their parallel performance and energy eficiency

    Harvesting Energy in ILUPACK via Slack Elimination

    Get PDF
    Ponència presentada al 2nd Workshop on Power-Aware Computing (PACO 2017) Ringberg Castle, Germany, July, 5-8 2017We develop a new energy-aware methodology to improve the energy consumption of a task-parallel preconditioned Conjugate Gradient iter- ative solver on a Haswell-EP Intel Xeon. This technique leverages the power-saving modes of the processor and the frequency range of the userspace Linux governor, modifying the CPU frequency for some oper- ations. We demonstrate that its application during the main operations of the PCG solver can reduce its energy consumption

    iMODS: internal coordinates normal mode analysis server

    Get PDF
    Normal mode analysis (NMA) in internal (dihedral) coordinates naturally reproduces the collective functional motions of biological macromolecules. iMODS facilitates the exploration of such modes and generates feasible transition pathways between two homologous structures, even with large macromolecules. The distinctive internal coordinate formulation improves the efficiency of NMA and extends its applicability while implicitly maintaining stereochemistry. Vibrational analysis, motion animationś and morphing trajectories can be easily carried out at different resolution scales almost interactively. The server is versatile; non-specialists can rapidly characterize potential conformational changes, whereas advanced users can customize the model resolution with multiple coarse-grained atomic representations and elastic network potentials. iMODS supports advanced visualization capabilities for illustrating collective motions, including an improved affine-model-based arrow representation of domain dynamics. The generated all-heavy-atoms conformations can be used to introduce flexibility for more advanced modeling or sampling strategies.Human Frontier Science Program—RGP0039/2008, Ministerio de Economía y Competitividad—BFU2013-44306P and Comunidad de Madrid—CAM-S2010/BMD

    Iteration-fusing conjugate gradient for sparse linear systems with MPI + OmpSs

    Get PDF
    In this paper, we target the parallel solution of sparse linear systems via iterative Krylov subspace-based method enhanced with a block-Jacobi preconditioner on a cluster of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the preconditioned conjugate gradient method that improve the interoperability between the message-passing interface and OmpSs programming models. Specifically, we progressively integrate several communication-reduction and iteration-fusing strategies into the initial code, obtaining more efficient versions of the method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 32 nodes with 24 cores each. The experimental analysis shows that the techniques described in the paper outperform the classical method by a margin that varies between 6 and 48%, depending on the evaluation

    Dynamic spawning of MPI processes applied to malleability

    Get PDF
    Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance.This work has been funded by the following projects: project PID2020-113656RB-C21 supported by MCIN/AEI/10.13039/501100011033 and project UJI-B2019-36 supported by UniversitatJaume I. Researcher S. Iserte was supported by the postdoctoralfellowship APOSTD/2020/026, and researcher I. Martín- Álvarez was supported by the predoctoral fellowship ACIF/2021/260, both from Valencian Region Government and European Social Funds.Peer ReviewedPostprint (author's final draft

    Sparse matrix-vector and matrix-multivector products for the truncated SVD on graphics processors

    Get PDF
    Many practical algorithms for numerical rank computations implement an iterative procedure that involves repeated multiplications of a vector, or a collection of vectors, with both a sparse matrix A and its transpose. Unfortunately, the realization of these sparse products on current high performance libraries often deliver much lower arithmetic throughput when the matrix involved in the product is transposed. In this work, we propose a hybrid sparse matrix layout, named CSRC, that combines the flexibility of some well-known sparse formats to offer a number of appealing properties: (1) CSRC can be obtained at low cost from the popular CSR (compressed sparse row) format; (2) CSRC has similar storage requirements as CSR; and especially, (3) the implementation of the sparse product kernels delivers high performance for both the direct product and its transposed variant on modern graphics accelerators thanks to a significant reduction of atomic operations compared to a conventional implementation based on CSR. This solution thus renders considerably higher performance when integrated into an iterative algorithm for the truncated singular value decomposition (SVD), such as the randomized SVD or, as demonstrated in the experimental results, the block Golub–Kahan–Lanczos algorithm

    Communication in task-parallel ILU-preconditioned CG solversusing MPI + OmpSs

    Get PDF
    We target the parallel solution of sparse linear systems via iterative Krylov subspace–based methods enhanced with incomplete LU (ILU)-type preconditioners on clusters of multicore processors. In order to tackle large-scale problems, we develop task-parallel implementations of the classical iteration for the CG method, accelerated via ILUPACK and ILU(0) preconditioners, using MPI + OmpSs. In addition, we integrate several communication-avoiding strategies into the codes, including the butterfly communication scheme and Eijkhout's formulation of the CG method. For all these implementations, we analyze the communication patterns and perform a comparative analysis of their performance and scalability on a cluster consisting of 16 nodes, with 16 cores each

    Dynamic spawning of MPI processes applied to malleability

    Get PDF
    Malleability allows computing facilities to adapt their workloads through resource management systems to maximize the throughput of the facility and the efficiency of the executed jobs. This technique is based on reconfiguring a job to a different resource amount during execution and then continuing with it. One of the stages of malleability is the dynamic spawning of processes in execution time, where different decisions in this stage will affect how the next stage of data redistribution is performed, which is the most time-consuming stage. This paper describes different methods and strategies, defining eight different alternatives to spawn processes dynamically and indicates which one should be used depending on whether a strong or weak scaling application is being used. In addition, it is described for both types of applications which strategies benefit most the application performance or the system productivity. The results show that reducing the number of spawning processes by reusing the older ones can reduce reconfiguration time compared to the classical method by up to 2.6 times for expanding and up to 36 times for shrinking. Furthermore, the asynchronous strategy requires analysing the impact of oversubscription on application performance
    corecore