11 research outputs found

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

    Linompss – A Linear Algebra Library on OMPSs

    Get PDF
    Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2). The former pertains to the design choice to increase hardware performance through growing core counts. The latter ensures that the associated software scales well. As a result, traditional, bulk-synchronous parallel programming models like OpenMP or MPI will likely fall out of grace in favor of more amenable variants. At the time of writing, the general consensus inclines towards task-based programming models that support dynamic scheduling and use graph-based models to record task dependencies. In this context, we present LINOMPSs, a parallel linear àlgebra library built on top of OMPSs. Currently, we offer most of the BLAS- 3 functionality and the three main factorizations (QR, LU, Cholesky). There is limited support for sparse matrices. For systems of equations, LINOMPSs disposes of implementations of both direct methods as well as iterative solvers. At the lowest level, LINOMPSs can be linked with most BLAS or LAPACK libraries (MKL, ESSL, Netlib, ATLAS...). These funcions are used, in turn, to implement the OMPSs tasks for the parallel implementations of the BLAS, or the Blocked BLAS. The Blocked BLAS are used to write blocked versions of the factorizations in OMPSs, which can then be combined to write solvers or full-fledged numerical simulations with finite element methods, multigrid methods, etc. These applications benefit from the asynchronous parallelism developed by OMPSs at run time. Moreover, the layered design allows for the concurrent and out-of-order execution of tasks that stem from previously distant regions of computation. We not only consider LINOMPSs as an environment for the development of linear algebra applications and benchmarks for the Computer Science Department. In our own department for example, we use LINOMPSs to experiment with mixed-precision and incomplete/inaccurate computations. With the expertise of the CASE and Earth Science Department at BSC, we envision LINOMPSs as a tool for cross-disciplinary collaboration and the development of industrial-strength engineering applications

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local

    Iteration-fusing conjugate gradient

    Get PDF
    This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear algebra kernels into subkernels to increase concurrency and relax data-dependencies. The paper presents two ways of applying the IFCG approach: The IFCG1 algorithm, which aims at hiding the cost of parallel reductions, and the IFCG2 algorithm, which aims at reducing idle time by starting computations as soon as possible. Both IFCG1 and IFCG2 algorithms are two complementary approaches aiming at increasing parallel performance. Extensive numerical experiments are conducted to compare the IFCG1 and IFCG2 numerical stability and performance against four state-of-the-art techniques. By considering a set of representative input matrices, the paper demonstrates that IFCG1 and IFCG2 provide parallel performance improvements up to 42.9% and 41.5% respectively and average improvements of 11.8% and 7.1% with respect to the best state-of-the-art techniques while keeping similar numerical stability properties. Also, this paper provides an evaluation of the IFCG algorithms' sensitivity to system noise and it demonstrates that they run 18.0% faster on average than the best state-of-the-art technique under realistic degrees of system noise.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316) , by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center Initiative.Peer ReviewedPostprint (author's final draft

    Linompss – A Linear Algebra Library on OMPSs

    Get PDF
    Exascale performance will require massive parallelism and asynchronous execution (DARPA, DOE, EESI2). The former pertains to the design choice to increase hardware performance through growing core counts. The latter ensures that the associated software scales well. As a result, traditional, bulk-synchronous parallel programming models like OpenMP or MPI will likely fall out of grace in favor of more amenable variants. At the time of writing, the general consensus inclines towards task-based programming models that support dynamic scheduling and use graph-based models to record task dependencies. In this context, we present LINOMPSs, a parallel linear àlgebra library built on top of OMPSs. Currently, we offer most of the BLAS- 3 functionality and the three main factorizations (QR, LU, Cholesky). There is limited support for sparse matrices. For systems of equations, LINOMPSs disposes of implementations of both direct methods as well as iterative solvers. At the lowest level, LINOMPSs can be linked with most BLAS or LAPACK libraries (MKL, ESSL, Netlib, ATLAS...). These funcions are used, in turn, to implement the OMPSs tasks for the parallel implementations of the BLAS, or the Blocked BLAS. The Blocked BLAS are used to write blocked versions of the factorizations in OMPSs, which can then be combined to write solvers or full-fledged numerical simulations with finite element methods, multigrid methods, etc. These applications benefit from the asynchronous parallelism developed by OMPSs at run time. Moreover, the layered design allows for the concurrent and out-of-order execution of tasks that stem from previously distant regions of computation. We not only consider LINOMPSs as an environment for the development of linear algebra applications and benchmarks for the Computer Science Department. In our own department for example, we use LINOMPSs to experiment with mixed-precision and incomplete/inaccurate computations. With the expertise of the CASE and Earth Science Department at BSC, we envision LINOMPSs as a tool for cross-disciplinary collaboration and the development of industrial-strength engineering applications

    Improving The Robustness Of The Register File: a Register File Cache Architecture

    Get PDF
    This thesis exploits a multi-band cache-like register file architecture to mitigate the potential damage caused by process variations and soft error (single event upsets). An quantitative analysis is conducted to measure the possible gains and loses by incorporating it using simulation results

    Improving The Robustness Of The Register File: a Register File Cache Architecture

    No full text
    This thesis exploits a multi-band cache-like register file architecture to mitigate the potential damage caused by process variations and soft error (single event upsets). An quantitative analysis is conducted to measure the possible gains and loses by incorporating it using simulation results

    Iteration-fusing conjugate gradient

    No full text
    This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear algebra kernels into subkernels to increase concurrency and relax data-dependencies. The paper presents two ways of applying the IFCG approach: The IFCG1 algorithm, which aims at hiding the cost of parallel reductions, and the IFCG2 algorithm, which aims at reducing idle time by starting computations as soon as possible. Both IFCG1 and IFCG2 algorithms are two complementary approaches aiming at increasing parallel performance. Extensive numerical experiments are conducted to compare the IFCG1 and IFCG2 numerical stability and performance against four state-of-the-art techniques. By considering a set of representative input matrices, the paper demonstrates that IFCG1 and IFCG2 provide parallel performance improvements up to 42.9% and 41.5% respectively and average improvements of 11.8% and 7.1% with respect to the best state-of-the-art techniques while keeping similar numerical stability properties. Also, this paper provides an evaluation of the IFCG algorithms' sensitivity to system noise and it demonstrates that they run 18.0% faster on average than the best state-of-the-art technique under realistic degrees of system noise.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316) , by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center Initiative.Peer Reviewe
    corecore