12 research outputs found

    Minimizing inner product data dependencies in conjugate gradient iteration

    Get PDF
    The amount of concurrency available in conjugate gradient iteration is limited by the summations required in the inner product computations. The inner product of two vectors of length N requires time c log(N), if N or more processors are available. This paper describes an algebraic restructuring of the conjugate gradient algorithm which minimizes data dependencies due to inner product calculations. After an initial start up, the new algorithm can perform a conjugate gradient iteration in time c*log(log(N))

    Iteration-fusing conjugate gradient

    Get PDF
    This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear algebra kernels into subkernels to increase concurrency and relax data-dependencies. The paper presents two ways of applying the IFCG approach: The IFCG1 algorithm, which aims at hiding the cost of parallel reductions, and the IFCG2 algorithm, which aims at reducing idle time by starting computations as soon as possible. Both IFCG1 and IFCG2 algorithms are two complementary approaches aiming at increasing parallel performance. Extensive numerical experiments are conducted to compare the IFCG1 and IFCG2 numerical stability and performance against four state-of-the-art techniques. By considering a set of representative input matrices, the paper demonstrates that IFCG1 and IFCG2 provide parallel performance improvements up to 42.9% and 41.5% respectively and average improvements of 11.8% and 7.1% with respect to the best state-of-the-art techniques while keeping similar numerical stability properties. Also, this paper provides an evaluation of the IFCG algorithms' sensitivity to system noise and it demonstrates that they run 18.0% faster on average than the best state-of-the-art technique under realistic degrees of system noise.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316) , by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center Initiative.Peer ReviewedPostprint (author's final draft

    Cumulative reports and publications through 31 December 1983

    Get PDF
    All reports for the calendar years 1975 through December 1983 are listed by author. Since ICASE reports are intended to be preprints of articles for journals and conference proceedings, the published reference is included when available. Thirteen older journal and conference proceedings references are included as well as five additional reports by ICASE personnel. Major categories of research covered include: (1) numerical methods, with particular emphasis on the development and analysis of basic algorithms; (2) computational problems in engineering and the physical sciences, particularly fluid dynamics, acoustics, structural analysis, and chemistry; and (3) computer systems and software, especially vector and parallel computers, microcomputers, and data management

    Predict-and-recompute conjugate gradient variants

    Get PDF
    The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we introduce several predict-and-recompute type conjugate gradient variants, which decrease the runtime per iteration by overlapping global synchronizations, and in the case of our pipelined variants, matrix vector products. Through the use of a predict-and-recompute scheme, whereby recursively updated quantities are first used as a predictor for their true values and then recomputed exactly at a later point in the iteration, our variants are observed to have convergence properties nearly as good as the standard conjugate gradient problem implementation on every problem we tested. It is also verified experimentally that our variants do indeed reduce runtime per iteration in practice, and that they scale similarly to previously studied communication hiding variants. Finally, because our variants achieve good convergence without the use of any additional input parameters, they have the potential to be used in place of the standard conjugate gradient implementation in a range of applications.Comment: This material is based upon work supported by the NSF GRFP. Code for reproducing all figures and tables in the this paper can be found here: https://github.com/tchen01/new_cg_variant

    Cumulative reports and publications through December 31, 1988

    Get PDF
    This document contains a complete list of ICASE Reports. Since ICASE Reports are intended to be preprints of articles that will appear in journals or conference proceedings, the published reference is included when it is available

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-脿-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicaci贸 entre els nodes de computaci贸 multi-core sorgeix com un dels factors principals que impacta el rendiment d鈥檜n sistema HPC d鈥檃vui en dia. I m茅s, mentre m茅s core es pusa, pitjor la situaci贸. Per tant aquesta tesi explora les possibles t猫cniques per a reduir la comunicaci贸 en l鈥檈xecuci贸 d鈥檜n programa paral路lel. Tot i aix貌, resulta que no existeix una sola t猫cnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada 脿mbit, com que t茅 els seus propis caracristics, disposa variosos oportunitats per la reducci贸 de comunicaci贸. La tesi, en primer lloc, dins de l鈥櫭爉bit de l鈥櫭爈gebra lineal num猫riques desenvolupa un algoritme IFCG que 茅s una evoluci贸 de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteraci贸 per augmentar el paral路lelisme. En la segona contribuci贸, la tesi dirigeix l鈥檃tenci贸 a reduir la necessitat de transferir els par脿metres entre el CPU i els GPUs durant l鈥檈ntrenament d鈥檜na xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representaci贸 de dades redu茂da abans i just despr猫s de la transfer猫ncia. La representaci贸 es decideix din脿micament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicaci贸 en paral路lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Aix铆 que corta com mitja de la comunicaci贸. En canvi, com que distribueix nom茅s cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computaci贸 local.Postprint (published version

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-脿-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicaci贸 entre els nodes de computaci贸 multi-core sorgeix com un dels factors principals que impacta el rendiment d鈥檜n sistema HPC d鈥檃vui en dia. I m茅s, mentre m茅s core es pusa, pitjor la situaci贸. Per tant aquesta tesi explora les possibles t猫cniques per a reduir la comunicaci贸 en l鈥檈xecuci贸 d鈥檜n programa paral路lel. Tot i aix貌, resulta que no existeix una sola t猫cnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada 脿mbit, com que t茅 els seus propis caracristics, disposa variosos oportunitats per la reducci贸 de comunicaci贸. La tesi, en primer lloc, dins de l鈥櫭爉bit de l鈥櫭爈gebra lineal num猫riques desenvolupa un algoritme IFCG que 茅s una evoluci贸 de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteraci贸 per augmentar el paral路lelisme. En la segona contribuci贸, la tesi dirigeix l鈥檃tenci贸 a reduir la necessitat de transferir els par脿metres entre el CPU i els GPUs durant l鈥檈ntrenament d鈥檜na xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representaci贸 de dades redu茂da abans i just despr猫s de la transfer猫ncia. La representaci贸 es decideix din脿micament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicaci贸 en paral路lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Aix铆 que corta com mitja de la comunicaci贸. En canvi, com que distribueix nom茅s cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computaci贸 local
    corecore