Search CORE

12 research outputs found

Minimizing inner product data dependencies in conjugate gradient iteration

Author: Vanrosendale J.
Publication venue
Publication date
Field of study

The amount of concurrency available in conjugate gradient iteration is limited by the summations required in the inner product computations. The inner product of two vectors of length N requires time c log(N), if N or more processors are available. This paper describes an algebraic restructuring of the conjugate gradient algorithm which minimizes data dependencies due to inner product calculations. After an initial start up, the new algorithm can perform a conjugate gradient iteration in time c*log(log(N))

NASA Technical Reports Server

Iteration-fusing conjugate gradient

Author: Casas Marc
Zhuang Sicong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2017
Field of study

This paper presents the Iteration-Fusing Conjugate Gradient (IFCG) approach which is an evolution of the Conjugate Gradient method that consists in i) letting computations from different iterations to overlap between them and ii) splitting linear algebra kernels into subkernels to increase concurrency and relax data-dependencies. The paper presents two ways of applying the IFCG approach: The IFCG1 algorithm, which aims at hiding the cost of parallel reductions, and the IFCG2 algorithm, which aims at reducing idle time by starting computations as soon as possible. Both IFCG1 and IFCG2 algorithms are two complementary approaches aiming at increasing parallel performance. Extensive numerical experiments are conducted to compare the IFCG1 and IFCG2 numerical stability and performance against four state-of-the-art techniques. By considering a set of representative input matrices, the paper demonstrates that IFCG1 and IFCG2 provide parallel performance improvements up to 42.9% and 41.5% respectively and average improvements of 11.8% and 7.1% with respect to the best state-of-the-art techniques while keeping similar numerical stability properties. Also, this paper provides an evaluation of the IFCG algorithms' sensitivity to system noise and it demonstrates that they run 18.0% faster on average than the best state-of-the-art technique under realistic degrees of system noise.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316) , by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the IBM/BSC Deep Learning Center Initiative.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

Cumulative reports and publications through 31 December 1983

Author
Publication venue
Publication date
Field of study

All reports for the calendar years 1975 through December 1983 are listed by author. Since ICASE reports are intended to be preprints of articles for journals and conference proceedings, the published reference is included when available. Thirteen older journal and conference proceedings references are included as well as five additional reports by ICASE personnel. Major categories of research covered include: (1) numerical methods, with particular emphasis on the development and analysis of basic algorithms; (2) computational problems in engineering and the physical sciences, particularly fluid dynamics, acoustics, structural analysis, and chemistry; and (3) computer systems and software, especially vector and parallel computers, microcomputers, and data management

NASA Technical Reports Server

Predict-and-recompute conjugate gradient variants

Author: Chen Tyler
Publication venue
Publication date: 01/01/1999
Field of study

The standard implementation of the conjugate gradient algorithm suffers from communication bottlenecks on parallel architectures, due primarily to the two global reductions required every iteration. In this paper, we introduce several predict-and-recompute type conjugate gradient variants, which decrease the runtime per iteration by overlapping global synchronizations, and in the case of our pipelined variants, matrix vector products. Through the use of a predict-and-recompute scheme, whereby recursively updated quantities are first used as a predictor for their true values and then recomputed exactly at a later point in the iteration, our variants are observed to have convergence properties nearly as good as the standard conjugate gradient problem implementation on every problem we tested. It is also verified experimentally that our variants do indeed reduce runtime per iteration in practice, and that they scale similarly to previously studied communication hiding variants. Finally, because our variants achieve good convergence without the use of any additional input parameters, they have the potential to be used in place of the standard conjugate gradient implementation in a range of applications.Comment: This material is based upon work supported by the NSF GRFP. Code for reproducing all figures and tables in the this paper can be found here: https://github.com/tchen01/new_cg_variant

arXiv.org e-Print Archive

University of Richmond

Cumulative reports and publications through December 31, 1988

Author
Publication venue
Publication date
Field of study

This document contains a complete list of ICASE Reports. Since ICASE Reports are intended to be preprints of articles that will appear in journals or conference proceedings, the published reference is included when it is available

NASA Technical Reports Server

Communication reduction techniques in numerical methods and deep neural networks

Author: Zhuang Sicong
Publication venue: Universitat Politècnica de Catalunya
Publication date: 28/11/2019
Field of study

Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Communication reduction techniques in numerical methods and deep neural networks

Author: Zhuang Sicong
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa