1,298 research outputs found

    Algorithms for efficient vectorization of repeated sparse power system network computations

    Get PDF
    Cataloged from PDF version of article.Standard sparsity-based algorithms used in power system appllcations need to be restructured for efficient vectorization due to the extremely short vectors processed. Further, intrinsic architectural features of vector computers such as chaining and sectioning should also be exploited for utmost performance. This paper presents novel data storage schemes and vectorization alsorim that resolve the recurrence problem, exploit chaining and minimize the number of indirect element selections in the repeated solution of sparse linear system of equations widely encountered in various power system problems. The proposed schemes are also applied and experimented for the vectorization of power mismatch calculations arising in the solution phase of FDLF which involves typical repeated sparse power network computations. The relative performances of the proposed and existing vectorization schemes are evaluated, both theoretically and experimentally on IBM 3090ArF.Standard sparsity-based algorithms used in power system appllcations need to be restructured for efficient vectorization due to the extremely short vectors processed. Further, intrinsic architectural features of vector computers such as chaining and sectioning should also be exploited for utmost performance. This paper presents novel data storage schemes and vectorization alsorim that resolve the recurrence problem, exploit chaining and minimize the number of indirect element selections in the repeated solution of sparse linear system of equations widely encountered in various power system problems. The proposed schemes are also applied and experimented for the vectorization of power mismatch calculations arising in the solution phase of FDLF which involves typical repeated sparse power network computations. The relative performances of the proposed and existing vectorization schemes are evaluated, both theoretically and experimentally on IBM 3090ArF

    Solution of partial differential equations on vector and parallel computers

    Get PDF
    The present status of numerical methods for partial differential equations on vector and parallel computers was reviewed. The relevant aspects of these computers are discussed and a brief review of their development is included, with particular attention paid to those characteristics that influence algorithm selection. Both direct and iterative methods are given for elliptic equations as well as explicit and implicit methods for initial boundary value problems. The intent is to point out attractive methods as well as areas where this class of computer architecture cannot be fully utilized because of either hardware restrictions or the lack of adequate algorithms. Application areas utilizing these computers are briefly discussed

    Performance analysis and acceleration of nuclear physics application on high-performance computing platforms using GPGPUs and topology-aware mapping techniques

    Get PDF
    The number of nodes on current generation of high performance computing (HPC) platforms increases with a steady rate, and nodes of these computing platforms support multiple/many core hardware designs. As the number of cores per node increase, either CPU or accelerator based, we need to make use of all those cores. Thus, one has to use the accelerators as much as possible inside scientific applications. Furthermore, with the increase of the number of nodes, the communication time between nodes is likely to increase, which necessitates application specific network topology-aware mapping techniques for efficient utilization of these platforms. In addition, one also needs to construct network models in order to study the benefits of specific network mapping. The specific topology-aware mapping techniques will help to distribute the computational tasks so that the communication patterns make optimal use of the underlying network hardware. This research will mainly focus on the Many Fermion Dynamics nuclear (MFDn) application developed at Iowa State University, a computational tool for low-energy nuclear physics, which utilizes the so-called Lanczos algorithm (LA), an algorithm for diagonalization of sparse matrices that is widely used in the scientific parallel computing domain. We present techniques applied to this application which enhance its performance with the utilization of general purpose graphics processing units (GPGPUs). Additionally, we compare the performance of the sparse matrix vector multiplication (SpMVM), the main computationally intensive kernel in the LA, with other efficient approaches presented in the literature. We compare results for the total HPC platforms\u27 resources needed for different SpMVM implementations, present and analyze the implementation of communication and computation overlapping method, and extend a model for the analysis of network topology presented in the literature. Finally, we present network topology-aware mapping techniques, focused at the LA stage, for IBM Blue Gene/Q (BG/Q) supercomputers, which enhance the performance as compared to the default mapping, and validate the results of our test using the network model

    Parallel implementation of the finite element method on shared memory multiprocessors

    Get PDF
    PhD ThesisThe work presented in this thesis concerns parallel methods for finite element analysis. The research has been funded by British Gas and some of the presented material involves work on their software. Practical problems involving the finite element method can use a large amount of processing power and the execution times can be very large. It is consequently important to investigate the possibilities for the parallel implementation of the method. The research has been carried out on an Encore Multimax, a shared memory multiprocessor with 14 identical CPU's. We firstly experimented on autoparallelising a large British Gas finite element program (GASP4) using Encore's parallelising Fortran compiler (epf). The par- allel program generated by epj proved not to be efficient. The main reasons are the complexity of the code and small grain parallelism. Since the program is hard to analyse for the compiler at high levels, only small grain parallelism has been inserted automatically into the code. This involves a great deal of low level syn- chronisations which produce large overheads and cause inefficiency. A detailed analysis of the autoparallelised code has been made with a view to determining the reasons for the inefficiency. Suggestions have also been made about writing programs such that they are suitable for efficient autoparallelisation. The finite element method consists of the assembly of a stiffness matrix and the solution of a set of simultaneous linear equations. A sparse representation of the stiffness matrix has been used to allow experimentation on large problems. Parallel assembly techniques for the sparse representation have been developed. Some of these methods have proved to be very efficient giving speed ups that are near ideal. For the solution phase, we have used the preconditioned conjugate gradient method (PCG). An incomplete LU factorization ofthe stiffness matrix with no fill- in (ILU(O)) has been found to be an effective preconditioner. The factors can be obtained at a low cost. We have parallelised all the steps of the PCG method. The main bottleneck is the triangular solves (preconditioning operations) at each step. Two parallel methods of triangular solution have been implemented. One is based on level scheduling (row-oriented parallelism) and the other is a new approach called independent columns (column-oriented parallelism). The algorithms have been tested for row and red-black orderings of the nodal unknowns in the finite element meshes considered. The best speed ups obtained are 7.29 (on 12 processors) for level scheduling and 7.11 (on 12 processors) for independent columns. Red-black ordering gives rise to better parallel performance than row ordering in general. An analysis of methods for the improvement of the parallel efficiency has been made.British Ga

    A factored sparse approximate inverse preconditioned conjugate gradient solver on graphics processing units

    Get PDF
    Graphics Processing Units (GPUs) exhibit significantly higher peak performance than conventional CPUs. However, in general only highly parallel algorithms can exploit their potential. In this scenario, the iterative solution to sparse linear systems of equations could be carried out quite efficiently on a GPU as it requires only matrix-by-vector products, dot products, and vector updates. However, to be really effective, any iterative solver needs to be properly preconditioned and this represents a major bottleneck for a successful GPU implementation. Due to its inherent parallelism, the factored sparse approximate inverse (FSAI) preconditioner represents an optimal candidate for the conjugate gradient-like solution of sparse linear systems. However, its GPU implementation requires a nontrivial recasting of multiple computational steps. We present our GPU version of the FSAI preconditioner along with a set of results that show how a noticeable speedup with respect to a highly tuned CPU counterpart is obtained

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local

    Heterogeneous parallel algorithms for computational fluid dynamics on unstructured meshes

    Get PDF
    Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On the other hand, multi-core CPUs have turned into energy efficient system-on-chip architectures. By doing so, the main components of the node are fused and integrated into a single chip reducing the energy costs. Nowadays, several institutions and governments are investing in the research and development of different aspects of HPC that could lead to the next generations of supercomputers. This initiatives have entitled the problem as the exascale challenge. This goal can only be achieved by incorporating major changes in computer architecture, memory design and network interfaces. The CFD community faces an important challenge: keep the pace at the rapid changes in the HPC resources. The codes and formulations need to be re-design in other to exploit the different levels of parallelism and complex memory hierarchies of the new heterogeneous systems. The main characteristics demanded to the new CFD software are: memory awareness, extreme concurrency, modularity and portability. This thesis is devoted to the study of a CFD algorithm re-factoring for the adoption of new technologies. Our application context is the solution of incompressible flows (DNS or LES) on unstructured meshes. The first approach was using GPUs for accelerating the Poisson solver, that is the most computational intensive part of our application. The positive results obtained in this first step motivated us to port the complete time integration phase of our application. This requires a major redesign of the code. We propose a portable implementation model for CFD applications. The main idea was substituting stencil data structures and kernels by algebraic storage formats and operators. By doing so, the algorithm was restructured into a minimal set of algebraic operations. The implementation strategy consisted in the creation of a low-level algebraic layer for computations on CPUs and GPUs, and a high-level user-friendly discretization layer for CPUs that is fully localized at the preprocessing stage where performance does not play an important role. As a result, at the time-integration phase the code relies only on three algebraic kernels: sparse-matrix-vector product (SpMV), linear combination of two vectors (AXPY) and dot product (DOT). Such a simple set of basic linear algebra operations naturally provides the desired portability to any computing architecture. Special attention was paid at the development of data structures compatibles with the stream processing model. A detailed performance analysis was studied in both sequential and parallel execution engaging up to 128 GPUs in a hybrid CPU/GPU supercomputer. Moreover, we tested the portable implementation model of TermoFluids code in the Mont-Blanc mobile-based supercomputer. The re-design of the kernels exploits a heterogeneous execution model using both computing devices CPU and GPU of the ARM-based nodes. The load balancing between the two computing devices exploits a tabu search strategy that tunes the workload distribution during the preprocessing stage. A comparison of the Mont-Blanc prototypes with high-end supercomputers in terms of the achieved net performance and energy consumption provided some guidelines of the behavior of CFD applications in ARM-based architectures. Finally, we present a memory aware auto-tuned Poisson solver for problems with one Fourier diagonalizable direction. This work was developed and tested in the BlueGene/Q Vesta supercomputer, and aims at demonstrating the relevance of vectorization and memory awareness for fully exploiting the modern energy efficient CPUs.Las fronteras de la dinámica de fluidos computacional (CFD) están en constante expansión y demandan más y más recursos computacionales. Actualmente, estamos experimentando una evolución en los sistemas de computación de alto rendimiento (HPC) impulsado por restricciones de consumo de energía. Los nuevos nodos HPC incorporan aceleradores que se utilizan como co-procesadores para incrementar el rendimiento y la relación FLOP por vatio. Por otro lado, CPUs multi-core se han convertido en arquitecturas system-on-chip. Hoy en día, varias instituciones y gobiernos están invirtiendo en la investigación y desarrollo de los diferentes aspectos de HPC que podrían llevar a las próximas generaciones de superordenadores. Estas iniciativas han titulado el problema como el "exascale challenge". Este objetivo sólo puede lograrse mediante la incorporación de cambios importantes en: la arquitectura de ordenador, diseño de la memoria y las interfaces de red. La comunidad de CFD se enfrenta a un reto importante: mantener el ritmo a los rápidos cambios en las infraestructuras de HPC. Los códigos y formulaciones necesitan ser rediseñados para explotar los diferentes niveles de paralelismo y complejas jerarquías de memoria de los nuevos sistemas heterogéneos. Las principales características exigidas al nuevo software CFD son: estructuras de datos, la concurrencia extrema, modularidad y portabilidad. Esta tesis está dedicada al estudio de un modelo de implementation CFD para la adopción de nuevas tecnologías. Nuestro contexto de aplicación es la solución de los flujos incompresibles (DNS o LES) en mallas no estructuradas. El primer enfoque se basó en utilizar GPUs para acelerar el solver de Poisson. Los resultados positivos obtenidos en este primer paso nos motivaron a la portabilidad completa de la fase de integración temporal de nuestra aplicación. Esto requiere un importante rediseño del código. Proponemos un modelo de implementacion portable para aplicaciones de CFD. La idea principal es sustituir las estructuras de datos de los stencils y kernels por formatos de almacenamiento algebraicos y operadores. La estrategia de implementación consistió en la creación de una capa algebraica de bajo nivel para los cálculos de CPU y GPU, y una capa de discretización fácil de usar de alto nivel para las CPU. Como resultado, la fase de integración temporal del código se basa sólo en tres funciones algebraicas: producto de una matriz dispersa con un vector (SPMV), combinación lineal de dos vectores (AXPY) y producto escalar (DOT). Además, se prestó especial atención en el desarrollo de estructuras de datos compatibles con el modelo stream processing. Un análisis detallado de rendimiento se ha estudiado tanto en ejecución secuencial y paralela utilizando hasta 128 GPUs en un superordenador híbrido CPU / GPU. Por otra parte, hemos probado el nuevo modelo de TermoFluids en el superordenador Mont-Blanc basado en tecnología móvil. El rediseño de las funciones explota un modelo de ejecución heterogénea utilizando tanto la CPU y la GPU de los nodos basados en arquitectura ARM. El equilibrio de carga entre las dos unidades de cálculo aprovecha una estrategia de búsqueda tabú que sintoniza la distribución de carga de trabajo durante la etapa de preprocesamiento. Una comparación de los prototipos Mont-Blanc con superordenadores de alta gama en términos de rendimiento y consumo de energía nos proporcionó algunas pautas del comportamiento de las aplicaciones CFD en arquitecturas basadas en ARM. Por último, se presenta una estructura de datos auto-sintonizada para el solver de Poisson en problemas con una dirección diagonalizable mediante una descomposicion de Fourier. Este trabajo fue desarrollado y probado en la superordenador BlueGene / Q Vesta, y tiene por objeto demostrar la relevancia de vectorización y las estructuras de datos para aprovechar plenamente las CPUs de los superodenadores modernos

    Techniques for power system simulation using multiple processors

    Get PDF
    The thesis describes development work which was undertaken to improve the speed of a real-time power system simulator used for the development and testing of control schemes. The solution of large, highly sparse matrices was targeted because this is the most time-consuming part of the current simulator. Major improvements in the speed of the matrix ordering phase of the solution were achieved through the development of a new ordering strategy. This was thoroughly investigated, and is shown to provide important additional improvements compared to standard ordering methods, in reducing path length and minimising potential pipeline stalls. Alterations were made to the remainder of the solution process which provided more flexibility in scheduling calculations. This was used to dramatically ease the run-time generation of efficient code, dedicated to the solution of one matrix structure, and also to reduce memory requirements. A survey of the available microprocessors was performed, which concluded that a special-purpose design could best implement the code generated at run-time, and a design was produced using a microprogrammable floating-point processor, which matched the code produced by the earlier work. A method of splitting the matrix solution onto parallel processors was investigated, and two methods of producing network splits were developed and their results compared. The best results from each method were found to agree well, with a predicted three-fold speed-up for the matrix solution of the C.E.G.B. transmission system from the use of six processors. This gain will increase for the whole simulator. A parallel processing topology of the partitioned network and produce the necessary structures for the remainder of the solution process
    corecore