135 research outputs found

    Enabling Massive Deep Neural Networks with the GraphBLAS

    Full text link
    Deep Neural Networks (DNNs) have emerged as a core tool for machine learning. The computations performed during DNN training and inference are dominated by operations on the weight matrices describing the DNN. As DNNs incorporate more stages and more nodes per stage, these weight matrices may be required to be sparse because of memory limitations. The GraphBLAS.org math library standard was developed to provide high performance manipulation of sparse weight matrices and input/output vectors. For sufficiently sparse matrices, a sparse matrix library requires significantly less memory than the corresponding dense matrix implementation. This paper provides a brief description of the mathematics underlying the GraphBLAS. In addition, the equations of a typical DNN are rewritten in a form designed to use the GraphBLAS. An implementation of the DNN is given using a preliminary GraphBLAS C library. The performance of the GraphBLAS implementation is measured relative to a standard dense linear algebra library implementation. For various sizes of DNN weight matrices, it is shown that the GraphBLAS sparse implementation outperforms a BLAS dense implementation as the weight matrix becomes sparser.Comment: 10 pages, 7 figures, to appear in the 2017 IEEE High Performance Extreme Computing (HPEC) conferenc

    Distributed Sparse Computing and Communication for Big Graph Analytics and Deep Learning

    Get PDF
    Sparsity can be found in the underlying structure of many real-world computationally expensive problems including big graph analytics and large scale sparse deep neural networks. In addition, if gracefully investigated, many of these problems contain a broad substratum of parallelism suitable for parallel and distributed executions of sparse computation. However, usually, dense computation is preferred to its sparse alternative as sparse computation is not only hard to parallelize due to the irregular nature of the sparse data, but also complicated to implement in terms of rewriting a dense algorithm into a sparse one. Hence, foolproof sparse computation requires customized data structures to encode the sparsity of the sparse data and new algorithms to mask the complexity of the sparse computation. However, by carefully exploiting the sparse data structures and algorithms, sparse computation can reduce memory consumption, communication volume, and processing power and thus undoubtedly move the scalability boundaries compared to its dense equivalent. In this dissertation, I explain how to use parallel and distributed computing techniques in the presence of sparsity to solve large scientific problems including graph analytics and deep learning. To meet this end goal, I leverage the duality between graph theory and sparse linear algebra primitives, and thus solve graph analytics and deep learning problems with the sparse matrix operations. My contributions are fourfold: (1) design and implementation of a new distributed compressed sparse matrix data structure that reduces both computation and communication volumes and is suitable for sparse matrix-vector and sparse matrix-matrix operations, (2) introducing the new MPI*X parallelism model that deems threads as basic units of computing and communication, (3) optimizing sparse matrix-matrix multiplication by employing different hashing techniques, and (4) proposing the new data-then-model parallelism that mitigates the effect of stragglers in sparse deep learning by combining data and model parallelisms. Altogether, these contributions provide a set of data structures and algorithms to accelerate and scale the sparse computing and communication

    Towards Efficient Hardware Acceleration of Deep Neural Networks on FPGA

    Get PDF
    Deep neural network (DNN) has achieved remarkable success in many applications because of its powerful capability for data processing. Their performance in computer vision have matched and in some areas even surpassed human capabilities. Deep neural networks can capture complex nonlinear features; however this ability comes at the cost of high computational and memory requirements. State-of-art networks require billions of arithmetic operations and millions of parameters. The brute-force computing model of DNN often requires extremely large hardware resources, introducing severe concerns on its scalability running on traditional von Neumann architecture. The well-known memory wall, and latency brought by the long-range connectivity and communication of DNN severely constrain the computation efficiency of DNN. The acceleration techniques of DNN, either software or hardware, often suffer from poor hardware execution efficiency of the simplified model (software), or inevitable accuracy degradation and limited supportable algorithms (hardware), respectively. In order to preserve the inference accuracy and make the hardware implementation in a more efficient form, a close investigation to the hardware/software co-design methodologies for DNNs is needed. The proposed work first presents an FPGA-based implementation framework for Recurrent Neural Network (RNN) acceleration. At architectural level, we improve the parallelism of RNN training scheme and reduce the computing resource requirement for computation efficiency enhancement. The hardware implementation primarily targets at reducing data communication load. Secondly, we propose a data locality-aware sparse matrix and vector multiplication (SpMV) kernel. At software level, we reorganize a large sparse matrix into many modest-sized blocks by adopting hypergraph-based partitioning and clustering. Available hardware constraints have been taken into consideration for the memory allocation and data access regularization. Thirdly, we present a holistic acceleration to sparse convolutional neural network (CNN). During network training, the data locality is regularized to ease the hardware mapping. The distributed architecture enables high computation parallelism and data reuse. The proposed research results in an hardware/software co-design methodology for fast and accurate DNN acceleration, through the innovations in algorithm optimization, hardware implementation, and the interactive design process across these two domains

    Convolutional neural nets for estimating the run time and energy consumption of the sparse matrix-vector product

    Get PDF
    Modeling the performance and energy consumption of the sparse matrix-vector product (SpMV) is essential to perform off-line analysis and, for example, choose a target computer architecture that delivers the best performance-energy consumption ratio. However, this task is especially complex given the memory-bounded nature and irregular memory accesses of the SpMV, mainly dictated by the input sparse matrix. In this paper, we propose a Machine Learning (ML)-driven approach that leverages Convolutional Neural Networks (CNNs) to provide accurate estimations of the performance and energy consumption of the SpMV kernel. The proposed CNN-based models use a blockwise approach to make the CNN architecture independent of the matrix size. These models are trained to estimate execution time as well as total, package, and DRAM energy consumption at different processor frequencies. The experimental results reveal that the overall relative error ranges between 0.5% and 14%, while at matrix level is not superior to 10%. To demonstrate the applicability and accuracy of the SpMV CNN-based models, this study is complemented with an ad-hoc time-energy model for the PageRank algorithm, a popular algorithm for web information retrieval used by search engines, which internally realizes the SpMV kernel

    Partitioning sparse deep neural networks for scalable training and inference

    Get PDF
    The state-of-the-art deep neural networks (DNNs) have significant computational and data management requirements. The size of both training data and models continue to increase. Sparsification and pruning methods are shown to be effective in removing a large fraction of connections in DNNs. The resulting sparse networks present unique challenges to further improve the computational efficiency of training and inference in deep learning. Both the feedforward (inference) and backpropagation steps in stochastic gradient descent (SGD) algorithm for training sparse DNNs involve consecutive sparse matrix-vector multiplications (SpMVs). We first introduce a distributed-memory parallel SpMV-based solution for the SGD algorithm to improve its scalability. The parallelization approach is based on row-wise partitioning of weight matrices that represent neuron connections between consecutive layers. We then propose a novel hypergraph model for partitioning weight matrices to reduce the total communication volume and ensure computational load-balance among processors. Experiments performed on sparse DNNs demonstrate that the proposed solution is highly efficient and scalable. By utilizing the proposed matrix partitioning scheme, the performance of our solution is further improved significantly

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local.Postprint (published version

    Communication reduction techniques in numerical methods and deep neural networks

    Get PDF
    Inter-node communication has turned out to be one of the determining factors of the performance on modern HPC systems. Furthermore, the situation only gets worse with the ever-incresing size of the cores involved. Hence, this thesis explore the various possible techniques to reduce the communication during the execution of a parallel program. It turned out that there is no one-size-fit-all approach to the challenge. Despite this, the problems in each field, due to their unique characteristics, dispose of distinct opportunities for the communication reduction. The thesis, first devles into numerical linear algebra, develops an evolution of the Pipelined CG called IFCG. It eliminates the synchronizations normally take place towards the end of each iteration to increase the parallelism. Secondly, the thesis draws its attention on reducing the necessity to transfer the parameters between the CPU host and GPUs during a neural network training. It develops two routines: ADT and AWP in order to compress and decompress the weights with a reduced data representation format prior and right after the data transfer takes place. The compress rate is adjusted vis-à-vis the L2-norm of the weights of every layer. In the third contribution, the thesis diminish the communication in model parallelizing a deep neural network. Instead of splitting and distributing the neurons of each layer to the available processes on the system, now it is done every other layers. This results in a 50% percent reduction of the communication whereas it introduces 50% of extra local FP computation.La comunicació entre els nodes de computació multi-core sorgeix com un dels factors principals que impacta el rendiment d’un sistema HPC d’avui en dia. I més, mentre més core es pusa, pitjor la situació. Per tant aquesta tesi explora les possibles tècniques per a reduir la comunicació en l’execució d’un programa paral·lel. Tot i això, resulta que no existeix una sola tècnica que pugui resoldre aquest obstacle. Tot i que els problemes en cada àmbit, com que té els seus propis caracristics, disposa variosos oportunitats per la reducció de comunicació. La tesi, en primer lloc, dins de l’àmbit de l’àlgebra lineal numèriques desenvolupa un algoritme IFCG que és una evolució de Pipelined CG. IFCG elimina les sincronitzacions normalment posa cap al final de cada iteració per augmentar el paral·lelisme. En la segona contribució, la tesi dirigeix l’atenció a reduir la necessitat de transferir els paràmetres entre el CPU i els GPUs durant l’entrenament d’una xarxa neuronal. Desenvolupa rutines ADT i AWP per comprimir i descomprimir els pesos amb una representació de dades reduïda abans i just desprès de la transferència. La representació es decideix dinàmicament segons el L2-norm dels pesos a cada capa. Al final la tesi disminueix la comunicació en paral·lelitzar el model duna xarxa neurona. En lloc de distribuir les neurones de cada capa als processos disponibles en el sistema, es fa cada dues capes. Així que corta com mitja de la comunicació. En canvi, com que distribueix només cada dues capes, les capes restes es repliquen, resulta que incorre en una augmenta de 50% de computació local

    Reconfigurable acceleration of Recurrent Neural Networks

    Get PDF
    Recurrent Neural Networks (RNNs) have been successful in a wide range of applications involving temporal sequences such as natural language processing, speech recognition and video analysis. However, RNNs often require a significant amount of memory and computational resources. In addition, the recurrent nature and data dependencies in RNN computations can lead to system stall, resulting in low throughput and high latency. This work describes novel parallel hardware architectures for accelerating RNN inference using Field-Programmable Gate Array (FPGA) technology, which considers the data dependencies and high computational costs of RNNs. The first contribution of this thesis is a latency-hiding architecture that utilizes column-wise matrix-vector multiplication instead of the conventional row-wise operation to eliminate data dependencies and improve the throughput of RNN inference designs. This architecture is further enhanced by a configurable checkerboard tiling strategy which allows large dimensions of weight matrices, while supporting element-based parallelism and vector-based parallelism. The presented reconfigurable RNN designs show significant speedup over CPU, GPU, and other FPGA designs. The second contribution of this thesis is a weight reuse approach for large RNN models with weights stored in off-chip memory, running with a batch size of one. A novel blocking-batching strategy is proposed to optimize the throughput of large RNN designs on FPGAs by reusing the RNN weights. Performance analysis is also introduced to enable FPGA designs to achieve the best trade-off between area, power consumption and performance. Promising power efficiency improvement has been achieved in addition to speeding up over CPU and GPU designs. The third contribution of this thesis is a low latency design for RNNs based on a partially-folded hardware architecture. It also introduces a technique that balances initiation interval of multi-layer RNN inferences to increase hardware efficiency and throughput while reducing latency. The approach is evaluated on a variety of applications, including gravitational wave detection and Bayesian RNN-based ECG anomaly detection. To facilitate the use of this approach, we open source an RNN template which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools.Open Acces

    EFFICIENTLY ACCELERATING SPARSE PROBLEMS BY ENABLING STREAM ACCESSES TO MEMORY USING HARDWARE/SOFTWARE TECHNIQUES

    Get PDF
    The objective of this research is to improve the performance of sparse problems that have a wide range of applications but still, suffer from serious challenges when running on modern computers. In summary, the challenges include the underutilization of available memory bandwidth because of lack of spatial locality, dependencies in computation, or slow mechanisms for decompressing the sparse data, and the underutilization of concurrent compute engines because of the distribution of non-zero values in sparse data. Our key insight to address the aforementioned challenges is that based on the type of the problem, we either use an intelligent reduction tree near memory to process data while gathering them from random locations of memory, transform the computations mathematically to extract more parallelism, modify the distribution of non-zero elements, or change the representation of sparse data. By applying such techniques, the execution adapts more effectively to given hardware resources. To this end, this research introduces hardware/software techniques to enable stream accesses to memory for accelerating four main categories of sparse problems including the inference of recommendation systems, iterative solvers of partial differential equations (PDEs), deep neural networks (DNNs), and graph algorithms.Ph.D
    corecore