Search CORE

9,635 research outputs found

Fast Training of Sparse Graph Neural Networks on Dense Hardware

Author: Balog Matej
Li Yujia
Moitra Subhodeep
Tarlow Daniel
van Merriënboer Bart
Publication venue
Publication date: 27/06/2019
Field of study

Graph neural networks have become increasingly popular in recent years due to their ability to naturally encode relational input data and their ability to scale to large graphs by operating on a sparse representation of graph adjacency matrices. As we look to scale up these models using custom hardware, a natural assumption would be that we need hardware tailored to sparse operations and/or dynamic control flow. In this work, we question this assumption by scaling up sparse graph neural networks using a platform targeted at dense computation on fixed-size data. Drawing inspiration from optimization of numerical algorithms on sparse matrices, we develop techniques that enable training the sparse graph neural network model from Allamanis et al. [2018] in 13 minutes using a 512-core TPUv2 Pod, whereas the original training takes almost a day

arXiv.org e-Print Archive

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Author: Basu Protonu
Deng Summer
Diril Utku
Dzhulgakov Dmytro
Hazelwood Kim
Jia Bill
Jia Yangqing
Kalaiah Aravind
Khudia Daya
Law James
Malani Parth
Malevich Andrey
Nadathur Satish
Naumov Maxim
Park Jongsoo
Pino Juan
Qiao Lin
Rao Vijay
Rotem Nadav
Schatz Martin
Sidorov Alexander
Sivakumar Viswanath
Smelyanskiy Mikhail
Tulloch Andrew
Wang Xiaodong
Wu Yiming
Yoo Sungjoo
Yuen Hector
Publication venue
Publication date: 29/11/2018
Field of study

The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computational characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers

arXiv.org e-Print Archive

To prune, or not to prune: exploring the efficacy of pruning for model compression

Author: Gupta Suyog
Zhu Michael
Publication venue
Publication date: 13/11/2017
Field of study

Model pruning seeks to induce sparsity in a deep neural network's various connection matrices, thereby reducing the number of nonzero-valued parameters in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep networks at the cost of only a marginal loss in accuracy and achieve a sizable reduction in model size. This hints at the possibility that the baseline models in these experiments are perhaps severely over-parameterized at the outset and a viable alternative for model compression might be to simply reduce the number of hidden units while maintaining the model's dense connection structure, exposing a similar trade-off in model size and accuracy. We investigate these two distinct paths for model compression within the context of energy-efficient inference in resource-constrained environments and propose a new gradual pruning technique that is simple and straightforward to apply across a variety of models/datasets with minimal tuning and can be seamlessly incorporated within the training process. We compare the accuracy of large, but pruned models (large-sparse) and their smaller, but dense (small-dense) counterparts with identical memory footprint. Across a broad range of neural network architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find large-sparse models to consistently outperform small-dense models and achieve up to 10x reduction in number of non-zero parameters with minimal loss in accuracy

arXiv.org e-Print Archive

C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs

Author: Ding Caiwen
Li Zhe
Liang Yun
Qiu Qinru
Wang Shuo
Wang Yanzhi
Yuan Bo
Publication venue
Publication date: 14/03/2018
Field of study

Recently, significant accuracy improvement has been achieved for acoustic recognition systems by increasing the model size of Long Short-Term Memory (LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to inefficient designs on FPGAs due to the limited on-chip resources. The previous work proposes to use a pruning based compression technique to reduce the model size and thus speedups the inference on FPGAs. However, the random nature of the pruning technique transforms the dense matrices of the model to highly unstructured sparse ones, which leads to unbalanced computation and irregular memory accesses and thus hurts the overall performance and energy efficiency. In contrast, we propose to use a structured compression technique which could not only reduce the LSTM model size but also eliminate the irregularities of computation and memory accesses. This approach employs block-circulant instead of sparse matrices to compress weight matrices and reduces the storage requirement from

\mathcal{O}(k^2)

\mathcal{O}(k)

. Fast Fourier Transform algorithm is utilized to further accelerate the inference by reducing the computational complexity from

\mathcal{O}(k^2)

\mathcal{O}(k\text{log}k)

. The datapath and activation functions are quantized as 16-bit to improve the resource utilization. More importantly, we propose a comprehensive framework called C-LSTM to automatically optimize and implement a wide range of LSTM variants on FPGAs. According to the experimental results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy efficiency compared with the state-of-the-art LSTM implementation under the same experimental setup, and the accuracy degradation is very small.Comment: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Array

arXiv.org e-Print Archive

FutureMapping: The Computational Structure of Spatial AI Systems

Author: Davison Andrew J.
Publication venue
Publication date: 29/03/2018
Field of study

We discuss and predict the evolution of Simultaneous Localisation and Mapping (SLAM) into a general geometric and semantic `Spatial AI' perception capability for intelligent embodied devices. A big gap remains between the visual perception performance that devices such as augmented reality eyewear or comsumer robots will require and what is possible within the constraints imposed by real products. Co-design of algorithms, processors and sensors will be needed. We explore the computational structure of current and future Spatial AI algorithms and consider this within the landscape of ongoing hardware developments

arXiv.org e-Print Archive

Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks

Author: Kojima Ryosuke
Matsuoka Satoshi
Nagasaka Yusuke
Nukada Akira
Publication venue
Publication date: 27/03/2019
Field of study

Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.Comment: 10 pages, 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID

arXiv.org e-Print Archive

Harnessing Intrinsic Noise in Memristor Hopfield Neural Networks for Combinatorial Optimization

Author: Beausoleil Raymond
Cai Fuxi
Kumar Suhas
Li Can
Liu Rui
Lu Wei
Strachan John Paul
Van Vaerenbergh Thomas
Xia Qiangfei
Yang J. Joshua
Yu Shimeng
Publication venue
Publication date: 03/04/2019
Field of study

We describe a hybrid analog-digital computing approach to solve important combinatorial optimization problems that leverages memristors (two-terminal nonvolatile memories). While previous memristor accelerators have had to minimize analog noise effects, we show that our optimization solver harnesses such noise as a computing resource. Here we describe a memristor-Hopfield Neural Network (mem-HNN) with massively parallel operations performed in a dense crossbar array. We provide experimental demonstrations solving NP-hard max-cut problems directly in analog crossbar arrays, and supplement this with experimentally-grounded simulations to explore scalability with problem size, providing the success probabilities, time and energy to solution, and interactions with intrinsic analog noise. Compared to fully digital approaches, and present-day quantum and optical accelerators, we forecast the mem-HNN to have over four orders of magnitude higher solution throughput per power consumption. This suggests substantially improved performance and scalability compared to current quantum annealing approaches, while operating at room temperature and taking advantage of existing CMOS technology augmented with emerging analog non-volatile memristors

arXiv.org e-Print Archive

DimmWitted: A Study of Main-Memory Statistical Analytics

Author: Ré Christopher
Zhang Ce
Publication venue
Publication date: 07/07/2014
Field of study

We perform the first study of the tradeoff space of access methods and replication to support statistical analytics using first-order methods executed in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems differ from conventional SQL-analytics in the amount and types of memory incoherence they can tolerate. Our goal is to understand tradeoffs in accessing the data in row- or column-order and at what granularity one should share the model and data for a statistical task. We study this new tradeoff space, and discover there are tradeoffs between hardware and statistical efficiency. We argue that our tradeoff study may provide valuable information for designers of analytics engines: for each system we consider, our prototype engine can run at least one popular task at least 100x faster. We conduct our study across five architectures using popular models including SVMs, logistic regression, Gibbs sampling, and neural networks

arXiv.org e-Print Archive

Inferring Mesoscale Models of Neural Computation

Author: Dean Thomas
Publication venue
Publication date: 19/10/2017
Field of study

Recent years have seen dramatic progress in the development of techniques for measuring the activity and connectivity of large populations of neurons in the brain. However, as these techniques grow ever more powerful---allowing us to even contemplate measuring every neuron in entire brain---a new problem arises: how do we make sense of the mountains of data that these techniques produce? Here, we argue that the time is ripe for building an intermediate or "mesoscale" computational theory that can bridge between single-cell (microscale) accounts of neural function and behavioral (macroscale) accounts of animal cognition and environmental complexity. Just as digital accounts of computation in conventional computers abstract away the non-essential dynamics of the analog circuits that implementing gates and registers, so too a computational account of animal cognition can afford to abstract from the non-essential dynamics of neurons. We argue that the geometry of neural circuits is essential in explaining the computational limitations and technological innovations inherent in biological information processing. We propose a blueprint for how to employ tools from modern machine learning to automatically infer a satisfying mesoscale account of neural computation that combines functional and structural data, with an emphasis on learning and exploiting regularity and repeating motifs in neuronal circuits. Rather than suggest a specific theory, we present a new class of scientific instruments that can enable neuroscientists to design, propose, implement and test mesoscale theories of neural computation

arXiv.org e-Print Archive

CRDN: Cascaded Residual Dense Networks for Dynamic MR Imaging with Edge-enhanced Loss Constraint

Author: Cheng Huitao
Ke Ziwen
Liang Dong
Liu Qiegen
Wang Shanshan
Ying Leslie
Zheng Hairong
Publication venue
Publication date: 18/01/2019
Field of study

Dynamic magnetic resonance (MR) imaging has generated great research interest, as it can provide both spatial and temporal information for clinical diagnosis. However, slow imaging speed or long scanning time is still one of the challenges for dynamic MR imaging. Most existing methods reconstruct Dynamic MR images from incomplete k-space data under the guidance of compressed sensing (CS) or low rank theory, which suffer from long iterative reconstruction time. Recently, deep learning has shown great potential in accelerating dynamic MR. Our previous work proposed a dynamic MR imaging method with both k-space and spatial prior knowledge integrated via multi-supervised network training. Nevertheless, there was still a certain degree of smooth in the reconstructed images at high acceleration factors. In this work, we propose cascaded residual dense networks for dynamic MR imaging with edge-enhance loss constraint, dubbed as CRDN. Specifically, the cascaded residual dense networks fully exploit the hierarchical features from all the convolutional layers with both local and global feature fusion. We further utilize the total variation (TV) loss function, which has the edge enhancement properties, for training the networks

arXiv.org e-Print Archive