9,635 research outputs found
Fast Training of Sparse Graph Neural Networks on Dense Hardware
Graph neural networks have become increasingly popular in recent years due to
their ability to naturally encode relational input data and their ability to
scale to large graphs by operating on a sparse representation of graph
adjacency matrices. As we look to scale up these models using custom hardware,
a natural assumption would be that we need hardware tailored to sparse
operations and/or dynamic control flow. In this work, we question this
assumption by scaling up sparse graph neural networks using a platform targeted
at dense computation on fixed-size data. Drawing inspiration from optimization
of numerical algorithms on sparse matrices, we develop techniques that enable
training the sparse graph neural network model from Allamanis et al. [2018] in
13 minutes using a 512-core TPUv2 Pod, whereas the original training takes
almost a day
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
The application of deep learning techniques resulted in remarkable
improvement of machine learning models. In this paper provides detailed
characterizations of deep learning models used in many Facebook social network
services. We present computational characteristics of our models, describe high
performance optimizations targeting existing systems, point out their
limitations and make suggestions for the future general-purpose/accelerated
inference hardware. Also, we highlight the need for better co-design of
algorithms, numerics and computing platforms to address the challenges of
workloads often run in data centers
To prune, or not to prune: exploring the efficacy of pruning for model compression
Model pruning seeks to induce sparsity in a deep neural network's various
connection matrices, thereby reducing the number of nonzero-valued parameters
in the model. Recent reports (Han et al., 2015; Narang et al., 2017) prune deep
networks at the cost of only a marginal loss in accuracy and achieve a sizable
reduction in model size. This hints at the possibility that the baseline models
in these experiments are perhaps severely over-parameterized at the outset and
a viable alternative for model compression might be to simply reduce the number
of hidden units while maintaining the model's dense connection structure,
exposing a similar trade-off in model size and accuracy. We investigate these
two distinct paths for model compression within the context of energy-efficient
inference in resource-constrained environments and propose a new gradual
pruning technique that is simple and straightforward to apply across a variety
of models/datasets with minimal tuning and can be seamlessly incorporated
within the training process. We compare the accuracy of large, but pruned
models (large-sparse) and their smaller, but dense (small-dense) counterparts
with identical memory footprint. Across a broad range of neural network
architectures (deep CNNs, stacked LSTM, and seq2seq LSTM models), we find
large-sparse models to consistently outperform small-dense models and achieve
up to 10x reduction in number of non-zero parameters with minimal loss in
accuracy
C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs
Recently, significant accuracy improvement has been achieved for acoustic
recognition systems by increasing the model size of Long Short-Term Memory
(LSTM) networks. Unfortunately, the ever-increasing size of LSTM model leads to
inefficient designs on FPGAs due to the limited on-chip resources. The previous
work proposes to use a pruning based compression technique to reduce the model
size and thus speedups the inference on FPGAs. However, the random nature of
the pruning technique transforms the dense matrices of the model to highly
unstructured sparse ones, which leads to unbalanced computation and irregular
memory accesses and thus hurts the overall performance and energy efficiency.
In contrast, we propose to use a structured compression technique which could
not only reduce the LSTM model size but also eliminate the irregularities of
computation and memory accesses. This approach employs block-circulant instead
of sparse matrices to compress weight matrices and reduces the storage
requirement from to . Fast Fourier Transform
algorithm is utilized to further accelerate the inference by reducing the
computational complexity from to
. The datapath and activation functions are
quantized as 16-bit to improve the resource utilization. More importantly, we
propose a comprehensive framework called C-LSTM to automatically optimize and
implement a wide range of LSTM variants on FPGAs. According to the experimental
results, C-LSTM achieves up to 18.8X and 33.5X gains for performance and energy
efficiency compared with the state-of-the-art LSTM implementation under the
same experimental setup, and the accuracy degradation is very small.Comment: Proceedings of the 2018 ACM/SIGDA International Symposium on
Field-Programmable Gate Array
FutureMapping: The Computational Structure of Spatial AI Systems
We discuss and predict the evolution of Simultaneous Localisation and Mapping
(SLAM) into a general geometric and semantic `Spatial AI' perception capability
for intelligent embodied devices. A big gap remains between the visual
perception performance that devices such as augmented reality eyewear or
comsumer robots will require and what is possible within the constraints
imposed by real products. Co-design of algorithms, processors and sensors will
be needed. We explore the computational structure of current and future Spatial
AI algorithms and consider this within the landscape of ongoing hardware
developments
Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks
Graph Convolutional Networks (GCNs) are recently getting much attention in
bioinformatics and chemoinformatics as a state-of-the-art machine learning
approach with high accuracy. GCNs process convolutional operations along with
graph structures, and GPUs are used to process enormous operations including
sparse-dense matrix multiplication (SpMM) when the graph structure is expressed
as an adjacency matrix with sparse matrix format. However, the SpMM operation
on small graph, where the number of nodes is tens or hundreds, hardly exploits
high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck
of training and inference in GCNs applications. In order to improve the
performance of GCNs applications, we propose new SpMM algorithm especially for
small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by
processing multiple SpMM operations with single CUDA kernel. To the best of our
knowledge, this is the first work of batched approach for SpMM. We evaluated
the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla
P100 GPU, and our batched approach shows significant speedups of up to 1.59x
and 1.37x in training and inference, respectively.Comment: 10 pages, 19th IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (CCGRID
Harnessing Intrinsic Noise in Memristor Hopfield Neural Networks for Combinatorial Optimization
We describe a hybrid analog-digital computing approach to solve important
combinatorial optimization problems that leverages memristors (two-terminal
nonvolatile memories). While previous memristor accelerators have had to
minimize analog noise effects, we show that our optimization solver harnesses
such noise as a computing resource. Here we describe a memristor-Hopfield
Neural Network (mem-HNN) with massively parallel operations performed in a
dense crossbar array. We provide experimental demonstrations solving NP-hard
max-cut problems directly in analog crossbar arrays, and supplement this with
experimentally-grounded simulations to explore scalability with problem size,
providing the success probabilities, time and energy to solution, and
interactions with intrinsic analog noise. Compared to fully digital approaches,
and present-day quantum and optical accelerators, we forecast the mem-HNN to
have over four orders of magnitude higher solution throughput per power
consumption. This suggests substantially improved performance and scalability
compared to current quantum annealing approaches, while operating at room
temperature and taking advantage of existing CMOS technology augmented with
emerging analog non-volatile memristors
DimmWitted: A Study of Main-Memory Statistical Analytics
We perform the first study of the tradeoff space of access methods and
replication to support statistical analytics using first-order methods executed
in the main memory of a Non-Uniform Memory Access (NUMA) machine. Statistical
analytics systems differ from conventional SQL-analytics in the amount and
types of memory incoherence they can tolerate. Our goal is to understand
tradeoffs in accessing the data in row- or column-order and at what granularity
one should share the model and data for a statistical task. We study this new
tradeoff space, and discover there are tradeoffs between hardware and
statistical efficiency. We argue that our tradeoff study may provide valuable
information for designers of analytics engines: for each system we consider,
our prototype engine can run at least one popular task at least 100x faster. We
conduct our study across five architectures using popular models including
SVMs, logistic regression, Gibbs sampling, and neural networks
Inferring Mesoscale Models of Neural Computation
Recent years have seen dramatic progress in the development of techniques for
measuring the activity and connectivity of large populations of neurons in the
brain. However, as these techniques grow ever more powerful---allowing us to
even contemplate measuring every neuron in entire brain---a new problem arises:
how do we make sense of the mountains of data that these techniques produce?
Here, we argue that the time is ripe for building an intermediate or
"mesoscale" computational theory that can bridge between single-cell
(microscale) accounts of neural function and behavioral (macroscale) accounts
of animal cognition and environmental complexity. Just as digital accounts of
computation in conventional computers abstract away the non-essential dynamics
of the analog circuits that implementing gates and registers, so too a
computational account of animal cognition can afford to abstract from the
non-essential dynamics of neurons. We argue that the geometry of neural
circuits is essential in explaining the computational limitations and
technological innovations inherent in biological information processing. We
propose a blueprint for how to employ tools from modern machine learning to
automatically infer a satisfying mesoscale account of neural computation that
combines functional and structural data, with an emphasis on learning and
exploiting regularity and repeating motifs in neuronal circuits. Rather than
suggest a specific theory, we present a new class of scientific instruments
that can enable neuroscientists to design, propose, implement and test
mesoscale theories of neural computation
CRDN: Cascaded Residual Dense Networks for Dynamic MR Imaging with Edge-enhanced Loss Constraint
Dynamic magnetic resonance (MR) imaging has generated great research
interest, as it can provide both spatial and temporal information for clinical
diagnosis. However, slow imaging speed or long scanning time is still one of
the challenges for dynamic MR imaging. Most existing methods reconstruct
Dynamic MR images from incomplete k-space data under the guidance of compressed
sensing (CS) or low rank theory, which suffer from long iterative
reconstruction time. Recently, deep learning has shown great potential in
accelerating dynamic MR. Our previous work proposed a dynamic MR imaging method
with both k-space and spatial prior knowledge integrated via multi-supervised
network training. Nevertheless, there was still a certain degree of smooth in
the reconstructed images at high acceleration factors. In this work, we propose
cascaded residual dense networks for dynamic MR imaging with edge-enhance loss
constraint, dubbed as CRDN. Specifically, the cascaded residual dense networks
fully exploit the hierarchical features from all the convolutional layers with
both local and global feature fusion. We further utilize the total variation
(TV) loss function, which has the edge enhancement properties, for training the
networks
- …