222 research outputs found
Incremental Learning Using a Grow-and-Prune Paradigm with Efficient Neural Networks
Deep neural networks (DNNs) have become a widely deployed model for numerous
machine learning applications. However, their fixed architecture, substantial
training cost, and significant model redundancy make it difficult to
efficiently update them to accommodate previously unseen data. To solve these
problems, we propose an incremental learning framework based on a
grow-and-prune neural network synthesis paradigm. When new data arrive, the
neural network first grows new connections based on the gradients to increase
the network capacity to accommodate new data. Then, the framework iteratively
prunes away connections based on the magnitude of weights to enhance network
compactness, and hence recover efficiency. Finally, the model rests at a
lightweight DNN that is both ready for inference and suitable for future
grow-and-prune updates. The proposed framework improves accuracy, shrinks
network size, and significantly reduces the additional training cost for
incoming data compared to conventional approaches, such as training from
scratch and network fine-tuning. For the LeNet-300-100 and LeNet-5 neural
network architectures derived for the MNIST dataset, the framework reduces
training cost by up to 64% (63%) and 67% (63%) compared to training from
scratch (network fine-tuning), respectively. For the ResNet-18 architecture
derived for the ImageNet dataset and DeepSpeech2 for the AN4 dataset, the
corresponding training cost reductions against training from scratch (network
fine-tunning) are 64% (60%) and 67% (62%), respectively. Our derived models
contain fewer network parameters but achieve higher accuracy relative to
conventional baselines
SCANN: Synthesis of Compact and Accurate Neural Networks
Deep neural networks (DNNs) have become the driving force behind recent
artificial intelligence (AI) research. An important problem with implementing a
neural network is the design of its architecture. Typically, such an
architecture is obtained manually by exploring its hyperparameter space and
kept fixed during training. This approach is time-consuming and inefficient.
Another issue is that modern neural networks often contain millions of
parameters, whereas many applications and devices require small inference
models. However, efforts to migrate DNNs to such devices typically entail a
significant loss of classification accuracy. To address these challenges, we
propose a two-step neural network synthesis methodology, called DR+SCANN, that
combines two complementary approaches to design compact and accurate DNNs. At
the core of our framework is the SCANN methodology that uses three basic
architecture-changing operations, namely connection growth, neuron growth, and
connection pruning, to synthesize feed-forward architectures with arbitrary
structure. SCANN encapsulates three synthesis methodologies that apply a
repeated grow-and-prune paradigm to three architectural starting points.
DR+SCANN combines the SCANN methodology with dataset dimensionality reduction
to alleviate the curse of dimensionality. We demonstrate the efficacy of SCANN
and DR+SCANN on various image and non-image datasets. We evaluate SCANN on
MNIST and ImageNet benchmarks. In addition, we also evaluate the efficacy of
using dimensionality reduction alongside SCANN (DR+SCANN) on nine small to
medium-size datasets. We also show that our synthesis methodology yields neural
networks that are much better at navigating the accuracy vs. energy efficiency
space. This would enable neural network-based inference even on
Internet-of-Things sensors.Comment: 13 pages, 8 figure
TransCODE: Co-design of Transformers and Accelerators for Efficient Training and Inference
Automated co-design of machine learning models and evaluation hardware is
critical for efficiently deploying such models at scale. Despite the
state-of-the-art performance of transformer models, they are not yet ready for
execution on resource-constrained hardware platforms. High memory requirements
and low parallelizability of the transformer architecture exacerbate this
problem. Recently-proposed accelerators attempt to optimize the throughput and
energy consumption of transformer models. However, such works are either
limited to a one-sided search of the model architecture or a restricted set of
off-the-shelf devices. Furthermore, previous works only accelerate model
inference and not training, which incurs substantially higher memory and
compute resources, making the problem even more challenging. To address these
limitations, this work proposes a dynamic training framework, called DynaProp,
that speeds up the training process and reduces memory consumption. DynaProp is
a low-overhead pruning method that prunes activations and gradients at runtime.
To effectively execute this method on hardware for a diverse set of transformer
architectures, we propose ELECTOR, a framework that simulates transformer
inference and training on a design space of accelerators. We use this simulator
in conjunction with the proposed co-design technique, called TransCODE, to
obtain the best-performing models with high accuracy on the given task and
minimize latency, energy consumption, and chip area. The obtained
transformer-accelerator pair achieves 0.3% higher accuracy than the
state-of-the-art pair while incurring 5.2 lower latency and 3.0
lower energy consumption
EdgeTran: Co-designing Transformers for Efficient Inference on Mobile Edge Platforms
Automated design of efficient transformer models has recently attracted
significant attention from industry and academia. However, most works only
focus on certain metrics while searching for the best-performing transformer
architecture. Furthermore, running traditional, complex, and large transformer
models on low-compute edge platforms is a challenging problem. In this work, we
propose a framework, called ProTran, to profile the hardware performance
measures for a design space of transformer architectures and a diverse set of
edge devices. We use this profiler in conjunction with the proposed co-design
technique to obtain the best-performing models that have high accuracy on the
given task and minimize latency, energy consumption, and peak power draw to
enable edge deployment. We refer to our framework for co-optimizing accuracy
and hardware performance measures as EdgeTran. It searches for the best
transformer model and edge device pair. Finally, we propose GPTran, a
multi-stage block-level grow-and-prune post-processing step that further
improves accuracy in a hardware-aware manner. The obtained transformer model is
2.8 smaller and has a 0.8% higher GLUE score than the baseline
(BERT-Base). Inference with it on the selected edge device enables 15.0% lower
latency, 10.0 lower energy, and 10.8 lower peak power draw
compared to an off-the-shelf GPU
CTRL: Clustering Training Losses for Label Error Detection
In supervised machine learning, use of correct labels is extremely important
to ensure high accuracy. Unfortunately, most datasets contain corrupted labels.
Machine learning models trained on such datasets do not generalize well. Thus,
detecting their label errors can significantly increase their efficacy. We
propose a novel framework, called CTRL (Clustering TRaining Losses for label
error detection), to detect label errors in multi-class datasets. It detects
label errors in two steps based on the observation that models learn clean and
noisy labels in different ways. First, we train a neural network using the
noisy training dataset and obtain the loss curve for each sample. Then, we
apply clustering algorithms to the training losses to group samples into two
categories: cleanly-labeled and noisily-labeled. After label error detection,
we remove samples with noisy labels and retrain the model. Our experimental
results demonstrate state-of-the-art error detection accuracy on both image
(CIFAR-10 and CIFAR-100) and tabular datasets under simulated noise. We also
use a theoretical analysis to provide insights into why CTRL performs so well
SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference
CNNs outperform traditional machine learning algorithms across a wide range
of applications. However, their computational complexity makes it necessary to
design efficient hardware accelerators. Most CNN accelerators focus on
exploring dataflow styles that exploit computational parallelism. However,
potential performance speedup from sparsity has not been adequately addressed.
The computation and memory footprint of CNNs can be significantly reduced if
sparsity is exploited in network evaluations. To take advantage of sparsity,
some accelerator designs explore sparsity encoding and evaluation on CNN
accelerators. However, sparsity encoding is just performed on activation or
weight and only in inference. It has been shown that activation and weight also
have high sparsity levels during training. Hence, sparsity-aware computation
should also be considered in training. To further improve performance and
energy efficiency, some accelerators evaluate CNNs with limited precision.
However, this is limited to the inference since reduced precision sacrifices
network accuracy if used in training. In addition, CNN evaluation is usually
memory-intensive, especially in training. In this paper, we propose SPRING, a
SParsity-aware Reduced-precision Monolithic 3D CNN accelerator for trainING and
inference. SPRING supports both CNN training and inference. It uses a binary
mask scheme to encode sparsities in activation and weight. It uses the
stochastic rounding algorithm to train CNNs with reduced precision without
accuracy loss. To alleviate the memory bottleneck in CNN evaluation, especially
in training, SPRING uses an efficient monolithic 3D NVM interface to increase
memory bandwidth. Compared to GTX 1080 Ti, SPRING achieves 15.6X, 4.2X and
66.0X improvements in performance, power reduction, and energy efficiency,
respectively, for CNN training, and 15.5X, 4.5X and 69.1X improvements for
inference
Fast Design Space Exploration of Nonlinear Systems: Part II
Nonlinear system design is often a multi-objective optimization problem
involving search for a design that satisfies a number of predefined
constraints. The design space is typically very large since it includes all
possible system architectures with different combinations of components
composing each architecture. In this article, we address nonlinear system
design space exploration through a two-step approach encapsulated in a
framework called Fast Design Space Exploration of Nonlinear Systems (ASSENT).
In the first step, we use a genetic algorithm to search for system
architectures that allow discrete choices for component values or else only
component values for a fixed architecture. This step yields a coarse design
since the system may or may not meet the target specifications. In the second
step, we use an inverse design to search over a continuous space and fine-tune
the component values with the goal of improving the value of the objective
function. We use a neural network to model the system response. The neural
network is converted into a mixed-integer linear program for active learning to
sample component values efficiently. We illustrate the efficacy of ASSENT on
problems ranging from nonlinear system design to design of electrical circuits.
Experimental results show that ASSENT achieves the same or better value of the
objective function compared to various other optimization techniques for
nonlinear system design by up to 54%. We improve sample efficiency by 6-10x
compared to reinforcement learning based synthesis of electrical circuits.Comment: 14 pages, 24 figures. arXiv admin note: substantial text overlap with
arXiv:2009.1021
Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers
Deployment of Transformer models on the edge is increasingly challenging due
to the exponentially growing model size and inference cost that scales
quadratically with the number of tokens in the input sequence. Token pruning is
an emerging solution to address this challenge due to its ease of deployment on
various Transformer backbones. However, most token pruning methods require a
computationally-expensive fine-tuning process after or during pruning, which is
not desirable in many cases. Some recent works explore pruning of off-the-shelf
pre-trained Transformers without fine-tuning. However, they only take the
importance of tokens into consideration. In this work, we propose Zero-TPrune,
the first zero-shot method that considers both the importance and similarity of
tokens in performing token pruning. Zero-TPrune leverages the attention graph
of pre-trained Transformer models to produce an importance rank for tokens and
removes the less informative tokens. The attention matrix can be thought of as
an adjacency matrix of a directed graph, to which a graph shift operator can be
applied iteratively to obtain the importance score distribution. This
distribution guides the partition of tokens into two groups and measures
similarity between them. Due to the elimination of the fine-tuning overhead,
Zero-TPrune can easily prune large models and perform hyperparameter tuning
efficiently. We evaluate the performance of Zero-TPrune on vision tasks by
applying it to various vision Transformer backbones. Compared with
state-of-the-art pruning methods that require fine-tuning, Zero-TPrune not only
eliminates the need for fine-tuning after pruning, but does so with only around
0.3% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning
methods, Zero-TPrune reduces accuracy loss by up to 45% on medium-sized models.Comment: 20 pages, 18 figure
- …