4 research outputs found
Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks
Neural Architecture Search (NAS) has demonstrated its power on various AI
accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and
Graphic Processing Units (GPUs). However, it remains an open problem, how to
integrate NAS with Application-Specific Integrated Circuits (ASICs), despite
them being the most powerful AI accelerating platforms. The major bottleneck
comes from the large design freedom associated with ASIC designs. Moreover,
with the consideration that multiple DNNs will run in parallel for different
workloads with diverse layer operations and sizes, integrating heterogeneous
ASIC sub-accelerators for distinct DNNs in one design can significantly boost
performance, and at the same time further complicate the design space. To
address these challenges, in this paper we build ASIC template set based on
existing successful designs, described by their unique dataflows, so that the
design space is significantly reduced. Based on the templates, we further
propose a framework, namely NASAIC, which can simultaneously identify multiple
DNN architectures and the associated heterogeneous ASIC accelerator design,
such that the design specifications (specs) can be satisfied, while the
accuracy can be maximized. Experimental results show that compared with
successive NAS and ASIC design optimizations which lead to design spec
violations, NASAIC can guarantee the results to meet the design specs with
17.77%, 2.49x, and 2.32x reductions on latency, energy, and area and with 0.76%
accuracy loss. To the best of the authors' knowledge, this is the first work on
neural architecture and ASIC accelerator design co-exploration.Comment: Accepted by DAC'2
RHNAS: Realizable Hardware and Neural Architecture Search
The rapidly evolving field of Artificial Intelligence necessitates automated
approaches to co-design neural network architecture and neural accelerators to
maximize system efficiency and address productivity challenges. To enable joint
optimization of this vast space, there has been growing interest in
differentiable NN-HW co-design. Fully differentiable co-design has reduced the
resource requirements for discovering optimized NN-HW configurations, but fail
to adapt to general hardware accelerator search spaces. This is due to the
existence of non-synthesizable (invalid) designs in the search space of many
hardware accelerators. To enable efficient and realizable co-design of
configurable hardware accelerators with arbitrary neural network search spaces,
we introduce RHNAS. RHNAS is a method that combines reinforcement learning for
hardware optimization with differentiable neural architecture search. RHNAS
discovers realizable NN-HW designs with 1.84x lower latency and 1.86x lower
energy-delay product (EDP) on ImageNet and 2.81x lower latency and 3.30x lower
EDP on CIFAR-10 over the default hardware accelerator design.Comment: 15 page
Standing on the Shoulders of Giants: Hardware and Neural Architecture Co-Search with Hot Start
Hardware and neural architecture co-search that automatically generates
Artificial Intelligence (AI) solutions from a given dataset is promising to
promote AI democratization; however, the amount of time that is required by
current co-search frameworks is in the order of hundreds of GPU hours for one
target hardware. This inhibits the use of such frameworks on commodity
hardware. The root cause of the low efficiency in existing co-search frameworks
is the fact that they start from a "cold" state (i.e., search from scratch). In
this paper, we propose a novel framework, namely HotNAS, that starts from a
"hot" state based on a set of existing pre-trained models (a.k.a. model zoo) to
avoid lengthy training time. As such, the search time can be reduced from 200
GPU hours to less than 3 GPU hours. In HotNAS, in addition to hardware design
space and neural architecture search space, we further integrate a compression
space to conduct model compressing during the co-search, which creates new
opportunities to reduce latency but also brings challenges. One of the key
challenges is that all of the above search spaces are coupled with each other,
e.g., compression may not work without hardware design support. To tackle this
issue, HotNAS builds a chain of tools to design hardware to support
compression, based on which a global optimizer is developed to automatically
co-search all the involved search spaces. Experiments on ImageNet dataset and
Xilinx FPGA show that, within the timing constraint of 5ms, neural
architectures generated by HotNAS can achieve up to 5.79% Top-1 and 3.97% Top-5
accuracy gain, compared with the existing ones.Comment: 13 page
Enabling On-Device CNN Training by Self-Supervised Instance Filtering and Error Map Pruning
This work aims to enable on-device training of convolutional neural networks
(CNNs) by reducing the computation cost at training time. CNN models are
usually trained on high-performance computers and only the trained models are
deployed to edge devices. But the statically trained model cannot adapt
dynamically in a real environment and may result in low accuracy for new
inputs. On-device training by learning from the real-world data after
deployment can greatly improve accuracy. However, the high computation cost
makes training prohibitive for resource-constrained devices. To tackle this
problem, we explore the computational redundancies in training and reduce the
computation cost by two complementary approaches: self-supervised early
instance filtering on data level and error map pruning on the algorithm level.
The early instance filter selects important instances from the input stream to
train the network and drops trivial ones. The error map pruning further prunes
out insignificant computations when training with the selected instances.
Extensive experiments show that the computation cost is substantially reduced
without any or with marginal accuracy loss. For example, when training
ResNet-110 on CIFAR-10, we achieve 68% computation saving while preserving full
accuracy and 75% computation saving with a marginal accuracy loss of 1.3%.
Aggressive computation saving of 96% is achieved with less than 0.1% accuracy
loss when quantization is integrated into the proposed approaches. Besides,
when training LeNet on MNIST, we save 79% computation while boosting accuracy
by 0.2%