797 research outputs found
Keynote: Small Neural Nets Are Beautiful: Enabling Embedded Systems with Small Deep-Neural-Network Architectures
Over the last five years Deep Neural Nets have offered more accurate
solutions to many problems in speech recognition, and computer vision, and
these solutions have surpassed a threshold of acceptability for many
applications. As a result, Deep Neural Networks have supplanted other
approaches to solving problems in these areas, and enabled many new
applications. While the design of Deep Neural Nets is still something of an art
form, in our work we have found basic principles of design space exploration
used to develop embedded microprocessor architectures to be highly applicable
to the design of Deep Neural Net architectures. In particular, we have used
these design principles to create a novel Deep Neural Net called SqueezeNet
that requires as little as 480KB of storage for its model parameters. We have
further integrated all these experiences to develop something of a playbook for
creating small Deep Neural Nets for embedded systems.Comment: Keynote at Embedded Systems Week (ESWEEK) 201
FATE: Fast and Accurate Timing Error Prediction Framework for Low Power DNN Accelerator Design
Deep neural networks (DNN) are increasingly being accelerated on
application-specific hardware such as the Google TPU designed especially for
deep learning. Timing speculation is a promising approach to further increase
the energy efficiency of DNN accelerators. Architectural exploration for timing
speculation requires detailed gate-level timing simulations that can be
time-consuming for large DNNs that execute millions of multiply-and-accumulate
(MAC) operations. In this paper we propose FATE, a new methodology for fast and
accurate timing simulations of DNN accelerators like the Google TPU. FATE
proposes two novel ideas: (i) DelayNet, a DNN based timing model for MAC units;
and (ii) a statistical sampling methodology that reduces the number of MAC
operations for which timing simulations are performed. We show that FATE
results in between 8 times-58 times speed-up in timing simulations, while
introducing less than 2% error in classification accuracy estimates. We
demonstrate the use of FATE by comparing to conventional DNN accelerator that
uses 2's complement (2C) arithmetic with an alternative implementation that
uses signed magnitude representations (SMR). We show that that the SMR
implementation provides 18% more energy savings for the same classification
accuracy than 2C, a result that might be of independent interest.Comment: To appear at IEEE/ACM International Conference On Computer Aided
Design 201
Check-It: A Plugin for Detecting and Reducing the Spread of Fake News and Misinformation on the Web
Over the past few years, we have been witnessing the rise of misinformation
on the Web. People fall victims of fake news during their daily lives and
assist their further propagation knowingly and inadvertently. There have been
many initiatives that are trying to mitigate the damage caused by fake news,
focusing on signals from either domain flag-lists, online social networks or
artificial intelligence. In this work, we present Check-It, a system that
combines, in an intelligent way, a variety of signals into a pipeline for fake
news identification. Check-It is developed as a web browser plugin with the
objective of efficient and timely fake news detection, respecting the user's
privacy. Experimental results show that Check-It is able to outperform the
state-of-the-art methods. On a dataset, consisting of 9 millions of articles
labeled as fake and real, Check-It obtains classification accuracies that
exceed 99%.Comment: 8 pages, 6 figures
SparCE: Sparsity aware General Purpose Core Extensions to Accelerate Deep Neural Networks
Deep Neural Networks (DNNs) have emerged as the method of choice for solving
a wide range of machine learning tasks. The enormous computational demands
posed by DNNs have most commonly been addressed through the design of custom
accelerators. However, these accelerators are prohibitive in many design
scenarios (e.g., wearable devices and IoT sensors), due to stringent area/cost
constraints. Accelerating DNNs on these low-power systems, comprising of mainly
the general-purpose processor (GPP) cores, requires new approaches. We improve
the performance of DNNs on GPPs by exploiting a key attribute of DNNs, i.e.,
sparsity. We propose Sparsity aware Core Extensions (SparCE)- a set of
micro-architectural and ISA extensions that leverage sparsity and are minimally
intrusive and low-overhead. We dynamically detect zero operands and skip a set
of future instructions that use it. Our design ensures that the instructions to
be skipped are prevented from even being fetched, as squashing instructions
comes with a penalty. SparCE consists of 2 key micro-architectural
enhancements- a Sparsity Register File (SpRF) that tracks zero registers and a
Sparsity aware Skip Address (SASA) table that indicates instructions to be
skipped. When an instruction is fetched, SparCE dynamically pre-identifies
whether the following instruction(s) can be skipped and appropriately modifies
the program counter, thereby skipping the redundant instructions and improving
performance. We model SparCE using the gem5 architectural simulator, and
evaluate our approach on 6 image-recognition DNNs in the context of both
training and inference using the Caffe framework. On a scalar microprocessor,
SparCE achieves 19%-31% reduction in application-level. We also evaluate SparCE
on a 4-way SIMD ARMv8 processor using the OpenBLAS library, and demonstrate
that SparCE achieves 8%-15% reduction in the application-level execution time
Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO
The data partitioning and scheduling strategies used by DNN accelerators to
leverage reuse and perform staging are known as dataflow, and they directly
impact the performance and energy efficiency of DNN accelerator designs. An
accelerator microarchitecture dictates the dataflow(s) that can be employed to
execute a layer or network. Selecting an optimal dataflow for a layer shape can
have a large impact on utilization and energy efficiency, but there is a lack
of understanding on the choices and consequences of dataflows, and of tools and
methodologies to help architects explore the co-optimization design space. In
this work, we first introduce a set of data-centric directives to concisely
specify the space of DNN dataflows in a compilerfriendly form. We then show how
these directives can be analyzed to infer various forms of reuse and to exploit
them using hardware capabilities. We codify this analysis into an analytical
cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse
and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow
including execution time and energy efficiency for a DNN model and hardware
configuration. We demonstrate the use of MAESTRO to drive a hardware design
space exploration (DSE) experiment, which searches across 480M designs to
identify 2.5M valid designs at an average rate of 0.17M designs per second,
including Pareto-optimal throughput- and energy-optimized design points
Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Detection
We propose a novel method for Acoustic Event Detection (AED). In contrast to
speech, sounds coming from acoustic events may be produced by a wide variety of
sources. Furthermore, distinguishing them often requires analyzing an extended
time period due to the lack of a clear sub-word unit. In order to incorporate
the long-time frequency structure for AED, we introduce a convolutional neural
network (CNN) with a large input field. In contrast to previous works, this
enables to train audio event detection end-to-end. Our architecture is inspired
by the success of VGGNet and uses small, 3x3 convolutions, but more depth than
previous methods in AED. In order to prevent over-fitting and to take full
advantage of the modeling capabilities of our network, we further propose a
novel data augmentation method to introduce data variation. Experimental
results show that our CNN significantly outperforms state of the art methods
including Bag of Audio Words (BoAW) and classical CNNs, achieving a 16%
absolute improvement.Comment: Presented in INTERSPEECH 201
Heterogeneous Dataflow Accelerators for Multi-DNN Workloads
Emerging AI-enabled applications such as augmented/virtual reality (AR/VR)
leverage multiple deep neural network (DNN) models for sub-tasks such as object
detection, hand tracking, and so on. Because of the diversity of the sub-tasks,
the layers within and across the DNN models are highly heterogeneous in
operation and shape. Such layer heterogeneity is a challenge for a fixed
dataflow accelerator (FDA) that employs a fixed dataflow on a single
accelerator substrate since each layer prefers different dataflows (computation
order and parallelization) and tile sizes. Reconfigurable DNN accelerators
(RDAs) have been proposed to adapt their dataflows to diverse layers to address
the challenge. However, the dataflow flexibility in RDAs is enabled at the area
and energy costs of expensive hardware structures (switches, controller, etc.)
and per-layer reconfiguration.
Alternatively, this work proposes a new class of accelerators, heterogeneous
dataflow accelerators (HDAs), which deploys multiple sub-accelerators each
supporting a different dataflow. HDAs enable coarser-grained dataflow
flexibility than RDAs with higher energy efficiency and lower area cost
comparable to FDAs. To exploit such benefits, hardware resource partitioning
across sub-accelerators and layer execution schedule need to be carefully
optimized. Therefore, we also present Herald, which co-optimizes hardware
partitioning and layer execution schedule. Using Herald on a suite of AR/VR and
MLPerf workloads, we identify a promising HDA architecture, Maelstrom, which
demonstrates 65.3% lower latency and 5.0% lower energy than the best FDAs and
22.0% lower energy at the cost of 20.7% higher latency than a state-of-the-art
RDA. The results suggest that HDA is an alternative class of Pareto-optimal
accelerators to RDA with strength in energy, which can be a better choice than
RDAs depending on the use cases.Comment: This paper is accepted at HPCA 202
Audio Super Resolution using Neural Networks
We introduce a new audio processing technique that increases the sampling
rate of signals such as speech or music using deep convolutional neural
networks. Our model is trained on pairs of low and high-quality audio examples;
at test-time, it predicts missing samples within a low-resolution signal in an
interpolation process similar to image super-resolution. Our method is simple
and does not involve specialized audio processing techniques; in our
experiments, it outperforms baselines on standard speech and music benchmarks
at upscaling ratios of 2x, 4x, and 6x. The method has practical applications in
telephony, compression, and text-to-speech generation; it demonstrates the
effectiveness of feed-forward convolutional architectures on an audio
generation task.Comment: Presented at the 5th International Conference on Learning
Representations (ICLR) 2017, Workshop Track, Toulon, Franc
Bandwidth Extension on Raw Audio via Generative Adversarial Networks
Neural network-based methods have recently demonstrated state-of-the-art
results on image synthesis and super-resolution tasks, in particular by using
variants of generative adversarial networks (GANs) with supervised feature
losses. Nevertheless, previous feature loss formulations rely on the
availability of large auxiliary classifier networks, and labeled datasets that
enable such classifiers to be trained. Furthermore, there has been
comparatively little work to explore the applicability of GAN-based methods to
domains other than images and video. In this work we explore a GAN-based method
for audio processing, and develop a convolutional neural network architecture
to perform audio super-resolution. In addition to several new architectural
building blocks for audio processing, a key component of our approach is the
use of an autoencoder-based loss that enables training in the GAN framework,
with feature losses derived from unlabeled data. We explore the impact of our
architectural choices, and demonstrate significant improvements over previous
works in terms of both objective and perceptual quality
Leveraging Deep Learning to Improve the Performance Predictability of Cloud Microservices
Performance unpredictability is a major roadblock towards cloud adoption, and
has performance, cost, and revenue ramifications. Predictable performance is
even more critical as cloud services transition from monolithic designs to
microservices. Detecting QoS violations after they occur in systems with
microservices results in long recovery times, as hotspots propagate and amplify
across dependent services. We present Seer, an online cloud performance
debugging system that leverages deep learning and the massive amount of tracing
data cloud systems collect to learn spatial and temporal patterns that
translate to QoS violations. Seer combines lightweight distributed RPC-level
tracing, with detailed low-level hardware monitoring to signal an upcoming QoS
violation, and diagnose the source of unpredictable performance. Once an
imminent QoS violation is detected, Seer notifies the cluster manager to take
action to avoid performance degradation altogether. We evaluate Seer both in
local clusters, and in large-scale deployments of end-to-end applications built
with microservices with hundreds of users. We show that Seer correctly
anticipates QoS violations 91% of the time, and avoids the QoS violation to
begin with in 84% of cases. Finally, we show that Seer can identify
application-level design bugs, and provide insights on how to better architect
microservices to achieve predictable performance
- …