14,369 research outputs found
Profile-guided memory optimization for deep neural networks
Recent years have seen deep neural networks (DNNs) becoming wider and deeper
to achieve better performance in many applications of AI. Such DNNs however
require huge amounts of memory to store weights and intermediate results (e.g.,
activations, feature maps, etc.) in propagation. This requirement makes it
difficult to run the DNNs on devices with limited, hard-to-extend memory,
degrades the running time performance, and restricts the design of network
models. We address this challenge by developing a novel profile-guided memory
optimization to efficiently and quickly allocate memory blocks during the
propagation in DNNs. The optimization utilizes a simple and fast heuristic
algorithm based on the two-dimensional rectangle packing problem. Experimenting
with well-known neural network models, we confirm that our method not only
reduces the memory consumption by up to but also accelerates training
and inference by up to a factor of four thanks to the rapidity of the memory
allocation and the ability to use larger mini-batch sizes.Comment: 7 pages, 9 figure
A Blended Deep Learning Approach for Predicting User Intended Actions
User intended actions are widely seen in many areas. Forecasting these
actions and taking proactive measures to optimize business outcome is a crucial
step towards sustaining the steady business growth. In this work, we focus on
pre- dicting attrition, which is one of typical user intended actions.
Conventional attrition predictive modeling strategies suffer a few inherent
drawbacks. To overcome these limitations, we propose a novel end-to-end
learning scheme to keep track of the evolution of attrition patterns for the
predictive modeling. It integrates user activity logs, dynamic and static user
profiles based on multi-path learning. It exploits historical user records by
establishing a decaying multi-snapshot technique. And finally it employs the
precedent user intentions via guiding them to the subsequent learning
procedure. As a result, it addresses all disadvantages of conventional methods.
We evaluate our methodology on two public data repositories and one private
user usage dataset provided by Adobe Creative Cloud. The extensive experiments
demonstrate that it can offer the appealing performance in comparison with
several existing approaches as rated by different popular metrics. Furthermore,
we introduce an advanced interpretation and visualization strategy to
effectively characterize the periodicity of user activity logs. It can help to
pinpoint important factors that are critical to user attrition and retention
and thus suggests actionable improvement targets for business practice. Our
work will provide useful insights into the prediction and elucidation of other
user intended actions as well.Comment: 10 pages, International Conference on Data Mining 201
Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM
Many long short-term memory (LSTM) applications need fast yet compact models.
Neural network compression approaches, such as the grow-and-prune paradigm,
have proved to be promising for cutting down network complexity by skipping
insignificant weights. However, current compression strategies are mostly
hardware-agnostic and network complexity reduction does not always translate
into execution efficiency. In this work, we propose a hardware-guided symbiotic
training methodology for compact, accurate, yet execution-efficient inference
models. It is based on our observation that hardware may introduce substantial
non-monotonic behavior, which we call the latency hysteresis effect, when
evaluating network size vs. inference latency. This observation raises question
about the mainstream smaller-dimension-is-better compression strategy, which
often leads to a sub-optimal model architecture. By leveraging the
hardware-impacted hysteresis effect and sparsity, we are able to achieve the
symbiosis of model compactness and accuracy with execution efficiency, thus
reducing LSTM latency while increasing its accuracy. We have evaluated our
algorithms on language modeling and speech recognition applications. Relative
to the traditional stacked LSTM architecture obtained for the Penn Treebank
dataset, we reduce the number of parameters by 18.0x (30.5x) and measured
run-time latency by up to 2.4x (5.2x) on Nvidia GPUs (Intel Xeon CPUs) without
any accuracy degradation. For the DeepSpeech2 architecture obtained for the AN4
dataset, we reduce the number of parameters by 7.0x (19.4x), word error rate
from 12.9% to 9.9% (10.4%), and measured run-time latency by up to 1.7x (2.4x)
on Nvidia GPUs (Intel Xeon CPUs). Thus, our method yields compact, accurate,
yet execution-efficient inference models
ROPNN: Detection of ROP Payloads Using Deep Neural Networks
Return-oriented programming (ROP) is a code reuse attack that chains short
snippets of existing code to perform arbitrary operations on target machines.
Existing detection methods against ROP exhibit unsatisfactory detection
accuracy and/or have high runtime overhead.
In this paper, we present ROPNN, which innovatively combines address space
layout guided disassembly and deep neural networks to detect ROP payloads. The
disassembler treats application input data as code pointers and aims to find
any potential gadget chains, which are then classified by a deep neural network
as benign or malicious. Our experiments show that ROPNN has high detection rate
(99.3%) and a very low false positive rate (0.01%). ROPNN successfully detects
all of the 100 real-world ROP exploits that are collected in-the-wild, created
manually or created by ROP exploit generation tools. Additionally, ROPNN
detects all 10 ROP exploits that can bypass Bin-CFI. ROPNN is non-intrusive and
does not incur any runtime overhead to the protected program
Checkmate: Breaking the Memory Wall with Optimal Tensor Rematerialization
We formalize the problem of trading-off DNN training time and memory
requirements as the tensor rematerialization optimization problem, a
generalization of prior checkpointing strategies. We introduce Checkmate, a
system that solves for optimal rematerialization schedules in reasonable times
(under an hour) using off-the-shelf MILP solvers or near-optimal schedules with
an approximation algorithm, then uses these schedules to accelerate millions of
training iterations. Our method scales to complex, realistic architectures and
is hardware-aware through the use of accelerator-specific, profile-based cost
models. In addition to reducing training cost, Checkmate enables real-world
networks to be trained with up to 5.1x larger input sizes. Checkmate is an
open-source project, available at https://github.com/parasj/checkmate.Comment: In Proceedings of 3rd Conference Machine Learning and Systems 2020
(MLSys 2020
Learning-based Application-Agnostic 3D NoC Design for Heterogeneous Manycore Systems
The rising use of deep learning and other big-data algorithms has led to an
increasing demand for hardware platforms that are computationally powerful, yet
energy-efficient. Due to the amount of data parallelism in these algorithms,
high-performance 3D manycore platforms that incorporate both CPUs and GPUs
present a promising direction. However, as systems use heterogeneity (e.g., a
combination of CPUs, GPUs, and accelerators) to improve performance and
efficiency, it becomes more pertinent to address the distinct and likely
conflicting communication requirements (e.g., CPU memory access latency or GPU
network throughput) that arise from such heterogeneity. Unfortunately, it is
difficult to quickly explore the hardware design space and choose appropriate
tradeoffs between these heterogeneous requirements. To address these
challenges, we propose the design of a 3D Network-on-Chip (NoC) for
heterogeneous manycore platforms that considers the appropriate design
objectives for a 3D heterogeneous system and explores various tradeoffs using
an efficient ML-based multi-objective optimization technique. The proposed
design space exploration considers the various requirements of its
heterogeneous components and generates a set of 3D NoC architectures that
efficiently trades off these design objectives. Our findings show that by
jointly considering these requirements (latency, throughput, temperature, and
energy), we can achieve 9.6% better Energy-Delay Product on average at nearly
iso-temperature conditions when compared to a thermally-optimized design for 3D
heterogeneous NoCs. More importantly, our results suggest that our 3D NoCs
optimized for a few applications can be generalized for unknown applications as
well. Our results show that these generalized 3D NoCs only incur a 1.8%
(36-tile system) and 1.1% (64-tile system) average performance loss compared to
application-specific NoCs.Comment: Published in IEEE Transactions on Computer
Bit-Flip Attack: Crushing Neural Network with Progressive Bit Search
Several important security issues of Deep Neural Network (DNN) have been
raised recently associated with different applications and components. The most
widely investigated security concern of DNN is from its malicious input, a.k.a
adversarial example. Nevertheless, the security challenge of DNN's parameters
is not well explored yet. In this work, we are the first to propose a novel DNN
weight attack methodology called Bit-Flip Attack (BFA) which can crush a neural
network through maliciously flipping extremely small amount of bits within its
weight storage memory system (i.e., DRAM). The bit-flip operations could be
conducted through well-known Row-Hammer attack, while our main contribution is
to develop an algorithm to identify the most vulnerable bits of DNN weight
parameters (stored in memory as binary bits), that could maximize the accuracy
degradation with a minimum number of bit-flips. Our proposed BFA utilizes a
Progressive Bit Search (PBS) method which combines gradient ranking and
progressive search to identify the most vulnerable bit to be flipped. With the
aid of PBS, we can successfully attack a ResNet-18 fully malfunction (i.e.,
top-1 accuracy degrade from 69.8% to 0.1%) only through 13 bit-flips out of 93
million bits, while randomly flipping 100 bits merely degrades the accuracy by
less than 1%
Attend and Predict: Understanding Gene Regulation by Selective Attention on Chromatin
The past decade has seen a revolution in genomic technologies that enable a
flood of genome-wide profiling of chromatin marks. Recent literature tried to
understand gene regulation by predicting gene expression from large-scale
chromatin measurements. Two fundamental challenges exist for such learning
tasks: (1) genome-wide chromatin signals are spatially structured,
high-dimensional and highly modular; and (2) the core aim is to understand what
are the relevant factors and how they work together? Previous studies either
failed to model complex dependencies among input signals or relied on separate
feature analysis to explain the decisions. This paper presents an
attention-based deep learning approach; we call AttentiveChrome, that uses a
unified architecture to model and to interpret dependencies among chromatin
factors for controlling gene regulation. AttentiveChrome uses a hierarchy of
multiple Long short-term memory (LSTM) modules to encode the input signals and
to model how various chromatin marks cooperate automatically. AttentiveChrome
trains two levels of attention jointly with the target prediction, enabling it
to attend differentially to relevant marks and to locate important positions
per mark. We evaluate the model across 56 different cell types (tasks) in
human. Not only is the proposed architecture more accurate, but its attention
scores also provide a better interpretation than state-of-the-art feature
visualization methods such as saliency map.
Code and data are shared at www.deepchrome.orgComment: 12 pages; At NIPS 201
Circumventing the Curse of Dimensionality in Magnetic Resonance Fingerprinting through a Deep Learning Approach
MR fingerprinting (MRF) is a rapid growing approach for fast quantitave MRI.
A typical drawback of dictionary-based MRF is its explosion in size as a
function of the number of reconstructed parameters, according to the curse of
dimensionality. Deep Neural Networks (NNs) have been proposed as a feasible
alternative, but these approaches are still in their infancy.
We tested different NN pipelines on simulated data: we studied optimal
training procedures by including different strategies of noise addition and
parameter space sampling, to achieve better accuracy and robustness to noise.
Four MRF sequences were considered, two of them designed to be more specific
for parameter encoding: IR-FISP, IR-FISP-, bSSFP and
IR-bSSFP-. A comparison between NN and the dictionary approaches was
performed using a numerical brain phantom.
Results demonstrated that training with random sampling and different levels
of noise variance yielded the best performance. NN performance was greater or
equal than dictionary-based approach in reconstructing MR parameter maps: the
difference in performance increased with the number of estimated parameters,
because the dictionary method suffers from the coarse resolution of the MR
parameter space sampling. The NN approach resulted more efficient in terms of
memory and computational burden, and thus has great potential in large-scale
MRF problems
Occlusion-guided compact template learning for ensemble deep network-based pose-invariant face recognition
Concatenation of the deep network representations extracted from different
facial patches helps to improve face recognition performance. However, the
concatenated facial template increases in size and contains redundant
information. Previous solutions aim to reduce the dimensionality of the facial
template without considering the occlusion pattern of the facial patches. In
this paper, we propose an occlusion-guided compact template learning (OGCTL)
approach that only uses the information from visible patches to construct the
compact template. The compact face representation is not sensitive to the
number of patches that are used to construct the facial template and is more
suitable for incorporating the information from different view angles for
image-set based face recognition. Instead of using occlusion masks in face
matching (e.g., DPRFS [38]), the proposed method uses occlusion masks in
template construction and achieves significantly better image-set based face
verification performance on a challenging database with a template size that is
an order-of-magnitude smaller than DPRFS.Comment: Accepted by International Conference on Biometrics (ICB 2019) as an
Oral presentatio
- …