10,130 research outputs found
Fast Spatio-Temporal Residual Network for Video Super-Resolution
Recently, deep learning based video super-resolution (SR) methods have
achieved promising performance. To simultaneously exploit the spatial and
temporal information of videos, employing 3-dimensional (3D) convolutions is a
natural approach. However, straight utilizing 3D convolutions may lead to an
excessively high computational complexity which restricts the depth of video SR
models and thus undermine the performance. In this paper, we present a novel
fast spatio-temporal residual network (FSTRN) to adopt 3D convolutions for the
video SR task in order to enhance the performance while maintaining a low
computational load. Specifically, we propose a fast spatio-temporal residual
block (FRB) that divide each 3D filter to the product of two 3D filters, which
have considerably lower dimensions. Furthermore, we design a cross-space
residual learning that directly links the low-resolution space and the
high-resolution space, which can greatly relieve the computational burden on
the feature fusion and up-scaling parts. Extensive evaluations and comparisons
on benchmark datasets validate the strengths of the proposed approach and
demonstrate that the proposed network significantly outperforms the current
state-of-the-art methods.Comment: To appear in CVPR 201
The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
This paper studies how neural network architecture affects the speed of
training. We introduce a simple concept called gradient confusion to help
formally analyze this. When gradient confusion is high, stochastic gradients
produced by different data samples may be negatively correlated, slowing down
convergence. But when gradient confusion is low, data samples interact
harmoniously, and training proceeds quickly. Through theoretical and
experimental results, we demonstrate how the neural network architecture
affects gradient confusion, and thus the efficiency of training. Our results
show that, for popular initialization techniques, increasing the width of
neural networks leads to lower gradient confusion, and thus faster model
training. On the other hand, increasing the depth of neural networks has the
opposite effect. Our results indicate that alternate initialization techniques
or networks using both batch normalization and skip connections help reduce the
training burden of very deep networks.Comment: ICML 2020 camera-ready versio
Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error
Compression techniques for deep neural network models are becoming very
important for the efficient execution of high-performance deep learning systems
on edge-computing devices. The concept of model compression is also important
for analyzing the generalization error of deep learning, known as the
compression-based error bound. However, there is still huge gap between a
practically effective compression method and its rigorous background of
statistical learning theory. To resolve this issue, we develop a new
theoretical framework for model compression and propose a new pruning method
called {\it spectral pruning} based on this framework. We define the ``degrees
of freedom'' to quantify the intrinsic dimensionality of a model by using the
eigenvalue distribution of the covariance matrix across the internal nodes and
show that the compression ability is essentially controlled by this quantity.
Moreover, we present a sharp generalization error bound of the compressed model
and characterize the bias--variance tradeoff induced by the compression
procedure. We apply our method to several datasets to justify our theoretical
analyses and show the superiority of the the proposed method.Comment: 17 pages, 4 figures. Accepted in IJCAI-PRICAI 2020. Proceedings of
the Twenty-Ninth International Joint Conference on Artificial Intelligence,
pages 2839--284
Orthogonal Deep Neural Networks
In this paper, we introduce the algorithms of Orthogonal Deep Neural Networks
(OrthDNNs) to connect with recent interest of spectrally regularized deep
learning methods. OrthDNNs are theoretically motivated by generalization
analysis of modern DNNs, with the aim to find solution properties of network
weights that guarantee better generalization. To this end, we first prove that
DNNs are of local isometry on data distributions of practical interest; by
using a new covering of the sample space and introducing the local isometry
property of DNNs into generalization analysis, we establish a new
generalization error bound that is both scale- and range-sensitive to singular
value spectrum of each of networks' weight matrices. We prove that the optimal
bound w.r.t. the degree of isometry is attained when each weight matrix has a
spectrum of equal singular values, among which orthogonal weight matrix or a
non-square one with orthonormal rows or columns is the most straightforward
choice, suggesting the algorithms of OrthDNNs. We present both algorithms of
strict and approximate OrthDNNs, and for the later ones we propose a simple yet
effective algorithm called Singular Value Bounding (SVB), which performs as
well as strict OrthDNNs, but at a much lower computational cost. We also
propose Bounded Batch Normalization (BBN) to make compatible use of batch
normalization with OrthDNNs. We conduct extensive comparative studies by using
modern architectures on benchmark image classification. Experiments show the
efficacy of OrthDNNs.Comment: To Appear in IEEE Transactions on Pattern Analysis and Machine
Intelligenc
LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence
Optimizing deep neural networks is largely thought to be an empirical
process, requiring manual tuning of several hyper-parameters, such as learning
rate, weight decay, and dropout rate. Arguably, the learning rate is the most
important of these to tune, and this has gained more attention in recent works.
In this paper, we propose a novel method to compute the learning rate for
training deep neural networks with stochastic gradient descent. We first derive
a theoretical framework to compute learning rates dynamically based on the
Lipschitz constant of the loss function. We then extend this framework to other
commonly used optimization algorithms, such as gradient descent with momentum
and Adam. We run an extensive set of experiments that demonstrate the efficacy
of our approach on popular architectures and datasets, and show that commonly
used learning rates are an order of magnitude smaller than the ideal value.Comment: v4; comparison studies adde
Privacy Risks of Securing Machine Learning Models against Adversarial Examples
The arms race between attacks and defenses for machine learning models has
come to a forefront in recent years, in both the security community and the
privacy community. However, one big limitation of previous research is that the
security domain and the privacy domain have typically been considered
separately. It is thus unclear whether the defense methods in one domain will
have any unexpected impact on the other domain.
In this paper, we take a step towards resolving this limitation by combining
the two domains. In particular, we measure the success of membership inference
attacks against six state-of-the-art defense methods that mitigate the risk of
adversarial examples (i.e., evasion attacks). Membership inference attacks
determine whether or not an individual data record has been part of a model's
training set. The accuracy of such attacks reflects the information leakage of
training algorithms about individual members of the training set. Adversarial
defense methods against adversarial examples influence the model's decision
boundaries such that model predictions remain unchanged for a small area around
each input. However, this objective is optimized on training data. Thus,
individual data records in the training set have a significant influence on
robust models. This makes the models more vulnerable to inference attacks.
To perform the membership inference attacks, we leverage the existing
inference methods that exploit model predictions. We also propose two new
inference methods that exploit structural properties of robust models on
adversarially perturbed data. Our experimental evaluation demonstrates that
compared with the natural training (undefended) approach, adversarial defense
methods can indeed increase the target model's risk against membership
inference attacks.Comment: ACM CCS 2019, code is available at
https://github.com/inspire-group/privacy-vs-robustnes
Model Similarity Mitigates Test Set Overuse
Excessive reuse of test data has become commonplace in today's machine
learning workflows. Popular benchmarks, competitions, industrial scale tuning,
among other applications, all involve test data reuse beyond guidance by
statistical confidence bounds. Nonetheless, recent replication studies give
evidence that popular benchmarks continue to support progress despite years of
extensive reuse. We proffer a new explanation for the apparent longevity of
test data: Many proposed models are similar in their predictions and we prove
that this similarity mitigates overfitting. Specifically, we show empirically
that models proposed for the ImageNet ILSVRC benchmark agree in their
predictions well beyond what we can conclude from their accuracy levels alone.
Likewise, models created by large scale hyperparameter search enjoy high levels
of similarity. Motivated by these empirical observations, we give a
non-asymptotic generalization bound that takes similarity into account, leading
to meaningful confidence bounds in practical settings.Comment: 18 pages, 7 figure
Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers
Deep neural networks have been shown to exhibit an intriguing vulnerability
to adversarial input images corrupted with imperceptible perturbations.
However, the majority of adversarial attacks assume global, fine-grained
control over the image pixel space. In this paper, we consider a different
setting: what happens if the adversary could only alter specific attributes of
the input image? These would generate inputs that might be perceptibly
different, but still natural-looking and enough to fool a classifier. We
propose a novel approach to generate such `semantic' adversarial examples by
optimizing a particular adversarial loss over the range-space of a parametric
conditional generative model. We demonstrate implementations of our attacks on
binary classifiers trained on face images, and show that such natural-looking
semantic adversarial examples exist. We evaluate the effectiveness of our
attack on synthetic and real data, and present detailed comparisons with
existing attack methods. We supplement our empirical results with theoretical
bounds that demonstrate the existence of such parametric adversarial examples.Comment: Accepted to International Conference on Computer Vision, (ICCV) 201
Improving Generalization Performance by Switching from Adam to SGD
Despite superior training outcomes, adaptive optimization methods such as
Adam, Adagrad or RMSprop have been found to generalize poorly compared to
Stochastic gradient descent (SGD). These methods tend to perform well in the
initial portion of training but are outperformed by SGD at later stages of
training. We investigate a hybrid strategy that begins training with an
adaptive method and switches to SGD when appropriate. Concretely, we propose
SWATS, a simple strategy which switches from Adam to SGD when a triggering
condition is satisfied. The condition we propose relates to the projection of
Adam steps on the gradient subspace. By design, the monitoring process for this
condition adds very little overhead and does not increase the number of
hyperparameters in the optimizer. We report experiments on several standard
benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and
CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling
with recurrent networks on the PTB and WT2 data sets. The results show that our
strategy is capable of closing the generalization gap between SGD and Adam on a
majority of the tasks
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Training large deep neural networks on massive datasets is computationally
very challenging. There has been recent surge in interest in using large batch
stochastic optimization methods to tackle this issue. The most prominent
algorithm in this line of research is LARS, which by employing layerwise
adaptive learning rates trains ResNet on ImageNet in a few minutes. However,
LARS performs poorly for attention models like BERT, indicating that its
performance gains are not consistent across tasks. In this paper, we first
study a principled layerwise adaptation strategy to accelerate training of deep
neural networks using large mini-batches. Using this strategy, we develop a new
layerwise adaptive large batch optimization technique called LAMB; we then
provide convergence analysis of LAMB as well as LARS, showing convergence to a
stationary point in general nonconvex settings. Our empirical results
demonstrate the superior performance of LAMB across various tasks such as BERT
and ResNet-50 training with very little hyperparameter tuning. In particular,
for BERT training, our optimizer enables use of very large batch sizes of 32868
without any degradation of performance. By increasing the batch size to the
memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to
just 76 minutes (Table 1). The LAMB implementation is available at
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.pyComment: Published as a conference paper at ICLR 202
- …