Search CORE

10,130 research outputs found

Fast Spatio-Temporal Residual Network for Video Super-Resolution

Author: Du Bo
He Fengxiang
Li Sheng
Tao Dacheng
Xu Yonghao
Zhang Lefei
Publication venue
Publication date: 05/04/2019
Field of study

Recently, deep learning based video super-resolution (SR) methods have achieved promising performance. To simultaneously exploit the spatial and temporal information of videos, employing 3-dimensional (3D) convolutions is a natural approach. However, straight utilizing 3D convolutions may lead to an excessively high computational complexity which restricts the depth of video SR models and thus undermine the performance. In this paper, we present a novel fast spatio-temporal residual network (FSTRN) to adopt 3D convolutions for the video SR task in order to enhance the performance while maintaining a low computational load. Specifically, we propose a fast spatio-temporal residual block (FRB) that divide each 3D filter to the product of two 3D filters, which have considerably lower dimensions. Furthermore, we design a cross-space residual learning that directly links the low-resolution space and the high-resolution space, which can greatly relieve the computational burden on the feature fusion and up-scaling parts. Extensive evaluations and comparisons on benchmark datasets validate the strengths of the proposed approach and demonstrate that the proposed network significantly outperforms the current state-of-the-art methods.Comment: To appear in CVPR 201

arXiv.org e-Print Archive

The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent

Author: De Soham
Goldstein Tom
Huang W. Ronny
Sankararaman Karthik A.
Xu Zheng
Publication venue
Publication date: 06/07/2020
Field of study

This paper studies how neural network architecture affects the speed of training. We introduce a simple concept called gradient confusion to help formally analyze this. When gradient confusion is high, stochastic gradients produced by different data samples may be negatively correlated, slowing down convergence. But when gradient confusion is low, data samples interact harmoniously, and training proceeds quickly. Through theoretical and experimental results, we demonstrate how the neural network architecture affects gradient confusion, and thus the efficiency of training. Our results show that, for popular initialization techniques, increasing the width of neural networks leads to lower gradient confusion, and thus faster model training. On the other hand, increasing the depth of neural networks has the opposite effect. Our results indicate that alternate initialization techniques or networks using both batch normalization and skip connections help reduce the training burden of very deep networks.Comment: ICML 2020 camera-ready versio

arXiv.org e-Print Archive

Spectral Pruning: Compressing Deep Neural Networks via Spectral Analysis and its Generalization Error

Author: Abe Hiroshi
Hirai So
Horiuchi Shingo
Ito Kotaro
Murata Tomoya
Nishimura Tomoaki
Suzuki Taiji
Wachi Tokuma
Yukishima Masatoshi
Publication venue
Publication date: 13/07/2020
Field of study

Compression techniques for deep neural network models are becoming very important for the efficient execution of high-performance deep learning systems on edge-computing devices. The concept of model compression is also important for analyzing the generalization error of deep learning, known as the compression-based error bound. However, there is still huge gap between a practically effective compression method and its rigorous background of statistical learning theory. To resolve this issue, we develop a new theoretical framework for model compression and propose a new pruning method called {\it spectral pruning} based on this framework. We define the ``degrees of freedom'' to quantify the intrinsic dimensionality of a model by using the eigenvalue distribution of the covariance matrix across the internal nodes and show that the compression ability is essentially controlled by this quantity. Moreover, we present a sharp generalization error bound of the compressed model and characterize the bias--variance tradeoff induced by the compression procedure. We apply our method to several datasets to justify our theoretical analyses and show the superiority of the the proposed method.Comment: 17 pages, 4 figures. Accepted in IJCAI-PRICAI 2020. Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pages 2839--284

arXiv.org e-Print Archive

Orthogonal Deep Neural Networks

Author: Jia Kui
Li Shuai
Liu Tongliang
Tao Dacheng
Wen Yuxin
Publication venue
Publication date: 15/10/2019
Field of study

In this paper, we introduce the algorithms of Orthogonal Deep Neural Networks (OrthDNNs) to connect with recent interest of spectrally regularized deep learning methods. OrthDNNs are theoretically motivated by generalization analysis of modern DNNs, with the aim to find solution properties of network weights that guarantee better generalization. To this end, we first prove that DNNs are of local isometry on data distributions of practical interest; by using a new covering of the sample space and introducing the local isometry property of DNNs into generalization analysis, we establish a new generalization error bound that is both scale- and range-sensitive to singular value spectrum of each of networks' weight matrices. We prove that the optimal bound w.r.t. the degree of isometry is attained when each weight matrix has a spectrum of equal singular values, among which orthogonal weight matrix or a non-square one with orthonormal rows or columns is the most straightforward choice, suggesting the algorithms of OrthDNNs. We present both algorithms of strict and approximate OrthDNNs, and for the later ones we propose a simple yet effective algorithm called Singular Value Bounding (SVB), which performs as well as strict OrthDNNs, but at a much lower computational cost. We also propose Bounded Batch Normalization (BBN) to make compatible use of batch normalization with OrthDNNs. We conduct extensive comparative studies by using modern architectures on benchmark image classification. Experiments show the efficacy of OrthDNNs.Comment: To Appear in IEEE Transactions on Pattern Analysis and Machine Intelligenc

arXiv.org e-Print Archive

LipschitzLR: Using theoretically computed adaptive learning rates for fast convergence

Author: Prashanth Tejas
Saha Snehanshu
Yedida Rahul
Publication venue
Publication date: 31/07/2020
Field of study

Optimizing deep neural networks is largely thought to be an empirical process, requiring manual tuning of several hyper-parameters, such as learning rate, weight decay, and dropout rate. Arguably, the learning rate is the most important of these to tune, and this has gained more attention in recent works. In this paper, we propose a novel method to compute the learning rate for training deep neural networks with stochastic gradient descent. We first derive a theoretical framework to compute learning rates dynamically based on the Lipschitz constant of the loss function. We then extend this framework to other commonly used optimization algorithms, such as gradient descent with momentum and Adam. We run an extensive set of experiments that demonstrate the efficacy of our approach on popular architectures and datasets, and show that commonly used learning rates are an order of magnitude smaller than the ideal value.Comment: v4; comparison studies adde

arXiv.org e-Print Archive

Privacy Risks of Securing Machine Learning Models against Adversarial Examples

Author: Anguita Davide
Athalye Anish
Biggio Battista
Cohen Jeremy M
Gehr Timon
Goodfellow Ian
Gowal Sven
Guo Chuan
Hayes J
Hayes Jamie
Jacobsen Jörn-Henrik
Kerckhoffs Auguste
Koh Pang Wei
Krizhevsky Alex
Lee Kuang-Chih
Lee Taesung
Madry Aleksander
Matej Moravvc
Mirman Matthew
Raghunathan Aditi
Schmidt Ludwig
Shafahi Ali
Sharif Mahmood
Shokri Reza
Silver David
Simonyan Karen
Sinha Aman
Song Chuanbiao
Steinhardt Jacob
Szegedy Christian
Tramèr Florian
Tsipras Dimitris
Wong Eric
Wong Eric
Zhang Hongyang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 25/08/2019
Field of study

The arms race between attacks and defenses for machine learning models has come to a forefront in recent years, in both the security community and the privacy community. However, one big limitation of previous research is that the security domain and the privacy domain have typically been considered separately. It is thus unclear whether the defense methods in one domain will have any unexpected impact on the other domain. In this paper, we take a step towards resolving this limitation by combining the two domains. In particular, we measure the success of membership inference attacks against six state-of-the-art defense methods that mitigate the risk of adversarial examples (i.e., evasion attacks). Membership inference attacks determine whether or not an individual data record has been part of a model's training set. The accuracy of such attacks reflects the information leakage of training algorithms about individual members of the training set. Adversarial defense methods against adversarial examples influence the model's decision boundaries such that model predictions remain unchanged for a small area around each input. However, this objective is optimized on training data. Thus, individual data records in the training set have a significant influence on robust models. This makes the models more vulnerable to inference attacks. To perform the membership inference attacks, we leverage the existing inference methods that exploit model predictions. We also propose two new inference methods that exploit structural properties of robust models on adversarially perturbed data. Our experimental evaluation demonstrates that compared with the natural training (undefended) approach, adversarial defense methods can indeed increase the target model's risk against membership inference attacks.Comment: ACM CCS 2019, code is available at https://github.com/inspire-group/privacy-vs-robustnes

arXiv.org e-Print Archive

Model Similarity Mitigates Test Set Overuse

Author: Hardt Moritz
Mania Horia
Miller John
Recht Benjamin
Schmidt Ludwig
Publication venue
Publication date: 29/05/2019
Field of study

Excessive reuse of test data has become commonplace in today's machine learning workflows. Popular benchmarks, competitions, industrial scale tuning, among other applications, all involve test data reuse beyond guidance by statistical confidence bounds. Nonetheless, recent replication studies give evidence that popular benchmarks continue to support progress despite years of extensive reuse. We proffer a new explanation for the apparent longevity of test data: Many proposed models are similar in their predictions and we prove that this similarity mitigates overfitting. Specifically, we show empirically that models proposed for the ImageNet ILSVRC benchmark agree in their predictions well beyond what we can conclude from their accuracy levels alone. Likewise, models created by large scale hyperparameter search enjoy high levels of similarity. Motivated by these empirical observations, we give a non-asymptotic generalization bound that takes similarity into account, leading to meaningful confidence bounds in practical settings.Comment: 18 pages, 7 figure

arXiv.org e-Print Archive

Semantic Adversarial Attacks: Parametric Transformations That Fool Deep Classifiers

Author: Hegde Chinmay
Joshi Ameya
Mukherjee Amitangshu
Sarkar Soumik
Publication venue
Publication date: 15/08/2019
Field of study

Deep neural networks have been shown to exhibit an intriguing vulnerability to adversarial input images corrupted with imperceptible perturbations. However, the majority of adversarial attacks assume global, fine-grained control over the image pixel space. In this paper, we consider a different setting: what happens if the adversary could only alter specific attributes of the input image? These would generate inputs that might be perceptibly different, but still natural-looking and enough to fool a classifier. We propose a novel approach to generate such `semantic' adversarial examples by optimizing a particular adversarial loss over the range-space of a parametric conditional generative model. We demonstrate implementations of our attacks on binary classifiers trained on face images, and show that such natural-looking semantic adversarial examples exist. We evaluate the effectiveness of our attack on synthetic and real data, and present detailed comparisons with existing attack methods. We supplement our empirical results with theoretical bounds that demonstrate the existence of such parametric adversarial examples.Comment: Accepted to International Conference on Computer Vision, (ICCV) 201

arXiv.org e-Print Archive

Improving Generalization Performance by Switching from Adam to SGD

Author: Keskar Nitish Shirish
Socher Richard
Publication venue
Publication date: 20/12/2017
Field of study

Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic gradient descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training. We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Concretely, we propose SWATS, a simple strategy which switches from Adam to SGD when a triggering condition is satisfied. The condition we propose relates to the projection of Adam steps on the gradient subspace. By design, the monitoring process for this condition adds very little overhead and does not increase the number of hyperparameters in the optimizer. We report experiments on several standard benchmarks such as: ResNet, SENet, DenseNet and PyramidNet for the CIFAR-10 and CIFAR-100 data sets, ResNet on the tiny-ImageNet data set and language modeling with recurrent networks on the PTB and WT2 data sets. The results show that our strategy is capable of closing the generalization gap between SGD and Adam on a majority of the tasks

arXiv.org e-Print Archive

Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

Author: Bhojanapalli Srinadh
Demmel James
Hseu Jonathan
Hsieh Cho-Jui
Keutzer Kurt
Kumar Sanjiv
Li Jing
Reddi Sashank
Song Xiaodan
You Yang
Publication venue
Publication date: 03/01/2020
Field of study

Training large deep neural networks on massive datasets is computationally very challenging. There has been recent surge in interest in using large batch stochastic optimization methods to tackle this issue. The most prominent algorithm in this line of research is LARS, which by employing layerwise adaptive learning rates trains ResNet on ImageNet in a few minutes. However, LARS performs poorly for attention models like BERT, indicating that its performance gains are not consistent across tasks. In this paper, we first study a principled layerwise adaptation strategy to accelerate training of deep neural networks using large mini-batches. Using this strategy, we develop a new layerwise adaptive large batch optimization technique called LAMB; we then provide convergence analysis of LAMB as well as LARS, showing convergence to a stationary point in general nonconvex settings. Our empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning. In particular, for BERT training, our optimizer enables use of very large batch sizes of 32868 without any degradation of performance. By increasing the batch size to the memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to just 76 minutes (Table 1). The LAMB implementation is available at https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.pyComment: Published as a conference paper at ICLR 202

arXiv.org e-Print Archive