3,194 research outputs found

    Learning Strict Identity Mappings in Deep Residual Networks

    Full text link
    A family of super deep networks, referred to as residual networks or ResNet, achieved record-beating performance in various visual tasks such as image recognition, object detection, and semantic segmentation. The ability to train very deep networks naturally pushed the researchers to use enormous resources to achieve the best performance. Consequently, in many applications super deep residual networks were employed for just a marginal improvement in performance. In this paper, we propose epsilon-ResNet that allows us to automatically discard redundant layers, which produces responses that are smaller than a threshold epsilon, with a marginal or no loss in performance. The epsilon-ResNet architecture can be achieved using a few additional rectified linear units in the original ResNet. Our method does not use any additional variables nor numerous trials like other hyper-parameter optimization techniques. The layer selection is achieved using a single training process and the evaluation is performed on CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets. In some instances, we achieve about 80% reduction in the number of parameters.Comment: Make title consistent with the CVPR versio

    Competitive Inner-Imaging Squeeze and Excitation for Residual Network

    Full text link
    Residual networks, which use a residual unit to supplement the identity mappings, enable very deep convolutional architecture to operate well, however, the residual architecture has been proved to be diverse and redundant, which may leads to low-efficient modeling. In this work, we propose a competitive squeeze-excitation (SE) mechanism for the residual network. Re-scaling the value for each channel in this structure will be determined by the residual and identity mappings jointly, and this design enables us to expand the meaning of channel relationship modeling in residual blocks. Modeling of the competition between residual and identity mappings cause the identity flow to control the complement of the residual feature maps for itself. Furthermore, we design a novel inner-imaging competitive SE block to shrink the consumption and re-image the global features of intermediate network structure, by using the inner-imaging mechanism, we can model the channel-wise relations with convolution in spatial. We carry out experiments on the CIFAR, SVHN, and ImageNet datasets, and the proposed method can challenge state-of-the-art results.Comment: Code is available at https://github.com/scut-aitcm/Competitive-Inner-Imaging-SENe

    Representation Learning on Large and Small Data

    Full text link
    Deep learning owes its success to three key factors: scale of data, enhanced models to learn representations from data, and scale of computation. This book chapter presented the importance of the data-driven approach to learn good representations from both big data and small data. In terms of big data, it has been widely accepted in the research community that the more data the better for both representation and classification improvement. The question is then how to learn representations from big data, and how to perform representation learning when data is scarce. We addressed the first question by presenting CNN model enhancements in the aspects of representation, optimization, and generalization. To address the small data challenge, we showed transfer representation learning to be effective. Transfer representation learning transfers the learned representation from a source domain where abundant training data is available to a target domain where training data is scarce. Transfer representation learning gave the OM and melanoma diagnosis modules of our XPRIZE Tricorder device (which finished 2nd2^{nd} out of 310310 competing teams) a significant boost in diagnosis accuracy.Comment: Book chapte

    Orthogonal Deep Neural Networks

    Full text link
    In this paper, we introduce the algorithms of Orthogonal Deep Neural Networks (OrthDNNs) to connect with recent interest of spectrally regularized deep learning methods. OrthDNNs are theoretically motivated by generalization analysis of modern DNNs, with the aim to find solution properties of network weights that guarantee better generalization. To this end, we first prove that DNNs are of local isometry on data distributions of practical interest; by using a new covering of the sample space and introducing the local isometry property of DNNs into generalization analysis, we establish a new generalization error bound that is both scale- and range-sensitive to singular value spectrum of each of networks' weight matrices. We prove that the optimal bound w.r.t. the degree of isometry is attained when each weight matrix has a spectrum of equal singular values, among which orthogonal weight matrix or a non-square one with orthonormal rows or columns is the most straightforward choice, suggesting the algorithms of OrthDNNs. We present both algorithms of strict and approximate OrthDNNs, and for the later ones we propose a simple yet effective algorithm called Singular Value Bounding (SVB), which performs as well as strict OrthDNNs, but at a much lower computational cost. We also propose Bounded Batch Normalization (BBN) to make compatible use of batch normalization with OrthDNNs. We conduct extensive comparative studies by using modern architectures on benchmark image classification. Experiments show the efficacy of OrthDNNs.Comment: To Appear in IEEE Transactions on Pattern Analysis and Machine Intelligenc

    How deep learning works --The geometry of deep learning

    Full text link
    Why and how that deep learning works well on different tasks remains a mystery from a theoretical perspective. In this paper we draw a geometric picture of the deep learning system by finding its analogies with two existing geometric structures, the geometry of quantum computations and the geometry of the diffeomorphic template matching. In this framework, we give the geometric structures of different deep learning systems including convolutional neural networks, residual networks, recursive neural networks, recurrent neural networks and the equilibrium prapagation framework. We can also analysis the relationship between the geometrical structures and their performance of different networks in an algorithmic level so that the geometric framework may guide the design of the structures and algorithms of deep learning systems.Comment: 16 pages, 13 figure

    Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

    Full text link
    Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectral-normalized identity priors (SNIP), a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm. It is applicable to any structured module, including a single attention head, an entire attention block, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio.Comment: Findings of EMNLP 202

    Batch Normalization and the impact of batch structure on the behavior of deep convolution networks

    Full text link
    Batch normalization was introduced in 2015 to speed up training of deep convolution networks by normalizing the activations across the current batch to have zero mean and unity variance. The results presented here show an interesting aspect of batch normalization, where controlling the shape of the training batches can influence what the network will learn. If training batches are structured as balanced batches (one image per class), and inference is also carried out on balanced test batches, using the batch's own means and variances, then the conditional results will improve considerably. The network uses the strong information about easy images in a balanced batch, and propagates it through the shared means and variances to help decide the identity of harder images on the same batch. Balancing the test batches requires the labels of the test images, which are not available in practice, however further investigation can be done using batch structures that are less strict and might not require the test image labels. The conditional results show the error rate almost reduced to zero for nontrivial datasets with small number of classes such as the CIFAR10

    Residual Networks Behave Like Ensembles of Relatively Shallow Networks

    Full text link
    In this work we propose a novel interpretation of residual networks showing that they can be seen as a collection of many paths of differing length. Moreover, residual networks seem to enable very deep networks by leveraging only the short paths during training. To support this observation, we rewrite residual networks as an explicit collection of paths. Unlike traditional models, paths through residual networks vary in length. Further, a lesion study reveals that these paths show ensemble-like behavior in the sense that they do not strongly depend on each other. Finally, and most surprising, most paths are shorter than one might expect, and only the short paths are needed during training, as longer paths do not contribute any gradient. For example, most of the gradient in a residual network with 110 layers comes from paths that are only 10-34 layers deep. Our results reveal one of the key characteristics that seem to enable the training of very deep networks: Residual networks avoid the vanishing gradient problem by introducing short paths which can carry gradient throughout the extent of very deep networks.Comment: NIPS 201

    Learning Channel Inter-dependencies at Multiple Scales on Dense Networks for Face Recognition

    Full text link
    We propose a new deep network structure for unconstrained face recognition. The proposed network integrates several key components together in order to characterize complex data distributions, such as in unconstrained face images. Inspired by recent progress in deep networks, we consider some important concepts, including multi-scale feature learning, dense connections of network layers, and weighting different network flows, for building our deep network structure. The developed network is evaluated in unconstrained face matching, showing the capability of learning complex data distributions caused by face images with various qualities.Comment: 12 page

    Estimating of the inertial manifold dimension for a chaotic attractor of complex Ginzburg-Landau equation using a neural network

    Full text link
    Dimension of an inertial manifold for a chaotic attractor of spatially distributed system is estimated using autoencoder neural network. The inertial manifold is a low dimensional manifold where the chaotic attractor is embedded. The autoencoder maps system state vectors onto themselves letting them pass through an inner state with a reduced dimension. The training processes of the autoencoder is shown to depend dramatically on the reduced dimension: a learning curve saturates when the dimension is too small and decays if it is sufficient for a lossless information transfer. The smallest sufficient value is considered as a dimension of the inertial manifold, and the autoencoder implements a mapping onto the inertial manifold and back. The correctness of the computed dimension is confirmed by its remarkable coincidence with the one obtained as a number of covariant Lyapunov vectors with vanishing pairwise angles. These vectors are called physical modes. Unlike never having zero angles residual ones they are known to span a tangent subspace for the inertial manifold.Comment: 10 pages and 9 figure
    • …
    corecore