395 research outputs found

    BlockDrop: Dynamic Inference Paths in Residual Networks

    Full text link
    Very deep convolutional neural networks offer excellent recognition results, yet their computational expense limits their impact for many real-world applications. We introduce BlockDrop, an approach that learns to dynamically choose which layers of a deep network to execute during inference so as to best reduce total computation without degrading prediction accuracy. Exploiting the robustness of Residual Networks (ResNets) to layer dropping, our framework selects on-the-fly which residual blocks to evaluate for a given novel image. In particular, given a pretrained ResNet, we train a policy network in an associative reinforcement learning setting for the dual reward of utilizing a minimal number of blocks while preserving recognition accuracy. We conduct extensive experiments on CIFAR and ImageNet. The results provide strong quantitative and qualitative evidence that these learned policies not only accelerate inference but also encode meaningful visual information. Built upon a ResNet-101 model, our method achieves a speedup of 20\% on average, going as high as 36\% for some images, while maintaining the same 76.4\% top-1 accuracy on ImageNet.Comment: CVPR 201

    Dynamic Feature Pruning and Consolidation for Occluded Person Re-Identification

    Full text link
    Occluded person re-identification (ReID) is a challenging problem due to contamination from occluders, and existing approaches address the issue with prior knowledge cues, eg human body key points, semantic segmentations and etc, which easily fails in the presents of heavy occlusion and other humans as occluders. In this paper, we propose a feature pruning and consolidation (FPC) framework to circumvent explicit human structure parse, which mainly consists of a sparse encoder, a global and local feature ranking module, and a feature consolidation decoder. Specifically, the sparse encoder drops less important image tokens (mostly related to background noise and occluders) solely according to correlation within the class token attention instead of relying on prior human shape information. Subsequently, the ranking stage relies on the preserved tokens produced by the sparse encoder to identify k-nearest neighbors from a pre-trained gallery memory by measuring the image and patch-level combined similarity. Finally, we use the feature consolidation module to compensate pruned features using identified neighbors for recovering essential information while disregarding disturbance from noise and occlusion. Experimental results demonstrate the effectiveness of our proposed framework on occluded, partial and holistic Re-ID datasets. In particular, our method outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1 accuracy on the challenging Occluded-Duke dataset.Comment: 12 pages, 9 figure

    Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks

    Get PDF
    Deep neural networks have transformed a wide variety of domains including natural language processing, image and video processing, and robotics. However, the computational cost of training and inference with these models is high, and the rise of unsupervised pretraining has allowed ever larger networks to be used to further improve performance. Running these large neural networks in compute constrained environments such as on edge devices is infeasible, and the alternative of doing inference using cloud compute can be exceedingly expensive, with the largest language models needing to be distributed across multiple GPUs. Because of these constraints, size reduction and improving inference speed has been a main focus in neural network research. A wide variety of techniques have been proposed to improve the efficiency of existing neural networks including pruning, quantization, and knowledge distillation. In addition there is extensive effort on creating more efficient networks through hand design or an automated process called neural architecture search. However, there remain key domains where where there is significant room for improvement, which we demonstrate in this thesis. In this thesis we aim to improve the efficiency of deep neural networks in terms of inference latency, model size and latent representation size. We take an alternative approach to previous research and instead investigate redundant representations in neural networks. Across three domains of text classification, image classification and generative models we hypothesize that current neural networks contain representational redundancy and show that through the removal of this redundancy we can improve their efficiency. For image classification we hypothesize that convolution kernels contain redundancy in terms of unnecessary channel wise flexibility, and test this by introducing additional weight sharing into the network, preserving or even increasing classification performance while requiring fewer parameters. We show the benefits of this approach on convolution layers on the CIFAR and Imagenet datasets, on both standard models and models explicitly designed to be parameter efficient. For generative models we show it is possible to reduce the size of the latent representation of the model while preserving the quality of the generated images through the unsupervised disentanglement of shape and orientation. To do this we introduce the affine variational autoencoder, a novel training procedure, and demonstrate its effectiveness on the problem of generating 2 dimensional images, as well as 3 dimensional voxel representations of objects. Finally, looking at the transformer model, we note that there is a mismatch between the tasks used for pretraining and the downstream tasks models are fine tuned on, such as text classification.We hypothesize that this results in a redundancy in terms of unnecessary spatial information, and remove it through the introduction of learned sequence length bottlenecks. We aim to create task specific networks given a dataset and performance requirements through the use of a neural architecture search method and learned downsampling. We show that these task specific networks achieve superior performance in terms of inference latency and accuracy tradeoff to standard models without requiring additional pretraining

    TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer

    Full text link
    In this paper, we introduce a set of effective TOken REduction (TORE) strategies for Transformer-based Human Mesh Recovery from monocular images. Current SOTA performance is achieved by Transformer-based structures. However, they suffer from high model complexity and computation cost caused by redundant tokens. We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature, where we hierarchically recover the mesh geometry with priors from body structure and conduct token clustering to pass fewer but more discriminative image feature tokens to the Transformer. As a result, our method vastly reduces the number of tokens involved in high-complexity interactions in the Transformer, achieving competitive accuracy of shape recovery at a significantly reduced computational cost. We conduct extensive experiments across a wide range of benchmarks to validate the proposed method and further demonstrate the generalizability of our method on hand mesh recovery. Our code will be publicly available once the paper is published

    Deep Neural Networks and Data for Automated Driving

    Get PDF
    This open access book brings together the latest developments from industry and research on automated driving and artificial intelligence. Environment perception for highly automated driving heavily employs deep neural networks, facing many challenges. How much data do we need for training and testing? How to use synthetic data to save labeling costs for training? How do we increase robustness and decrease memory usage? For inevitably poor conditions: How do we know that the network is uncertain about its decisions? Can we understand a bit more about what actually happens inside neural networks? This leads to a very practical problem particularly for DNNs employed in automated driving: What are useful validation techniques and how about safety? This book unites the views from both academia and industry, where computer vision and machine learning meet environment perception for highly automated driving. Naturally, aspects of data, robustness, uncertainty quantification, and, last but not least, safety are at the core of it. This book is unique: In its first part, an extended survey of all the relevant aspects is provided. The second part contains the detailed technical elaboration of the various questions mentioned above

    Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting

    Full text link
    Removing clutter from scenes is essential in many applications, ranging from privacy-concerned content filtering to data augmentation. In this work, we present an automatic system that removes clutter from 3D scenes and inpaints with coherent geometry and texture. We propose techniques for its two key components: 3D segmentation from shared properties and 3D inpainting, both of which are important problems. The definition of 3D scene clutter (frequently-moving objects) is not well captured by commonly-studied object categories in computer vision. To tackle the lack of well-defined clutter annotations, we group noisy fine-grained labels, leverage virtual rendering, and impose an instance-level area-sensitive loss. Once clutter is removed, we inpaint geometry and texture in the resulting holes by merging inpainted RGB-D images. This requires novel voting and pruning strategies that guarantee multi-view consistency across individually inpainted images for mesh reconstruction. Experiments on ScanNet and Matterport dataset show that our method outperforms baselines for clutter segmentation and 3D inpainting, both visually and quantitatively.Comment: 18 pages. ICCV 2023. Project page: https://weify627.github.io/clutter

    DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos

    Full text link
    Existing implicit neural representation (INR) methods do not fully exploit spatiotemporal redundancies in videos. Index-based INRs ignore the content-specific spatial features and hybrid INRs ignore the contextual dependency on adjacent frames, leading to poor modeling capability for scenes with large motion or dynamics. We analyze this limitation from the perspective of function fitting and reveal the importance of frame difference. To use explicit motion information, we propose Difference Neural Representation for Videos (DNeRV), which consists of two streams for content and frame difference. We also introduce a collaborative content unit for effective feature fusion. We test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves competitive results against the state-of-the-art neural compression approaches and outperforms existing implicit methods on downstream inpainting and interpolation for 960×1920960 \times 1920 videos
    corecore