425 research outputs found
Recommended from our members
Multi-task Pruning via Filter Index Sharing: A Many-Objective Optimization Approach
© The Author(s) 2021. State-of-the-art deep neural network plays an increasingly important role in artificial intelligence, while the huge number of parameters in networks brings high memory cost and computational complexity. To solve this problem, filter pruning is widely used for neural network compression and acceleration. However, existing algorithms focus mainly on pruning single model, and few results are available to multi-task pruning that is capable of pruning multi-model and promoting the learning performance. By utilizing the filter sharing technique, this paper aimed to establish a multi-task pruning framework for simultaneously pruning and merging filters in multi-task networks. An optimization problem of selecting the important filters is solved by developing a many-objective optimization algorithm where three criteria are adopted as objectives for the many-objective optimization problem. With the purpose of keeping the network structure, an index matrix is introduced to regulate the information sharing during multi-task training. The proposed multi-task pruning algorithm is quite flexible that can be performed with either adaptive or pre-specified pruning rates. Extensive experiments are performed to verify the applicability and superiority of the proposed method on both single-task and multi-task pruning.National Natural Science Foundation of China under Grants 61701238, 11431015, 61773209, 61873148 and 61933007; Natural Science Foundation of Jiangsu Province of China under Grant BK20190021; Six Talent Peaks Project in Jiangsu Province of China under Grant XYDXX-033; Royal Society of the U.K.; Alexander von Humboldt Foundation of Germany
BlockDrop: Dynamic Inference Paths in Residual Networks
Very deep convolutional neural networks offer excellent recognition results,
yet their computational expense limits their impact for many real-world
applications. We introduce BlockDrop, an approach that learns to dynamically
choose which layers of a deep network to execute during inference so as to best
reduce total computation without degrading prediction accuracy. Exploiting the
robustness of Residual Networks (ResNets) to layer dropping, our framework
selects on-the-fly which residual blocks to evaluate for a given novel image.
In particular, given a pretrained ResNet, we train a policy network in an
associative reinforcement learning setting for the dual reward of utilizing a
minimal number of blocks while preserving recognition accuracy. We conduct
extensive experiments on CIFAR and ImageNet. The results provide strong
quantitative and qualitative evidence that these learned policies not only
accelerate inference but also encode meaningful visual information. Built upon
a ResNet-101 model, our method achieves a speedup of 20\% on average, going as
high as 36\% for some images, while maintaining the same 76.4\% top-1 accuracy
on ImageNet.Comment: CVPR 201
Dynamic Feature Pruning and Consolidation for Occluded Person Re-Identification
Occluded person re-identification (ReID) is a challenging problem due to
contamination from occluders, and existing approaches address the issue with
prior knowledge cues, eg human body key points, semantic segmentations and etc,
which easily fails in the presents of heavy occlusion and other humans as
occluders. In this paper, we propose a feature pruning and consolidation (FPC)
framework to circumvent explicit human structure parse, which mainly consists
of a sparse encoder, a global and local feature ranking module, and a feature
consolidation decoder. Specifically, the sparse encoder drops less important
image tokens (mostly related to background noise and occluders) solely
according to correlation within the class token attention instead of relying on
prior human shape information. Subsequently, the ranking stage relies on the
preserved tokens produced by the sparse encoder to identify k-nearest neighbors
from a pre-trained gallery memory by measuring the image and patch-level
combined similarity. Finally, we use the feature consolidation module to
compensate pruned features using identified neighbors for recovering essential
information while disregarding disturbance from noise and occlusion.
Experimental results demonstrate the effectiveness of our proposed framework on
occluded, partial and holistic Re-ID datasets. In particular, our method
outperforms state-of-the-art results by at least 8.6% mAP and 6.0% Rank-1
accuracy on the challenging Occluded-Duke dataset.Comment: 12 pages, 9 figure
Representational Redundancy Reduction Strategies for Efficient Neural Network Architectures for Visual and Language Tasks
Deep neural networks have transformed a wide variety of domains including natural language processing, image and video processing, and robotics. However, the computational cost of training and inference with these models is high, and the rise of unsupervised pretraining has allowed ever larger networks to be used to further improve performance. Running these large neural networks in compute constrained environments such as on edge devices is infeasible, and the alternative of doing inference using cloud compute can be exceedingly expensive, with the largest language models needing to be distributed across multiple GPUs.
Because of these constraints, size reduction and improving inference speed has been a main focus in neural network research. A wide variety of techniques have been proposed to improve the efficiency of existing neural networks including pruning, quantization, and knowledge distillation. In addition there is extensive effort on creating more efficient networks through hand design or an automated process called neural architecture search. However, there remain key domains where where there is significant room for improvement, which we demonstrate in this thesis.
In this thesis we aim to improve the efficiency of deep neural networks in terms of inference latency, model size and latent representation size. We take an alternative approach to previous research and instead investigate redundant representations in neural networks. Across three domains of text classification, image classification and generative models we hypothesize that current neural networks contain representational redundancy and show that through the removal of this redundancy we can improve their efficiency.
For image classification we hypothesize that convolution kernels contain redundancy in terms of unnecessary channel wise flexibility, and test this by introducing additional weight sharing into the network, preserving or even increasing classification performance while requiring fewer parameters. We show the benefits of this approach on convolution layers on the CIFAR and Imagenet datasets, on both standard models and models explicitly designed to be parameter efficient.
For generative models we show it is possible to reduce the size of the latent representation of the model while preserving the quality of the generated images through the unsupervised disentanglement of shape and orientation. To do this we introduce the affine variational autoencoder, a novel training procedure, and demonstrate its effectiveness on the problem of generating 2 dimensional images, as well as 3 dimensional voxel representations of objects.
Finally, looking at the transformer model, we note that there is a mismatch between the tasks used for pretraining and the downstream tasks models are fine tuned on, such as text classification.We hypothesize that this results in a redundancy in terms of unnecessary spatial information, and remove it through the introduction of learned sequence length bottlenecks. We aim to create task specific networks given a dataset and performance requirements through the use of a neural architecture search method and learned downsampling. We show that these task specific networks achieve superior performance in terms of inference latency and accuracy tradeoff to standard models without requiring additional pretraining
TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer
In this paper, we introduce a set of effective TOken REduction (TORE)
strategies for Transformer-based Human Mesh Recovery from monocular images.
Current SOTA performance is achieved by Transformer-based structures. However,
they suffer from high model complexity and computation cost caused by redundant
tokens. We propose token reduction strategies based on two important aspects,
i.e., the 3D geometry structure and 2D image feature, where we hierarchically
recover the mesh geometry with priors from body structure and conduct token
clustering to pass fewer but more discriminative image feature tokens to the
Transformer. As a result, our method vastly reduces the number of tokens
involved in high-complexity interactions in the Transformer, achieving
competitive accuracy of shape recovery at a significantly reduced computational
cost. We conduct extensive experiments across a wide range of benchmarks to
validate the proposed method and further demonstrate the generalizability of
our method on hand mesh recovery. Our code will be publicly available once the
paper is published
Deep Neural Networks and Data for Automated Driving
This open access book brings together the latest developments from industry and research on automated driving and artificial intelligence. Environment perception for highly automated driving heavily employs deep neural networks, facing many challenges. How much data do we need for training and testing? How to use synthetic data to save labeling costs for training? How do we increase robustness and decrease memory usage? For inevitably poor conditions: How do we know that the network is uncertain about its decisions? Can we understand a bit more about what actually happens inside neural networks? This leads to a very practical problem particularly for DNNs employed in automated driving: What are useful validation techniques and how about safety? This book unites the views from both academia and industry, where computer vision and machine learning meet environment perception for highly automated driving. Naturally, aspects of data, robustness, uncertainty quantification, and, last but not least, safety are at the core of it. This book is unique: In its first part, an extended survey of all the relevant aspects is provided. The second part contains the detailed technical elaboration of the various questions mentioned above
Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting
Removing clutter from scenes is essential in many applications, ranging from
privacy-concerned content filtering to data augmentation. In this work, we
present an automatic system that removes clutter from 3D scenes and inpaints
with coherent geometry and texture. We propose techniques for its two key
components: 3D segmentation from shared properties and 3D inpainting, both of
which are important problems. The definition of 3D scene clutter
(frequently-moving objects) is not well captured by commonly-studied object
categories in computer vision. To tackle the lack of well-defined clutter
annotations, we group noisy fine-grained labels, leverage virtual rendering,
and impose an instance-level area-sensitive loss. Once clutter is removed, we
inpaint geometry and texture in the resulting holes by merging inpainted RGB-D
images. This requires novel voting and pruning strategies that guarantee
multi-view consistency across individually inpainted images for mesh
reconstruction. Experiments on ScanNet and Matterport dataset show that our
method outperforms baselines for clutter segmentation and 3D inpainting, both
visually and quantitatively.Comment: 18 pages. ICCV 2023. Project page:
https://weify627.github.io/clutter
DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos
Existing implicit neural representation (INR) methods do not fully exploit
spatiotemporal redundancies in videos. Index-based INRs ignore the
content-specific spatial features and hybrid INRs ignore the contextual
dependency on adjacent frames, leading to poor modeling capability for scenes
with large motion or dynamics. We analyze this limitation from the perspective
of function fitting and reveal the importance of frame difference. To use
explicit motion information, we propose Difference Neural Representation for
Videos (DNeRV), which consists of two streams for content and frame difference.
We also introduce a collaborative content unit for effective feature fusion. We
test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves
competitive results against the state-of-the-art neural compression approaches
and outperforms existing implicit methods on downstream inpainting and
interpolation for videos
- …