282 research outputs found
Surrogate Lagrangian Relaxation: A Path To Retrain-free Deep Neural Network Pruning
Network pruning is a widely used technique to reduce computation cost and
model size for deep neural networks. However, the typical three-stage pipeline
significantly increases the overall training time. In this paper, we develop a
systematic weight-pruning optimization approach based on Surrogate Lagrangian
relaxation, which is tailored to overcome difficulties caused by the discrete
nature of the weight-pruning problem. We prove that our method ensures fast
convergence of the model compression problem, and the convergence of the SLR is
accelerated by using quadratic penalties. Model parameters obtained by SLR
during the training phase are much closer to their optimal values as compared
to those obtained by other state-of-the-art methods. We evaluate our method on
image classification tasks using CIFAR-10 and ImageNet with state-of-the-art
MLP-Mixer, Swin Transformer, and VGG-16, ResNet-18, ResNet-50 and ResNet-110,
MobileNetV2. We also evaluate object detection and segmentation tasks on COCO,
KITTI benchmark, and TuSimple lane detection dataset using a variety of models.
Experimental results demonstrate that our SLR-based weight-pruning optimization
approach achieves a higher compression rate than state-of-the-art methods under
the same accuracy requirement and also can achieve higher accuracy under the
same compression rate requirement. Under classification tasks, our SLR approach
converges to the desired accuracy faster on both of the datasets.
Under object detection and segmentation tasks, SLR also converges
faster to the desired accuracy. Further, our SLR achieves high model accuracy
even at the hard-pruning stage without retraining, which reduces the
traditional three-stage pruning into a two-stage process. Given a limited
budget of retraining epochs, our approach quickly recovers the model's
accuracy.Comment: arXiv admin note: text overlap with arXiv:2012.1007
Achieving on-Mobile Real-Time Super-Resolution with Neural Architecture and Pruning Search
Though recent years have witnessed remarkable progress in single image
super-resolution (SISR) tasks with the prosperous development of deep neural
networks (DNNs), the deep learning methods are confronted with the computation
and memory consumption issues in practice, especially for resource-limited
platforms such as mobile devices. To overcome the challenge and facilitate the
real-time deployment of SISR tasks on mobile, we combine neural architecture
search with pruning search and propose an automatic search framework that
derives sparse super-resolution (SR) models with high image quality while
satisfying the real-time inference requirement. To decrease the search cost, we
leverage the weight sharing strategy by introducing a supernet and decouple the
search problem into three stages, including supernet construction,
compiler-aware architecture and pruning search, and compiler-aware pruning
ratio search. With the proposed framework, we are the first to achieve
real-time SR inference (with only tens of milliseconds per frame) for
implementing 720p resolution with competitive image quality (in terms of PSNR
and SSIM) on mobile platforms (Samsung Galaxy S20)
YOLObile: Real-Time Object Detection on Mobile Devices via Compression-Compilation Co-Design
The rapid development and wide utilization of object detection techniques
have aroused attention on both accuracy and speed of object detectors. However,
the current state-of-the-art object detection works are either
accuracy-oriented using a large model but leading to high latency or
speed-oriented using a lightweight model but sacrificing accuracy. In this
work, we propose YOLObile framework, a real-time object detection on mobile
devices via compression-compilation co-design. A novel block-punched pruning
scheme is proposed for any kernel size. To improve computational efficiency on
mobile devices, a GPU-CPU collaborative scheme is adopted along with advanced
compiler-assisted optimizations. Experimental results indicate that our pruning
scheme achieves 14 compression rate of YOLOv4 with 49.0 mAP. Under our
YOLObile framework, we achieve 17 FPS inference speed using GPU on Samsung
Galaxy S20. By incorporating our proposed GPU-CPU collaborative scheme, the
inference speed is increased to 19.1 FPS, and outperforms the original YOLOv4
by 5 speedup. Source code is at:
\url{https://github.com/nightsnack/YOLObile}
- …