175 research outputs found
SpVOS: Efficient Video Object Segmentation with Triple Sparse Convolution
Semi-supervised video object segmentation (Semi-VOS), which requires only
annotating the first frame of a video to segment future frames, has received
increased attention recently. Among existing pipelines, the
memory-matching-based one is becoming the main research stream, as it can fully
utilize the temporal sequence information to obtain high-quality segmentation
results. Even though this type of method has achieved promising performance,
the overall framework still suffers from heavy computation overhead, mainly
caused by the per-frame dense convolution operations between high-resolution
feature maps and each kernel filter. Therefore, we propose a sparse baseline of
VOS named SpVOS in this work, which develops a novel triple sparse convolution
to reduce the computation costs of the overall VOS framework. The designed
triple gate, taking full consideration of both spatial and temporal redundancy
between adjacent video frames, adaptively makes a triple decision to decide how
to apply the sparse convolution on each pixel to control the computation
overhead of each layer, while maintaining sufficient discrimination capability
to distinguish similar objects and avoid error accumulation. A mixed sparse
training strategy, coupled with a designed objective considering the sparsity
constraint, is also developed to balance the VOS segmentation performance and
computation costs. Experiments are conducted on two mainstream VOS datasets,
including DAVIS and Youtube-VOS. Results show that, the proposed SpVOS achieves
superior performance over other state-of-the-art sparse methods, and even
maintains comparable performance, e.g., an 83.04% (79.29%) overall score on the
DAVIS-2017 (Youtube-VOS) validation set, with the typical non-sparse VOS
baseline (82.88% for DAVIS-2017 and 80.36% for Youtube-VOS) while saving up to
42% FLOPs, showing its application potential for resource-constrained
scenarios.Comment: 15 pages, 6 figure
InceptionNeXt: When Inception Meets ConvNeXt
Inspired by the long-range modeling ability of ViTs, large-kernel
convolutions are widely studied and adopted recently to enlarge the receptive
field and improve model performance, like the remarkable work ConvNeXt which
employs 7x7 depthwise convolution. Although such depthwise operator only
consumes a few FLOPs, it largely harms the model efficiency on powerful
computing devices due to the high memory access costs. For example, ConvNeXt-T
has similar FLOPs with ResNet-50 but only achieves 60% throughputs when trained
on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt
can improve speed, it results in significant performance degradation. It is
still unclear how to speed up large-kernel-based CNN models while preserving
their performance. To tackle this issue, inspired by Inceptions, we propose to
decompose large-kernel depthwise convolution into four parallel branches along
channel dimension, i.e. small square kernel, two orthogonal band kernels, and
an identity mapping. With this new Inception depthwise convolution, we build a
series of networks, namely IncepitonNeXt, which not only enjoy high throughputs
but also maintain competitive performance. For instance, InceptionNeXt-T
achieves 1.6x higher training throughputs than ConvNeX-T, as well as attains
0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can
serve as an economical baseline for future architecture design to reduce carbon
footprint. Code is available at https://github.com/sail-sg/inceptionnext.Comment: Code: https://github.com/sail-sg/inceptionnex
- …