284 research outputs found
Kervolutional Neural Networks
Convolutional neural networks (CNNs) have enabled the state-of-the-art
performance in many computer vision tasks. However, little effort has been
devoted to establishing convolution in non-linear space. Existing works mainly
leverage on the activation layers, which can only provide point-wise
non-linearity. To solve this problem, a new operation, kervolution (kernel
convolution), is introduced to approximate complex behaviors of human
perception systems leveraging on the kernel trick. It generalizes convolution,
enhances the model capacity, and captures higher order interactions of
features, via patch-wise kernel functions, but without introducing additional
parameters. Extensive experiments show that kervolutional neural networks (KNN)
achieve higher accuracy and faster convergence than baseline CNN.Comment: oral paper in CVPR 201
Kernel Cross-Correlator
Cross-correlator plays a significant role in many visual perception tasks,
such as object detection and tracking. Beyond the linear cross-correlator, this
paper proposes a kernel cross-correlator (KCC) that breaks traditional
limitations. First, by introducing the kernel trick, the KCC extends the linear
cross-correlation to non-linear space, which is more robust to signal noises
and distortions. Second, the connection to the existing works shows that KCC
provides a unified solution for correlation filters. Third, KCC is applicable
to any kernel function and is not limited to circulant structure on training
data, thus it is able to predict affine transformations with customized
properties. Last, by leveraging the fast Fourier transform (FFT), KCC
eliminates direct calculation of kernel vectors, thus achieves better
performance yet still with a reasonable computational cost. Comprehensive
experiments on visual tracking and human activity recognition using wearable
devices demonstrate its robustness, flexibility, and efficiency. The source
codes of both experiments are released at https://github.com/wang-chen/KCCComment: The Thirty-Second AAAI Conference on Artificial Intelligence
(AAAI-18
Non-iterative RGB-D-inertial Odometry
This paper presents a non-iterative solution to RGB-D-inertial odometry
system. Traditional odometry methods resort to iterative algorithms which are
usually computationally expensive or require well-designed initialization. To
overcome this problem, this paper proposes to combine a non-iterative front-end
(odometry) with an iterative back-end (loop closure) for the RGB-D-inertial
SLAM system. The main contribution lies in the novel non-iterative front-end,
which leverages on inertial fusion and kernel cross-correlators (KCC) to match
point clouds in frequency domain. Dominated by the fast Fourier transform
(FFT), our method is only of complexity , where is
the number of points. Map fusion is conducted by element-wise operations, so
that both time and space complexity are further reduced. Extensive experiments
show that, due to the lightweight of the proposed front-end, the framework is
able to run at a much faster speed yet still with comparable accuracy with the
state-of-the-arts
DropPos: Pre-Training Vision Transformers by Reconstructing Dropped Positions
As it is empirically observed that Vision Transformers (ViTs) are quite
insensitive to the order of input tokens, the need for an appropriate
self-supervised pretext task that enhances the location awareness of ViTs is
becoming evident. To address this, we present DropPos, a novel pretext task
designed to reconstruct Dropped Positions. The formulation of DropPos is
simple: we first drop a large random subset of positional embeddings and then
the model classifies the actual position for each non-overlapping patch among
all possible positions solely based on their visual appearance. To avoid
trivial solutions, we increase the difficulty of this task by keeping only a
subset of patches visible. Additionally, considering there may be different
patches with similar visual appearances, we propose position smoothing and
attentive reconstruction strategies to relax this classification problem, since
it is not necessary to reconstruct their exact positions in these cases.
Empirical evaluations of DropPos show strong capabilities. DropPos outperforms
supervised pre-training and achieves competitive results compared with
state-of-the-art self-supervised alternatives on a wide range of downstream
benchmarks. This suggests that explicitly encouraging spatial reasoning
abilities, as DropPos does, indeed contributes to the improved location
awareness of ViTs. The code is publicly available at
https://github.com/Haochen-Wang409/DropPos.Comment: Accepted by NeurIPS 202
Using Unreliable Pseudo-Labels for Label-Efficient Semantic Segmentation
The crux of label-efficient semantic segmentation is to produce high-quality
pseudo-labels to leverage a large amount of unlabeled or weakly labeled data. A
common practice is to select the highly confident predictions as the
pseudo-ground-truths for each pixel, but it leads to a problem that most pixels
may be left unused due to their unreliability. However, we argue that every
pixel matters to the model training, even those unreliable and ambiguous
pixels. Intuitively, an unreliable prediction may get confused among the top
classes, however, it should be confident about the pixel not belonging to the
remaining classes. Hence, such a pixel can be convincingly treated as a
negative key to those most unlikely categories. Therefore, we develop an
effective pipeline to make sufficient use of unlabeled data. Concretely, we
separate reliable and unreliable pixels via the entropy of predictions, push
each unreliable pixel to a category-wise queue that consists of negative keys,
and manage to train the model with all candidate pixels. Considering the
training evolution, we adaptively adjust the threshold for the
reliable-unreliable partition. Experimental results on various benchmarks and
training settings demonstrate the superiority of our approach over the
state-of-the-art alternatives
- …