29 research outputs found
Dual-Resolution Correspondence Networks
We tackle the problem of establishing dense pixel-wise correspondences
between a pair of images. In this work, we introduce Dual-Resolution
Correspondence Networks (DRC-Net), to obtain pixel-wise correspondences in a
coarse-to-fine manner. DRC-Net extracts both coarse- and fine- resolution
feature maps. The coarse maps are used to produce a full but coarse 4D
correlation tensor, which is then refined by a learnable neighbourhood
consensus module. The fine-resolution feature maps are used to obtain the final
dense correspondences guided by the refined coarse 4D correlation tensor. The
selected coarse-resolution matching scores allow the fine-resolution features
to focus only on a limited number of possible matches with high confidence. In
this way, DRC-Net dramatically increases matching reliability and localisation
accuracy, while avoiding to apply the expensive 4D convolution kernels on
fine-resolution feature maps. We comprehensively evaluate our method on
large-scale public benchmarks including HPatches, InLoc, and Aachen Day-Night.
It achieves the state-of-the-art results on all of them
3DG-STFM: 3D Geometric Guided Student-Teacher Feature Matching
We tackle the essential task of finding dense visual correspondences between
a pair of images. This is a challenging problem due to various factors such as
poor texture, repetitive patterns, illumination variation, and motion blur in
practical scenarios. In contrast to methods that use dense correspondence
ground-truths as direct supervision for local feature matching training, we
train 3DG-STFM: a multi-modal matching model (Teacher) to enforce the depth
consistency under 3D dense correspondence supervision and transfer the
knowledge to 2D unimodal matching model (Student). Both teacher and student
models consist of two transformer-based matching modules that obtain dense
correspondences in a coarse-to-fine manner. The teacher model guides the
student model to learn RGB-induced depth information for the matching purpose
on both coarse and fine branches. We also evaluate 3DG-STFM on a model
compression task. To the best of our knowledge, 3DG-STFM is the first
student-teacher learning method for the local feature matching task. The
experiments show that our method outperforms state-of-the-art methods on indoor
and outdoor camera pose estimations, and homography estimation problems. Code
is available at: https://github.com/Ryan-prime/3DG-STFM
Cost Aggregation with 4D Convolutional Swin Transformer for Few-Shot Segmentation
This paper presents a novel cost aggregation network, called Volumetric
Aggregation with Transformers (VAT), for few-shot segmentation. The use of
transformers can benefit correlation map aggregation through self-attention
over a global receptive field. However, the tokenization of a correlation map
for transformer processing can be detrimental, because the discontinuity at
token boundaries reduces the local context available near the token edges and
decreases inductive bias. To address this problem, we propose a 4D
Convolutional Swin Transformer, where a high-dimensional Swin Transformer is
preceded by a series of small-kernel convolutions that impart local context to
all pixels and introduce convolutional inductive bias. We additionally boost
aggregation performance by applying transformers within a pyramidal structure,
where aggregation at a coarser level guides aggregation at a finer level. Noise
in the transformer output is then filtered in the subsequent decoder with the
help of the query's appearance embedding. With this model, a new
state-of-the-art is set for all the standard benchmarks in few-shot
segmentation. It is shown that VAT attains state-of-the-art performance for
semantic correspondence as well, where cost aggregation also plays a central
role.Comment: Code and trained models are available at
https://seokju-cho.github.io/VAT/ . This is ECCV'22 camera-ready version,
which is revised from arXiv:2112.1168
Learning Probabilistic Coordinate Fields for Robust Correspondences
We introduce Probabilistic Coordinate Fields (PCFs), a novel
geometric-invariant coordinate representation for image correspondence
problems. In contrast to standard Cartesian coordinates, PCFs encode
coordinates in correspondence-specific barycentric coordinate systems (BCS)
with affine invariance. To know \textit{when and where to trust} the encoded
coordinates, we implement PCFs in a probabilistic network termed PCF-Net, which
parameterizes the distribution of coordinate fields as Gaussian mixture models.
By jointly optimizing coordinate fields and their confidence conditioned on
dense flows, PCF-Net can work with various feature descriptors when quantifying
the reliability of PCFs by confidence maps. An interesting observation of this
work is that the learned confidence map converges to geometrically coherent and
semantically consistent regions, which facilitates robust coordinate
representation. By delivering the confident coordinates to keypoint/feature
descriptors, we show that PCF-Net can be used as a plug-in to existing
correspondence-dependent approaches. Extensive experiments on both indoor and
outdoor datasets suggest that accurate geometric invariant coordinates help to
achieve the state of the art in several correspondence problems, such as sparse
feature matching, dense image registration, camera pose estimation, and
consistency filtering. Further, the interpretable confidence map predicted by
PCF-Net can also be leveraged to other novel applications from texture transfer
to multi-homography classification.Comment: Accepted by IEEE Transactions on Pattern Analysis and Machine
Intelligenc
AiATrack: Attention in Attention for Transformer Visual Tracking
Transformer trackers have achieved impressive advancements recently, where
the attention mechanism plays an important role. However, the independent
correlation computation in the attention mechanism could result in noisy and
ambiguous attention weights, which inhibits further performance improvement. To
address this issue, we propose an attention in attention (AiA) module, which
enhances appropriate correlations and suppresses erroneous ones by seeking
consensus among all correlation vectors. Our AiA module can be readily applied
to both self-attention blocks and cross-attention blocks to facilitate feature
aggregation and information propagation for visual tracking. Moreover, we
propose a streamlined Transformer tracking framework, dubbed AiATrack, by
introducing efficient feature reuse and target-background embeddings to make
full use of temporal references. Experiments show that our tracker achieves
state-of-the-art performance on six tracking benchmarks while running at a
real-time speed.Comment: Accepted by ECCV 2022. Code and models are publicly available at
https://github.com/Little-Podi/AiATrac