143 research outputs found
Learning Local Feature Descriptor with Motion Attribute for Vision-based Localization
In recent years, camera-based localization has been widely used for robotic
applications, and most proposed algorithms rely on local features extracted
from recorded images. For better performance, the features used for open-loop
localization are required to be short-term globally static, and the ones used
for re-localization or loop closure detection need to be long-term static.
Therefore, the motion attribute of a local feature point could be exploited to
improve localization performance, e.g., the feature points extracted from
moving persons or vehicles can be excluded from these systems due to their
unsteadiness. In this paper, we design a fully convolutional network (FCN),
named MD-Net, to perform motion attribute estimation and feature description
simultaneously. MD-Net has a shared backbone network to extract features from
the input image and two network branches to complete each sub-task. With
MD-Net, we can obtain the motion attribute while avoiding increasing much more
computation. Experimental results demonstrate that the proposed method can
learn distinct local feature descriptor along with motion attribute only using
an FCN, by outperforming competing methods by a wide margin. We also show that
the proposed algorithm can be integrated into a vision-based localization
algorithm to improve estimation accuracy significantly.Comment: This paper will be presented on IROS1
Mixed Neural Voxels for Fast Multi-view Video Synthesis
Synthesizing high-fidelity videos from real-world multi-view input is
challenging because of the complexities of real-world environments and highly
dynamic motions. Previous works based on neural radiance fields have
demonstrated high-quality reconstructions of dynamic scenes. However, training
such models on real-world scenes is time-consuming, usually taking days or
weeks. In this paper, we present a novel method named MixVoxels to better
represent the dynamic scenes with fast training speed and competitive rendering
qualities. The proposed MixVoxels represents the 4D dynamic scenes as a mixture
of static and dynamic voxels and processes them with different networks. In
this way, the computation of the required modalities for static voxels can be
processed by a lightweight model, which essentially reduces the amount of
computation, especially for many daily dynamic scenes dominated by the static
background. To separate the two kinds of voxels, we propose a novel variation
field to estimate the temporal variance of each voxel. For the dynamic voxels,
we design an inner-product time query method to efficiently query multiple time
steps, which is essential to recover the high-dynamic motions. As a result,
with 15 minutes of training for dynamic scenes with inputs of 300-frame videos,
MixVoxels achieves better PSNR than previous methods. Codes and trained models
are available at https://github.com/fengres/mixvoxelsComment: ICCV 2023 (Oral
Ultrafast Video Attention Prediction with Coupled Knowledge Distillation
Large convolutional neural network models have recently demonstrated
impressive performance on video attention prediction. Conventionally, these
models are with intensive computation and large memory. To address these
issues, we design an extremely light-weight network with ultrafast speed, named
UVA-Net. The network is constructed based on depth-wise convolutions and takes
low-resolution images as input. However, this straight-forward acceleration
method will decrease performance dramatically. To this end, we propose a
coupled knowledge distillation strategy to augment and train the network
effectively. With this strategy, the model can further automatically discover
and emphasize implicit useful cues contained in the data. Both spatial and
temporal knowledge learned by the high-resolution complex teacher networks also
can be distilled and transferred into the proposed low-resolution light-weight
spatiotemporal network. Experimental results show that the performance of our
model is comparable to ten state-of-the-art models in video attention
prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS
on GPU and 404 FPS on CPU, which is 206 times faster than previous models
- …