53 research outputs found
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
As the most essential property in a video, motion information is critical to
a robust and generalized video representation. To inject motion dynamics,
recent works have adopted frame difference as the source of motion information
in video contrastive learning, considering the trade-off between quality and
cost. However, existing works align motion features at the instance level,
which suffers from spatial and temporal weak alignment across modalities. In
this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion
\textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and
significant motion information. Specifically, we first develop a dense
contrastive learning framework in the spatiotemporal domain to generate
pixel-level motion supervision. Then, we design a motion decoder and a
foreground sampling strategy to eliminate the weak alignments in terms of time
and space. Moreover, a frame-level motion contrastive loss is presented to
improve the temporal diversity of the motion features. Extensive experiments
demonstrate that the representations learned by FIMA possess great
motion-awareness capabilities and achieve state-of-the-art or competitive
results on downstream tasks across UCF101, HMDB51, and Diving48 datasets. Code
is available at \url{https://github.com/ZMHH-H/FIMA}.Comment: ACM MM 2023 Camera Read
TransPose: 6D Object Pose Estimation with Geometry-Aware Transformer
Estimating the 6D object pose is an essential task in many applications. Due
to the lack of depth information, existing RGB-based methods are sensitive to
occlusion and illumination changes. How to extract and utilize the geometry
features in depth information is crucial to achieve accurate predictions. To
this end, we propose TransPose, a novel 6D pose framework that exploits
Transformer Encoder with geometry-aware module to develop better learning of
point cloud feature representations. Specifically, we first uniformly sample
point cloud and extract local geometry features with the designed local feature
extractor base on graph convolution network. To improve robustness to
occlusion, we adopt Transformer to perform the exchange of global information,
making each local feature contains global information. Finally, we introduce
geometry-aware module in Transformer Encoder, which to form an effective
constrain for point cloud feature learning and makes the global information
exchange more tightly coupled with point cloud tasks. Extensive experiments
indicate the effectiveness of TransPose, our pose estimation pipeline achieves
competitive results on three benchmark datasets.Comment: 10 pages, 5 figures, IEEE Journa
Ambient-Aware LiDAR Odometry in Variable Terrains
The flexibility of Simultaneous Localization and Mapping (SLAM) algorithms in
various environments has consistently been a significant challenge. To address
the issue of LiDAR odometry drift in high-noise settings, integrating
clustering methods to filter out unstable features has become an effective
module of SLAM frameworks. However, reducing the amount of point cloud data can
lead to potential loss of information and possible degeneration. As a result,
this research proposes a LiDAR odometry that can dynamically assess the point
cloud's reliability. The algorithm aims to improve adaptability in diverse
settings by selecting important feature points with sensitivity to the level of
environmental degeneration. Firstly, a fast adaptive Euclidean clustering
algorithm based on range image is proposed, which, combined with depth
clustering, extracts the primary structural points of the environment defined
as ambient skeleton points. Then, the environmental degeneration level is
computed through the dense normal features of the skeleton points, and the
point cloud cleaning is dynamically adjusted accordingly. The algorithm is
validated on the KITTI benchmark and real environments, demonstrating higher
accuracy and robustness in different environments
Causality-based Cross-Modal Representation Learning for Vision-and-Language Navigation
Vision-and-Language Navigation (VLN) has gained significant research interest
in recent years due to its potential applications in real-world scenarios.
However, existing VLN methods struggle with the issue of spurious associations,
resulting in poor generalization with a significant performance gap between
seen and unseen environments. In this paper, we tackle this challenge by
proposing a unified framework CausalVLN based on the causal learning paradigm
to train a robust navigator capable of learning unbiased feature
representations. Specifically, we establish reasonable assumptions about
confounders for vision and language in VLN using the structured causal model
(SCM). Building upon this, we propose an iterative backdoor-based
representation learning (IBRL) method that allows for the adaptive and
effective intervention on confounders. Furthermore, we introduce the visual and
linguistic backdoor causal encoders to enable unbiased feature expression for
multi-modalities during training and validation, enhancing the agent's
capability to generalize across different environments. Experiments on three
VLN datasets (R2R, RxR, and REVERIE) showcase the superiority of our proposed
method over previous state-of-the-art approaches. Moreover, detailed
visualization analysis demonstrates the effectiveness of CausalVLN in
significantly narrowing down the performance gap between seen and unseen
environments, underscoring its strong generalization capability.Comment: 16 page
Unbiased Directed Object Attention Graph for Object Navigation
Object navigation tasks require agents to locate specific objects in unknown
environments based on visual information. Previously, graph convolutions were
used to implicitly explore the relationships between objects. However, due to
differences in visibility among objects, it is easy to generate biases in
object attention. Thus, in this paper, we propose a directed object attention
(DOA) graph to guide the agent in explicitly learning the attention
relationships between objects, thereby reducing the object attention bias. In
particular, we use the DOA graph to perform unbiased adaptive object attention
(UAOA) on the object features and unbiased adaptive image attention (UAIA) on
the raw images, respectively. To distinguish features in different branches, a
concise adaptive branch energy distribution (ABED) method is proposed. We
assess our methods on the AI2-Thor dataset. Compared with the state-of-the-art
(SOTA) method, our method reports 7.4%, 8.1% and 17.6% increase in success rate
(SR), success weighted by path length (SPL) and success weighted by action
efficiency (SAE), respectively.Comment: 13 pages, ready to ACM Mutimedia, under revie
PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For Vision-and-Language Navigation
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal
navigation task. One powerful technique to enhance the generalization
performance in VLN is the use of an independent speaker model to provide pseudo
instructions for data augmentation. However, current speaker models based on
Long-Short Term Memory (LSTM) lack the ability to attend to features relevant
at different locations and time steps. To address this, we propose a novel
progress-aware spatio-temporal transformer speaker (PASTS) model that uses the
transformer as the core of the network. PASTS uses a spatio-temporal encoder to
fuse panoramic representations and encode intermediate connections through
steps. Besides, to avoid the misalignment problem that could result in
incorrect supervision, a speaker progress monitor (SPM) is proposed to enable
the model to estimate the progress of instruction generation and facilitate
more fine-grained caption results. Additionally, a multifeature dropout (MFD)
strategy is introduced to alleviate overfitting. The proposed PASTS is flexible
to be combined with existing VLN models. The experimental results demonstrate
that PASTS outperforms all existing speaker models and successfully improves
the performance of previous VLN models, achieving state-of-the-art performance
on the standard Room-to-Room (R2R) dataset.Comment: 15 pages, 11 figure
InstructDET: Diversifying Referring Object Detection with Generalized Instructions
We propose InstructDET, a data-centric method for referring object detection
(ROD) that localizes target objects based on user instructions. While deriving
from referring expressions (REC), the instructions we leverage are greatly
diversified to encompass common user intentions related to object detection.
For one image, we produce tremendous instructions that refer to every single
object and different combinations of multiple objects. Each instruction and its
corresponding object bounding boxes (bbxs) constitute one training data pair.
In order to encompass common detection expressions, we involve emerging
vision-language model (VLM) and large language model (LLM) to generate
instructions guided by text prompts and object bbxs, as the generalizations of
foundation models are effective to produce human-like expressions (e.g.,
describing object property, category, and relationship). We name our
constructed dataset as InDET. It contains images, bbxs and generalized
instructions that are from foundation models. Our InDET is developed from
existing REC datasets and object detection datasets, with the expanding
potential that any image with object bbxs can be incorporated through using our
InstructDET method. By using our InDET dataset, we show that a conventional ROD
model surpasses existing methods on standard REC datasets and our InDET test
set. Our data-centric method InstructDET, with automatic data expansion by
leveraging foundation models, directs a promising field that ROD can be greatly
diversified to execute common object detection instructions.Comment: 29 pages (include Appendix) Published in ICL
- …