38 research outputs found
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
Vision Transformers (ViTs) have proven to be effective, in solving 2D image
understanding tasks by training over large-scale image datasets; and meanwhile
as a somehow separate track, in modeling the 3D visual world too such as voxels
or point clouds. However, with the growing hope that transformers can become
the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks
have so far adopted vastly different architecture designs that are hardly
transferable. That invites an (over-)ambitious question: can we close the gap
between the 2D and 3D ViT architectures? As a piloting study, this paper
demonstrates the appealing promise to understand the 3D visual world, using a
standard 2D ViT architecture, with only minimal customization at the input and
output levels without redesigning the pipeline. To build a 3D ViT from its 2D
sibling, we "inflate" the patch embedding and token sequence, accompanied with
new positional encoding mechanisms designed to match the 3D data geometry. The
resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly
robustly on popular 3D tasks such as object classification, point cloud
segmentation and indoor scene detection, compared to highly customized
3D-specific designs. It can hence act as a strong baseline for new 3D ViTs.
Moreover, we note that pursing a unified 2D-3D ViT design has practical
relevance besides just scientific curiosity. Specifically, we demonstrate that
Simple3D-Former naturally enables to exploit the wealth of pre-trained weights
from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in
to enhancing the 3D task performance "for free"
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
Text-to-video (T2V) generation is a rapidly growing research area that aims
to translate the scenes, objects, and actions within complex video text into a
sequence of coherent visual frames. We present FlowZero, a novel framework that
combines Large Language Models (LLMs) with image diffusion models to generate
temporally-coherent videos. FlowZero uses LLMs to understand complex
spatio-temporal dynamics from text, where LLMs can generate a comprehensive
dynamic scene syntax (DSS) containing scene descriptions, object layouts, and
background motion patterns. These elements in DSS are then used to guide the
image diffusion model for video generation with smooth object motions and
frame-to-frame coherence. Moreover, FlowZero incorporates an iterative
self-refinement process, enhancing the alignment between the spatio-temporal
layouts and the textual prompts for the videos. To enhance global coherence, we
propose enriching the initial noise of each frame with motion dynamics to
control the background movement and camera motion adaptively. By using
spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves
improvement in zero-shot video synthesis, generating coherent videos with vivid
motion.Comment: Project page: https://flowzero-video.github.i
Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos
We propose a unified point cloud video self-supervised learning framework for
object-centric and scene-centric data. Previous methods commonly conduct
representation learning at the clip or frame level and cannot well capture
fine-grained semantics. Instead of contrasting the representations of clips or
frames, in this paper, we propose a unified self-supervised framework by
conducting contrastive learning at the point level. Moreover, we introduce a
new pretext task by achieving semantic alignment of superpoints, which further
facilitates the representations to capture semantic cues at multiple scales. In
addition, due to the high redundancy in the temporal dimension of dynamic point
clouds, directly conducting contrastive learning at the point level usually
leads to massive undesired negatives and insufficient modeling of positive
representations. To remedy this, we propose a selection strategy to retain
proper negatives and make use of high-similarity samples from other instances
as positive supplements. Extensive experiments show that our method outperforms
supervised counterparts on a wide range of downstream tasks and demonstrates
the superior transferability of the learned representations.Comment: Accepted by ICCV 202
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
The main challenge in video question answering (VideoQA) is to capture and
understand the complex spatial and temporal relations between objects based on
given questions. Existing graph-based methods for VideoQA usually ignore
keywords in questions and employ a simple graph to aggregate features without
considering relative relations between objects, which may lead to inferior
performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal
(KRST) graph network for VideoQA. First, to make question features aware of
keywords, we employ an attention mechanism to assign high weights to keywords
during question encoding. The keyword-aware question features are then used to
guide video graph construction. Second, because relations are relative, we
integrate the relative relation modeling to better capture the spatio-temporal
dynamics among object nodes. Moreover, we disentangle the spatio-temporal
reasoning into an object-level spatial graph and a frame-level temporal graph,
which reduces the impact of spatial and temporal relation reasoning on each
other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets
demonstrate the superiority of our KRST over multiple state-of-the-art methods.Comment: under revie
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos
Recently, the community has made tremendous progress in developing effective
methods for point cloud video understanding that learn from massive amounts of
labeled data. However, annotating point cloud videos is usually notoriously
expensive. Moreover, training via one or only a few traditional tasks (e.g.,
classification) may be insufficient to learn subtle details of the
spatio-temporal structure existing in point cloud videos. In this paper, we
propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to
capture the structure of point cloud videos without human annotations. MaST-Pre
is based on spatio-temporal point-tube masking and consists of two
self-supervised learning tasks. First, by reconstructing masked point tubes,
our method is able to capture the appearance information of point cloud videos.
Second, to learn motion, we propose a temporal cardinality difference
prediction task that estimates the change in the number of points within a
point tube. In this way, MaST-Pre is forced to model the spatial and temporal
structure in point cloud videos. Extensive experiments on MSRAction-3D,
NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed
method.Comment: Accepted by ICCV 202
A Study on Differentiable Logic and LLMs for EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2023
In this technical report, we present our findings from a study conducted on
the EPIC-KITCHENS-100 Unsupervised Domain Adaptation task for Action
Recognition. Our research focuses on the innovative application of a
differentiable logic loss in the training to leverage the co-occurrence
relations between verb and noun, as well as the pre-trained Large Language
Models (LLMs) to generate the logic rules for the adaptation to unseen action
labels. Specifically, the model's predictions are treated as the truth
assignment of a co-occurrence logic formula to compute the logic loss, which
measures the consistency between the predictions and the logic constraints. By
using the verb-noun co-occurrence matrix generated from the dataset, we observe
a moderate improvement in model performance compared to our baseline framework.
To further enhance the model's adaptability to novel action labels, we
experiment with rules generated using GPT-3.5, which leads to a slight decrease
in performance. These findings shed light on the potential and challenges of
incorporating differentiable logic and LLMs for knowledge extraction in
unsupervised domain adaptation for action recognition. Our final submission
(entitled `NS-LLM') achieved the first place in terms of top-1 action
recognition accuracy.Comment: Technical report submitted to CVPR 2023 EPIC-Kitchens challenge