40 research outputs found
FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax
Text-to-video (T2V) generation is a rapidly growing research area that aims
to translate the scenes, objects, and actions within complex video text into a
sequence of coherent visual frames. We present FlowZero, a novel framework that
combines Large Language Models (LLMs) with image diffusion models to generate
temporally-coherent videos. FlowZero uses LLMs to understand complex
spatio-temporal dynamics from text, where LLMs can generate a comprehensive
dynamic scene syntax (DSS) containing scene descriptions, object layouts, and
background motion patterns. These elements in DSS are then used to guide the
image diffusion model for video generation with smooth object motions and
frame-to-frame coherence. Moreover, FlowZero incorporates an iterative
self-refinement process, enhancing the alignment between the spatio-temporal
layouts and the textual prompts for the videos. To enhance global coherence, we
propose enriching the initial noise of each frame with motion dynamics to
control the background movement and camera motion adaptively. By using
spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves
improvement in zero-shot video synthesis, generating coherent videos with vivid
motion.Comment: Project page: https://flowzero-video.github.i
Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer?
Vision Transformers (ViTs) have proven to be effective, in solving 2D image
understanding tasks by training over large-scale image datasets; and meanwhile
as a somehow separate track, in modeling the 3D visual world too such as voxels
or point clouds. However, with the growing hope that transformers can become
the "universal" modeling tool for heterogeneous data, ViTs for 2D and 3D tasks
have so far adopted vastly different architecture designs that are hardly
transferable. That invites an (over-)ambitious question: can we close the gap
between the 2D and 3D ViT architectures? As a piloting study, this paper
demonstrates the appealing promise to understand the 3D visual world, using a
standard 2D ViT architecture, with only minimal customization at the input and
output levels without redesigning the pipeline. To build a 3D ViT from its 2D
sibling, we "inflate" the patch embedding and token sequence, accompanied with
new positional encoding mechanisms designed to match the 3D data geometry. The
resultant "minimalist" 3D ViT, named Simple3D-Former, performs surprisingly
robustly on popular 3D tasks such as object classification, point cloud
segmentation and indoor scene detection, compared to highly customized
3D-specific designs. It can hence act as a strong baseline for new 3D ViTs.
Moreover, we note that pursing a unified 2D-3D ViT design has practical
relevance besides just scientific curiosity. Specifically, we demonstrate that
Simple3D-Former naturally enables to exploit the wealth of pre-trained weights
from large-scale realistic 2D images (e.g., ImageNet), which can be plugged in
to enhancing the 3D task performance "for free"
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Protein research is crucial in various fundamental disciplines, but
understanding their intricate structure-function relationships remains
challenging. Recent Large Language Models (LLMs) have made significant strides
in comprehending task-specific knowledge, suggesting the potential for
ChatGPT-like systems specialized in protein to facilitate basic research. In
this work, we introduce ProtChatGPT, which aims at learning and understanding
protein structures via natural languages. ProtChatGPT enables users to upload
proteins, ask questions, and engage in interactive conversations to produce
comprehensive answers. The system comprises protein encoders, a
Protein-Language Pertaining Transformer (PLP-former), a projection adapter, and
an LLM. The protein first undergoes protein encoders and PLP-former to produce
protein embeddings, which are then projected by the adapter to conform with the
LLM. The LLM finally combines user questions with projected embeddings to
generate informative answers. Experiments show that ProtChatGPT can produce
promising responses to proteins and their corresponding questions. We hope that
ProtChatGPT could form the basis for further exploration and application in
protein research. Code and our pre-trained model will be publicly available
Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos
We propose a unified point cloud video self-supervised learning framework for
object-centric and scene-centric data. Previous methods commonly conduct
representation learning at the clip or frame level and cannot well capture
fine-grained semantics. Instead of contrasting the representations of clips or
frames, in this paper, we propose a unified self-supervised framework by
conducting contrastive learning at the point level. Moreover, we introduce a
new pretext task by achieving semantic alignment of superpoints, which further
facilitates the representations to capture semantic cues at multiple scales. In
addition, due to the high redundancy in the temporal dimension of dynamic point
clouds, directly conducting contrastive learning at the point level usually
leads to massive undesired negatives and insufficient modeling of positive
representations. To remedy this, we propose a selection strategy to retain
proper negatives and make use of high-similarity samples from other instances
as positive supplements. Extensive experiments show that our method outperforms
supervised counterparts on a wide range of downstream tasks and demonstrates
the superior transferability of the learned representations.Comment: Accepted by ICCV 202
Keyword-Aware Relative Spatio-Temporal Graph Networks for Video Question Answering
The main challenge in video question answering (VideoQA) is to capture and
understand the complex spatial and temporal relations between objects based on
given questions. Existing graph-based methods for VideoQA usually ignore
keywords in questions and employ a simple graph to aggregate features without
considering relative relations between objects, which may lead to inferior
performance. In this paper, we propose a Keyword-aware Relative Spatio-Temporal
(KRST) graph network for VideoQA. First, to make question features aware of
keywords, we employ an attention mechanism to assign high weights to keywords
during question encoding. The keyword-aware question features are then used to
guide video graph construction. Second, because relations are relative, we
integrate the relative relation modeling to better capture the spatio-temporal
dynamics among object nodes. Moreover, we disentangle the spatio-temporal
reasoning into an object-level spatial graph and a frame-level temporal graph,
which reduces the impact of spatial and temporal relation reasoning on each
other. Extensive experiments on the TGIF-QA, MSVD-QA and MSRVTT-QA datasets
demonstrate the superiority of our KRST over multiple state-of-the-art methods.Comment: under revie
Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos
Recently, the community has made tremendous progress in developing effective
methods for point cloud video understanding that learn from massive amounts of
labeled data. However, annotating point cloud videos is usually notoriously
expensive. Moreover, training via one or only a few traditional tasks (e.g.,
classification) may be insufficient to learn subtle details of the
spatio-temporal structure existing in point cloud videos. In this paper, we
propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to
capture the structure of point cloud videos without human annotations. MaST-Pre
is based on spatio-temporal point-tube masking and consists of two
self-supervised learning tasks. First, by reconstructing masked point tubes,
our method is able to capture the appearance information of point cloud videos.
Second, to learn motion, we propose a temporal cardinality difference
prediction task that estimates the change in the number of points within a
point tube. In this way, MaST-Pre is forced to model the spatial and temporal
structure in point cloud videos. Extensive experiments on MSRAction-3D,
NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed
method.Comment: Accepted by ICCV 202