241 research outputs found
Object Detection in Videos with Tubelet Proposal Networks
Object detection in videos has drawn increasing attention recently with the
introduction of the large-scale ImageNet VID dataset. Different from object
detection in static images, temporal information in videos is vital for object
detection. To fully utilize temporal information, state-of-the-art methods are
based on spatiotemporal tubelets, which are essentially sequences of associated
bounding boxes across time. However, the existing methods have major
limitations in generating tubelets in terms of quality and efficiency.
Motion-based methods are able to obtain dense tubelets efficiently, but the
lengths are generally only several frames, which is not optimal for
incorporating long-term temporal information. Appearance-based methods, usually
involving generic object tracking, could generate long tubelets, but are
usually computationally expensive. In this work, we propose a framework for
object detection in videos, which consists of a novel tubelet proposal network
to efficiently generate spatiotemporal proposals, and a Long Short-term Memory
(LSTM) network that incorporates temporal information from tubelet proposals
for achieving high object detection accuracy in videos. Experiments on the
large-scale ImageNet VID dataset demonstrate the effectiveness of the proposed
framework for object detection in videos.Comment: CVPR 201
T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation
Despite the stunning ability to generate high-quality images by recent
text-to-image models, current approaches often struggle to effectively compose
objects with different attributes and relationships into a complex and coherent
scene. We propose T2I-CompBench, a comprehensive benchmark for open-world
compositional text-to-image generation, consisting of 6,000 compositional text
prompts from 3 categories (attribute binding, object relationships, and complex
compositions) and 6 sub-categories (color binding, shape binding, texture
binding, spatial relationships, non-spatial relationships, and complex
compositions). We further propose several evaluation metrics specifically
designed to evaluate compositional text-to-image generation. We introduce a new
approach, Generative mOdel fine-tuning with Reward-driven Sample selection
(GORS), to boost the compositional text-to-image generation abilities of
pretrained text-to-image models. Extensive experiments and evaluations are
conducted to benchmark previous methods on T2I-CompBench, and to validate the
effectiveness of our proposed evaluation metrics and GORS approach. Project
page is available at https://karine-h.github.io/T2I-CompBench/.Comment: Project page: https://karine-h.github.io/T2I-CompBench
SAM3D: Segment Anything in 3D Scenes
In this work, we propose SAM3D, a novel framework that is able to predict
masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB
images without further training or finetuning. For a point cloud of a 3D scene
with posed RGB images, we first predict segmentation masks of RGB images with
SAM, and then project the 2D masks into the 3D points. Later, we merge the 3D
masks iteratively with a bottom-up merging approach. At each step, we merge the
point cloud masks of two adjacent frames with the bidirectional merging
approach. In this way, the 3D masks predicted from different frames are
gradually merged into the 3D masks of the whole 3D scene. Finally, we can
optionally ensemble the result from our SAM3D with the over-segmentation
results based on the geometric information of the 3D scenes. Our approach is
experimented with ScanNet dataset and qualitative results demonstrate that our
SAM3D achieves reasonable and fine-grained 3D segmentation results without any
training or finetuning of SAM.Comment: Technical Report. The code is released at
https://github.com/Pointcept/SegmentAnything3
HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting
Realistic 3D human generation from text prompts is a desirable yet
challenging task. Existing methods optimize 3D representations like mesh or
neural fields via score distillation sampling (SDS), which suffers from
inadequate fine details or excessive training time. In this paper, we propose
an efficient yet effective framework, HumanGaussian, that generates
high-quality 3D humans with fine-grained geometry and realistic appearance. Our
key insight is that 3D Gaussian Splatting is an efficient renderer with
periodic Gaussian shrinkage or growing, where such adaptive density control can
be naturally guided by intrinsic human structures. Specifically, 1) we first
propose a Structure-Aware SDS that simultaneously optimizes human appearance
and geometry. The multi-modal score function from both RGB and depth space is
leveraged to distill the Gaussian densification and pruning process. 2)
Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS
into a noisier generative score and a cleaner classifier score, which well
addresses the over-saturation issue. The floating artifacts are further
eliminated based on Gaussian size in a prune-only phase to enhance generation
smoothness. Extensive experiments demonstrate the superior efficiency and
competitive quality of our framework, rendering vivid 3D humans under diverse
scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussianComment: Accepted by CVPR 2024, camera-ready version. Project Page:
https://alvinliu0.github.io/projects/HumanGaussia
Drag-A-Video: Non-rigid Video Editing with Point-based Interaction
Video editing is a challenging task that requires manipulating videos on both
the spatial and temporal dimensions. Existing methods for video editing mainly
focus on changing the appearance or style of the objects in the video, while
keeping their structures unchanged. However, there is no existing method that
allows users to interactively ``drag'' any points of instances on the first
frame to precisely reach the target points with other frames consistently
deformed. In this paper, we propose a new diffusion-based method for
interactive point-based video manipulation, called Drag-A-Video. Our method
allows users to click pairs of handle points and target points as well as masks
on the first frame of an input video. Then, our method transforms the inputs
into point sets and propagates these sets across frames. To precisely modify
the contents of the video, we employ a new video-level motion supervision to
update the features of the video and introduce the latent offsets to achieve
this update at multiple denoising timesteps. We propose a temporal-consistent
point tracking module to coordinate the movement of the points in the handle
point sets. We demonstrate the effectiveness and flexibility of our method on
various videos. The website of our work is available here:
https://drag-a-video.github.io/
- …