Search CORE

241 research outputs found

Object Detection in Videos with Tubelet Proposal Networks

Author: Kang Kai
Li Hongsheng
Liu Xihui
Ouyang Wanli
Wang Xiaogang
Xiao Tong
Yan Junjie
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/04/2017
Field of study

Object detection in videos has drawn increasing attention recently with the introduction of the large-scale ImageNet VID dataset. Different from object detection in static images, temporal information in videos is vital for object detection. To fully utilize temporal information, state-of-the-art methods are based on spatiotemporal tubelets, which are essentially sequences of associated bounding boxes across time. However, the existing methods have major limitations in generating tubelets in terms of quality and efficiency. Motion-based methods are able to obtain dense tubelets efficiently, but the lengths are generally only several frames, which is not optimal for incorporating long-term temporal information. Appearance-based methods, usually involving generic object tracking, could generate long tubelets, but are usually computationally expensive. In this work, we propose a framework for object detection in videos, which consists of a novel tubelet proposal network to efficiently generate spatiotemporal proposals, and a Long Short-term Memory (LSTM) network that incorporates temporal information from tubelet proposals for achieving high object detection accuracy in videos. Experiments on the large-scale ImageNet VID dataset demonstrate the effectiveness of the proposed framework for object detection in videos.Comment: CVPR 201

arXiv.org e-Print Archive

Crossref

T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Author: Huang Kaiyi
Li Zhenguo
Liu Xihui
Sun Kaiyue
Xie Enze
Publication venue
Publication date: 12/07/2023
Field of study

Despite the stunning ability to generate high-quality images by recent text-to-image models, current approaches often struggle to effectively compose objects with different attributes and relationships into a complex and coherent scene. We propose T2I-CompBench, a comprehensive benchmark for open-world compositional text-to-image generation, consisting of 6,000 compositional text prompts from 3 categories (attribute binding, object relationships, and complex compositions) and 6 sub-categories (color binding, shape binding, texture binding, spatial relationships, non-spatial relationships, and complex compositions). We further propose several evaluation metrics specifically designed to evaluate compositional text-to-image generation. We introduce a new approach, Generative mOdel fine-tuning with Reward-driven Sample selection (GORS), to boost the compositional text-to-image generation abilities of pretrained text-to-image models. Extensive experiments and evaluations are conducted to benchmark previous methods on T2I-CompBench, and to validate the effectiveness of our proposed evaluation metrics and GORS approach. Project page is available at https://karine-h.github.io/T2I-CompBench/.Comment: Project page: https://karine-h.github.io/T2I-CompBench

arXiv.org e-Print Archive

SAM3D: Segment Anything in 3D Scenes

Author: He Tong
Liu Xihui
Wu Xiaoyang
Yang Yunhan
Zhao Hengshuang
Publication venue
Publication date: 06/06/2023
Field of study

In this work, we propose SAM3D, a novel framework that is able to predict masks in 3D point clouds by leveraging the Segment-Anything Model (SAM) in RGB images without further training or finetuning. For a point cloud of a 3D scene with posed RGB images, we first predict segmentation masks of RGB images with SAM, and then project the 2D masks into the 3D points. Later, we merge the 3D masks iteratively with a bottom-up merging approach. At each step, we merge the point cloud masks of two adjacent frames with the bidirectional merging approach. In this way, the 3D masks predicted from different frames are gradually merged into the 3D masks of the whole 3D scene. Finally, we can optionally ensemble the result from our SAM3D with the over-segmentation results based on the geometric information of the 3D scenes. Our approach is experimented with ScanNet dataset and qualitative results demonstrate that our SAM3D achieves reasonable and fine-grained 3D segmentation results without any training or finetuning of SAM.Comment: Technical Report. The code is released at https://github.com/Pointcept/SegmentAnything3

arXiv.org e-Print Archive

HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting

Author: Lin Dahua
Liu Xian
Liu Xihui
Liu Ziwei
Shan Ying
Tang Jiaxiang
Zeng Gang
Zhan Xiaohang
Publication venue
Publication date: 14/03/2024
Field of study

Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS), which suffers from inadequate fine details or excessive training time. In this paper, we propose an efficient yet effective framework, HumanGaussian, that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing, where such adaptive density control can be naturally guided by intrinsic human structures. Specifically, 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover, we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score, which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework, rendering vivid 3D humans under diverse scenarios. Project Page: https://alvinliu0.github.io/projects/HumanGaussianComment: Accepted by CVPR 2024, camera-ready version. Project Page: https://alvinliu0.github.io/projects/HumanGaussia

arXiv.org e-Print Archive

RhoA phosphorylation mediated by Rho/RhoA-associated kinase pathway improves the anti-freezing potentiality of murine hatched and diapaused blastocysts

Author: Gu Meichao
Guo Yong
Hemin Ni
Liu Yunhai
Pauciullo Alfredo
Sheng Xihui
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Institutional Research Information System University of Turin

Drag-A-Video: Non-rigid Video Editing with Point-based Interaction

Author: Han Haoyu
Li Zhenguo
Liu Xihui
Teng Yao
Wu Yue
Xie Enze
Publication venue
Publication date: 05/12/2023
Field of study

Video editing is a challenging task that requires manipulating videos on both the spatial and temporal dimensions. Existing methods for video editing mainly focus on changing the appearance or style of the objects in the video, while keeping their structures unchanged. However, there is no existing method that allows users to interactively ``drag'' any points of instances on the first frame to precisely reach the target points with other frames consistently deformed. In this paper, we propose a new diffusion-based method for interactive point-based video manipulation, called Drag-A-Video. Our method allows users to click pairs of handle points and target points as well as masks on the first frame of an input video. Then, our method transforms the inputs into point sets and propagates these sets across frames. To precisely modify the contents of the video, we employ a new video-level motion supervision to update the features of the video and introduce the latent offsets to achieve this update at multiple denoising timesteps. We propose a temporal-consistent point tracking module to coordinate the movement of the points in the handle point sets. We demonstrate the effectiveness and flexibility of our method on various videos. The website of our work is available here: https://drag-a-video.github.io/

arXiv.org e-Print Archive