6 research outputs found
Training-Free Semantic Video Composition via Pre-trained Diffusion Model
The video composition task aims to integrate specified foregrounds and
backgrounds from different videos into a harmonious composite. Current
approaches, predominantly trained on videos with adjusted foreground color and
lighting, struggle to address deep semantic disparities beyond superficial
adjustments, such as domain gaps. Therefore, we propose a training-free
pipeline employing a pre-trained diffusion model imbued with semantic prior
knowledge, which can process composite videos with broader semantic
disparities. Specifically, we process the video frames in a cascading manner
and handle each frame in two processes with the diffusion model. In the
inversion process, we propose Balanced Partial Inversion to obtain generation
initial points that balance reversibility and modifiability. Then, in the
generation process, we further propose Inter-Frame Augmented attention to
augment foreground continuity across frames. Experimental results reveal that
our pipeline successfully ensures the visual harmony and inter-frame coherence
of the outputs, demonstrating efficacy in managing broader semantic
disparities
MotionZero:Exploiting Motion Priors for Zero-shot Text-to-Video Generation
Zero-shot Text-to-Video synthesis generates videos based on prompts without
any videos. Without motion information from videos, motion priors implied in
prompts are vital guidance. For example, the prompt "airplane landing on the
runway" indicates motion priors that the "airplane" moves downwards while the
"runway" stays static. Whereas the motion priors are not fully exploited in
previous approaches, thus leading to two nontrivial issues: 1) the motion
variation pattern remains unaltered and prompt-agnostic for disregarding motion
priors; 2) the motion control of different objects is inaccurate and entangled
without considering the independent motion priors of different objects. To
tackle the two issues, we propose a prompt-adaptive and disentangled motion
control strategy coined as MotionZero, which derives motion priors from prompts
of different objects by Large-Language-Models and accordingly applies motion
control of different objects to corresponding regions in disentanglement.
Furthermore, to facilitate videos with varying degrees of motion amplitude, we
propose a Motion-Aware Attention scheme which adjusts attention among frames by
motion amplitude. Extensive experiments demonstrate that our strategy could
correctly control motion of different objects and support versatile
applications including zero-shot video edit
Make-A-Storyboard: A General Framework for Storyboard with Disentangled and Merged Control
Story Visualization aims to generate images aligned with story prompts,
reflecting the coherence of storybooks through visual consistency among
characters and scenes.Whereas current approaches exclusively concentrate on
characters and neglect the visual consistency among contextually correlated
scenes, resulting in independent character images without inter-image
coherence.To tackle this issue, we propose a new presentation form for Story
Visualization called Storyboard, inspired by film-making, as illustrated in
Fig.1.Specifically, a Storyboard unfolds a story into visual representations
scene by scene. Within each scene in Storyboard, characters engage in
activities at the same location, necessitating both visually consistent scenes
and characters.For Storyboard, we design a general framework coined as
Make-A-Storyboard that applies disentangled control over the consistency of
contextual correlated characters and scenes and then merge them to form
harmonized images.Extensive experiments demonstrate 1) Effectiveness.the
effectiveness of the method in story alignment, character consistency, and
scene correlation; 2) Generalization. Our method could be seamlessly integrated
into mainstream Image Customization methods, empowering them with the
capability of story visualization