19 research outputs found
Text-based Editing of Talking-head Video
Editing talking-head video to change the speech content or to remove filler words is challenging. We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts). Our method automatically annotates an input talking-head video with phonemes, visemes, 3D face pose and geometry, reflectance, expression and scene illumination per frame. To edit a video, the user has to only edit the transcript, and an optimization strategy then chooses segments of the input corpus as base material. The annotated parameters corresponding to the selected segments are seamlessly stitched together and used to produce an intermediate video representation in which the lower half of the face is rendered with a parametric face model. Finally, a recurrent video generation network transforms this representation to a photorealistic video that matches the edited transcript. We demonstrate a large variety of edits, such as the addition, removal, and alteration of words, as well as convincing language translation and full sentence synthesis
Soundify: Matching Sound Effects to Video
In the art of video editing, sound helps add character to an object and
immerse the viewer within a space. Through formative interviews with
professional editors (N=10), we found that the task of adding sounds to video
can be challenging. This paper presents Soundify, a system that assists editors
in matching sounds to video. Given a video, Soundify identifies matching
sounds, synchronizes the sounds to the video, and dynamically adjusts panning
and volume to create spatial audio. In a human evaluation study (N=889), we
show that Soundify is capable of matching sounds to video out-of-the-box for a
diverse range of audio categories. In a within-subjects expert study (N=12), we
demonstrate the usefulness of Soundify in helping video editors match sounds to
video with lighter workload, reduced task completion time, and improved
usability.Comment: Full paper in UIST 2023; Short paper in NeurIPS 2021 ML4CD Workshop;
Online demo: http://soundify.c
Generative Disco: Text-to-Video Generation for Music Visualization
Visuals are a core part of our experience of music, owing to the way they can
amplify the emotions and messages conveyed through the music. However, creating
music visualization is a complex, time-consuming, and resource-intensive
process. We introduce Generative Disco, a generative AI system that helps
generate music visualizations with large language models and text-to-image
models. Users select intervals of music to visualize and then parameterize that
visualization by defining start and end prompts. These prompts are warped
between and generated according to the beat of the music for audioreactive
video. We introduce design patterns for improving generated videos:
"transitions", which express shifts in color, time, subject, or style, and
"holds", which encourage visual emphasis and consistency. A study with
professionals showed that the system was enjoyable, easy to explore, and highly
expressive. We conclude on use cases of Generative Disco for professionals and
how AI-generated content is changing the landscape of creative work