In recent years, video generation has become a prominent generative tool and
has drawn significant attention. However, there is little consideration in
audio-to-video generation, though audio contains unique qualities like temporal
semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to
incorporate audio input that includes both changeable temporal semantics and
magnitude. To generate video frames, TPoS utilizes a latent stable diffusion
model with textual semantic information, which is then guided by the sequential
audio embedding from our pretrained Audio Encoder. As a result, this method
produces audio reactive video contents. We demonstrate the effectiveness of
TPoS across various tasks and compare its results with current state-of-the-art
techniques in the field of audio-to-video generation. More examples are
available at https://ku-vai.github.io/TPoS/Comment: ICCV202