79,657 research outputs found
Deep Cross-Modal Audio-Visual Generation
Cross-modal audio-visual perception has been a long-lasting topic in
psychology and neurology, and various studies have discovered strong
correlations in human perception of auditory and visual stimuli. Despite works
in computational multimodal modeling, the problem of cross-modal audio-visual
generation has not been systematically studied in the literature. In this
paper, we make the first attempt to solve this cross-modal generation problem
leveraging the power of deep generative adversarial training. Specifically, we
use conditional generative adversarial networks to achieve cross-modal
audio-visual generation of musical performances. We explore different encoding
methods for audio and visual signals, and work on two scenarios:
instrument-oriented generation and pose-oriented generation. Being the first to
explore this new problem, we compose two new datasets with pairs of images and
sounds of musical performances of different instruments. Our experiments using
both classification and human evaluations demonstrate that our model has the
ability to generate one modality, i.e., audio/visual, from the other modality,
i.e., visual/audio, to a good extent. Our experiments on various design choices
along with the datasets will facilitate future research in this new problem
space
VGM-RNN: Recurrent Neural Networks for Video Game Music Generation
The recent explosion of interest in deep neural networks has affected and in some cases reinvigorated work in fields as diverse as natural language processing, image recognition, speech recognition and many more. For sequence learning tasks, recurrent neural networks and in particular LSTM-based networks have shown promising results. Recently there has been interest – for example in the research by Google’s Magenta team – in applying so-called “language modeling” recurrent neural networks to musical tasks, including for the automatic generation of original music. In this work we demonstrate our own LSTM-based music language modeling recurrent network. We show that it is able to learn musical features from a MIDI dataset and generate output that is musically interesting while demonstrating features of melody, harmony and rhythm. We source our dataset from VGMusic.com, a collection of user-submitted MIDI transcriptions of video game songs, and attempt to generate output which emulates this kind of music
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
- …