188 research outputs found
Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion
Any-to-any singing voice conversion (SVC) is confronted with the challenge of
``timbre leakage'' issue caused by inadequate disentanglement between the
content and the speaker timbre. To address this issue, this study introduces
NeuCoSVC, a novel neural concatenative SVC framework. It consists of a
self-supervised learning (SSL) representation extractor, a neural harmonic
signal generator, and a waveform synthesizer. The SSL extractor condenses audio
into fixed-dimensional SSL features, while the harmonic signal generator
leverages linear time-varying filters to produce both raw and filtered harmonic
signals for pitch information. The synthesizer reconstructs waveforms using SSL
features, harmonic signals, and loudness information. During inference, voice
conversion is performed by substituting source SSL features with their nearest
counterparts from a matching pool which comprises SSL features extracted from
the reference audio, while preserving raw harmonic signals and loudness from
the source audio. By directly utilizing SSL features from the reference audio,
the proposed framework effectively resolves the ``timbre leakage" issue caused
by previous disentanglement-based approaches. Experimental results demonstrate
that the proposed NeuCoSVC system outperforms the disentanglement-based speaker
embedding approach in one-shot SVC across intra-language, cross-language, and
cross-domain evaluations
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Singing voice resynthesis using concatenative-based techniques
Tese de Doutoramento. Engenharia Informática. Faculdade de Engenharia. Universidade do Porto. 201
- …