15,347 research outputs found
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
Visually Indicated Sounds
Objects make distinctive sounds when they are hit or scratched. These sounds
reveal aspects of an object's material properties, as well as the actions that
produced them. In this paper, we propose the task of predicting what sound an
object makes when struck as a way of studying physical interactions within a
visual scene. We present an algorithm that synthesizes sound from silent videos
of people hitting and scratching objects with a drumstick. This algorithm uses
a recurrent neural network to predict sound features from videos and then
produces a waveform from these features with an example-based synthesis
procedure. We show that the sounds predicted by our model are realistic enough
to fool participants in a "real or fake" psychophysical experiment, and that
they convey significant information about material properties and physical
interactions
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Scalable Approach to Uncertainty Quantification and Robust Design of Interconnected Dynamical Systems
Development of robust dynamical systems and networks such as autonomous
aircraft systems capable of accomplishing complex missions faces challenges due
to the dynamically evolving uncertainties coming from model uncertainties,
necessity to operate in a hostile cluttered urban environment, and the
distributed and dynamic nature of the communication and computation resources.
Model-based robust design is difficult because of the complexity of the hybrid
dynamic models including continuous vehicle dynamics, the discrete models of
computations and communications, and the size of the problem. We will overview
recent advances in methodology and tools to model, analyze, and design robust
autonomous aerospace systems operating in uncertain environment, with stress on
efficient uncertainty quantification and robust design using the case studies
of the mission including model-based target tracking and search, and trajectory
planning in uncertain urban environment. To show that the methodology is
generally applicable to uncertain dynamical systems, we will also show examples
of application of the new methods to efficient uncertainty quantification of
energy usage in buildings, and stability assessment of interconnected power
networks
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell,
and touch), vision and sound are basic sources through which humans understand
the world. Often correlated during natural events, these two modalities combine
to jointly affect human perception. In this paper, we pose the task of
generating sound given visual input. Such capabilities could help enable
applications in virtual reality (generating sound for virtual scenes
automatically) or provide additional accessibility to images or videos for
people with visual impairments. As a first step in this direction, we apply
learning-based methods to generate raw waveform samples given input video
frames. We evaluate our models on a dataset of videos containing a variety of
sounds (such as ambient sounds and sounds from people/animals). Our experiments
show that the generated sounds are fairly realistic and have good temporal
synchronization with the visual inputs.Comment: Project page:
http://bvision11.cs.unc.edu/bigpen/yipin/visual2sound_webpage/visual2sound.htm
- …