65,276 research outputs found
Can we steal your vocal identity from the Internet?: Initial investigation of cloning Obamaâs voice using GAN, WaveNet and low-quality found data
Thanks to the growing availability of spoofing databases and rapid advances
in using them, systems for detecting voice spoofing attacks are becoming more
and more capable, and error rates close to zero are being reached for the
ASVspoof2015 database. However, speech synthesis and voice conversion paradigms
that are not considered in the ASVspoof2015 database are appearing. Such
examples include direct waveform modelling and generative adversarial networks.
We also need to investigate the feasibility of training spoofing systems using
only low-quality found data. For that purpose, we developed a generative
adversarial network-based speech enhancement system that improves the quality
of speech data found in publicly available sources. Using the enhanced data, we
trained state-of-the-art text-to-speech and voice conversion models and
evaluated them in terms of perceptual speech quality and speaker similarity.
The results show that the enhancement models significantly improved the SNR of
low-quality degraded data found in publicly available sources and that they
significantly improved the perceptual cleanliness of the source speech without
significantly degrading the naturalness of the voice. However, the results also
show limitations when generating speech with the low-quality found data.Comment: conference manuscript submitted to Speaker Odyssey 201
Multi-parametric source-filter separation of speech and prosodic voice restoration
In this thesis, methods and models are developed and presented aiming at the estimation, restoration and transformation of the characteristics of human speech. During a first period of the thesis, a concept was developed that allows restoring prosodic voice features and reconstruct more natural sounding speech from pathological voices using a multi-resolution approach. Inspired from observations with respect to this approach, the necessity of a novel method for the separation of speech into voice source and articulation components emerged in order to improve the perceptive quality of the restored speech signal. This work subsequently represents the main part of this work and therefore is presented first in this thesis. The proposed method is evaluated on synthetic, physically modelled, healthy and pathological speech. A robust, separate representation of source and filter characteristics has applications in areas that go far beyond the reconstruction of alaryngeal speech. It is potentially useful for efficient speech coding, voice biometrics, emotional speech synthesis, remote and/or non-invasive voice disorder diagnosis, etc. A key aspect of the voice restoration method is the reliable separation of the speech signal into voice source and articulation for it is mostly the voice source that requires replacement or enhancement in alaryngeal speech. Observations during the evaluation of above method highlighted that this separation is insufficient with currently known methods. Therefore, the main part of this thesis is concerned with the modelling of voice and vocal tract and the estimation of the respective model parameters. Most methods for joint source filter estimation known today represent a compromise between model complexity, estimation feasibility and estimation efficiency. Typically, single-parametric models are used to represent the source for the sake of tractable optimization or multi-parametric models are estimated using inefficient grid searches over the entire parameter space. The novel method presented in this work proposes advances in the direction of efficiently estimating and fitting multi-parametric source and filter models to healthy and pathological speech signals, resulting in a more reliable estimation of voice source and especially vocal tract coefficients. In particular, the proposed method is exhibits a largely reduced bias in the estimated formant frequencies and bandwidths over a large variety of experimental conditions such as environmental noise, glottal jitter, fundamental frequency, voice types and glottal noise. The methods appears to be especially robust to environmental noise and improves the separation of deterministic voice source components from the articulation. Alaryngeal speakers often have great difficulty at producing intelligible, not to mention prosodic, speech. Despite great efforts and advances in surgical and rehabilitative techniques, currently known methods, devices and modes of speech rehabilitation leave pathological speakers with a lack in the ability to control key aspects of their voice. The proposed multiresolution approach presented at the end of this thesis provides alaryngeal speakers an intuitive manner to increase prosodic features in their speech by reconstructing a more intelligible, more natural and more prosodic voice. The proposed method is entirely non-invasive. Key prosodic cues are reconstructed and enhanced at different temporal scales by inducing additional volatility estimated from other, still intact, speech features. The restored voice source is thus controllable in an intuitive way by the alaryngeal speaker. Despite the above mentioned advantages there is also a weak point of the proposed joint source-filter estimation method to be mentioned. The proposed method exhibits a susceptibility to modelling errors of the glottal source. On the other hand, the proposed estimation framework appears to be well suited for future research on exactly this topic. A logical continuation of this work is the leverage the efficiency and reliability of the proposed method for the development of new, more accurate glottal source models
Seeing Through Noise: Visually Driven Speaker Separation and Enhancement
Isolating the voice of a specific person while filtering out other voices or
background noises is challenging when video is shot in noisy environments. We
propose audio-visual methods to isolate the voice of a single speaker and
eliminate unrelated sounds. First, face motions captured in the video are used
to estimate the speaker's voice, by passing the silent video frames through a
video-to-speech neural network-based model. Then the speech predictions are
applied as a filter on the noisy input audio. This approach avoids using
mixtures of sounds in the learning process, as the number of such possible
mixtures is huge, and would inevitably bias the trained model. We evaluate our
method on two audio-visual datasets, GRID and TCD-TIMIT, and show that our
method attains significant SDR and PESQ improvements over the raw
video-to-speech predictions, and a well-known audio-only method.Comment: Supplementary video: https://www.youtube.com/watch?v=qmsyj7vAzo
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Speaker-normalized sound representations in the human auditory cortex
The acoustic dimensions that distinguish speech sounds (like the vowel differences in âbootâ and âboatâ) also differentiate speakersâ voices. Therefore, listeners must normalize across speakers without losing linguistic information. Past behavioral work suggests an important role for auditory contrast enhancement in normalization: preceding context affects listenersâ perception of subsequent speech sounds. Here, using intracranial electrocorticography in humans, we investigate whether and how such context effects arise in auditory cortex. Participants identified speech sounds that were preceded by phrases from two different speakers whose voices differed along the same acoustic dimension as target words (the lowest resonance of the vocal tract). In every participant, target vowels evoke a speaker-dependent neural response that is consistent with the listenerâs perception, and which follows from a contrast enhancement model. Auditory cortex processing thus displays a critical feature of normalization, allowing listeners to extract meaningful content from the voices of diverse speakers
- âŠ