86 research outputs found

    Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

    Full text link
    In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES Convention; 12 pages, 4 figure

    In search of the optimal acoustic features for statistical parametric speech synthesis

    Get PDF
    In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally represented as acoustic features and the waveform is generated by a vocoder. A comprehensive summary of state-of-the-art vocoding techniques is presented, highlighting their characteristics, advantages, and drawbacks, primarily when used in SPSS. We conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last decade. In fact, it seems that the most complicated methods perform worse than simpler ones based on more robust analysis/synthesis algorithms. Typical methods, based on the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+ noise), which we believe to be a major drawback. Problems include: difficulties in the estimation of components; modelling of complex non-linear mechanisms; a lack of ground truth. In addition, the statistical dependence that exists between stochastic and deterministic components of speech is not modelled. We start by improving just the waveform generation stage of SPSS, using standard acoustic features. We propose a new method of waveform generation tailored for SPSS, based on neither source-filter separation nor sinusoidal modelling. The proposed waveform generator avoids unnecessary assumptions and decompositions as far as possible, and uses only the fundamental frequency and spectral envelope as acoustic features. A very small speech database is used as a source of base speech signals which are subsequently \reshaped" to match the specifications output by the acoustic model in the SPSS framework. All of this is done without any decomposition, such as source+filter or harmonics+noise. A comprehensive description of the waveform generation process is presented, along with implementation issues. Two SPSS voices, a female and a male, were built to test the proposed method by using a standard TTS toolkit, Merlin. In a subjective evaluation, listeners preferred the proposed waveform generator over a state-of-the-art vocoder, STRAIGHT. Even though the proposed \waveform reshaping" generator generates higher speech quality than STRAIGHT, the improvement is not large enough. Consequently, we propose a new acoustic representation, whose implementation involves feature extraction and waveform generation, i.e., a complete vocoder. The new representation encodes the complex spectrum derived from the Fourier Transform in a way explicitly designed for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises four feature streams describing magnitude spectrum, phase spectrum, and fundamental frequency; all of these are represented by real numbers. It avoids heuristics or unstable methods for phase unwrapping. The new feature extraction does not attempt to decompose the speech structure and thus the "phasiness" and "buzziness" found in a typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at a lower frame rate than a typical vocoder. To demonstrate the proposed method, two DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective comparisons were performed with a state-of-the-art baseline. The proposed vocoder substantially outperformed the baseline for both voices and under all configurations tested. Furthermore, several enhancements were made over the original design, which are beneficial for either sound quality or compatibility with other tools. In addition to its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing in unit selection-based systems, and can be used for voice conversion or automatic speech recognition

    HMM-based speech synthesis using an acoustic glottal source model

    Get PDF
    Parametric speech synthesis has received increased attention in recent years following the development of statistical HMM-based speech synthesis. However, the speech produced using this method still does not sound as natural as human speech and there is limited parametric flexibility to replicate voice quality aspects, such as breathiness. The hypothesis of this thesis is that speech naturalness and voice quality can be more accurately replicated by a HMM-based speech synthesiser using an acoustic glottal source model, the Liljencrants-Fant (LF) model, to represent the source component of speech instead of the traditional impulse train. Two different analysis-synthesis methods were developed during this thesis, in order to integrate the LF-model into a baseline HMM-based speech synthesiser, which is based on the popular HTS system and uses the STRAIGHT vocoder. The first method, which is called Glottal Post-Filtering (GPF), consists of passing a chosen LF-model signal through a glottal post-filter to obtain the source signal and then generating speech, by passing this source signal through the spectral envelope filter. The system which uses the GPF method (HTS-GPF system) is similar to the baseline system, but it uses a different source signal instead of the impulse train used by STRAIGHT. The second method, called Glottal Spectral Separation (GSS), generates speech by passing the LF-model signal through the vocal tract filter. The major advantage of the synthesiser which incorporates the GSS method, named HTS-LF, is that the acoustic properties of the LF-model parameters are automatically learnt by the HMMs. In this thesis, an initial perceptual experiment was conducted to compare the LFmodel to the impulse train. The results showed that the LF-model was significantly better, both in terms of speech naturalness and replication of two basic voice qualities (breathy and tense). In a second perceptual evaluation, the HTS-LF system was better than the baseline system, although the difference between the two had been expected to be more significant. A third experiment was conducted to evaluate the HTS-GPF system and an improved HTS-LF system, in terms of speech naturalness, voice similarity and intelligibility. The results showed that the HTS-GPF system performed similarly to the baseline. However, the HTS-LF system was significantly outperformed by the baseline. Finally, acoustic measurements were performed on the synthetic speech to investigate the speech distortion in the HTS-LF system. The results indicated that a problem in replicating the rapid variations of the vocal tract filter parameters at transitions between voiced and unvoiced sounds is the most significant cause of speech distortion. This problem encourages future work to further improve the system

    Object coding of music using expressive MIDI

    Get PDF
    PhDStructured audio uses a high level representation of a signal to produce audio output. When it was first introduced in 1998, creating a structured audio representation from an audio signal was beyond the state-of-the-art. Inspired by object coding and structured audio, we present a system to reproduce audio using Expressive MIDI, high-level parameters being used to represent pitch expression from an audio signal. This allows a low bit-rate MIDI sketch of the original audio to be produced. We examine optimisation techniques which may be suitable for inferring Expressive MIDI parameters from estimated pitch trajectories, considering the effect of data codings on the difficulty of optimisation. We look at some less common Gray codes and examine their effect on algorithm performance on standard test problems. We build an expressive MIDI system, estimating parameters from audio and synthesising output from those parameters. When the parameter estimation succeeds, we find that the system produces note pitch trajectories which match source audio to within 10 pitch cents. We consider the quality of the system in terms of both parameter estimation and the final output, finding that improvements to core components { audio segmentation and pitch estimation, both active research fields { would produce a better system. We examine the current state-of-the-art in pitch estimation, and find that some estimators produce high precision estimates but are prone to harmonic errors, whilst other estimators produce fewer harmonic errors but are less precise. Inspired by this, we produce a novel pitch estimator combining the output of existing estimators

    Speech Enhancement Using Speech Synthesis Techniques

    Full text link
    Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
    corecore