114 research outputs found

    A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation

    Full text link
    We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0dB to 80dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.Comment: 8 pages 9 figures, Submitted and accepted in Interspeech201

    Glottal Spectral Separation for Speech Synthesis

    Get PDF

    Reconstructing intelligible audio speech from visual speech features

    Get PDF
    This work describes an investigation into the feasibility of producing intelligible audio speech from only visual speech fea- tures. The proposed method aims to estimate a spectral enve- lope from visual features which is then combined with an arti- ficial excitation signal and used within a model of speech pro- duction to reconstruct an audio signal. Different combinations of audio and visual features are considered, along with both a statistical method of estimation and a deep neural network. The intelligibility of the reconstructed audio speech is measured by human listeners, and then compared to the intelligibility of the video signal only and when combined with the reconstructed audio

    HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

    Get PDF

    Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer

    Full text link
    In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.Comment: A revised version of this work has been accepted to the 154th AES Convention; 12 pages, 4 figure

    A Comparison Between STRAIGHT, Glottal, an Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

    Get PDF
    Speech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Puhe on olennainen osa ihmistenvÀlistÀ informaation siirtoa. Vaikka kielellistÀ sisÀltöÀ pidetÀÀn yleisesti puheen tÀrkeimpÀnÀ ominaisuutena, puhesignaali sisÀltÀÀ myös runsaasti muuta informaatiota kuten prosodisia vihjeitÀ, jotka muokkaavat siirrettÀvÀn informaation merkitystÀ. TÀmÀ informaatio tuotetaan suurilta osin nÀennÀisjaksollisella glottisherÀtteellÀ, joka on puheen herÀtteenÀ toimiva akustinen virtaussignaali. SÀÀtÀmÀllÀ ÀÀnihuulten alapuolista painetta ja ÀÀnihuulten kireyttÀ ihmiset muuttavat glottisherÀtteen ominaisuuksia viestittÀÀkseen esimerkiksi tunnetilaa. Glottaalinen kÀÀnteissuodatus (GKS) on laskennallinen menetelmÀ glottisherÀtteen estimointiin nauhoitetusta puhesignaalista. GlottisherÀtteen perusteella puheen laadusta voidaan tunnistaa useita piirteitÀ kuten ÀÀntötapa, sekÀ hetkellisesti ettÀ ajan funktiona. Puheen perustutkimuksen, kuten fonetiikan, lisÀksi viimeaikaiset edistykset GKS:ssÀ ja koneoppimisessa ovat avaamassa mahdollisuuksia laajempaan GKS:n soveltamiseen puheteknologiassa, kuten puhesynteesissÀ ja puheen biopiirteistÀmisessÀ paralingvistisiÀ sovelluksia varten. Haasteena on kuitenkin se, ettÀ GKS on vaikea kÀÀnteisongelma, jossa todellista puhetta vastaavan glottisherÀtteen suora mittaus on mahdotonta. TÀstÀ johtuen GKS:ssÀ kÀytettÀvien algoritmien kehitystyö ja arviointi perustuu etukÀteisoletuksiin puhesignaalin ominaisuuksista. TÀssÀ vÀitöskirjassa esitetyissÀ menetelmissÀ on yhteisenÀ oletuksena se, ettÀ ÀÀntövÀylÀn siirtofunktio voidaan arvioida (joka on GKS:n pÀÀongelma) aikapainottamalla GKS:n optimointikriteeriÀ niin, ettÀ glottisherÀtteen pÀÀeksitaatiopiikkin vaikutus vaimenee. TÀssÀ vÀitöskirjassa GKS:ta tutkitaan useasta eri nÀkökulmasta, jotka sisÀltÀvÀt kaksi uutta GKS-menetelmÀÀ, jotka parantavat arviointituloksia aikaisempiin menetelmiin verrattuna, sekÀ perustutkimusta kÀÀnteissuodatusprosessin automatisointiin liittyen. LisÀksi GKS-pohjaista ÀÀntövÀylÀn siirtofunktiota kÀytetÀÀn formanttiestimoinnissa sekÀ kuulohavaintopainotettuna versiona puheen spektrin verhokÀyrÀn arvioinnissa. TÀmÀn vÀitöskirjan keskeisin puheteknologiasovellus on GKS-pohjaisten puheen spektrin verhokÀyrÀmallien sekÀ glottisherÀteaaltomuotojen kÀyttö kohdedatana neuroverkkomalleille tilastollisessa parametrisessa puhesynteesissÀ. Saatujen tulosten perusteella kehitetyt menetelmÀt parantavat GKS-pohjaisten menetelmien laatua kaikilla ÀÀnityypeillÀ, mutta puhesynteesisovelluksissa GKS-pohjaiset ratkaisut hyödyttÀvÀt edelleen lÀhinnÀ matalia miesÀÀniÀ
    • 

    corecore