5,566 research outputs found

    Learning Latent Representations for Speech Generation and Transformation

    Full text link
    An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.Comment: Accepted to Interspeech 201

    Modern Methods of Time-Frequency Warping of Sound Signals

    Get PDF
    Tato prĂĄce se zabĂœvĂĄ reprezentacĂ­ nestacionĂĄrnĂ­ch harmonickĂœch signĂĄlĆŻ s časově proměnnĂœmi komponentami. PrimĂĄrně je zaměƙena na Harmonickou transformaci a jeji variantu se subkvadratickou vĂœpočetnĂ­ sloĆŸitostĂ­, Rychlou harmonickou transformaci. V tĂ©to prĂĄci jsou prezentovĂĄny dva algoritmy vyuĆŸĂ­vajĂ­cĂ­ Rychlou harmonickou transformaci. Prvni pouĆŸĂ­vĂĄ jako metodu odhadu změny zĂĄkladnĂ­ho kmitočtu sbĂ­ranĂ© logaritmickĂ© spektrum a druhĂĄ pouĆŸĂ­vĂĄ metodu analĂœzy syntĂ©zou. Oba algoritmy jsou pouĆŸity k analĂœze ƙečovĂ©ho segmentu pro porovnĂĄnĂ­ vystupĆŻ. Nakonec je algoritmus vyuĆŸĂ­vajĂ­cĂ­ metody analĂœzy syntĂ©zou pouĆŸit na reĂĄlnĂ© zvukovĂ© signĂĄly, aby bylo moĆŸnĂ© změƙit zlepĆĄenĂ­ reprezentace kmitočtově modulovanĂœch signĂĄlĆŻ za pouĆŸitĂ­ HarmonickĂ© transformace.This thesis deals with representation of non-stationary harmonic signals with time-varying components. Its main focus is aimed at Harmonic Transform and its variant with subquadratic computational complexity, the Fast Harmonic Transform. Two algorithms using the Fast Harmonic Transform are presented. The first uses the gathered log-spectrum as fundamental frequency change estimation method, the second uses analysis-by-synthesis approach. Both algorithms are used on a speech segment to compare its output. Further the analysis-by-synthesis algorithm is applied on several real sound signals to measure the increase in the ability to represent real frequency-modulated signals using the Harmonic Transform.

    A Comparison of Front-Ends for Bitstream-Based ASR over IP

    Get PDF
    Automatic speech recognition (ASR) is called to play a relevant role in the provision of spoken interfaces for IP-based applications. However, as a consequence of the transit of the speech signal over these particular networks, ASR systems need to face two new challenges: the impoverishment of the speech quality due to the compression needed to fit the channel capacity and the inevitable occurrence of packet losses. In this framework, bitstream-based approaches that obtain the ASR feature vectors directly from the coded bitstream, avoiding the speech decoding process, have been proposed ([S.H. Choi, H.K. Kim, H.S. Lee, Speech recognition using quantized LSP parameters and their transformations in digital communications, Speech Commun. 30 (4) (2000) 223–233. A. Gallardo-Antolín, C. Pelàez-Moreno, F. Díaz-de-María, Recognizing GSM digital speech, IEEE Trans. Speech Audio Process., to appear. H.K. Kim, R.V. Cox, R.C. Rose, Performance improvement of a bitstream-based front-end for wireless speech recognition in adverse environments, IEEE Trans. Speech Audio Process. 10 (8) (2002) 591–604. C. Peláez-Moreno, A. Gallardo-Antolín, F. Díaz-de-María, Recognizing voice over IP networks: a robust front-end for speech recognition on the WWW, IEEE Trans. Multimedia 3(2) (2001) 209–218], among others) to improve the robustness of ASR systems. LSP (Line Spectral Pairs) are the preferred set of parameters for the description of the speech spectral envelope in most of the modern speech coders. Nevertheless, LSP have proved to be unsuitable for ASR, and they must be transformed into cepstrum-type parameters. In this paper we comparatively evaluate the robustness of the most significant LSP to cepstrum transformations in a simulated VoIP (voice over IP) environment which includes two of the most popular codecs used in that network (G.723.1 and G.729) and several network conditions. In particular, we compare ‘pseudocepstrum’ [H.K. Kim, S.H. Choi, H.S. Lee, On approximating Line Spectral Frequencies to LPC cepstral coefficients, IEEE Trans. Speech Audio Process. 8 (2) (2000) 195–199], an approximated but straightforward transformation of LSP into LP cepstral coefficients, with a more computationally demanding but exact one. Our results show that pseudocepstrum is preferable when network conditions are good or computational resources low, while the exact procedure is recommended when network conditions become more adverse.Publicad

    Wavelet-based voice morphing

    Get PDF
    This paper presents a new multi-scale voice morphing algorithm. This algorithm enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content. The voice morphing algorithm performs the morphing at different subbands by using the theory of wavelets and models the spectral conversion using the theory of Radial Basis Function Neural Networks. The results obtained on the TIMIT speech database demonstrate effective transformation of the speaker identity
    • 

    corecore