Search CORE

1,284 research outputs found

Voice morphing using the generative topographic mapping

Author: Moroz I. M.
Orphanidou C.
Roberts S. J.
Publication venue
Publication date: 01/01/2003
Field of study

In this paper we address the problem of Voice Morphing. We attempt to transform the spectral characteristics of a source speaker's speech signal so that the listener would believe that the speech was uttered by a target speaker. The voice morphing system transforms the spectral envelope as represented by a Linear Prediction model. The transformation is achieved by codebook mapping using the Generative Topographic Mapping, a non-linear, latent variable, parametrically constrained, Gaussian Mixture Model

Oxford University Research Archive

Determination of Formant Features in Czech and Slovak for GMM Emotional Speech Classifier

Author: Pribil J.
Pribilova A.
Publication venue: Společnost pro radioelektronické inženýrství
Publication date: 01/04/2013
Field of study

The paper is aimed at determination of formant features (FF) which describe vocal tract characteristics. It comprises analysis of the first three formant positions together with their bandwidths and the formant tilts. Subsequently, the statistical evaluation and comparison of the FF was performed. This experiment was realized with the speech material in the form of sentences of male and female speakers expressing four emotional states (joy, sadness, anger, and a neutral state) in Czech and Slovak languages. The statistical distribution of the analyzed formant frequencies and formant tilts shows good differentiation between neutral and emotional styles for both voices. Contrary to it, the values of the formant 3-dB bandwidths have no correlation with the type of the speaking style or the type of the voice. These spectral parameters together with the values of the other speech characteristics were used in the feature vector for Gaussian mixture models (GMM) emotional speech style classifier that is currently developed. The overall mean classification error rate achieves about 18 %, and the best obtained error rate is 5 % for the sadness style of the female voice. These values are acceptable in this first stage of development of the GMM classifier that should be used for evaluation of the synthetic speech quality after applied voice conversion and emotional speech style transformation

Directory of Open Access Journals

Digital library of Brno University of Technology

Reconstruction of Phonated Speech from Whispers Using Formant-Derived Plausible Pitch Modulation

Author: Beigi Homayoon
Hamid Reza Sharifzadeh
Ian V. Mcloughlin
Jingjie Li
Joliveau Elodie
McLoughlin Ian Vince
Netsell Ronald
Rothenberg Martin
Sharifzadeh Hamid Reza
Sharifzadeh Hamid Reza
Su Lim Tan
Sundberg Johan
Toda Tomoki
Yan Song
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 11/05/2015
Field of study

Whispering is a natural, unphonated, secondary aspect of speech communications for most people. However, it is the primary mechanism of communications for some speakers who have impaired voice production mechanisms, such as partial laryngectomees, as well as for those prescribed voice rest, which often follows surgery or damage to the larynx. Unlike most people, who choose when to whisper and when not to, these speakers may have little choice but to rely on whispers for much of their daily vocal interaction. Even though most speakers will whisper at times, and some speakers can only whisper, the majority of today’s computational speech technology systems assume or require phonated speech. This article considers conversion of whispers into natural-sounding phonated speech as a noninvasive prosthetic aid for people with voice impairments who can only whisper. As a by-product, the technique is also useful for unimpaired speakers who choose to whisper. Speech reconstruction systems can be classified into those requiring training and those that do not. Among the latter, a recent parametric reconstruction framework is explored and then enhanced through a refined estimation of plausible pitch from weighted formant differences. The improved reconstruction framework, with proposed formant-derived artificial pitch modulation, is validated through subjective and objective comparison tests alongside state-of-the-art alternatives

Crossref

Kent Academic Repository

Wavelet-based voice morphing

Author: Moroz I. M.
Orphanidou C.
Roberts S. J.
Publication venue
Publication date: 01/01/2004
Field of study

This paper presents a new multi-scale voice morphing algorithm. This algorithm enables a user to transform one person's speech pattern into another person's pattern with distinct characteristics, giving it a new identity, while preserving the original content. The voice morphing algorithm performs the morphing at different subbands by using the theory of wavelets and models the spectral conversion using the theory of Radial Basis Function Neural Networks. The results obtained on the TIMIT speech database demonstrate effective transformation of the speaker identity

Oxford University Research Archive

Kalman tracking of linear predictor and harmonic noise models for noisy speech enhancement

Author: Ben Milner
Boll
Chen
Deller
Ephraim
Ephraim
Ephraim
Ephraim
Esfandiar Zavarehei
Friedman
Griffin
Hansen
Ioannis Andrianakis
Jonathan Darch
Kalman
Lim
Lim
Paul White
Qin Yan
Rentzos
Saeed Vaseghi
Sameti
Secrest
Seltzer
Stylianou
Stylianou
Tucker
Turunen
Vaseghi
Weber
Yan
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

This paper presents a speech enhancement method based on the tracking and denoising of the formants of a linear prediction (LP) model of the spectral envelope of speech and the parameters of a harmonic noise model (HNM) of its excitation. The main advantages of tracking and denoising the prominent energy contours of speech are the efficient use of the spectral and temporal structures of successive speech frames and a mitigation of processing artefact known as the ‘musical noise’ or ‘musical tones’.The formant-tracking linear prediction (FTLP) model estimation consists of three stages: (a) speech pre-cleaning based on a spectral amplitude estimation, (b) formant-tracking across successive speech frames using the Viterbi method, and (c) Kalman filtering of the formant trajectories across successive speech frames.The HNM parameters for the excitation signal comprise; voiced/unvoiced decision, the fundamental frequency, the harmonics’ amplitudes and the variance of the noise component of excitation. A frequency-domain pitch extraction method is proposed that searches for the peak signal to noise ratios (SNRs) at the harmonics. For each speech frame several pitch candidates are calculated. An estimate of the pitch trajectory across successive frames is obtained using a Viterbi decoder. The trajectories of the noisy excitation harmonics across successive speech frames are modeled and denoised using Kalman filters.The proposed method is used to deconstruct noisy speech, de-noise its model parameters and then reconstitute speech from its cleaned parts. Experimental evaluations show the performance gains of the formant tracking, pitch extraction and noise reduction stages

Crossref

Southampton (e-Prints Soton)

University of East Anglia digital repository

Learning Latent Representations for Speech Generation and Transformation

Author: Glass James
Hsu Wei-Ning
Zhang Yu
Publication venue
Publication date: 22/09/2017
Field of study

An ability to model a generative process and learn a latent representation for speech in an unsupervised fashion will be crucial to process vast quantities of unlabelled speech data. Recently, deep probabilistic generative models such as Variational Autoencoders (VAEs) have achieved tremendous success in modeling natural images. In this paper, we apply a convolutional VAE to model the generative process of natural speech. We derive latent space arithmetic operations to disentangle learned latent representations. We demonstrate the capability of our model to modify the phonetic content or the speaker identity for speech segments using the derived operations, without the need for parallel supervisory data.Comment: Accepted to Interspeech 201

arXiv.org e-Print Archive

Crossref

EVALUATION OF INTELLIGIBILITY AND SPEAKER SIMILARITY OF VOICE TRANSFORMATION

Author: Raghunathan Anusha
Publication venue: UKnowledge
Publication date: 01/01/2011
Field of study

Voice transformation refers to a class of techniques that modify the voice characteristics either to conceal the identity or to mimic the voice characteristics of another speaker. Its applications include automatic dialogue replacement and voice generation for people with voice disorders. The diversity in applications makes evaluation of voice transformation a challenging task. The objective of this research is to propose a framework to evaluate intentional voice transformation techniques. Our proposed framework is based on two fundamental qualities: intelligibility and speaker similarity. Intelligibility refers to the clarity of the speech content after voice transformation and speaker similarity measures how well the modified output disguises the source speaker. We measure intelligibility with word error rates and speaker similarity with likelihood of identifying the correct speaker. The novelty of our approach is, we consider whether similarly transformed training data are available to the recognizer. We have demonstrated that this factor plays a significant role in intelligibility and speaker similarity for both human testers and automated recognizers. We thoroughly test two classes of voice transformation techniques: pitch distortion and voice conversion, using our proposed framework. We apply our results for patients with voice hypertension using video self-modeling and preliminary results are presented

University of Kentucky

Simulating Vocal Imitation in Infants, using a Growth Articulatory Model and Speech Robotics

Author: Bessiere P
Schwartz J-L
Serkhane J
Publication venue
Publication date: 01/01/2003
Field of study

In order to shed lights on the cognitive representations likely to underlie early vocal imitation, we tried to simulate Kuhl and Meltzoff's experiment (1996), using Bayesian robotics and a statistical model of the vocal tract that had been fitted to pre-babblers' actual vocalizations. It was shown that audition is compulsory to account for infants' early vocal imitation performance, inasmuch as the simulation of purely visual imitation failed to reproduce infants' score and pattern of imitation. Further, a small number of vocalizations (less than 100!) appeared to be enough for a learning process to provide scores at least as high as those of pre-babblers. Thus, early vocal imitation lies in the reach of a baby robot, with only a few assumptions about learning and imitation

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

CogPrints Cognitive Sciences Eprint Archive