Search CORE

61,653 research outputs found

Combining vocal tract length normalization with hierarchial linear transformations

Author: Dines J.
Garner P.N.
Saheer L.
Yamagishi J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR-based adaptation techniques, being much closer in quality to that generated by the original av-erage voice model. However with only a single parameter, VTLN captures very few speaker specific characteristics when compared to linear transform based adaptation techniques. This paper pro-poses that the merits of VTLN can be combined with those of linear transform based adaptation in a hierarchial Bayesian frame-work, where VTLN is used as the prior information. A novel tech-nique for propagating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regres-sion (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity. Index Terms — Statistical parametric speech synthesis, hidden Markov models, speaker adaptation, vocal tract length normaliza-tion, constrained structural maximum a posteriori linear regression 1

Infoscience - École polytechnique fédérale de Lausanne

CiteSeerX

Crossref

Edinburgh Research Explorer

Combining Vocal Tract Length Normalization with Linear Transformations in a Bayesian Framework

Author: Dines John
Garner Philip N.
Saheer Lakshmi
Yamagishi Junichi
Publication venue: Idiap
Publication date: 19/12/2013
Field of study

Recent research has demonstrated the effectiveness of vocal tract length normalization (VTLN) as a rapid adaptation technique for statistical parametric speech synthesis. VTLN produces speech with naturalness preferable to that of MLLR- based adaptation techniques, being much closer in quality to that generated by the original average voice model. By contrast, with just a single parameter, VTLN captures very few speaker specific characteristics when compared to the available linear transform based adaptation techniques. This paper proposes that the merits of VTLN can be combined with those of linear transform based adaptation technique in a Bayesian framework, where VTLN is used as the prior information. A novel technique of propa- gating the gender information from the VTLN prior through constrained structural maximum a posteriori linear regression (CSMAPLR) adaptation is presented. Experiments show that the resulting transformation has improved speech quality with better naturalness, intelligibility and improved speaker similarity

Infoscience - École polytechnique fédérale de Lausanne

Bayesian Speaker Adaptation Based on a New Hierarchical Probabilistic Model

Author: Johnson Michael T
Li Bi-Cheng
Qu Dan
Zhang Wei-Qiang
Zhang Wen-Lin
Publication venue: e-Publications@Marquette
Publication date: 01/09/2012
Field of study

In this paper, a new hierarchical Bayesian speaker adaptation method called HMAP is proposed that combines the advantages of three conventional algorithms, maximum a posteriori (MAP), maximum-likelihood linear regression (MLLR), and eigenvoice, resulting in excellent performance across a wide range of adaptation conditions. The new method efficiently utilizes intra-speaker and inter-speaker correlation information through modeling phone and speaker subspaces in a consistent hierarchical Bayesian way. The phone variations for a specific speaker are assumed to be located in a low-dimensional subspace. The phone coordinate, which is shared among different speakers, implicitly contains the intra-speaker correlation information. For a specific speaker, the phone variation, represented by speaker-dependent eigenphones, are concatenated into a supervector. The eigenphone supervector space is also a low dimensional speaker subspace, which contains inter-speaker correlation information. Using principal component analysis (PCA), a new hierarchical probabilistic model for the generation of the speech observations is obtained. Speaker adaptation based on the new hierarchical model is derived using the maximum a posteriori criterion in a top-down manner. Both batch adaptation and online adaptation schemes are proposed. With tuned parameters, the new method can handle varying amounts of adaptation data automatically and efficiently. Experimental results on a Mandarin Chinese continuous speech recognition task show good performance under all testing conditions

epublications@Marquette

Time and information in perceptual adaptation to speech

Author: Choi Ja Young
Perrachione Tyler
Publication venue
Publication date: 27/11/2018
Field of study

Presubmission manuscript and supplementary files (stimuli, stimulus presentation code, data, data analysis code).Perceptual adaptation to a talker enables listeners to efficiently resolve the many-to-many mapping between variable speech acoustics and abstract linguistic representations. However, models of speech perception have not delved into the variety or the quantity of information necessary for successful adaptation, nor how adaptation unfolds over time. In three experiments using speeded classification of spoken words, we explored how the quantity (duration), quality (phonetic detail), and temporal continuity of talker-specific context contribute to facilitating perceptual adaptation to speech. In single- and mixed-talker conditions, listeners identified phonetically-confusable target words in isolation or preceded by carrier phrases of varying lengths and phonetic content, spoken by the same talker as the target word. Word identification was always slower in mixed-talker conditions than single-talker ones. However, interference from talker variability decreased as the duration of preceding speech increased but was not affected by the amount of preceding talker-specific phonetic information. Furthermore, efficiency gains from adaptation depended on temporal continuity between preceding speech and the target word. These results suggest that perceptual adaptation to speech may be understood via models of auditory streaming, where perceptual continuity of an auditory object (e.g., a talker) facilitates allocation of attentional resources, resulting in more efficient perceptual processing.NIH NIDCD (R03DC014045

Boston University Institutional Repository (OpenBU)

Parallel Reference Speaker Weighting for Kinematic-Independent Acoustic-to-Articulatory Inversion

Author: Berry Jeffrey J.
Ji An
Johnson Michael T.
Publication venue: e-Publications@Marquette
Publication date: 01/10/2016
Field of study

Acoustic-to-articulatory inversion, the estimation of articulatory kinematics from an acoustic waveform, is a challenging but important problem. Accurate estimation of articulatory movements has the potential for significant impact on our understanding of speech production, on our capacity to assess and treat pathologies in a clinical setting, and on speech technologies such as computer aided pronunciation assessment and audio-video synthesis. However, because of the complex and speaker-specific relationship between articulation and acoustics, existing approaches for inversion do not generalize well across speakers. As acquiring speaker-specific kinematic data for training is not feasible in many practical applications, this remains an important and open problem. This paper proposes a novel approach to acoustic-to-articulatory inversion, Parallel Reference Speaker Weighting (PRSW), which requires no kinematic data for the target speaker and a small amount of acoustic adaptation data. PRSW hypothesizes that acoustic and kinematic similarities are correlated and uses speaker-adapted articulatory models derived from acoustically derived weights. The system was assessed using a 20-speaker data set of synchronous acoustic and Electromagnetic Articulography (EMA) kinematic data. Results demonstrate that by restricting the reference group to a subset consisting of speakers with strong individual speaker-dependent inversion performance, the PRSW method is able to attain kinematic-independent acoustic-to-articulatory inversion performance nearly matching that of the speaker-dependent model, with an average correlation of 0.62 versus 0.63. This indicates that given a sufficiently complete and appropriately selected reference speaker set for adaptation, it is possible to create effective articulatory models without kinematic training data

epublications@Marquette

Recommended from our members

Unsupervised intralingual and cross-lingual speaker adaptation for HMM-based speech synthesis using two-pass decision tree construction

Author: Byrne William
Gibson Matthew
Publication venue: IEEE Transactions on Audio, Speech, and Language Processing
Publication date: 01/01/2010
Field of study

Hidden Markov model (HMM)-based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to estimate the transcription of the adaptation data. This paper firstly presents an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for such supplementary acoustic models. This is achieved by defining a mapping between HMM-based synthesis models and ASR-style models, via a two-pass decision tree construction process. Secondly, it is shown that this mapping also enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data. Thirdly, this paper demonstrates how this technique lends itself to the task of unsupervised cross-lingual adaptation of HMM-based speech synthesis models, and explains the advantages of such an approach. Finally, listener evaluations reveal that the proposed unsupervised adaptation methods deliver performance approaching that of supervised adaptation

Apollo (Cambridge)

How do you say ‘hello’? Personality impressions from brief novel voices

Author: A Todorov
A Todorov
AC Little
AC Little
AC Little
Alexander Todorov
AW Young
B Yao
B Yao
BC Jones
C Ferdenzi
C Nass
CA Klofstad
CA Sutherland
CC Tigue
CD Aronovitch
Charles R. Larson
CL Apicella
CL Apicella
CP Said
CT Ferrand
CY Olivola
D Rendall
DA Kenny
DA Leopold
DA Puts
DC Funder
DR Feinberg
DR Feinberg
DS Berry
DS Berry
DS Berry
DS Berry
E Vannoni
FT Passini
G Rhodes
G Rhodes
GW Allport
IR Titze
IS Penton-Voak
J Kreiman
J Willis
JH Langlois
JH Langlois
JJ Horton
JJ Ohala
JM Montepare
JS Wiggins
JW Lewis
K Grammer
K Miyake
KR Scherer
L Bruckert
L Bruckert
L Germine
LA Zebrowitz
LA Zebrowitz
LA Zebrowitz
LA Zebrowitz
LA Zebrowitz-McArthur
LZ McArthur
M Latinus
M Latinus
M Latinus
M Shevlin
M Zuckerman
M Zuckerman
M Zuckerman
NJ Lass
NJ Lass
NN Oosterhof
O Baumann
P Belin
P Belin
P Boersma
P Borkenau
Pascal Belin
PE Bestelmeyer
PEG Bestelmeyer
Phil McAleer
R Hassin
RA Page
RM Krauss
RR Mccrae
RS Kramer
S Evans
S Patel
S Rosenberg
SA Collins
SC Verosky
SJ Ko
SM Hughes
SM Hughes
SM Hughes
ST Fiske
TK Perrachione
V Bruce
V Pivonkova
VX Luevano
WA van Dommelen
WA van Dommelen
WT Fitch
WT Fitch
WT Norman
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

On hearing a novel voice, listeners readily form personality impressions of that speaker. Accurate or not, these impressions are known to affect subsequent interactions; yet the underlying psychological and acoustical bases remain poorly understood. Furthermore, hitherto studies have focussed on extended speech as opposed to analysing the instantaneous impressions we obtain from first experience. In this paper, through a mass online rating experiment, 320 participants rated 64 sub-second vocal utterances of the word ‘hello’ on one of 10 personality traits. We show that: (1) personality judgements of brief utterances from unfamiliar speakers are consistent across listeners; (2) a two-dimensional ‘social voice space’ with axes mapping Valence (Trust, Likeability) and Dominance, each driven by differing combinations of vocal acoustics, adequately summarises ratings in both male and female voices; and (3) a positive combination of Valence and Dominance results in increased perceived male vocal Attractiveness, whereas perceived female vocal Attractiveness is largely controlled by increasing Valence. Results are discussed in relation to the rapid evaluation of personality and, in turn, the intent of others, as being driven by survival mechanisms via approach or avoidance behaviours. These findings provide empirical bases for predicting personality impressions from acoustical analyses of short utterances and for generating desired personality impressions in artificial voices

Crossref

HAL AMU

Directory of Open Access Journals

PubMed Central

Enlighten

FigShare