795 research outputs found
Relying on critical articulators to estimate vocal tract spectra in an articulatory-acoustic database
We present a new phone-dependent feature weighting scheme that can be used to map articulatory configurations (e.g. EMA) onto vocal tract spectra (e.g. MFCC) through table lookup. The approach consists of assigning feature weights according to a feature's ability to predict the acoustic distance between frames. Since an articulator's predictive accuracy is phone-dependent (e.g., lip location is a better predictor for bilabial sounds than for palatal sounds), a unique weight vector is found for each phone. Inspection of the weights reveals a correspondence with the expected critical articulators for many phones. The proposed method reduces overall cepstral error by 6\% when compared to a uniform weighting scheme. Vowels show the greatest benefit, though improvements occur for 80\% of the tested phones
Articulatory Controllable Speech Modification Based on Statistical Inversion and Production Mappings
In this paper, we present an innovative way of utilizing the natural relationship between speech sounds and articulatory movements by developing an articulatory controllable speech modification system. Specifically, we employ statistical acoustic-to-articulatory inversion mapping and articulatory-to-acoustic production mapping based on a Gaussian mixture model, allowing flexible modification of the model parameters and the independence of the text input features. Modification of an input speech signal through manipulation of the unobserved articulatory movements is achievable through a sequence of inversion and production mappings. To ensure the naturalness of articulatory movement trajectories, we introduce a method for manipulating articulatory parameters by considering their intercorrelation. Moreover, to generate high-quality modified speech sounds, we avoid the use of vocoder-based excitation generation by presenting several implementations of direct waveform modification capable of directly filtering an input speech signal using the differences in spectral parameters. The experimental results demonstrate that: 1) higher accuracy in the estimation of spectral parameters is achieved by using sequential inversion and production mappings than for conventional production mapping using measured articulatory parameters, 2) the method for manipulating articulatory parameters by considering their intercorrelation makes it possible to generate more natural trajectories of modified articulatory movements; 3) the implementations of the direct waveform modification method significantly improve the quality of modified speech sounds, even under varying speaking conditions; and 4) the controllability of the system is ensured by its capability of producing modified vowel sounds through the manipulation of appropriate articulatory configurations.journal articl
The weight of phonetic substance in the structure of sound inventories
In the research field initiated by Lindblom & Liljencrants in 1972, we illustrate the possibility of giving substance to phonology, predicting the structure of phonological systems with nonphonological principles, be they listener-oriented (perceptual contrast and stability) or speaker-oriented (articulatory contrast and economy). We proposed for vowel systems the Dispersion-Focalisation Theory (Schwartz et al., 1997b). With the DFT, we can predict vowel systems using two competing perceptual constraints weighted with two parameters, respectively λ and α. The first one aims at increasing auditory distances between vowel spectra (dispersion), the second one aims at increasing the perceptual salience of each spectrum through formant proximities (focalisation). We also introduced new variants based on research in physics - namely, phase space (λ,α) and polymorphism of a given phase, or superstructures in phonological organisations (Vallée et al., 1999) which allow us to generate 85.6% of 342 UPSID systems from 3- to 7-vowel qualities. No similar theory for consonants seems to exist yet. Therefore we present in detail a typology of consonants, and then suggest ways to explain plosive vs. fricative and voiceless vs. voiced consonants predominances by i) comparing them with language acquisition data at the babbling stage and looking at the capacity to acquire relatively different linguistic systems in relation with the main degrees of freedom of the articulators; ii) showing that the places “preferred” for each manner are at least partly conditioned by the morphological constraints that facilitate or complicate, make possible or impossible the needed articulatory gestures, e.g. the complexity of the articulatory control for voicing and the aerodynamics of fricatives. A rather strict coordination between the glottis and the oral constriction is needed to produce acceptable voiced fricatives (Mawass et al., 2000). We determine that the region where the combinations of Ag (glottal area) and Ac (constriction area) values results in a balance between the voice and noise components is indeed very narrow. We thus demonstrate that some of the main tendencies in the phonological vowel and consonant structures of the world’s languages can be explained partly by sensorimotor constraints, and argue that actually phonology can take part in a theory of Perception-for-Action-Control
Statistics in Phonetics
Phonetics is the scientific field concerned with the study of how speech is
produced, heard and perceived. It abounds with data, such as acoustic speech
recordings, neuroimaging data, or articulatory data. In this paper, we provide
an introduction to different areas of phonetics (acoustic phonetics,
sociophonetics, speech perception, articulatory phonetics, speech inversion,
sound change, and speech technology), an overview of the statistical methods
for analyzing their data, and an introduction to the signal processing methods
commonly applied to speech recordings. A major transition in the statistical
modeling of phonetic data has been the shift from fixed effects to random
effects regression models, the modeling of curve data (for instance via GAMMs
or FDA methods), and the use of Bayesian methods. This shift has been driven in
part by the increased focus on large speech corpora in phonetics, which has
been driven by machine learning methods such as forced alignment. We conclude
by identifying opportunities for future research
A speech production model including the nasal Cavity: A novel approach to articulatory analysis of speech signals.
Speaker Independent Acoustic-to-Articulatory Inversion
Acoustic-to-articulatory inversion, the determination of articulatory parameters from acoustic signals, is a difficult but important problem for many speech processing applications, such as automatic speech recognition (ASR) and computer aided pronunciation training (CAPT). In recent years, several approaches have been successfully implemented for speaker dependent models with parallel acoustic and kinematic training data. However, in many practical applications inversion is needed for new speakers for whom no articulatory data is available. In order to address this problem, this dissertation introduces a novel speaker adaptation approach called Parallel Reference Speaker Weighting (PRSW), based on parallel acoustic and articulatory Hidden Markov Models (HMM). This approach uses a robust normalized articulatory space and palate referenced articulatory features combined with speaker-weighted adaptation to form an inversion mapping for new speakers that can accurately estimate articulatory trajectories. The proposed PRSW method is evaluated on the newly collected Marquette electromagnetic articulography - Mandarin Accented English (EMA-MAE) corpus using 20 native English speakers. Cross-speaker inversion results show that given a good selection of reference speakers with consistent acoustic and articulatory patterns, the PRSW approach gives good speaker independent inversion performance even without kinematic training data
Speech Communication
Contains reports on four research projects.C.J. LeBel FellowshipKurzweil Applied IntelligenceNational Institutes of Health (Grant 5 T32 NS07040)National Institutes of Health (Grant 5 RO1 NS04332)National Science Foundation (Grant BNS84-18733)Systems Development FoundationU.S. Navy - Office of Naval Research (Contract N00014-82-K-0727
Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab
Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process.
The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values.
The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features.
The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages
Models and analysis of vocal emissions for biomedical applications: 5th International Workshop: December 13-15, 2007, Firenze, Italy
The MAVEBA Workshop proceedings, held on a biannual basis, collect the scientific papers presented both as oral and poster contributions, during the conference. The main subjects are: development of theoretical and mechanical models as an aid to the study of main phonatory dysfunctions, as well as the biomedical engineering methods for the analysis of voice signals and images, as a support to clinical diagnosis and classification of vocal pathologies. The Workshop has the sponsorship of: Ente Cassa Risparmio di Firenze, COST Action 2103, Biomedical Signal Processing and Control Journal (Elsevier Eds.), IEEE Biomedical Engineering Soc. Special Issues of International Journals have been, and will be, published, collecting selected papers from the conference
- …