287 research outputs found
Audio-Visual Speaker Conversion using Prosody Features
International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie
Audio-Visual Speaker Conversion using Prosody Features
International audienceThe article presents a joint audio-video approach towards speaker identity conversion, based on statistical methods originally introduced for voice conversion. Using the experimental data from the 3D BIWI Audiovisual corpus of Affective Communication, mapping functions are built between each two speakers in order to convert speaker-specific features: speech signal and 3D facial expressions. The results obtained by combining audio and visual features are compared to corresponding results from earlier approaches, while outlining the improvements brought by introducing dynamic features and exploiting prosodic features.L'article présente une approche audio-visuelle pour la conversion de locuteur, basée sur des méthodes statistiques initialement proposées pour la conversion de voix. En utilisant le corpus audiovisuel BIWI 3D, des modèles de conversion entre locuteurs sont calculés séparément pour la voix et les expressions faciales. Les résultats obtenus en combinant les deux modalités sont comparés subjectivement avec d'autres méthodes et démontrent l'importance de la dynamique et de la prosodie
Speaker adaptation of an acoustic-to-articulatory inversion model using cascaded Gaussian mixture regressions
International audienceThe article presents a method for adapting a GMM-based acoustic-articulatory inversion model trained on a reference speaker to another speaker. The goal is to estimate the articulatory trajectories in the geometrical space of a reference speaker from the speech audio signal of another speaker. This method is developed in the context of a system of visual biofeedback, aimed at pronunciation training. This system provides a speaker with visual information about his/her own articulation, via a 3D orofacial clone. In previous work, we proposed to use GMM-based voice conversion for speaker adaptation. Acoustic-articulatory mapping was achieved in 2 consecutive steps: 1) converting the spectral trajectories of the target speaker (i.e. the system user) into spectral trajectories of the reference speaker (voice conversion), and 2) estimating the most likely articulatory trajectories of the reference speaker from the converted spectral features (acoustic-articulatory inversion). In this work, we propose to combine these two steps into the same statistical mapping framework, by fusing multiple regressions based on trajectory GMM and maximum likelihood criterion (MLE). The proposed technique is compared to two standard speaker adaptation techniques based respectively on MAP and MLLR
What the Future Brings: Investigating the Impact of Lookahead for Incremental Neural TTS
In incremental text to speech synthesis (iTTS), the synthesizer produces an
audio output before it has access to the entire input sentence. In this paper,
we study the behavior of a neural sequence-to-sequence TTS system when used in
an incremental mode, i.e. when generating speech output for token n, the system
has access to n + k tokens from the text sequence. We first analyze the impact
of this incremental policy on the evolution of the encoder representations of
token n for different values of k (the lookahead parameter). The results show
that, on average, tokens travel 88% of the way to their full context
representation with a one-word lookahead and 94% after 2 words. We then
investigate which text features are the most influential on the evolution
towards the final representation using a random forest analysis. The results
show that the most salient factors are related to token length. We finally
evaluate the effects of lookahead k at the decoder level, using a MUSHRA
listening test. This test shows results that contrast with the above high
figures: speech synthesis quality obtained with 2 word-lookahead is
significantly lower than the one obtained with the full sentence.Comment: 5 pages, 4 figure
Notes on the use of variational autoencoders for speech and audio spectrogram modeling
International audienceVariational autoencoders (VAEs) are powerful (deep) generative artificial neural networks. They have been recently used in several papers for speech and audio processing, in particular for the modeling of speech/audio spectrograms. In these papers, very poor theoretical support is given to justify the chosen data representation and decoder likelihood function or the corresponding cost function used for training the VAE. Yet, a nice theoretical statistical framework exists and has been extensively presented and discussed in papers dealing with nonnegative matrix factorization (NMF) of audio spectrograms and its application to audio source separation. In the present paper, we show how this statistical framework applies to VAE-based speech/audio spectrogram modeling. This provides the latter insights on the choice and interpretability of data representation and model parameterization
Adaptive Latency for Part-of-Speech Tagging in Incremental Text-to-Speech Synthesis
International audienceIncremental text-to-speech systems aim at synthesizing a text 'on-the-fly', while the user is typing a sentence. In this context, this article addresses the problem of the part-of-speech tagging (POS, i.e. lexical category) which is a critical step for accurate grapheme-to-phoneme conversion and prosody estimation. Here, the main challenge is to estimate the POS of a given word without knowing its 'right context' (i.e. the following words which are not available yet). To address this issue, we propose a method based on a set of decision trees estimating online whether a given POS tag is likely to be modified when more right-contextual information becomes available. In such a case, the synthesis is delayed until POS stability is guaranteed. This results in delivering the synthetic voice in word chunks of variable length. Objective evaluation on French shows that the proposed method is able to estimate POS tags with more than a 92% accuracy (compared to a non-incremental system) while minimizing the synthesis latency (between 1 and 4 words). Perceptual evaluation (ranking test) is then carried in the context of HMM-based speech synthesis. Experimental results show that the word grouping resulting from the proposed method is rated more acceptable than word-byword incremental synthesis
Is it really always only the others who are to blame? GP’s view on medical overuse. A questionnaire study
Background
Medical overuse is a common problem in health care. Preventing unnecessary medicine is one of the main tasks of General Practice, so called quaternary prevention. We aimed to capture the current opinion of German General Practitioners (GPs) to medical overuse.
Methods
A quantitative online study was conducted. The questionnaire was developed based on a qualitative study and literature search. GPs were asked to estimate prevalence of medical overuse as well as to evaluate drivers and solutions of medical overuse. GPs in Bavaria were recruited via email (750 addresses). A descriptive data analysis was performed. Additionally the association between doctors’ attitudes and (1) demographic variables and (2) interest in campaigns against medical overuse was assessed.
Results
Response rate was 18%. The mean age was 54 years, 79% were male and 68% have worked as GP longer than 15 years. Around 38% of medical services were considered as medical overuse and nearly half of the GPs (47%) judged medical overuse to be the more important problem than medical underuse. Main drivers were seen in “patients´ expectations” (76%), “lack of a primary care system” (61%) and “defensive medicine” (53%), whereas “disregard of evidence/guidelines” (15%) and “economic pressure on the side of the doctor” (13%) were not weighted as important causes. Demographic variables did not have an important impact on GPs´ response pattern. GPs interested in campaigns like “Choosing Wisely” showed a higher awareness for medical overuse, although these campaigns were only known by 50% of the respondents.
Discussion
Medical overuse is an important issue for GPs. Main drivers were searched and found outside their own sphere of responsibility. Campaigns as “Choosing Wisely” seem to have a positive effect on GPs attitude, but knowledge is still limited
Fill in the Gap! Combining Self-supervised Representation Learning with Neural Audio Synthesis for Speech Inpainting
Most speech self-supervised learning (SSL) models are trained with a pretext
task which consists in predicting missing parts of the input signal, either
future segments (causal prediction) or segments masked anywhere within the
input (non-causal prediction). Learned speech representations can then be
efficiently transferred to downstream tasks (e.g., automatic speech or speaker
recognition). In the present study, we investigate the use of a speech SSL
model for speech inpainting, that is reconstructing a missing portion of a
speech signal from its surrounding context, i.e., fulfilling a downstream task
that is very similar to the pretext task. To that purpose, we combine an SSL
encoder, namely HuBERT, with a neural vocoder, namely HiFiGAN, playing the role
of a decoder. In particular, we propose two solutions to match the HuBERT
output with the HiFiGAN input, by freezing one and fine-tuning the other, and
vice versa. Performance of both approaches was assessed in single- and
multi-speaker settings, for both informed and blind inpainting configurations
(i.e., the position of the mask is known or unknown, respectively), with
different objective metrics and a perceptual evaluation. Performances show that
if both solutions allow to correctly reconstruct signal portions up to the size
of 200ms (and even 400ms in some cases), fine-tuning the SSL encoder provides a
more accurate signal reconstruction in the single-speaker setting case, while
freezing it (and training the neural vocoder instead) is a better strategy when
dealing with multi-speaker data
Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation
We propose a computational model of speech production combining a pre-trained
neural articulatory synthesizer able to reproduce complex speech stimuli from a
limited set of interpretable articulatory parameters, a DNN-based internal
forward model predicting the sensory consequences of articulatory commands, and
an internal inverse model based on a recurrent neural network recovering
articulatory commands from the acoustic speech input. Both forward and inverse
models are jointly trained in a self-supervised way from raw acoustic-only
speech data from different speakers. The imitation simulations are evaluated
objectively and subjectively and display quite encouraging performances
- …