85 research outputs found
The Mirrornet : Learning Audio Synthesizer Controls Inspired by Sensorimotor Interaction
Experiments to understand the sensorimotor neural interactions in the human
cortical speech system support the existence of a bidirectional flow of
interactions between the auditory and motor regions. Their key function is to
enable the brain to `learn' how to control the vocal tract for speech
production. This idea is the impetus for the recently proposed "MirrorNet", a
constrained autoencoder architecture. In this paper, the MirrorNet is applied
to learn, in an unsupervised manner, the controls of a specific audio
synthesizer (DIVA) to produce melodies only from their auditory spectrograms.
The results demonstrate how the MirrorNet discovers the synthesizer parameters
to generate the melodies that closely resemble the original and those of unseen
melodies, and even determine the best set parameters to approximate renditions
of complex piano melodies generated by a different synthesizer. This
generalizability of the MirrorNet illustrates its potential to discover from
sensory data the controls of arbitrary motor-plants
Discriminative Multimodal Learning via Conditional Priors in Generative Models
Deep generative models with latent variables have been used lately to learn
joint representations and generative processes from multi-modal data. These two
learning mechanisms can, however, conflict with each other and representations
can fail to embed information on the data modalities. This research studies the
realistic scenario in which all modalities and class labels are available for
model training, but where some modalities and labels required for downstream
tasks are missing. We show, in this scenario, that the variational lower bound
limits mutual information between joint representations and missing modalities.
We, to counteract these problems, introduce a novel conditional multi-modal
discriminative model that uses an informative prior distribution and optimizes
a likelihood-free objective function that maximizes mutual information between
joint representations and missing modalities. Extensive experimentation
demonstrates the benefits of our proposed model, empirical results show that
our model achieves state-of-the-art results in representative problems such as
downstream classification, acoustic inversion, and image and annotation
generation
Perceptual-gestural (mis)mapping in serial short-term memory: The impact of talker variability
The mechanisms underlying the poorer serial recall of talker-variable lists (e.g., alternating female–male voices) as compared with single-voice lists were examined. We tested the novel hypothesis that this talker variability effect arises from the tendency for perceptual organization to partition the list into streams based on voice such that the representation of order maps poorly onto the formation of a gestural sequence-output plan assembled in support of the reproduction of the true temporal order of the items. In line with the hypothesis, (a) the presence of a spoken lead-in designed to further promote by-voice perceptual partitioning accentuates the effect (Experiments 1 and 2); (b) the impairment is larger the greater the acoustic coherence is between nonadjacent items: Alternating-voice lists are more poorly recalled than four-voice lists (Experiment 3); and (c) talker variability combines nonadditively with phonological similarity, consistent with the view that both variables disrupt sequence output planning (Experiment 4). The results support the view that serial short-term memory performance reflects the action of sequencing processes embodied within general-purpose perceptual input-processing and gestural output-planning systems
The IMS Toucan System for the Blizzard Challenge 2023
For our contribution to the Blizzard Challenge 2023, we improved on the
system we submitted to the Blizzard Challenge 2021. Our approach entails a
rule-based text-to-phoneme processing system that includes rule-based
disambiguation of homographs in the French language. It then transforms the
phonemes to spectrograms as intermediate representations using a fast and
efficient non-autoregressive synthesis architecture based on Conformer and
Glow. A GAN based neural vocoder that combines recent state-of-the-art
approaches converts the spectrogram to the final wave. We carefully designed
the data processing, training, and inference procedures for the challenge data.
Our system identifier is G. Open source code and demo are available.Comment: Published at the Blizzard Challenge Workshop 2023, colocated with the
Speech Synthesis Workshop 2023, a sattelite event of the Interspeech 202
The Effect of Speaking Rate on Vowel Variability Based on the Uncontrolled Manifold Approach and Flow-Based Invertible Neural Network Modeling
Variability is intrinsic to human speech production. One approach to understand variability in speech is to decompose it into task-irrelevant (“good”) and task-relevant (“bad”) parts with respect to speech tasks. Based on the uncontrolled manifold (UCM) approach, this dissertation investigates how vowel token-to-token variability in articulation and acoustics can be decomposed into “good” and “bad” parts and how speaking rate changes the pattern of these two from the Haskins IEEE rate comparison database. Furthermore, it is examined whether the “good” part of variability, or flexibility, can be modeled directly from speech data using the flow-based invertible neural networks framework. The application of the UCM analysis and FlowINN modeling method is discussed, particularly focusing on how the “good” part of variability in speech can be useful rather than being disregarded as noise
EMG-to-Speech: Direct Generation of Speech from Facial Electromyographic Signals
The general objective of this work is the design, implementation, improvement and evaluation of a system that uses surface electromyographic (EMG) signals and directly synthesizes an audible speech output: EMG-to-speech
Basque and Spanish Multilingual TTS Model for Speech-to-Speech Translation
[EN] Lately, multiple Text-to-Speech models have emerged using Deep Neural networks to
synthesize audio from text. In this work, the state-of-the-art multilingual and
multi-speaker Text-to-Speech model has been trained in Basque, Spanish, Catalan, and
Galician. The research consisted of gathering the datasets, pre-processing their audio and
text data, training the model in the languages in different steps, and evaluating the
results at each point. For the training step, a transfer learning approach has been used
from a model already trained in three languages: English, Portuguese, and French.
Therefore, the final model created here supports a total of seven languages. Moreover,
these models also support zero-shot voice conversion, using an input audio file as a
reference. Finally, a prototype application has been created to do Speech-to-Speech
Translation, putting together the models trained here and other models from the
community. Along the way, some Deep Speech Speech-to-Text models have been
generated for Basque and Galician.[EU] Azkenaldian, Text-to-Speech eredu anitz sortu dira sare neuronal sakonak erabiliz, testutik audioa sintetizatzeko. Lan honetan, state-of-the-art Text-to-Speech eredu
eleaniztun eta hiztun anitzeko eredua landu da euskaraz, gaztelaniaz, katalanez eta
galegoz. Ikerketa honetan datu-multzoak bildu, haien audio- eta testu-datuak aldez
aurretik prozesatu, eredua hizkuntzetan entrenatu da urrats desberdinetan eta emaitzak
puntu bakoitzean ebaluatu dira. Entrenatze-urratserako, ikaskuntza-transferentzia
teknika erabili da dagoeneko hiru hizkuntzatan trebatutako eredu batetik abiatuta:
ingelesa, portugesa eta frantsesa. Beraz, hemen sortutako azken ereduak zazpi hizkuntza
onartzen ditu guztira. Gainera, eredu hauek zero-shot ahots bihurketa ere egiten dute,
sarrerako audio fitxategi bat erreferentzia gisa erabiliz. Azkenik, Speech-to-Speech
Translation egiteko prototipo aplikazio bat sortu da hemen entrenatutako ereduak eta
komunitateko beste eredu batzuk elkartuz. Bide horretan, Deep Speech Speech-to-Text
eredu batzuk sortu dira euskararako eta galegorako
Efficient, end-to-end and self-supervised methods for speech processing and generation
Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored.
Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models.
Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generació de la parla en vàries direccions. Primer, les arquitectures fi-a-fi permeten la injecció i síntesi de mostres temporals directament. D'altra banda, amb l'exploració de solucions eficients permet l'aplicació d'aquests sistemes en entorns de computació restringida, com els telèfons intel·ligents. Finalment, les darreres tendències exploren les dades d'àudio i veu per derivar-ne representacions amb la mínima supervisió. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'ús d'estructures pseudo-recurrents recents, com els models d’auto atenció i les xarxes quasi-recurrents, per a construir models acústics text-a-veu. Així, el sistema QLAD proposat en aquest treball sintetitza més ràpid en CPU i GPU que el seu homòleg recurrent, preservant el mateix nivell de qualitat de síntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuació es proposa un model de xarxa adversària generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operació d'inferència sobre una estructura purament convolucional. Això implica un increment en l'eficiència respecte altres models existents auto regressius i que també treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracció de soroll i preservació de la naturalitat i la intel·ligibilitat de la veu comparat amb altres sistemes clàssics i models regressius basats en xarxes neuronals profundes en espectre. També es demostra que la SEGAN és eficient transferint les seves operacions a nous llenguatges i sorolls. Així, un model SEGAN entrenat en Anglès aconsegueix un rendiment comparable a aquesta llengua quan el transferim al català o al coreà amb només 24 segons de dades d'adaptació. Finalment, explorem l'ús de tota la capacitat generativa del model i l’apliquem a recuperació de senyals de veu malmeses per vàries distorsions severes. Això ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperació de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucció de parts del senyal que s’han malmès, com extensió d’ample de banda i recuperació de seccions temporals perdudes, entre d’altres. En aquesta última aplicació del model, el fet d’incloure funcions de pèrdua acústicament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu característiques acústiques a la sortida de la xarxa discriminadora de la nostra GAN. També es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l’equilibri en la sinèrgia adversària i la qualitat generada finalment després d’afegir les funcions acústiques. Finalment, proposem un codificador de veu agnòstic al problema, anomenat PASE, juntament amb el conjunt d’eines per entrenar-lo. El PASE és un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informació abstracta com identitat del parlant, les característiques prosòdiques i els continguts lingüístics. També es proposa un entorn auto-supervisat multi-tasca per tal d’entrenar aquest sistema, el qual suposa un avenç en el terreny de l’aprenentatge no supervisat en l’àmbit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l’entrada. Primer explorem el rendiment d’aquest codificador per a solventar tasques de reconeixement del parlant, de l’emoció i de la parla, mostrant-se efectiu especialment si s’ajusta la representació de manera supervisada amb un conjunt de dades d’adaptació.Postprint (published version
- …