9 research outputs found

    Visual Speech Enhancement

    Full text link
    When video is shot in noisy environment, the voice of a speaker seen in the video can be enhanced using the visible mouth movements, reducing background noise. While most existing methods use audio-only inputs, improved performance is obtained with our visual speech enhancement, based on an audio-visual neural network. We include in the training data videos to which we added the voice of the target speaker as background noise. Since the audio input is not sufficient to separate the voice of a speaker from his own voice, the trained model better exploits the visual input and generalizes well to different noise types. The proposed model outperforms prior audio visual methods on two public lipreading datasets. It is also the first to be demonstrated on a dataset not designed for lipreading, such as the weekly addresses of Barack Obama.Comment: Accepted to Interspeech 2018. Supplementary video: https://www.youtube.com/watch?v=nyYarDGpcY

    SEGAN: Speech Enhancement Generative Adversarial Network

    Full text link
    Current speech enhancement techniques operate on the spectral domain and/or exploit some higher-level feature. The majority of them tackle a limited number of noise conditions and rely on first-order statistics. To circumvent these issues, deep networks are being increasingly used, thanks to their ability to learn complex functions from large example sets. In this work, we propose the use of generative adversarial networks for speech enhancement. In contrast to current techniques, we operate at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them. We evaluate the proposed model using an independent, unseen test set with two speakers and 20 alternative noise conditions. The enhanced samples confirm the viability of the proposed model, and both objective and subjective evaluations confirm the effectiveness of it. With that, we open the exploration of generative architectures for speech enhancement, which may progressively incorporate further speech-centric design choices to improve their performance.Comment: 5 pages, 4 figures, accepted in INTERSPEECH 201

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Improved sequential and batch learning in neural networks using the tangent plane algorithm

    Get PDF
    The principal aim of this research is to investigate and develop improved sequential and batch learning algorithms based upon the tangent plane algorithm for artificial neural networks. A secondary aim is to apply the newly developed algorithms to multi-category cancer classification problems in the bio-informatics area, which involves the study of dna or protein sequences, macro-molecular structures, and gene expressions

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaci贸 de la parla en v脿ries direccions. Primer, les arquitectures fi-a-fi permeten la injecci贸 i s铆ntesi de mostres temporals directament. D'altra banda, amb l'exploraci贸 de solucions eficients permet l'aplicaci贸 d'aquests sistemes en entorns de computaci贸 restringida, com els tel猫fons intel路ligents. Finalment, les darreres tend猫ncies exploren les dades d'脿udio i veu per derivar-ne representacions amb la m铆nima supervisi贸. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'煤s d'estructures pseudo-recurrents recents, com els models d鈥檃uto atenci贸 i les xarxes quasi-recurrents, per a construir models ac煤stics text-a-veu. Aix铆, el sistema QLAD proposat en aquest treball sintetitza m茅s r脿pid en CPU i GPU que el seu hom貌leg recurrent, preservant el mateix nivell de qualitat de s铆ntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaci贸 es proposa un model de xarxa advers脿ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaci贸 d'infer猫ncia sobre una estructura purament convolucional. Aix貌 implica un increment en l'efici猫ncia respecte altres models existents auto regressius i que tamb茅 treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracci贸 de soroll i preservaci贸 de la naturalitat i la intel路ligibilitat de la veu comparat amb altres sistemes cl脿ssics i models regressius basats en xarxes neuronals profundes en espectre. Tamb茅 es demostra que la SEGAN 茅s eficient transferint les seves operacions a nous llenguatges i sorolls. Aix铆, un model SEGAN entrenat en Angl猫s aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catal脿 o al core脿 amb nom茅s 24 segons de dades d'adaptaci贸. Finalment, explorem l'煤s de tota la capacitat generativa del model i l鈥檃pliquem a recuperaci贸 de senyals de veu malmeses per v脿ries distorsions severes. Aix貌 ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaci贸 de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucci贸 de parts del senyal que s鈥檋an malm猫s, com extensi贸 d鈥檃mple de banda i recuperaci贸 de seccions temporals perdudes, entre d鈥檃ltres. En aquesta 煤ltima aplicaci贸 del model, el fet d鈥檌ncloure funcions de p猫rdua ac煤sticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracter铆stiques ac煤stiques a la sortida de la xarxa discriminadora de la nostra GAN. Tamb茅 es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l鈥檈quilibri en la sin猫rgia advers脿ria i la qualitat generada finalment despr茅s d鈥檃fegir les funcions ac煤stiques. Finalment, proposem un codificador de veu agn貌stic al problema, anomenat PASE, juntament amb el conjunt d鈥檈ines per entrenar-lo. El PASE 茅s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaci贸 abstracta com identitat del parlant, les caracter铆stiques pros貌diques i els continguts ling眉铆stics. Tamb茅 es proposa un entorn auto-supervisat multi-tasca per tal d鈥檈ntrenar aquest sistema, el qual suposa un aven莽 en el terreny de l鈥檃prenentatge no supervisat en l鈥櫭爉bit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l鈥檈ntrada. Primer explorem el rendiment d鈥檃quest codificador per a solventar tasques de reconeixement del parlant, de l鈥檈moci贸 i de la parla, mostrant-se efectiu especialment si s鈥檃justa la representaci贸 de manera supervisada amb un conjunt de dades d鈥檃daptaci贸.Postprint (published version

    Efficient, end-to-end and self-supervised methods for speech processing and generation

    Get PDF
    Deep learning has affected the speech processing and generation fields in many directions. First, end-to-end architectures allow the direct injection and synthesis of waveform samples. Secondly, the exploration of efficient solutions allow to implement these systems in computationally restricted environments, like smartphones. Finally, the latest trends exploit audio-visual data with least supervision. In this thesis these three directions are explored. Firstly, we propose the use of recent pseudo-recurrent structures, like self-attention models and quasi-recurrent networks, to build acoustic models for text-to-speech. The proposed system, QLAD, turns out to synthesize faster on CPU and GPU than its recurrent counterpart whilst preserving the good synthesis quality level, which is competitive with state of the art vocoder-based models. Then, a generative adversarial network is proposed for speech enhancement, named SEGAN. This model works as a speech-to-speech conversion system in time-domain, where a single inference operation is needed for all samples to operate through a fully convolutional structure. This implies an increment in modeling efficiency with respect to other existing models, which are auto-regressive and also work in time-domain. SEGAN achieves prominent results in noise supression and preservation of speech naturalness and intelligibility when compared to the other classic and deep regression based systems. We also show that SEGAN is efficient in transferring its operations to new languages and noises. A SEGAN trained for English performs similarly to this language on Catalan and Korean with only 24 seconds of adaptation data. Finally, we unveil the generative capacity of the model to recover signals from several distortions. We hence propose the concept of generalized speech enhancement. First, the model proofs to be effective to recover voiced speech from whispered one. Then the model is scaled up to solve other distortions that require a recomposition of damaged parts of the signal, like extending the bandwidth or recovering lost temporal sections, among others. The model improves by including additional acoustic losses in a multi-task setup to impose a relevant perceptual weighting on the generated result. Moreover, a two-step training schedule is also proposed to stabilize the adversarial training after the addition of such losses, and both components boost SEGAN's performance across distortions.Finally, we propose a problem-agnostic speech encoder, named PASE, together with the framework to train it. PASE is a fully convolutional network that yields compact representations from speech waveforms. These representations contain abstract information like the speaker identity, the prosodic features or the spoken contents. A self-supervised framework is also proposed to train this encoder, which suposes a new step towards unsupervised learning for speech processing. Once the encoder is trained, it can be exported to solve different tasks that require speech as input. We first explore the performance of PASE codes to solve speaker recognition, emotion recognition and speech recognition. PASE works competitively well compared to well-designed classic features in these tasks, specially after some supervised adaptation. Finally, PASE also provides good descriptors of identity for multi-speaker modeling in text-to-speech, which is advantageous to model novel identities without retraining the model.L'aprenentatge profund ha afectat els camps de processament i generaci贸 de la parla en v脿ries direccions. Primer, les arquitectures fi-a-fi permeten la injecci贸 i s铆ntesi de mostres temporals directament. D'altra banda, amb l'exploraci贸 de solucions eficients permet l'aplicaci贸 d'aquests sistemes en entorns de computaci贸 restringida, com els tel猫fons intel路ligents. Finalment, les darreres tend猫ncies exploren les dades d'脿udio i veu per derivar-ne representacions amb la m铆nima supervisi贸. En aquesta tesi precisament s'exploren aquestes tres direccions. Primer de tot, es proposa l'煤s d'estructures pseudo-recurrents recents, com els models d鈥檃uto atenci贸 i les xarxes quasi-recurrents, per a construir models ac煤stics text-a-veu. Aix铆, el sistema QLAD proposat en aquest treball sintetitza m茅s r脿pid en CPU i GPU que el seu hom貌leg recurrent, preservant el mateix nivell de qualitat de s铆ntesi, competitiu amb l'estat de l'art en models basats en vocoder. A continuaci贸 es proposa un model de xarxa advers脿ria generativa per a millora de veu, anomenat SEGAN. Aquest model fa conversions de veu-a-veu en temps amb una sola operaci贸 d'infer猫ncia sobre una estructura purament convolucional. Aix貌 implica un increment en l'efici猫ncia respecte altres models existents auto regressius i que tamb茅 treballen en el domini temporal. La SEGAN aconsegueix resultats prominents d'extracci贸 de soroll i preservaci贸 de la naturalitat i la intel路ligibilitat de la veu comparat amb altres sistemes cl脿ssics i models regressius basats en xarxes neuronals profundes en espectre. Tamb茅 es demostra que la SEGAN 茅s eficient transferint les seves operacions a nous llenguatges i sorolls. Aix铆, un model SEGAN entrenat en Angl猫s aconsegueix un rendiment comparable a aquesta llengua quan el transferim al catal脿 o al core脿 amb nom茅s 24 segons de dades d'adaptaci贸. Finalment, explorem l'煤s de tota la capacitat generativa del model i l鈥檃pliquem a recuperaci贸 de senyals de veu malmeses per v脿ries distorsions severes. Aix貌 ho anomenem millora de la parla generalitzada. Primer, el model demostra ser efectiu per a la tasca de recuperaci贸 de senyal sonoritzat a partir de senyal xiuxiuejat. Posteriorment, el model escala a poder resoldre altres distorsions que requereixen una reconstrucci贸 de parts del senyal que s鈥檋an malm猫s, com extensi贸 d鈥檃mple de banda i recuperaci贸 de seccions temporals perdudes, entre d鈥檃ltres. En aquesta 煤ltima aplicaci贸 del model, el fet d鈥檌ncloure funcions de p猫rdua ac煤sticament rellevants incrementa la naturalitat del resultat final, en una estructura multi-tasca que prediu caracter铆stiques ac煤stiques a la sortida de la xarxa discriminadora de la nostra GAN. Tamb茅 es proposa fer un entrenament en dues etapes del sistema SEGAN, el qual mostra un increment significatiu de l鈥檈quilibri en la sin猫rgia advers脿ria i la qualitat generada finalment despr茅s d鈥檃fegir les funcions ac煤stiques. Finalment, proposem un codificador de veu agn貌stic al problema, anomenat PASE, juntament amb el conjunt d鈥檈ines per entrenar-lo. El PASE 茅s un sistema purament convolucional que crea representacions compactes de trames de veu. Aquestes representacions contenen informaci贸 abstracta com identitat del parlant, les caracter铆stiques pros貌diques i els continguts ling眉铆stics. Tamb茅 es proposa un entorn auto-supervisat multi-tasca per tal d鈥檈ntrenar aquest sistema, el qual suposa un aven莽 en el terreny de l鈥檃prenentatge no supervisat en l鈥櫭爉bit del processament de la parla. Una vegada el codificador esta entrenat, es pot exportar per a solventar diferents tasques que requereixin tenir senyals de veu a l鈥檈ntrada. Primer explorem el rendiment d鈥檃quest codificador per a solventar tasques de reconeixement del parlant, de l鈥檈moci贸 i de la parla, mostrant-se efectiu especialment si s鈥檃justa la representaci贸 de manera supervisada amb un conjunt de dades d鈥檃daptaci贸

    Estructuras de metadatos para un mejor uso de la informaci贸n en alimentaci贸n animal

    Get PDF
    Para un uso eficiente de los alimentos es necesario conocer en profundidad tanto las necesidades de los animales como las caracter铆sticas de los alimentos. Respecto a estas 煤ltimas, los datos sobre la composici贸n qu铆mica y el valor nutritivo de los alimentos se han obtenido de forma sistem谩tica en los laboratorios de nutrici贸n animal durante los 煤ltimos 200 a帽os (Gizzi y Givens, 2004). Sin embargo, la mayor铆a de estos datos se utilizan con un prop贸sito 煤nico, ya sea el control de calidad o la producci贸n de resultados cient铆ficos, obviando su valor residual cuando se analizan conjuntamente. Desde principios del siglo XX, parte de esta informaci贸n se ha recogido en tablas, pero 茅stas presentan algunas limitaciones, como el tama帽o reducido y la estaticidad. Para superar dichas limitaciones surgen las bases de datos de alimentos. El Servicio de Informaci贸n sobre Alimentos (SIA) de la Universidad de C贸rdoba lleva a帽os trabajando en la construcci贸n de este tipo de bases de datos (G贸mez Cabrera et al., 2003), pero se ha encontrado con algunas dificultades relacionadas con la gesti贸n y an谩lisis de la informaci贸n acumulada. La b煤squeda de soluciones a dichos problemas es el punto de partida de la presente Tesis Doctoral. 2. Contenido de la investigaci贸n Los datos acumulados en el SIA carec铆an de la informaci贸n accesoria necesaria para una adecuada interpretaci贸n y uso. Para solucionarlo se ha dise帽ado una estructura de metadataci贸n adaptada a las necesidades del registro diario de informaci贸n en los laboratorios. Por otro lado, se han dise帽ado sistem谩ticas de denominaci贸n y lenguajes controlados para los metadatos de cara a evitar la heterogeneidad de los descriptores. Respecto a la fase de an谩lisis de la informaci贸n, se hab铆a detectado la importancia del pre-procesamiento. En la presente Tesis Doctoral se ha estudiado el comportamiento, respecto a las externalidades o outputs m谩s habituales de las bases de datos de alimentos, de diferentes t茅cnicas para la integraci贸n de datos diversos, la b煤squeda de repeticiones, la detecci贸n de an贸malos o outliers y la gesti贸n de los vac铆os de informaci贸n o missing data. Se han estudiado algoritmos uni- y multivariantes, as铆 como aproximaciones globales y locales a los aspectos citados. 3. Conclusi贸n Se concluye que las bases de datos de alimentos construidas en base a estructuras de metadatos son una gran opci贸n para compartir resultados de investigaci贸n (data sharing) y para controlar la heterogeneidad t铆pica de los datos sobre alimentos para animales. El pre-procesamiento de la informaci贸n, en especial la detecci贸n de outliers y el manejo de missing data, se muestra como un paso esencial, siendo los algoritmos m谩s adecuados en cada caso funci贸n de las caracter铆sticas de la base de datos y del tipo de an谩lisis que se quiere llevar a cabo. Adem谩s, pese a que ambos aspectos suelen ser vistos como un problema, su estudio permite obtener informaci贸n cuali- y cuantitativa muy valiosa
    corecore