7 research outputs found

    Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

    Full text link
    We investigate the use of generative adversarial networks (GANs) in speech dereverberation for robust speech recognition. GANs have been recently studied for speech enhancement to remove additive noises, but there still lacks of a work to examine their ability in speech dereverberation and the advantages of using GANs have not been fully established. In this paper, we provide deep investigations in the use of GAN-based dereverberation front-end in ASR. First, we study the effectiveness of different dereverberation networks (the generator in GAN) and find that LSTM leads a significant improvement as compared with feed-forward DNN and CNN in our dataset. Second, further adding residual connections in the deep LSTMs can boost the performance as well. Finally, we find that, for the success of GAN, it is important to update the generator and the discriminator using the same mini-batch data during training. Moreover, using reverberant spectrogram as a condition to discriminator, as suggested in previous studies, may degrade the performance. In summary, our GAN-based dereverberation front-end achieves 14%-19% relative CER reduction as compared to the baseline DNN dereverberation network when tested on a strong multi-condition training acoustic model.Comment: Interspeech 201

    A combined evaluation of established and new approaches for speech recognition in varied reverberation conditions

    Get PDF
    International audienceRobustness to reverberation is a key concern for distant-microphone ASR. Various approaches have been proposed, including single-channel or multichannel dereverberation, robust feature extraction, alternative acoustic models, and acoustic model adaptation. However, to the best of our knowledge, a detailed study of these techniques in varied reverberation conditions is still missing in the literature. In this paper, we conduct a series of experiments to assess the impact of various dereverberation and acoustic model adaptation approaches on the ASR performance in the range of reverberation conditions found in real domestic environments. We consider both established approaches such as WPE and newer approaches such as learning hidden unit contribution (LHUC) adaptations, whose performance has not been reported before in this context, and we employ them in combination. Our results indicate that performing weighted prediction error (WPE) dereverberation on a reverberated test speech utterance and decoding using an deep neural network (DNN) acoustic model trained with multi-condition reverberated speech with feature-space maximum likelihood linear regression (fMLLR) transformed features, outperforms more recent approaches and helps significantly reduce the word error rate (WER)

    Síntesis de voz basada en modelos ocultos de Markov y algoritmos de aprendizaje profundo

    Get PDF
    This thesis addresses the problem of improving the results of statistical parametric speech synthesis using deep learning algorithms. The subject has become more important in recent times due to the increasing presence of artificial voices in several devices and applications. In these, there is a need to refine the results so that the sound of a synthetic voice approaches the naturalness and expressiveness of human speech. HMM-based voice synthesis became a hot topic after the second half of the 2000s thanks to its proven ability to generate speech with small amounts of data as well as its increased flexibility compared to other techniques. For this reason, the interest of the world’s leading research groups in this area turned to refine their results. In this work, three proposals are made to improve those results using post-filters based on LSTM deep neural networks. Unlike the preliminary proposals found in the literature, are based on collections of various architectures, such as auto-encoders and auto-associative memories, which are trained and applied according to subsets of speech parameters. In this way, the results achieved surpass previous attempts in which it is considered a single model that is mainly focused on the spectrum of voices. Also, two applications are presented where the use of HMM-based speech synthesis and post-filter systems based on deep learning algorithms show good results. The first is the change of accent in voices, a little-explored area for the variants of the Castilian Spanish. The second is noise reduction in signals with both natural and artificial noise. Both post-filter systems for speech synthesis, as well as the additional applications, include combinations of algorithms with other classical speech signal improvement. The work presented here allows us to glimpse new areas of research in the topic of speech synthesis and enhancement of speech signals in the presence of noise.En esta tesis se aborda el problema de mejorar los resultados de la síntesis estadística paramétrica de voz basada en Modelos Ocultos de Markov (HMM), utilizando algoritmos de aprendizaje profundo. El tema ha cobrado mayor importancia en épocas recientes debido a la presencia cada vez mayor de voces artificiales en diversos dispositivos y aplicaciones, en los cuales existe la necesidad de perfeccionar los resultados de manera que el sonido de la voz sintetizada se acerque a la naturalidad y expresividad del habla humana. La síntesis de voz basada en HMM se difundió durante la segunda mitad de la década de 2000, gracias a su probada capacidad para generar voces con menos recursos y mayor flexibilidad que otras técnicas. Por esta razón, el interés de los principales grupos de investigación del mundo en este tema se volvió hacia perfeccionar sus resultados, a partir de la década de 2010. En la presente tesis se realizan tres propuestas para mejorar estos resultados: la primera utiliza post-filtros basados en redes neuronales de memoria a corto y largo plazo (LSTM), la segunda una combinación con filtros Wiener, y la tercera un nuevo enfoque discriminativo. A diferencia de las propuestas preliminares que se encuentran en la literatura, las de esta tesis tienen como base colecciones de diversas arquitecturas, tales como autocodificadores (autoencoders) y memorias auto-asociativas, las cuales se entrenan y aplican de acuerdo con subconjuntos de parámetros del habla. De esta manera, los resultados alcanzados superan intentos previos en los que se considera un único modelo, principalmente enfocado a las componentes espectrales de las voces. Adicionalmente, se presentan dos aplicaciones donde la propuesta de utilización de síntesis de voz basada en HMM y los sistemas de post-filtros basados en algoritmos de aprendizaje profundo muestran buenos resultados. La primera es el cambio de acento en voces, área poco explorada para variantes de la lengua castellana. La segunda es la reducción de ruido en señales degradadas tanto con ruidos naturales como artificiales. Tanto los sistemas de post-filtros para la síntesis de voz, como las aplicaciones adicionales, incluyen combinaciones de los algoritmos de aprendizaje profundo con otros clásicos en el tema de mejoramiento de señales de habla. El trabajo permite vislumbrar nuevas líneas de investigación en el tema de síntesis de voz y de mejora de señales de habla en presencia de ruido

    Proceedings of the Detection and Classification of Acoustic Scenes and Events 2017 Workshop (DCASE2017)

    Get PDF
    corecore