5 research outputs found

    Investigating Generative Adversarial Networks based Speech Dereverberation for Robust Speech Recognition

    Full text link
    We investigate the use of generative adversarial networks (GANs) in speech dereverberation for robust speech recognition. GANs have been recently studied for speech enhancement to remove additive noises, but there still lacks of a work to examine their ability in speech dereverberation and the advantages of using GANs have not been fully established. In this paper, we provide deep investigations in the use of GAN-based dereverberation front-end in ASR. First, we study the effectiveness of different dereverberation networks (the generator in GAN) and find that LSTM leads a significant improvement as compared with feed-forward DNN and CNN in our dataset. Second, further adding residual connections in the deep LSTMs can boost the performance as well. Finally, we find that, for the success of GAN, it is important to update the generator and the discriminator using the same mini-batch data during training. Moreover, using reverberant spectrogram as a condition to discriminator, as suggested in previous studies, may degrade the performance. In summary, our GAN-based dereverberation front-end achieves 14%-19% relative CER reduction as compared to the baseline DNN dereverberation network when tested on a strong multi-condition training acoustic model.Comment: Interspeech 201

    Single-Microphone Early and Late Reverberation Suppression in Noisy Speech

    No full text

    Single-Microphone Speech Dereverberation based on Multiple-Step Linear Predictive Inverse Filtering and Spectral Subtraction

    Get PDF
    Single-channel speech dereverberation is a challenging problem of deconvolution of reverberation, produced by the room impulse response, from the speech signal, when only one observation of the reverberant signal (one microphone) is available. Although reverberation in mild levels is helpful in perceiving the speech (or any audio) signal, the adverse effect of reverberation, particularly at high levels, could both deteriorate the performance of automatic recognition systems and make it less intelligible by humans. Single-microphone speech dereverberation is more challenging than multi-microphone speech dereverberation, since it does not allow for spatial processing of different observations of the signal. A review of the recent single-channel dereverberation techniques reveals that, those based on LP-residual enhancement are the most promising ones. On the other hand, spectral subtraction has also been effectively used for dereverberation particularly when long reflections are involved. By using LP-residuals and spectral subtraction as two promising tools for dereverberation, a new dereverberation technique is proposed. The first stage of the proposed technique consists of pre-whitening followed by a delayed long-term LP filtering whose kurtosis or skewness of LP-residuals is maximized to control the weight updates of the inverse filter. The second stage consists of nonlinear spectral subtraction. The proposed two-stage dereverberation scheme leads to two separate algorithms depending on whether kurtosis or skewness maximization is used to establish a feedback function for the weight updates of the adaptive inverse filter. It is shown that the proposed algorithms have several advantages over the existing major single-microphone methods, including a reduction in both early and late reverberations, speech enhancement even in the case of very high reverberation time, robustness to additive background noise, and introducing only a few minor artifacts. Equalized room impulse responses by the proposed algorithms have less reverberation times. This means the inverse-filtering by the proposed algorithms is more successful in dereverberating the speech signal. For short, medium and high reverberation times, the signal-to-reverberation ratio of the proposed technique is significantly higher than that of the existing major algorithms. The waveforms and spectrograms of the inverse-filtered and fully-processed signals indicate the superiority of the proposed algorithms. Assessment of the overall quality of the processed speech signals by automatic speech recognition and perceptual evaluation of speech quality test also confirms that in most cases the proposed technique yields higher scores and in the cases that it does not do so, the difference is not as significant as the other aspects of the performance evaluation. Finally, the robustness of the proposed algorithms against the background noise is investigated and compared to that of the benchmark algorithms, which shows that the proposed algorithms are capable of maintaining a rather stable performance for contaminated speech signals with SNR levels as low as 0 dB

    Égalisation adaptative et non invasive de la réponse temps-fréquence d'une petite salle

    Get PDF
    RÉSUMÉ Dans le cadre de cette recherche, on s’intéresse au son, à l’environnement dans lequel il se propage, à l’interaction entre l’onde de son et son canal de transmission ainsi qu’aux transformations induites par les composantes d’une chaine audio. Le contexte précis étudié est celui de l’écoute musicale sur haut-parleurs.Pour le milieu dans lequel l’onde se propage, comme pour tout canal de transmission, il existe des fonctions mathématiques permettant de caractériser les transformations induites par le canal sur un signal qui le traverse. Un signal électrique sert de signal d’excitation pour ce canal constitué en l’occurrence d’un amplificateur, d’un haut-parleur et de la salle dans laquelle a lieu l’écoute, qui selon ses caractéristiques, retourne en sortie à la position d’écoute une onde de son altérée. Réponse en fréquence, réponse à l’impulsion, fonction de transfert ; les mathématiques utilisées ne diffèrent en rien de celles servant communément à la caractérisation d’un canal de transmission ou à l’expression des fonctions liant les sorties d’un système linéaire à ses entrées. Naturellement, il y a un but à cet exercice de modélisation : l’obtention de la réponse de la chaine amplificateur/salle/haut-parleur rend possible sa correction. Il est commun dans bien des contextes d’écoute, qu’un filtre soit inséré dans la chaine audio entre la source (exemple : lecteur CD) et le haut-parleur qui transforme le signal électrique en signal acoustique propagé dans la salle. Ce filtre, dit « égalisateur », a pour but de compenser en fréquences l’effet des composantes de la chaine audio et de la salle sur le signal sonore y étant transmis. Ses propriétés découlent de celles de l’amplificateur, du haut-parleur et de la salle. Bien qu’analytiquement rigoureuse, l’approche physique, centrée sur la modélisation physique du haut-parleur et sur l’équation de propagation de l’onde acoustique, est mal adaptée aux salles à géométrie complexe ou changeante au fil du temps. La seconde approche, la modélisation expérimentale, abordée dans ce travail, fait abstraction des propriétés physiques. La chaine amplificateur/haut-parleur/salle y est plutôt vue comme une « boite noire » comprenant entrées et sorties. Le problème étudié est celui de la caractérisation d’un système électro-acoustique ayant comme unique entrée un signal émis à travers un haut-parleur dans une salle, et comme unique sortie le signal capté par un microphone placé à la position d’écoute. L’originalité de ce travail réside non seulement dans la technique développée pour en arriver à cette caractérisation, mais surtout dans les contraintes imposées dans la manière d’y arriver. La majorité des techniques documentées à ce jour font appel à des signaux d’excitation dédiés à la mesure ; des signaux dotés de caractéristiques favorables à la simplification du calcul de réponse impulsionnelle qui en découle. Des signaux connus sont émis à travers un haut-parleur et la réponse à leur excitation est captée à l’aide d’un microphone à la position d’écoute. L’exercice de mesure lui-même pose problème, notamment, lorsqu’un auditoire est présent dans la salle. Aussi, la réponse de la salle peut changer entre le moment de la prise de mesure et l’écoute si la salle est reconfigurée, par exemple un rideau est tiré ou une estrade déplacée. Dans le cas d’une salle de spectacle, le haut-parleur utilisé peut varier selon le contexte. Un recensement des travaux dans lesquels des solutions à ce problème sont suggérées fut effectué. Le principal objectif est de développer une méthode innovatrice permettant de capturer la réponse impulsionnelle de la chaine audio à l’insu de l’auditoire. Pour ce faire, aucun signal dédié à la mesure ne doit être utilisé. La méthode développée permet la capture de la réponse impulsionnelle électro-acoustique en n’exploitant que les signaux musicaux. Le résultat est, un algorithme permettant la modélisation dynamique et en continu de la réponse d’une salle. Un filtre égalisateur numérique à réponse impulsionnelle finie doit être conçu, lui aussi capable de s’adapter dynamiquement au comportement de la salle, même lorsque celui-ci varie au fil du temps. La familiarisation avec des concepts plus avancés de programmation C++ orienté objet étant de mise, une technique permettant d’exploiter des signaux musicaux afin d’obtenir la réponse impulsionnelle et la réponse en fréquence du système fut testée expérimentalement sous forme d’un module VST. L’excitation est procurée par les signaux musicaux émis sur haut-parleurs durant l’écoute. Une moyenne mobile pondérée reconstruit statistiquement, au fil du temps, la réponse de la salle sur toute la plage de fréquences audibles. Dans le but d’en quantifier la performance, la réponse en fréquence obtenue est comparée à celle obtenue par une méthode standard servant de référence. L’erreur quadratique moyenne sert de métrique d’erreur et montre que plus la musique défile, plus la réponse en fréquence obtenue s’apparente à la référence pour un même point d’écoute. Une approche à résolution spectrale variable est utilisée pour construire, par bandes de fréquences, la réponse du filtre découlant de celle de la chaine audio. La réponse en fréquence du système corrigée par le filtre égalisateur est plus plane que celle du système initial. Des techniques explorées dans le cadre de ce travail de recherche ont mené à la publication d’un article scientifique dans une revue à comité de lecture et un article de conférence dans lesquels des méthodes similaires furent exploitées en génie des mines.----------ABSTRACT In this research, we are interested in sound, environment wherein it propagates, the interaction between the sound wave and a transmission channel, and the changes induced by the components of an audio chain. The specific context studied is that of listening to music on loudspeakers. For the environment in which sound wave propagates, like for any transmission channel, there are mathematical functions used to characterize the changes induced by a channel on the signal therethrough. An electric signal serves as a input for a system, in this case consisting of an amplifier, a loudspeaker, and the room where the listening takes place, which according to its characteristics, returns as an output at the listening position, an altered sound wave. Frequency response, impulse response, transfer function, the mathematics used are no different from those used commonly for the characterization of a transmission channel or the expression of the outputs of a linear system to its inputs. Naturally, there is a purpose to this modeling exercise: getting the frequency response of the amplifier/loundspeaker/room chain makes possible its equalization. It is common in many contexts of listening to find a filter inserted into the audio chain between the source (Eg CD player) and the amplifier/loudspeaker that converts the electrical signal to an acoustic signal propagated in the room. This filter, called “equalizer” is intended to compensate the frequency effect of the components of the audio chain and the room on the sound signal that will be transmitted. Properties for designing this filter are derived from those of the audio chain. Although analytically rigorous, physical approach, focusing on physical modeling of the loudspeaker and the propagation equation of the acoustic wave is ill-suited to rooms with complex geometry and changing over time. The second approach, experimental modeling, and therefore that addressed in this work, ignores physical properties. The system audio chain is rather seen as a “black box” including inputs and outputs. The problem studied is the characterization of an electro-acoustic system as having a single input signal transmitted through a speaker in a room, and a single output signal picked up by a microphone at the listening position. The originality of this work lies not only in the technique developed to arrive at this characterization, but especially in the constraints imposed in order to get there. The majority of technics documented to this date involve using excitation signals dedicated the measure; signals with favorable characteristics to simplify the calculation of the impulse response of the audio chain. Known signals are played through a loudspeaker and the room’s response to excitation is captured with a microphone at the listening position. The measurement exercise itself poses problem, especially when there is an audience in the room. Also, the response of the room may change between the time of the measurement and time of listening. If the room is reconfigured for example, a curtain is pulled or the stage moved. In the case of a theater, the speaker used may vary depending on the context. A survey of work in which solutions to this problem are suggested was made. The main objective is to develop an innovative method to capture the impulse response of an audio chain without the knowledge of the audience. To do this, no signal dedicated to the measurement should be used. The developed method allows the capture of the electro-acoustic impulse response exploiting only the music signals when it comes to a concert hall or using a movie sound track when a movie is a movie theater. As a result, an algorithm for modeling dynamicly and continuously the response of a room. A finite impulse response filter acting as a digital equalizer must be designed and also able to dynamically adapt the behavior of the room, even when it varies over time. Familiarization with more advanced programming concepts of object-oriented C++ being put, a technique to exploit music signals to obtain the impulse response and frequency response of the audio chain was implemented as a VST module and tested experimentally. The excitation is provided by music signals played through speakers. Using a weighted moving average reconstructed statistically over time, the response of the room on the entire audible frequency range is obtained. In order to quantify the performance the frequency response obtained is compared with that obtained by using a standard reference method. The mean square error is used as an error metric and shows that more music scrolls, more the frequency response obtained is similar to the reference one for the same listening position. A multi spectral resolution method is used to build, for diffrent frequency bands, the filter response arising from the inversion of the room/speaker frequency response. The resulting dynamically adapting filter has properties similar to those of the human ear, a significant spectral-resolution in lower frequencies, and high time-resolution at high frequencies. The response corrected by the filter system tends approaching to a pure pulse. Techniques explored in the context of this research led to the publication of a scientific article in a peer reviewed journal and one conference paper in which similar methods were used for mining engineering applications
    corecore