60 research outputs found

    Impact of single-microphone dereverberation on DNN-based meeting transcription systems

    Get PDF
    Over the past few decades, a range of front-end techniques have been proposed to improve the robustness of automatic speech recognition systems against environmental distortion. While these techniques are effective for small tasks consisting of carefully designed data sets, especially when used with a classical acoustic model, there has been limited evidence that they are useful for a state-of-the-art system with large scale realistic data. This paper focuses on reverberation as a type of distortion and investigates the degree to which dereverberation processing can improve the performance of various forms of acoustic models based on deep neural networks (DNNs) in a challenging meeting transcription task using a single distant microphone. Experimental results show that dereverberation improves the recognition performance regardless of the acoustic model structure and the type of the feature vectors input into the neural networks, providing additional relative improvements of 4.7% and 4.1% to our best configured speaker-independent and speaker-adaptive DNN-based systems, respectively.Xie Chen was funded by Toshiba Research Europe Ltd, Cambridge Research Lab.This is the accepted manuscript of a paper published in the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on, Issue Date: 4-9 May 2014, Written by: Yoshioka, T.; Xie Chen; Gales, M.J.F.)

    Speech Enhancement Exploiting the Source-Filter Model

    Get PDF
    Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user’s speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.Mobiltelefonie ist aus dem heutigen Leben nicht mehr wegzudenken. Telefonate werden in beliebigen Situationen an beliebigen Orten gefĂŒhrt und dabei nimmt das Mikrofon nicht nur die Sprache des Nutzers auf, sondern auch die UmgebungsgerĂ€usche, welche das VerstĂ€ndnis des GesprĂ€chspartners stark beeinflussen können. Moderne Systeme können durch Sprachverbesserungsalgorithmen solchen Effekten entgegenwirken, dabei ist vielen Nutzern nicht einmal bewusst, dass diese Algorithmen existieren. In dieser Arbeit wird die Entwicklung eines einkanaligen Sprachverbesserungssystems vorgestellt. Der Ansatz setzt auf das Teile-und-herrsche-Verfahren, um störende UmgebungsgerĂ€usche aus Mikrofonsignalen herauszufiltern. Dieses Verfahren kann fĂŒr sĂ€mtliche FĂ€lle angewendet werden, in denen Sprache aus verrauschten Signalen extrahiert werden soll. Der Ansatz nutzt das Quelle-Filter-Modell, um das ursprĂŒngliche Problem in zwei Unterprobleme aufzuteilen, die anschließend gelöst werden, indem die Quelle (das Anregungssignal) und das Filter (die spektrale EinhĂŒllende) separat verbessert werden. Die verbesserten Signale werden gemeinsam genutzt, um das gestörte Mikrofonsignal zu entrauschen. Die SchĂ€tzung von spektralen EinhĂŒllenden wurde bereits in der Vergangenheit erforscht und zum Teil auch fĂŒr die Sprachverbesserung angewandt. Typischerweise wird dabei jedoch das Anregungssignal vernachlĂ€ssigt, so dass die spektrale Feinstruktur des Mikrofonsignals nicht verbessert werden kann. Beide AnsĂ€tze nutzen jeweils die Eigenschaften der cepstralen DomĂ€ne, die unter anderem vorteilhafte mathematische Eigenschaften mit sich bringen, sowie die Möglichkeit, Prototypen eines Anregungssignals zu erzeugen. Wir untersuchen modellbasierte AnsĂ€tze, wie z.B. Gaußsche Mischmodelle, klassische signalverarbeitungsbasierte Lösungen und auch moderne tiefe neuronale Netzwerke in dieser Arbeit. Die so verbesserten Signale werden nicht direkt zur Sprachsignalverbesserung genutzt (z.B. Sprachsynthese), sondern als sogenannter A-priori-Signal-zu-Rauschleistungs-SchĂ€tzwert in einem traditionellen statistischen Sprachverbesserungssystem. Dieses besteht aus einem Störleistungs-SchĂ€tzer, einem A-priori-Signal-zu-Rauschleistungs-SchĂ€tzer und einer spektralen Gewichtungsregel, die ĂŒblicherweise mit Hilfe der Ergebnisse der beiden SchĂ€tzer berechnet wird. Schließlich wird eine SchĂ€tzung des sauberen Sprachsignals aus der Mikrofonaufnahme gewonnen. Der neue Ansatz bietet eine signifikant höhere DĂ€mpfung des StörgerĂ€uschs als der bisherige Stand der Technik. Dabei wird eine vergleichbare QualitĂ€t der Sprachkomponente und der SprachverstĂ€ndlichkeit gewĂ€hrleistet. Somit konnte die GesamtqualitĂ€t des verbesserten Sprachsignals gegenĂŒber dem Stand der Technik erhöht werden

    Noise-Robust Voice Conversion

    Get PDF
    A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method

    Exploration and Optimization of Noise Reduction Algorithms for Speech Recognition in Embedded Devices

    Get PDF
    Environmental noise present in real-life applications substantially degrades the performance of speech recognition systems. An example is an in-car scenario where a speech recognition system has to support the man-machine interface. Several sources of noise coming from the engine, wipers, wheels etc., interact with speech. Special challenge is given in an open window scenario, where noise of traffic, park noise, etc., has to be regarded. The main goal of this thesis is to improve the performance of a speech recognition system based on a state-of-the-art hidden Markov model (HMM) using noise reduction methods. The performance is measured with respect to word error rate and with the method of mutual information. The noise reduction methods are based on weighting rules. Least-squares weighting rules in the frequency domain have been developed to enable a continuous development based on the existing system and also to guarantee its low complexity and footprint for applications in embedded devices. The weighting rule parameters are optimized employing a multidimensional optimization task method of Monte Carlo followed by a compass search method. Root compression and cepstral smoothing methods have also been implemented to boost the recognition performance. The additional complexity and memory requirements of the proposed system are minimum. The performance of the proposed system was compared to the European Telecommunications Standards Institute (ETSI) standardized system. The proposed system outperforms the ETSI system by up to 8.6 % relative increase in word accuracy and achieves up to 35.1 % relative increase in word accuracy compared to the existing baseline system on the ETSI Aurora 3 German task. A relative increase of up to 18 % in word accuracy over the existing baseline system is also obtained from the proposed weighting rules on large vocabulary databases. An entropy-based feature vector analysis method has also been developed to assess the quality of feature vectors. The entropy estimation is based on the histogram approach. The method has the advantage to objectively asses the feature vector quality regardless of the acoustic modeling assumption used in the speech recognition system

    Mask-based enhancement of very noisy speech

    Get PDF
    When speech is contaminated by high levels of additive noise, both its perceptual quality and its intelligibility are reduced. Studies show that conventional approaches to speech enhancement are able to improve quality but not intelligibility. However, in recent years, algorithms that estimate a time-frequency mask from noisy speech using a supervised machine learning approach and then apply this mask to the noisy speech have been shown to be capable of improving intelligibility. The most direct way of measuring intelligibility is to carry out listening tests with human test subjects. However, in situations where listening tests are impractical and where some additional uncertainty in the results is permissible, for example during the development phase of a speech enhancer, intrusive intelligibility metrics can provide an alternative to listening tests. This thesis begins by outlining a new intrusive intelligibility metric, WSTOI, that is a development of the existing STOI metric. WSTOI improves STOI by weighting the intelligibility contributions of different time-frequency regions with an estimate of their intelligibility content. The prediction accuracies of WSTOI and STOI are compared for a range of noises and noise suppression algorithms and it is found that WSTOI outperforms STOI in all tested conditions. The thesis then investigates the best choice of mask-estimation algorithm, target mask, and method of applying the estimated mask. A new target mask, the HSWOBM, is proposed that optimises a modified version of WSTOI with a higher frequency resolution. The HSWOBM is optimised for a stochastic noise signal to encourage a mask estimator trained on the HSWOBM to generalise better to unseen noise conditions. A high frequency resolution version of WSTOI is optimised as this gives improvements in predicted quality compared with optimising WSTOI. Of the tested approaches to target mask estimation, the best-performing approach uses a feed-forward neural network with a loss function based on WSTOI. The best-performing feature set is based on the gains produced by a classical speech enhancer and an estimate of the local voiced-speech-plus-noise to noise ratio in different time-frequency regions, which is obtained with the aid of a pitch estimator. When the estimated target mask is applied in the conventional way, by multiplying the speech by the mask in the time-frequency domain, it can result in speech with very poor perceptual quality. The final chapter of this thesis therefore investigates alternative approaches to applying the estimated mask to the noisy speech, in order to improve both intelligibility and quality. An approach is developed that uses the mask to supply prior information about the speech presence probability to a classical speech enhancer that minimises the expected squared error in the log spectral amplitudes. The proposed end-to-end enhancer outperforms existing algorithms in terms of predicted quality and intelligibility for most noise types.Open Acces

    Binaural scene analysis : localization, detection and recognition of speakers in complex acoustic scenes

    Get PDF
    The human auditory system has the striking ability to robustly localize and recognize a specific target source in complex acoustic environments while ignoring interfering sources. Surprisingly, this remarkable capability, which is referred to as auditory scene analysis, is achieved by only analyzing the waveforms reaching the two ears. Computers, however, are presently not able to compete with the performance achieved by the human auditory system, even in the restricted paradigm of confronting a computer algorithm based on binaural signals with a highly constrained version of auditory scene analysis, such as localizing a sound source in a reverberant environment or recognizing a speaker in the presence of interfering noise. In particular, the problem of focusing on an individual speech source in the presence of competing speakers, termed the cocktail party problem, has been proven to be extremely challenging for computer algorithms. The primary objective of this thesis is the development of a binaural scene analyzer that is able to jointly localize, detect and recognize multiple speech sources in the presence of reverberation and interfering noise. The processing of the proposed system is divided into three main stages: localization stage, detection of speech sources, and recognition of speaker identities. The only information that is assumed to be known a priori is the number of target speech sources that are present in the acoustic mixture. Furthermore, the aim of this work is to reduce the performance gap between humans and machines by improving the performance of the individual building blocks of the binaural scene analyzer. First, a binaural front-end inspired by auditory processing is designed to robustly determine the azimuth of multiple, simultaneously active sound sources in the presence of reverberation. The localization model builds on the supervised learning of azimuthdependent binaural cues, namely interaural time and level differences. Multi-conditional training is performed to incorporate the uncertainty of these binaural cues resulting from reverberation and the presence of competing sound sources. Second, a speech detection module that exploits the distinct spectral characteristics of speech and noise signals is developed to automatically select azimuthal positions that are likely to correspond to speech sources. Due to the established link between the localization stage and the recognition stage, which is realized by the speech detection module, the proposed binaural scene analyzer is able to selectively focus on a predefined number of speech sources that are positioned at unknown spatial locations, while ignoring interfering noise sources emerging from other spatial directions. Third, the speaker identities of all detected speech sources are recognized in the final stage of the model. To reduce the impact of environmental noise on the speaker recognition performance, a missing data classifier is combined with the adaptation of speaker models using a universal background model. This combination is particularly beneficial in nonstationary background noise

    Robust speaker recognition using both vocal source and vocal tract features estimated from noisy input utterances.

    Get PDF
    Wang, Ning.Thesis (M.Phil.)--Chinese University of Hong Kong, 2007.Includes bibliographical references (leaves 106-115).Abstracts in English and Chinese.Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Introduction to Speech and Speaker Recognition --- p.1Chapter 1.2 --- Difficulties and Challenges of Speaker Authentication --- p.6Chapter 1.3 --- Objectives and Thesis Outline --- p.7Chapter 2 --- Speaker Recognition System --- p.10Chapter 2.1 --- Baseline Speaker Recognition System Overview --- p.10Chapter 2.1.1 --- Feature Extraction --- p.12Chapter 2.1.2 --- Pattern Generation and Classification --- p.24Chapter 2.2 --- Performance Evaluation Metric for Different Speaker Recognition Tasks --- p.30Chapter 2.3 --- Robustness of Speaker Recognition System --- p.30Chapter 2.3.1 --- Speech Corpus: CU2C --- p.30Chapter 2.3.2 --- Noise Database: NOISEX-92 --- p.34Chapter 2.3.3 --- Mismatched Training and Testing Conditions --- p.35Chapter 2.4 --- Summary --- p.37Chapter 3 --- Speaker Recognition System using both Vocal Tract and Vocal Source Features --- p.38Chapter 3.1 --- Speech Production Mechanism --- p.39Chapter 3.1.1 --- Speech Production: An Overview --- p.39Chapter 3.1.2 --- Acoustic Properties of Human Speech --- p.40Chapter 3.2 --- Source-filter Model and Linear Predictive Analysis --- p.44Chapter 3.2.1 --- Source-filter Speech Model --- p.44Chapter 3.2.2 --- Linear Predictive Analysis for Speech Signal --- p.46Chapter 3.3 --- Vocal Tract Features --- p.51Chapter 3.4 --- Vocal Source Features --- p.52Chapter 3.4.1 --- Source Related Features: An Overview --- p.52Chapter 3.4.2 --- Source Related Features: Technical Viewpoints --- p.54Chapter 3.5 --- Effects of Noises on Speech Properties --- p.55Chapter 3.6 --- Summary --- p.61Chapter 4 --- Estimation of Robust Acoustic Features for Speaker Discrimination --- p.62Chapter 4.1 --- Robust Speech Techniques --- p.63Chapter 4.1.1 --- Noise Resilience --- p.64Chapter 4.1.2 --- Speech Enhancement --- p.64Chapter 4.2 --- Spectral Subtractive-Type Preprocessing --- p.65Chapter 4.2.1 --- Noise Estimation --- p.66Chapter 4.2.2 --- Spectral Subtraction Algorithm --- p.66Chapter 4.3 --- LP Analysis of Noisy Speech --- p.67Chapter 4.3.1 --- LP Inverse Filtering: Whitening Process --- p.68Chapter 4.3.2 --- Magnitude Response of All-pole Filter in Noisy Condition --- p.70Chapter 4.3.3 --- Noise Spectral Reshaping --- p.72Chapter 4.4 --- Distinctive Vocal Tract and Vocal Source Feature Extraction . . --- p.73Chapter 4.4.1 --- Vocal Tract Feature Extraction --- p.73Chapter 4.4.2 --- Source Feature Generation Procedure --- p.75Chapter 4.4.3 --- Subband-specific Parameterization Method --- p.79Chapter 4.5 --- Summary --- p.87Chapter 5 --- Speaker Recognition Tasks & Performance Evaluation --- p.88Chapter 5.1 --- Speaker Recognition Experimental Setup --- p.89Chapter 5.1.1 --- Task Description --- p.89Chapter 5.1.2 --- Baseline Experiments --- p.90Chapter 5.1.3 --- Identification and Verification Results --- p.91Chapter 5.2 --- Speaker Recognition using Source-tract Features --- p.92Chapter 5.2.1 --- Source Feature Selection --- p.92Chapter 5.2.2 --- Source-tract Feature Fusion --- p.94Chapter 5.2.3 --- Identification and Verification Results --- p.95Chapter 5.3 --- Performance Analysis --- p.98Chapter 6 --- Conclusion --- p.102Chapter 6.1 --- Discussion and Conclusion --- p.102Chapter 6.2 --- Suggestion of Future Work --- p.10

    Model-Based Speech Enhancement

    Get PDF
    Abstract A method of speech enhancement is developed that reconstructs clean speech from a set of acoustic features using a harmonic plus noise model of speech. This is a significant departure from traditional filtering-based methods of speech enhancement. A major challenge with this approach is to estimate accurately the acoustic features (voicing, fundamental frequency, spectral envelope and phase) from noisy speech. This is achieved using maximum a-posteriori (MAP) estimation methods that operate on the noisy speech. In each case a prior model of the relationship between the noisy speech features and the estimated acoustic feature is required. These models are approximated using speaker-independent GMMs of the clean speech features that are adapted to speaker-dependent models using MAP adaptation and for noise using the Unscented Transform. Objective results are presented to optimise the proposed system and a set of subjective tests compare the approach with traditional enhancement methods. Threeway listening tests examining signal quality, background noise intrusiveness and overall quality show the proposed system to be highly robust to noise, performing significantly better than conventional methods of enhancement in terms of background noise intrusiveness. However, the proposed method is shown to reduce signal quality, with overall quality measured to be roughly equivalent to that of the Wiener filter
    • 

    corecore