321 research outputs found

    Exploration and Optimization of Noise Reduction Algorithms for Speech Recognition in Embedded Devices

    Get PDF
    Environmental noise present in real-life applications substantially degrades the performance of speech recognition systems. An example is an in-car scenario where a speech recognition system has to support the man-machine interface. Several sources of noise coming from the engine, wipers, wheels etc., interact with speech. Special challenge is given in an open window scenario, where noise of traffic, park noise, etc., has to be regarded. The main goal of this thesis is to improve the performance of a speech recognition system based on a state-of-the-art hidden Markov model (HMM) using noise reduction methods. The performance is measured with respect to word error rate and with the method of mutual information. The noise reduction methods are based on weighting rules. Least-squares weighting rules in the frequency domain have been developed to enable a continuous development based on the existing system and also to guarantee its low complexity and footprint for applications in embedded devices. The weighting rule parameters are optimized employing a multidimensional optimization task method of Monte Carlo followed by a compass search method. Root compression and cepstral smoothing methods have also been implemented to boost the recognition performance. The additional complexity and memory requirements of the proposed system are minimum. The performance of the proposed system was compared to the European Telecommunications Standards Institute (ETSI) standardized system. The proposed system outperforms the ETSI system by up to 8.6 % relative increase in word accuracy and achieves up to 35.1 % relative increase in word accuracy compared to the existing baseline system on the ETSI Aurora 3 German task. A relative increase of up to 18 % in word accuracy over the existing baseline system is also obtained from the proposed weighting rules on large vocabulary databases. An entropy-based feature vector analysis method has also been developed to assess the quality of feature vectors. The entropy estimation is based on the histogram approach. The method has the advantage to objectively asses the feature vector quality regardless of the acoustic modeling assumption used in the speech recognition system

    Studies on noise robust automatic speech recognition

    Get PDF
    Noise in everyday acoustic environments such as cars, traffic environments, and cafeterias remains one of the main challenges in automatic speech recognition (ASR). As a research theme, it has received wide attention in conferences and scientific journals focused on speech technology. This article collection reviews both the classic and novel approaches suggested for noise robust ASR. The articles are literature reviews written for the spring 2009 seminar course on noise robust automatic speech recognition (course code T-61.6060) held at TKK

    Automatic Speech Recognition Using LP-DCTC/DCS Analysis Followed by Morphological Filtering

    Get PDF
    Front-end feature extraction techniques have long been a critical component in Automatic Speech Recognition (ASR). Nonlinear filtering techniques are becoming increasingly important in this application, and are often better than linear filters at removing noise without distorting speech features. However, design and analysis of nonlinear filters are more difficult than for linear filters. Mathematical morphology, which creates filters based on shape and size characteristics, is a design structure for nonlinear filters. These filters are limited to minimum and maximum operations that introduce a deterministic bias into filtered signals. This work develops filtering structures based on a mathematical morphology that utilizes the bias while emphasizing spectral peaks. The combination of peak emphasis via LP analysis with morphological filtering results in more noise robust speech recognition rates. To help understand the behavior of these pre-processing techniques the deterministic and statistical properties of the morphological filters are compared to the properties of feature extraction techniques that do not employ such algorithms. The robust behavior of these algorithms for automatic speech recognition in the presence of rapidly fluctuating speech signals with additive and convolutional noise is illustrated. Examples of these nonlinear feature extraction techniques are given using the Aurora 2.0 and Aurora 3.0 databases. Features are computed using LP analysis alone to emphasize peaks, morphological filtering alone, or a combination of the two approaches. Although absolute best results are normally obtained using a combination of the two methods, morphological filtering alone is nearly as effective and much more computationally efficient

    Hidden Markov model-based speech enhancement

    Get PDF
    This work proposes a method of model-based speech enhancement that uses a network of HMMs to first decode noisy speech and to then synthesise a set of features that enables a speech production model to reconstruct clean speech. The motivation is to remove the distortion and residual and musical noises that are associated with conventional filteringbased methods of speech enhancement. STRAIGHT forms the speech production model for speech reconstruction and requires a time-frequency spectral surface, aperiodicity and a fundamental frequency contour. The technique of HMM-based synthesis is used to create the estimate of the timefrequency surface, and aperiodicity after the model and state sequence is obtained from HMM decoding of the input noisy speech. Fundamental frequency were found to be best estimated using the PEFAC method rather than synthesis from the HMMs. For the robust HMM decoding in noisy conditions it is necessary for the HMMs to model noisy speech and consequently noise adaptation is investigated to achieve this and its resulting effect on the reconstructed speech measured. Even with such noise adaptation to match the HMMs to the noisy conditions, decoding errors arise, both in terms of incorrect decoding and time alignment errors. Confidence measures are developed to identify such errors and then compensation methods developed to conceal these errors in the enhanced speech signal. Speech quality and intelligibility analysis is first applied in terms of PESQ and NCM showing the superiority of the proposed method against conventional methods at low SNRs. Three way subjective MOS listening test then discovers the performance of the proposed method overwhelmingly surpass the conventional methods over all noise conditions and then a subjective word recognition test shows an advantage of the proposed method over speech intelligibility to the conventional methods at low SNRs

    In search of the optimal acoustic features for statistical parametric speech synthesis

    Get PDF
    In the Statistical Parametric Speech Synthesis (SPSS) paradigm, speech is generally represented as acoustic features and the waveform is generated by a vocoder. A comprehensive summary of state-of-the-art vocoding techniques is presented, highlighting their characteristics, advantages, and drawbacks, primarily when used in SPSS. We conclude that state-of-the-art vocoding methods are suboptimal and are a cause of significant loss of quality, even though numerous vocoders have been proposed in the last decade. In fact, it seems that the most complicated methods perform worse than simpler ones based on more robust analysis/synthesis algorithms. Typical methods, based on the source-filter or sinusoidal models, rely on excessive simplifying assumptions. They perform what we call an "extreme decomposition" of speech (e.g., source+filter or sinusoids+ noise), which we believe to be a major drawback. Problems include: difficulties in the estimation of components; modelling of complex non-linear mechanisms; a lack of ground truth. In addition, the statistical dependence that exists between stochastic and deterministic components of speech is not modelled. We start by improving just the waveform generation stage of SPSS, using standard acoustic features. We propose a new method of waveform generation tailored for SPSS, based on neither source-filter separation nor sinusoidal modelling. The proposed waveform generator avoids unnecessary assumptions and decompositions as far as possible, and uses only the fundamental frequency and spectral envelope as acoustic features. A very small speech database is used as a source of base speech signals which are subsequently \reshaped" to match the specifications output by the acoustic model in the SPSS framework. All of this is done without any decomposition, such as source+filter or harmonics+noise. A comprehensive description of the waveform generation process is presented, along with implementation issues. Two SPSS voices, a female and a male, were built to test the proposed method by using a standard TTS toolkit, Merlin. In a subjective evaluation, listeners preferred the proposed waveform generator over a state-of-the-art vocoder, STRAIGHT. Even though the proposed \waveform reshaping" generator generates higher speech quality than STRAIGHT, the improvement is not large enough. Consequently, we propose a new acoustic representation, whose implementation involves feature extraction and waveform generation, i.e., a complete vocoder. The new representation encodes the complex spectrum derived from the Fourier Transform in a way explicitly designed for SPSS, rather than for speech coding or copy-synthesis. The feature set comprises four feature streams describing magnitude spectrum, phase spectrum, and fundamental frequency; all of these are represented by real numbers. It avoids heuristics or unstable methods for phase unwrapping. The new feature extraction does not attempt to decompose the speech structure and thus the "phasiness" and "buzziness" found in a typical vocoder, such as STRAIGHT, is dramatically reduced. Our method works at a lower frame rate than a typical vocoder. To demonstrate the proposed method, two DNN-based voices, a male and a female, were built using the Merlin toolkit. Subjective comparisons were performed with a state-of-the-art baseline. The proposed vocoder substantially outperformed the baseline for both voices and under all configurations tested. Furthermore, several enhancements were made over the original design, which are beneficial for either sound quality or compatibility with other tools. In addition to its use in SPSS, the proposed vocoder is also demonstrated being used for join smoothing in unit selection-based systems, and can be used for voice conversion or automatic speech recognition

    Speech Enhancement Exploiting the Source-Filter Model

    Get PDF
    Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user’s speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.Mobiltelefonie ist aus dem heutigen Leben nicht mehr wegzudenken. Telefonate werden in beliebigen Situationen an beliebigen Orten gefĂŒhrt und dabei nimmt das Mikrofon nicht nur die Sprache des Nutzers auf, sondern auch die UmgebungsgerĂ€usche, welche das VerstĂ€ndnis des GesprĂ€chspartners stark beeinflussen können. Moderne Systeme können durch Sprachverbesserungsalgorithmen solchen Effekten entgegenwirken, dabei ist vielen Nutzern nicht einmal bewusst, dass diese Algorithmen existieren. In dieser Arbeit wird die Entwicklung eines einkanaligen Sprachverbesserungssystems vorgestellt. Der Ansatz setzt auf das Teile-und-herrsche-Verfahren, um störende UmgebungsgerĂ€usche aus Mikrofonsignalen herauszufiltern. Dieses Verfahren kann fĂŒr sĂ€mtliche FĂ€lle angewendet werden, in denen Sprache aus verrauschten Signalen extrahiert werden soll. Der Ansatz nutzt das Quelle-Filter-Modell, um das ursprĂŒngliche Problem in zwei Unterprobleme aufzuteilen, die anschließend gelöst werden, indem die Quelle (das Anregungssignal) und das Filter (die spektrale EinhĂŒllende) separat verbessert werden. Die verbesserten Signale werden gemeinsam genutzt, um das gestörte Mikrofonsignal zu entrauschen. Die SchĂ€tzung von spektralen EinhĂŒllenden wurde bereits in der Vergangenheit erforscht und zum Teil auch fĂŒr die Sprachverbesserung angewandt. Typischerweise wird dabei jedoch das Anregungssignal vernachlĂ€ssigt, so dass die spektrale Feinstruktur des Mikrofonsignals nicht verbessert werden kann. Beide AnsĂ€tze nutzen jeweils die Eigenschaften der cepstralen DomĂ€ne, die unter anderem vorteilhafte mathematische Eigenschaften mit sich bringen, sowie die Möglichkeit, Prototypen eines Anregungssignals zu erzeugen. Wir untersuchen modellbasierte AnsĂ€tze, wie z.B. Gaußsche Mischmodelle, klassische signalverarbeitungsbasierte Lösungen und auch moderne tiefe neuronale Netzwerke in dieser Arbeit. Die so verbesserten Signale werden nicht direkt zur Sprachsignalverbesserung genutzt (z.B. Sprachsynthese), sondern als sogenannter A-priori-Signal-zu-Rauschleistungs-SchĂ€tzwert in einem traditionellen statistischen Sprachverbesserungssystem. Dieses besteht aus einem Störleistungs-SchĂ€tzer, einem A-priori-Signal-zu-Rauschleistungs-SchĂ€tzer und einer spektralen Gewichtungsregel, die ĂŒblicherweise mit Hilfe der Ergebnisse der beiden SchĂ€tzer berechnet wird. Schließlich wird eine SchĂ€tzung des sauberen Sprachsignals aus der Mikrofonaufnahme gewonnen. Der neue Ansatz bietet eine signifikant höhere DĂ€mpfung des StörgerĂ€uschs als der bisherige Stand der Technik. Dabei wird eine vergleichbare QualitĂ€t der Sprachkomponente und der SprachverstĂ€ndlichkeit gewĂ€hrleistet. Somit konnte die GesamtqualitĂ€t des verbesserten Sprachsignals gegenĂŒber dem Stand der Technik erhöht werden

    Robust Phase-based Speech Signal Processing From Source-Filter Separation to Model-Based Robust ASR

    Get PDF
    The Fourier analysis plays a key role in speech signal processing. As a complex quantity, it can be expressed in the polar form using the magnitude and phase spectra. The magnitude spectrum is widely used in almost every corner of speech processing. However, the phase spectrum is not an obviously appealing start point for processing the speech signal. In contrast to the magnitude spectrum whose fine and coarse structures have a clear relation to speech perception, the phase spectrum is difficult to interpret and manipulate. In fact, there is not a meaningful trend or extrema which may facilitate the modelling process. Nonetheless, the speech phase spectrum has recently gained renewed attention. An expanding body of work is showing that it can be usefully employed in a multitude of speech processing applications. Now that the potential for the phase-based speech processing has been established, there is a need for a fundamental model to help understand the way in which phase encodes speech information. In this thesis a novel phase-domain source-filter model is proposed that allows for deconvolution of the speech vocal tract (filter) and excitation (source) components through phase processing. This model utilises the Hilbert transform, shows how the excitation and vocal tract elements mix in the phase domain and provides a framework for efficiently segregating the source and filter components through phase manipulation. To investigate the efficacy of the suggested approach, a set of features is extracted from the phase filter part for automatic speech recognition (ASR) and the source part of the phase is utilised for fundamental frequency estimation. Accuracy and robustness in both cases are illustrated and discussed. In addition, the proposed approach is improved by replacing the log with the generalised logarithmic function in the Hilbert transform and also by computing the group delay via regression filter. Furthermore, statistical distribution of the phase spectrum and its representations along the feature extraction pipeline are studied. It is illustrated that the phase spectrum has a bell-shaped distribution. Some statistical normalisation methods such as mean-variance normalisation, Laplacianisation, Gaussianisation and Histogram equalisation are successfully applied to the phase-based features and lead to a significant robustness improvement. The robustness gain achieved through using statistical normalisation and generalised logarithmic function encouraged the use of more advanced model-based statistical techniques such as vector Taylor Series (VTS). VTS in its original formulation assumes usage of the log function for compression. In order to simultaneously take advantage of the VTS and generalised logarithmic function, a new formulation is first developed to merge both into a unified framework called generalised VTS (gVTS). Also in order to leverage the gVTS framework, a novel channel noise estimation method is developed. The extensions of the gVTS framework and the proposed channel estimation to the group delay domain are then explored. The problems it presents are analysed and discussed, some solutions are proposed and finally the corresponding formulae are derived. Moreover, the effect of additive noise and channel distortion in the phase and group delay domains are scrutinised and the results are utilised in deriving the gVTS equations. Experimental results in the Aurora-4 ASR task in an HMM/GMM set up along with a DNN-based bottleneck system in the clean and multi-style training modes confirmed the efficacy of the proposed approach in dealing with both additive and channel noise
    • 

    corecore