6 research outputs found

    Speech Enhancement Exploiting the Source-Filter Model

    Get PDF
    Imagining everyday life without mobile telephony is nowadays hardly possible. Calls are being made in every thinkable situation and environment. Hence, the microphone will not only pick up the user’s speech but also sound from the surroundings which is likely to impede the understanding of the conversational partner. Modern speech enhancement systems are able to mitigate such effects and most users are not even aware of their existence. In this thesis the development of a modern single-channel speech enhancement approach is presented, which uses the divide and conquer principle to combat environmental noise in microphone signals. Though initially motivated by mobile telephony applications, this approach can be applied whenever speech is to be retrieved from a corrupted signal. The approach uses the so-called source-filter model to divide the problem into two subproblems which are then subsequently conquered by enhancing the source (the excitation signal) and the filter (the spectral envelope) separately. Both enhanced signals are then used to denoise the corrupted signal. The estimation of spectral envelopes has quite some history and some approaches already exist for speech enhancement. However, they typically neglect the excitation signal which leads to the inability of enhancing the fine structure properly. Both individual enhancement approaches exploit benefits of the cepstral domain which offers, e.g., advantageous mathematical properties and straightforward synthesis of excitation-like signals. We investigate traditional model-based schemes like Gaussian mixture models (GMMs), classical signal processing-based, as well as modern deep neural network (DNN)-based approaches in this thesis. The enhanced signals are not used directly to enhance the corrupted signal (e.g., to synthesize a clean speech signal) but as so-called a priori signal-to-noise ratio (SNR) estimate in a traditional statistical speech enhancement system. Such a traditional system consists of a noise power estimator, an a priori SNR estimator, and a spectral weighting rule that is usually driven by the results of the aforementioned estimators and subsequently employed to retrieve the clean speech estimate from the noisy observation. As a result the new approach obtains significantly higher noise attenuation compared to current state-of-the-art systems while maintaining a quite comparable speech component quality and speech intelligibility. In consequence, the overall quality of the enhanced speech signal turns out to be superior as compared to state-of-the-art speech ehnahcement approaches.Mobiltelefonie ist aus dem heutigen Leben nicht mehr wegzudenken. Telefonate werden in beliebigen Situationen an beliebigen Orten geführt und dabei nimmt das Mikrofon nicht nur die Sprache des Nutzers auf, sondern auch die Umgebungsgeräusche, welche das Verständnis des Gesprächspartners stark beeinflussen können. Moderne Systeme können durch Sprachverbesserungsalgorithmen solchen Effekten entgegenwirken, dabei ist vielen Nutzern nicht einmal bewusst, dass diese Algorithmen existieren. In dieser Arbeit wird die Entwicklung eines einkanaligen Sprachverbesserungssystems vorgestellt. Der Ansatz setzt auf das Teile-und-herrsche-Verfahren, um störende Umgebungsgeräusche aus Mikrofonsignalen herauszufiltern. Dieses Verfahren kann für sämtliche Fälle angewendet werden, in denen Sprache aus verrauschten Signalen extrahiert werden soll. Der Ansatz nutzt das Quelle-Filter-Modell, um das ursprüngliche Problem in zwei Unterprobleme aufzuteilen, die anschließend gelöst werden, indem die Quelle (das Anregungssignal) und das Filter (die spektrale Einhüllende) separat verbessert werden. Die verbesserten Signale werden gemeinsam genutzt, um das gestörte Mikrofonsignal zu entrauschen. Die Schätzung von spektralen Einhüllenden wurde bereits in der Vergangenheit erforscht und zum Teil auch für die Sprachverbesserung angewandt. Typischerweise wird dabei jedoch das Anregungssignal vernachlässigt, so dass die spektrale Feinstruktur des Mikrofonsignals nicht verbessert werden kann. Beide Ansätze nutzen jeweils die Eigenschaften der cepstralen Domäne, die unter anderem vorteilhafte mathematische Eigenschaften mit sich bringen, sowie die Möglichkeit, Prototypen eines Anregungssignals zu erzeugen. Wir untersuchen modellbasierte Ansätze, wie z.B. Gaußsche Mischmodelle, klassische signalverarbeitungsbasierte Lösungen und auch moderne tiefe neuronale Netzwerke in dieser Arbeit. Die so verbesserten Signale werden nicht direkt zur Sprachsignalverbesserung genutzt (z.B. Sprachsynthese), sondern als sogenannter A-priori-Signal-zu-Rauschleistungs-Schätzwert in einem traditionellen statistischen Sprachverbesserungssystem. Dieses besteht aus einem Störleistungs-Schätzer, einem A-priori-Signal-zu-Rauschleistungs-Schätzer und einer spektralen Gewichtungsregel, die üblicherweise mit Hilfe der Ergebnisse der beiden Schätzer berechnet wird. Schließlich wird eine Schätzung des sauberen Sprachsignals aus der Mikrofonaufnahme gewonnen. Der neue Ansatz bietet eine signifikant höhere Dämpfung des Störgeräuschs als der bisherige Stand der Technik. Dabei wird eine vergleichbare Qualität der Sprachkomponente und der Sprachverständlichkeit gewährleistet. Somit konnte die Gesamtqualität des verbesserten Sprachsignals gegenüber dem Stand der Technik erhöht werden

    A Study into Speech Enhancement Techniques in Adverse Environment

    Get PDF
    This dissertation developed speech enhancement techniques that improve the speech quality in applications such as mobile communications, teleconferencing and smart loudspeakers. For these applications it is necessary to suppress noise and reverberation. Thus the contribution in this dissertation is twofold: single channel speech enhancement system which exploits the temporal and spectral diversity of the received microphone signal for noise suppression and multi-channel speech enhancement method with the ability to employ spatial diversity to reduce reverberation

    Noise-Robust Voice Conversion

    Get PDF
    A persistent challenge in speech processing is the presence of noise that reduces the quality of speech signals. Whether natural speech is used as input or speech is the desirable output to be synthesized, noise degrades the performance of these systems and causes output speech to be unnatural. Speech enhancement deals with such a problem, typically seeking to improve the input speech or post-processes the (re)synthesized speech. An intriguing complement to post-processing speech signals is voice conversion, in which speech by one person (source speaker) is made to sound as if spoken by a different person (target speaker). Traditionally, the majority of speech enhancement and voice conversion methods rely on parametric modeling of speech. A promising complement to parametric models is an inventory-based approach, which is the focus of this work. In inventory-based speech systems, one records an inventory of clean speech signals as a reference. Noisy speech (in the case of enhancement) or target speech (in the case of conversion) can then be replaced by the best-matching clean speech in the inventory, which is found via a correlation search method. Such an approach has the potential to alleviate intelligibility and unnaturalness issues often encountered by parametric modeling speech processing systems. This work investigates and compares inventory-based speech enhancement methods with conventional ones. In addition, the inventory search method is applied to estimate source speaker characteristics for voice conversion in noisy environments. Two noisy-environment voice conversion systems were constructed for a comparative study: a direct voice conversion system and an inventory-based voice conversion system, both with limited noise filtering at the front end. Results from this work suggest that the inventory method offers encouraging improvements over the direct conversion method

    Speech enhancement in binaural hearing protection devices

    Get PDF
    The capability of people to operate safely and effective under extreme noise conditions is dependent on their accesses to adequate voice communication while using hearing protection. This thesis develops speech enhancement algorithms that can be implemented in binaural hearing protection devices to improve communication and situation awareness in the workplace. The developed algorithms which emphasize low computational complexity, come with the capability to suppress noise while enhancing speech

    Voice inactivity ranking for enhancement of speech on microphone arrays

    Full text link
    Motivated by the problem of improving the performance of speech enhancement algorithms in non-stationary acoustic environments with low SNR, a framework is proposed for identifying signal frames of noisy speech that are unlikely to contain voice activity. Such voice-inactive frames can then be incorporated into an adaptation strategy to improve the performance of existing speech enhancement algorithms. This adaptive approach is applicable to single-channel as well as multi-channel algorithms for noisy speech. In both cases, the adaptive versions of the enhancement algorithms are observed to improve SNR levels by 20dB, as indicated by PESQ and WER criteria. In advanced speech enhancement algorithms, it is often of interest to identify some regions of the signal that have a high likelihood of being noise only i.e. no speech present. This is in contrast to advanced speech recognition, speaker recognition, and pitch tracking algorithms in which we are interested in identifying all regions that have a high likelihood of containing speech, as well as regions that have a high likelihood of not containing speech. In other terms, this would mean minimizing the false positive and false negative rates, respectively. In the context of speech enhancement, the identification of some speech-absent regions prompts the minimization of false positives while setting an acceptable tolerance on false negatives, as determined by the performance of the enhancement algorithm. Typically, Voice Activity Detectors (VADs) are used for identifying speech absent regions for the application of speech enhancement. In recent years a myriad of Deep Neural Network (DNN) based approaches have been proposed to improve the performance of VADs at low SNR levels by training on combinations of speech and noise. Training on such an exhaustive dataset is combinatorically explosive. For this dissertation, we propose a voice inactivity ranking framework, where the identification of voice-inactive frames is performed using a machine learning (ML) approach that only uses clean speech utterances for training and is robust to high levels of noise. In the proposed framework, input frames of noisy speech are ranked by ‘voice inactivity score’ to acquire definitely speech inactive (DSI) frame-sequences. These DSI regions serve as a noise estimate and are adaptively used by the underlying speech enhancement algorithm to enhance speech from a speech mixture. The proposed voice-inactivity ranking framework was used to perform speech enhancement in single-channel and multi-channel systems. In the context of microphone arrays, the proposed framework was used to determine parameters for spatial filtering using adaptive beamformers. We achieved an average Word Error Rate (WER) improvement of 50% at SNR levels below 0dB compared to the noisy signal, which is 7±2.5% more than the framework where state-of-the-art VAD decision was used for spatial filtering. For monaural signals, we propose a multi-frame multiband spectral-subtraction (MF-MBSS) speech enhancement system utilizing the voice inactivity framework to compute and update the noise statistics on overlapping frequency bands. The proposed MF-MBSS not only achieved an average PESQ improvement of 16% with a maximum improvement of 56% when compared to the state-of-the-art Spectral Subtraction but also a 5 ± 1.5% improvement in the Word Error Rate (WER) of the spatially filtered output signal, in non-stationary acoustic environments

    About voice activity detection

    Get PDF
    Orientadores: Romis Ribeiro de Faissol Attux, Everton Zaccaria NadalinDissertação (mestrado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Este trabalho tem por objetivo o estudo e a avaliação de técnicas de detecção de atividade de voz (VAD, Voice Activity Detection) em arquivos de áudio digital, bem como a proposta de uma nova metodologia de solução. Para tanto, foram estudados os conceitos fundamentais de processamento digital de sinais de fala, em especial, algumas abordagens clássicas ao problema da distinção entre voz e não voz. Começamos os estudos pelas pioneiras técnicas que faziam uso de análises de energia e das taxas de cruzamento por zero do sinal de voz, para então passarmos por enfoques mais recen-tes, tais como os que exploram a entropia espectral, a variabilidade em longo prazo, bem como a periodicidade do sinal de voz. Seguindo a história das metodologias para detecção da presença de fala, voltamos o foco para classificadores de atividade de voz baseados em modelos estatísticos e terminamos por examinar as recentes aplicações de reconhecimento de padrões e de técnicas de aprendizado de máquina ao problema estudado. Tal cenário revela uma vasta gama de caracterís-ticas representativas da voz a serem exploradas para a detecção da presença da mesma, bem como de métodos para extração de tais atributos. Assim, a seleção destas características e as técnicas de classificação a serem utilizadas são dois aspectos complementares que formam o par de interesses deste estudo. Em um sinal com alta relação sinal ruído, a detecção de atividade de voz pode ser realiza-da satisfatoriamente ao se aplicar um limiar de energia. Contudo, em baixa relação sinal-ruído pode ser bastante difícil detectar corretamente o sinal de interesse, especialmente quando este é corrompido por sinais acusticamente mais complexos tais como oriundos de vias urbanas e de praças de alimentação. Com o intuito de avaliar os atributos bem como as técnicas de classificação utilizados pela literatura em diferentes tipos e níveis de ruído, alguns algoritmos de detecção de atividade de voz tiveram o desempenho observado com o auxilio de uma extensa base de dados de ruído, a QUT-NOISE-TIMIT. Neste trabalho, apresenta-se, ainda, uma nova proposta que explora a natureza quase pe-riódica da voz para a detecção da parte vozeada da fala, uma vez que esta é mais robusta ao ruído e que a parte não vozeada da fala pode ser aproximada com técnicas de suavização. A investigação de tal proposta foi possível através da elaboração de algoritmos de VAD que aplicam a correlação cruzada entre espectros de quadros consecutivos para extração de atributo a ser explorado por diferentes estratégias de classificação. Discute-se o desempenho da proposta em comparação com o desempenho dos atributos utilizados pela literatura em conjunto com diferentes técnicas de classificação. Bons resultados foram obtidos quando da utilização da característica proposta em diferentes abordagens de classificação, especialmente em ambientes com ruídos de burburinhoAbstract: This work aims to study and evaluate voice activity detection techniques (VAD Voice Activity Detection) applied to digital audio files, as well as proposes a new solution methodology. To achieve this end, the fundamental concepts of digital speech processing were studied, in particu-lar some classic approaches to the problem of the distinction between voice and non-voice. We started the study from the pioneering technique, which use energy analysis and zero-crossing rate of the speech signal, proceeding to more recent approaches such as those exploiting the spectral entropy, the long-term variability, as well as the periodicity of the voice signal. Following the history of the methodologies for detecting the presence of speech, we focused on VADs classifiers based on statistical models and, finally we examined recent pattern recognition ap-plications and machine learning techniques to solve the studied problem. This scenario presents a wide range of representative features of the voice that could be exploited for the detection of presence as well as methods for extracting these attributes. Thus, the selection of these features and classifi-cation techniques to be used are two complementary aspects that form the core of this study. In the context of a high signal to noise ratio, voice activity detection can be per-formed satisfactorily by applying an energy threshold. However, in low signal to noise ratio, it can be quite difficult to correctly detect the signal of interest, especially when it is corrupted by acoustically complex signals such as from urban roads and food courts. In order to evalu-ate the attributes and the classification techniques used in the literature in different scenarios and noise levels, some voice activity detection algorithms have their performance assessed with the aid of an extensive noise database, QUT -NOISE - TIMIT. In this study, we also present a new proposal that exploits the quasi-periodic nature of the voice for the detection of voiced speech, since it is more robust to noise and the non-voiced speech can be approximated with smoothing techniques. The investigation of such proposal was possible through the development of VAD algorithms that apply cross-correlation be-tween spectra of consecutive frames for attribute extraction that can be exploited by different classification strategies. We discuss the performance of the proposal compared with the performance of features commonly used in the literature in combination with different classification techniques. Good results were obtained when using the proposed resource in different classification approaches, especially in environments with bubble noiseMestradoEngenharia de ComputaçãoMestre em Engenharia ElétricaCAPE
    corecore