53 research outputs found

    New Approaches for Speech Enhancement in the Short-Time Fourier Transform Domain

    Get PDF
    Speech enhancement aims at the improvement of speech quality by using various algorithms. A speech enhancement technique can be implemented as either a time domain or a transform domain method. In the transform domain speech enhancement, the spectrum of clean speech signal is estimated through the modification of noisy speech spectrum and then it is used to obtain the enhanced speech signal in the time domain. Among the existing transform domain methods in the literature, the short-time Fourier transform (STFT) processing has particularly served as the basis to implement most of the frequency domain methods. In general, speech enhancement methods in the STFT domain can be categorized into the estimators of complex discrete Fourier transform (DFT) coefficients and the estimators of real-valued short-time spectral amplitude (STSA). Due to the computational efficiency of the STSA estimation method and also its superior performance in most cases, as compared to the estimators of complex DFT coefficients, we focus mostly on the estimation of speech STSA throughout this work and aim at developing algorithms for noise reduction and reverberation suppression. First, we tackle the problem of additive noise reduction using the single-channel Bayesian STSA estimation method. In this respect, we present new schemes for the selection of Bayesian cost function parameters for a parametric STSA estimator, namely the Wďż˝-SA estimator, based on an initial estimate of the speech and also the properties of human auditory system. We further use the latter information to design an efficient flooring scheme for the gain function of the STSA estimator. Next, we apply the generalized Gaussian distribution (GGD) to theWďż˝-SA estimator as the speech STSA prior and propose to choose its parameters according to noise spectral variance and a priori signal to noise ratio (SNR). The suggested STSA estimation schemes are able to provide further noise reduction as well as less speech distortion, as compared to the previous methods. Quality and noise reduction performance evaluations indicated the superiority of the proposed speech STSA estimation with respect to the previous estimators. Regarding the multi-channel counterpart of the STSA estimation method, first we generalize the proposed single-channel Wďż˝-SA estimator to the multi-channel case for spatially uncorrelated noise. It is shown that under the Bayesian framework, a straightforward extension from the single-channel to the multi-channel case can be performed by generalizing the STSA estimator parameters, i.e. ďż˝ and ďż˝. Next, we develop Bayesian STSA estimators by taking advantage of speech spectral phase rather than only relying on the spectral amplitude of observations, in contrast to conventional methods. This contribution is presented for the multi-channel scenario with single-channel as a special case. Next, we aim at developing multi-channel STSA estimation under spatially correlated noise and derive a generic structure for the extension of a single-channel estimator to its multi-channel counterpart. It is shown that the derived multi-channel extension requires a proper estimate of the spatial correlation matrix of noise. Subsequently, we focus on the estimation of noise correlation matrix, that is not only important in the multi-channel STSA estimation scheme but also highly useful in different beamforming methods. Next, we aim at speech reverberation suppression in the STFT domain using the weighted prediction error (WPE) method. The original WPE method requires an estimate of the desired speech spectral variance along with reverberation prediction weights, leading to a sub-optimal strategy that alternatively estimates each of these two quantities. Also, similar to most other STFT based speech enhancement methods, the desired speech coefficients are assumed to be temporally independent, while this assumption is inaccurate. Taking these into account, first, we employ a suitable estimator for the speech spectral variance and integrate it into the estimation of the reverberation prediction weights. In addition to the performance advantage with respect to the previous versions of the WPE method, the presented approach provides a good reduction in implementation complexity. Next, we take into account the temporal correlation present in the STFT of the desired speech, namely the inter-frame correlation (IFC), and consider an approximate model where only the frames within each segment of speech are considered as correlated. Furthermore, an efficient method for the estimation of the underlying IFC matrix is developed based on the extension of the speech variance estimator proposed previously. The performance results reveal lower residual reverberation and higher overall quality provided by the proposed method. Finally, we focus on the problem of late reverberation suppression using the classic speech spectral enhancement method originally developed for additive noise reduction. As our main contribution, we propose a novel late reverberant spectral variance (LRSV) estimator which replaces the noise spectral variance in order to modify the gain function for reverberation suppression. The suggested approach employs a modified version of the WPE method in a model based smoothing scheme used for the estimation of the LRSV. According to the experiments, the proposed LRSV estimator outperforms the previous major methods considerably and scores the closest results to the theoretically true LRSV estimator. Particularly, in case of changing room impulse responses (RIRs) where other methods cannot follow the true LRSV estimator accurately, the suggested estimator is able to track true LRSV values and results in a smaller tracking error. We also target a few other aspects of the spectral enhancement method for reverberation suppression, which were explored before only for the purpose of noise reduction. These contributions include the estimation of signal to reverberant ratio (SRR) and the development of new schemes for the speech presence probability (SPP) and spectral gain flooring in the context of late reverberation suppression

    Pre-processing of Speech Signals for Robust Parameter Estimation

    Get PDF

    Adaptive Hidden Markov Noise Modelling for Speech Enhancement

    Get PDF
    A robust and reliable noise estimation algorithm is required in many speech enhancement systems. The aim of this thesis is to propose and evaluate a robust noise estimation algorithm for highly non-stationary noisy environments. In this work, we model the non-stationary noise using a set of discrete states with each state representing a distinct noise power spectrum. In this approach, the state sequence over time is conveniently represented by a Hidden Markov Model (HMM). In this thesis, we first present an online HMM re-estimation framework that models time-varying noise using a Hidden Markov Model and tracks changes in noise characteristics by a sequential model update procedure that tracks the noise characteristics during the absence of speech. In addition the algorithm will when necessary create new model states to represent novel noise spectra and will merge existing states that have similar characteristics. We then extend our work in robust noise estimation during speech activity by incorporating a speech model into our existing noise model. The noise characteristics within each state are updated based on a speech presence probability which is derived from a modified Minima controlled recursive averaging method. We have demonstrated the effectiveness of our noise HMM in tracking both stationary and highly non-stationary noise, and shown that it gives improved performance over other conventional noise estimation methods when it is incorporated into a standard speech enhancement algorithm

    Single-channel source separation using non-negative matrix factorization

    Get PDF

    Speech enhancement in binaural hearing protection devices

    Get PDF
    The capability of people to operate safely and effective under extreme noise conditions is dependent on their accesses to adequate voice communication while using hearing protection. This thesis develops speech enhancement algorithms that can be implemented in binaural hearing protection devices to improve communication and situation awareness in the workplace. The developed algorithms which emphasize low computational complexity, come with the capability to suppress noise while enhancing speech

    Model-based speech enhancement for hearing aids

    Get PDF

    Trennung und Schätzung der Anzahl von Audiosignalquellen mit Zeit- und Frequenzßberlappung

    Get PDF
    Everyday audio recordings involve mixture signals: music contains a mixture of instruments; in a meeting or conference, there is a mixture of human voices. For these mixtures, automatically separating or estimating the number of sources is a challenging task. A common assumption when processing mixtures in the time-frequency domain is that sources are not fully overlapped. However, in this work we consider some cases where the overlap is severe — for instance, when instruments play the same note (unison) or when many people speak concurrently ("cocktail party") — highlighting the need for new representations and more powerful models. To address the problems of source separation and count estimation, we use conventional signal processing techniques as well as deep neural networks (DNN). We first address the source separation problem for unison instrument mixtures, studying the distinct spectro-temporal modulations caused by vibrato. To exploit these modulations, we developed a method based on time warping, informed by an estimate of the fundamental frequency. For cases where such estimates are not available, we present an unsupervised model, inspired by the way humans group time-varying sources (common fate). This contribution comes with a novel representation that improves separation for overlapped and modulated sources on unison mixtures but also improves vocal and accompaniment separation when used as an input for a DNN model. Then, we focus on estimating the number of sources in a mixture, which is important for real-world scenarios. Our work on count estimation was motivated by a study on how humans can address this task, which lead us to conduct listening experiments, confirming that humans are only able to estimate the number of up to four sources correctly. To answer the question of whether machines can perform similarly, we present a DNN architecture, trained to estimate the number of concurrent speakers. Our results show improvements compared to other methods, and the model even outperformed humans on the same task. In both the source separation and source count estimation tasks, the key contribution of this thesis is the concept of “modulation”, which is important to computationally mimic human performance. Our proposed Common Fate Transform is an adequate representation to disentangle overlapping signals for separation, and an inspection of our DNN count estimation model revealed that it proceeds to find modulation-like intermediate features.Im Alltag sind wir von gemischten Signalen umgeben: Musik besteht aus einer Mischung von Instrumenten; in einem Meeting oder auf einer Konferenz sind wir einer Mischung menschlicher Stimmen ausgesetzt. Für diese Mischungen ist die automatische Quellentrennung oder die Bestimmung der Anzahl an Quellen eine anspruchsvolle Aufgabe. Eine häufige Annahme bei der Verarbeitung von gemischten Signalen im Zeit-Frequenzbereich ist, dass die Quellen sich nicht vollständig überlappen. In dieser Arbeit betrachten wir jedoch einige Fälle, in denen die Überlappung immens ist zum Beispiel, wenn Instrumente den gleichen Ton spielen (unisono) oder wenn viele Menschen gleichzeitig sprechen (Cocktailparty) —, so dass neue Signal-Repräsentationen und leistungsfähigere Modelle notwendig sind. Um die zwei genannten Probleme zu bewältigen, verwenden wir sowohl konventionelle Signalverbeitungsmethoden als auch tiefgehende neuronale Netze (DNN). Wir gehen zunächst auf das Problem der Quellentrennung für Unisono-Instrumentenmischungen ein und untersuchen die speziellen, durch Vibrato ausgelösten, zeitlich-spektralen Modulationen. Um diese Modulationen auszunutzen entwickelten wir eine Methode, die auf Zeitverzerrung basiert und eine Schätzung der Grundfrequenz als zusätzliche Information nutzt. Für Fälle, in denen diese Schätzungen nicht verfügbar sind, stellen wir ein unüberwachtes Modell vor, das inspiriert ist von der Art und Weise, wie Menschen zeitveränderliche Quellen gruppieren (Common Fate). Dieser Beitrag enthält eine neuartige Repräsentation, die die Separierbarkeit für überlappte und modulierte Quellen in Unisono-Mischungen erhöht, aber auch die Trennung in Gesang und Begleitung verbessert, wenn sie in einem DNN-Modell verwendet wird. Im Weiteren beschäftigen wir uns mit der Schätzung der Anzahl von Quellen in einer Mischung, was für reale Szenarien wichtig ist. Unsere Arbeit an der Schätzung der Anzahl war motiviert durch eine Studie, die zeigt, wie wir Menschen diese Aufgabe angehen. Dies hat uns dazu veranlasst, eigene Hörexperimente durchzuführen, die bestätigten, dass Menschen nur in der Lage sind, die Anzahl von bis zu vier Quellen korrekt abzuschätzen. Um nun die Frage zu beantworten, ob Maschinen dies ähnlich gut können, stellen wir eine DNN-Architektur vor, die erlernt hat, die Anzahl der gleichzeitig sprechenden Sprecher zu ermitteln. Die Ergebnisse zeigen Verbesserungen im Vergleich zu anderen Methoden, aber vor allem auch im Vergleich zu menschlichen Hörern. Sowohl bei der Quellentrennung als auch bei der Schätzung der Anzahl an Quellen ist ein Kernbeitrag dieser Arbeit das Konzept der “Modulation”, welches wichtig ist, um die Strategien von Menschen mittels Computern nachzuahmen. Unsere vorgeschlagene Common Fate Transformation ist eine adäquate Darstellung, um die Überlappung von Signalen für die Trennung zugänglich zu machen und eine Inspektion unseres DNN-Zählmodells ergab schließlich, dass sich auch hier modulationsähnliche Merkmale finden lassen

    Deep neural network techniques for monaural speech enhancement: state of the art analysis

    Full text link
    Deep neural networks (DNN) techniques have become pervasive in domains such as natural language processing and computer vision. They have achieved great success in these domains in task such as machine translation and image generation. Due to their success, these data driven techniques have been applied in audio domain. More specifically, DNN models have been applied in speech enhancement domain to achieve denosing, dereverberation and multi-speaker separation in monaural speech enhancement. In this paper, we review some dominant DNN techniques being employed to achieve speech separation. The review looks at the whole pipeline of speech enhancement from feature extraction, how DNN based tools are modelling both global and local features of speech and model training (supervised and unsupervised). We also review the use of speech-enhancement pre-trained models to boost speech enhancement process. The review is geared towards covering the dominant trends with regards to DNN application in speech enhancement in speech obtained via a single speaker.Comment: conferenc

    An Indirect Speech Enhancement Framework Through Intermediate Noisy Speech Targets

    Get PDF
    Noise presents a severe challenge in speech communication and processing systems. Speech enhancement aims at removing the inference and restoring speech quality. It is an essential step in a speech processing pipeline in many modern electronic devices, such as mobile phones and smart speakers. Traditionally, speech engineers have relied on signal processing techniques, such as spectral subtraction or Wiener filtering. Since the advent of deep learning, data-driven methods have offered an alternative solution to speech enhancement. Researchers and engineers have proposed various neural network architectures to map noisy speech features into clean ones. In this thesis, we refer to this class of mapping based data-driven techniques collectively as a direct method in speech enhancement. The output speech from direct mapping methods usually contains noise residue and unpleasant distortion if the speech power is low relative to the noise power or the background noise is very complex. The former adverse condition refers to low signal-to-noise-ratio (SNR). The latter condition implies difficult noise types. Researchers have proposed improving the SNR of speech signal incrementally during enhancement to overcome such difficulty, known as SNR-progressive speech enhancement. This design breaks down the problem of direct mapping into manageable sub-tasks. Inspired by the previous work, we propose to adopt a multi-stage indirect approach to speech enhancement in challenging noise conditions. Unlike SNR-progressive speech enhancement, we gradually transform noisy speech from difficult background noise to speech in simple noise types. The thesis's focus will include the characterization of background noise, speech transformation techniques, and integration of an indirect speech enhancement system.Ph.D
    • …
    corecore