53 research outputs found
Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function
This paper addresses the problem of speech separation and enhancement from
multichannel convolutive and noisy mixtures, \emph{assuming known mixing
filters}. We propose to perform the speech separation and enhancement task in
the short-time Fourier transform domain, using the convolutive transfer
function (CTF) approximation. Compared to time-domain filters, CTF has much
less taps, consequently it has less near-common zeros among channels and less
computational complexity. The work proposes three speech-source recovery
methods, namely: i) the multichannel inverse filtering method, i.e. the
multiple input/output inverse theorem (MINT), is exploited in the CTF domain,
and for the multi-source case, ii) a beamforming-like multichannel inverse
filtering method applying single source MINT and using power minimization,
which is suitable whenever the source CTFs are not all known, and iii) a
constrained Lasso method, where the sources are recovered by minimizing the
-norm to impose their spectral sparsity, with the constraint that the
-norm fitting cost, between the microphone signals and the mixing model
involving the unknown source signals, is less than a tolerance. The noise can
be reduced by setting a tolerance onto the noise power. Experiments under
various acoustic conditions are carried out to evaluate the three proposed
methods. The comparison between them as well as with the baseline methods is
presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
Deep neural network techniques for monaural speech enhancement: state of the art analysis
Deep neural networks (DNN) techniques have become pervasive in domains such
as natural language processing and computer vision. They have achieved great
success in these domains in task such as machine translation and image
generation. Due to their success, these data driven techniques have been
applied in audio domain. More specifically, DNN models have been applied in
speech enhancement domain to achieve denosing, dereverberation and
multi-speaker separation in monaural speech enhancement. In this paper, we
review some dominant DNN techniques being employed to achieve speech
separation. The review looks at the whole pipeline of speech enhancement from
feature extraction, how DNN based tools are modelling both global and local
features of speech and model training (supervised and unsupervised). We also
review the use of speech-enhancement pre-trained models to boost speech
enhancement process. The review is geared towards covering the dominant trends
with regards to DNN application in speech enhancement in speech obtained via a
single speaker.Comment: conferenc
Efficient and Robust Methods for Audio and Video Signal Analysis
This thesis presents my research concerning audio and video signal processing and machine learning. Specifically, the topics of my research include computationally efficient classifier compounds, automatic speech recognition (ASR), music dereverberation, video cut point detection and video classification.Computational efficacy of information retrieval based on multiple measurement modalities has been considered in this thesis. Specifically, a cascade processing framework, including a training algorithm to set its parameters has been developed for combining multiple detectors or binary classifiers in computationally efficient way. The developed cascade processing framework has been applied on video information retrieval tasks of video cut point detection and video classification. The results in video classification, compared to others found in the literature, indicate that the developed framework is capable of both accurate and computationally efficient classification. The idea of cascade processing has been additionally adapted for the ASR task. A procedure for combining multiple speech state likelihood estimation methods within an ASR framework in cascaded manner has been developed. The results obtained clearly show that without impairing the transcription accuracy the computational load of ASR can be reduced using the cascaded speech state likelihood estimation process.Additionally, this thesis presents my work on noise robustness of ASR using a nonnegative matrix factorization (NMF) -based approach. Specifically, methods for transformation of sparse NMF-features into speech state likelihoods has been explored. The results reveal that learned transformations from NMF activations to speech state likelihoods provide better ASR transcription accuracy than dictionary label -based transformations. The results, compared to others in a noisy speech recognition -challenge show that NMF-based processing is an efficient strategy for noise robustness in ASR.The thesis also presents my work on audio signal enhancement, specifically, on removing the detrimental effect of reverberation from music audio. In the work, a linear prediction -based dereverberation algorithm, which has originally been developed for speech signal enhancement, was applied for music. The results obtained show that the algorithm performs well in conjunction with music signals and indicate that dynamic compression of music does not impair the dereverberation performance
New Approaches for Speech Enhancement in the Short-Time Fourier Transform Domain
Speech enhancement aims at the improvement of speech quality by using various algorithms. A speech enhancement technique can be implemented as either a time domain or a transform domain method. In the transform domain speech enhancement, the spectrum of clean speech signal is estimated through the modification of noisy speech spectrum and then it is used to obtain the enhanced speech signal in the time domain. Among the existing transform domain methods in the literature, the short-time Fourier transform (STFT) processing has particularly served as the basis to implement most of the frequency domain methods. In general, speech enhancement methods in the STFT domain can be categorized into the estimators of complex discrete Fourier transform (DFT) coefficients and the estimators of real-valued short-time spectral amplitude (STSA). Due to the computational efficiency of the STSA estimation method and also its superior performance in most cases, as compared to the estimators of complex DFT coefficients, we focus mostly on the estimation of speech STSA throughout this work and aim at developing algorithms for noise reduction and reverberation suppression.
First, we tackle the problem of additive noise reduction using the single-channel Bayesian STSA estimation method. In this respect, we present new schemes for the selection of Bayesian cost function parameters for a parametric STSA estimator, namely the W�-SA estimator, based on an initial estimate of the speech and also the properties of human auditory system. We further use the latter information to design an efficient flooring scheme for the gain function of the STSA estimator. Next, we apply the generalized Gaussian distribution (GGD) to theW�-SA estimator as the speech STSA prior and propose to choose its parameters according to noise spectral variance and a priori signal to noise ratio (SNR). The suggested STSA estimation schemes are able to
provide further noise reduction as well as less speech distortion, as compared to the previous methods. Quality and noise reduction performance evaluations indicated the superiority of the
proposed speech STSA estimation with respect to the previous estimators.
Regarding the multi-channel counterpart of the STSA estimation method, first we generalize the proposed single-channel W�-SA estimator to the multi-channel case for spatially uncorrelated
noise. It is shown that under the Bayesian framework, a straightforward extension from the single-channel to the multi-channel case can be performed by generalizing the STSA estimator parameters, i.e. � and �. Next, we develop Bayesian STSA estimators by taking advantage of speech spectral phase rather than only relying on the spectral amplitude of observations, in
contrast to conventional methods. This contribution is presented for the multi-channel scenario with single-channel as a special case.
Next, we aim at developing multi-channel STSA estimation
under spatially correlated noise and derive a generic structure for the extension of a single-channel estimator to its multi-channel counterpart. It is shown that the derived multi-channel extension
requires a proper estimate of the spatial correlation matrix of noise. Subsequently, we focus on the estimation of noise correlation matrix, that is not only important in the multi-channel STSA estimation scheme but also highly useful in different beamforming methods. Next, we aim at speech reverberation suppression in the STFT domain using the weighted prediction
error (WPE) method. The original WPE method requires an estimate of the desired speech spectral variance along with reverberation prediction weights, leading to a sub-optimal strategy
that alternatively estimates each of these two quantities. Also, similar to most other STFT based speech enhancement methods, the desired speech coefficients are assumed to be temporally independent, while this assumption is inaccurate. Taking these into account, first, we employ a suitable estimator for the speech spectral variance and integrate it into the estimation of the reverberation prediction weights. In addition to the performance advantage with respect to the previous versions of the WPE method, the presented approach provides a good reduction in implementation complexity. Next, we take into account the temporal correlation present in the STFT of the desired
speech, namely the inter-frame correlation (IFC), and consider an approximate model where only the frames within each segment of speech are considered as correlated. Furthermore, an efficient
method for the estimation of the underlying IFC matrix is developed based on the extension of the speech variance estimator proposed previously. The performance results reveal lower residual reverberation and higher overall quality provided by the proposed method.
Finally, we focus on the problem of late reverberation suppression using the classic speech spectral enhancement method originally developed for additive noise reduction. As our main contribution,
we propose a novel late reverberant spectral variance (LRSV) estimator which replaces the noise spectral variance in order to modify the gain function for reverberation suppression. The suggested approach employs a modified version of the WPE method in a model based smoothing scheme used for the estimation of the LRSV. According to the experiments, the proposed LRSV estimator outperforms the previous major methods considerably and scores the closest results to
the theoretically true LRSV estimator. Particularly, in case of changing room impulse responses (RIRs) where other methods cannot follow the true LRSV estimator accurately, the suggested
estimator is able to track true LRSV values and results in a smaller tracking error. We also target a few other aspects of the spectral enhancement method for reverberation suppression, which were
explored before only for the purpose of noise reduction. These contributions include the estimation of signal to reverberant ratio (SRR) and the development of new schemes for the speech presence probability (SPP) and spectral gain flooring in the context of late reverberation suppression
Transient and steady-state component separation for audio signals
In this work the problem of transient and steady-state component separation of an audio signal was addressed. In particular, a recently proposed method for separation of transient and steady-state components based on the median filter was investigated. For a better understanding of the processes involved, a modification of the filtering stage of the algorithm was proposed. This modification was evaluated subjectively by listening tests and objectively by an application-based comparison. Also some extensions to the model were presented in conjunction with different possible applications for the transient and steady-state decomposition in the area of audio editing and processing
Towards End-to-End Acoustic Localization using Deep Learning: from Audio Signal to Source Position Coordinates
This paper presents a novel approach for indoor acoustic source localization
using microphone arrays and based on a Convolutional Neural Network (CNN). The
proposed solution is, to the best of our knowledge, the first published work in
which the CNN is designed to directly estimate the three dimensional position
of an acoustic source, using the raw audio signal as the input information
avoiding the use of hand crafted audio features. Given the limited amount of
available localization data, we propose in this paper a training strategy based
on two steps. We first train our network using semi-synthetic data, generated
from close talk speech recordings, and where we simulate the time delays and
distortion suffered in the signal that propagates from the source to the array
of microphones. We then fine tune this network using a small amount of real
data. Our experimental results show that this strategy is able to produce
networks that significantly improve existing localization methods based on
\textit{SRP-PHAT} strategies. In addition, our experiments show that our CNN
method exhibits better resistance against varying gender of the speaker and
different window sizes compared with the other methods.Comment: 18 pages, 3 figures, 8 table
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
- …