491 research outputs found
A Subband-Based SVM Front-End for Robust ASR
This work proposes a novel support vector machine (SVM) based robust
automatic speech recognition (ASR) front-end that operates on an ensemble of
the subband components of high-dimensional acoustic waveforms. The key issues
of selecting the appropriate SVM kernels for classification in frequency
subbands and the combination of individual subband classifiers using ensemble
methods are addressed. The proposed front-end is compared with state-of-the-art
ASR front-ends in terms of robustness to additive noise and linear filtering.
Experiments performed on the TIMIT phoneme classification task demonstrate the
benefits of the proposed subband based SVM front-end: it outperforms the
standard cepstral front-end in the presence of noise and linear filtering for
signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed
front-end with a conventional front-end such as MFCC yields further
improvements over the individual front ends across the full range of noise
levels
Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates
This work addresses the problem of block-online processing for multi-channel
speech enhancement. Such processing is vital in scenarios with moving speakers
and/or when very short utterances are processed, e.g., in voice assistant
scenarios. We consider several variants of a system that performs beamforming
supported by DNN-based voice activity detection (VAD) followed by
post-filtering. The speaker is targeted through estimating relative transfer
functions between microphones. Each block of the input signals is processed
independently in order to make the method applicable in highly dynamic
environments. Owing to the short length of the processed block, the statistics
required by the beamformer are estimated less precisely. The influence of this
inaccuracy is studied and compared to the processing regime when recordings are
treated as one block (batch processing). The experimental evaluation of the
proposed method is performed on large datasets of CHiME-4 and on another
dataset featuring moving target speaker. The experiments are evaluated in terms
of objective and perceptual criteria (such as signal-to-interference ratio
(SIR) or perceptual evaluation of speech quality (PESQ), respectively).
Moreover, word error rate (WER) achieved by a baseline automatic speech
recognition system is evaluated, for which the enhancement method serves as a
front-end solution. The results indicate that the proposed method is robust
with respect to short length of the processed block. Significant improvements
in terms of the criteria and WER are observed even for the block length of 250
ms.Comment: 10 pages, 8 figures, 4 tables. Modified version of the article
accepted for publication in IET Signal Processing journal. Original results
unchanged, additional experiments presented, refined discussion and
conclusion
Euclidean Distance Matrices: Essential Theory, Algorithms and Applications
Euclidean distance matrices (EDM) are matrices of squared distances between
points. The definition is deceivingly simple: thanks to their many useful
properties they have found applications in psychometrics, crystallography,
machine learning, wireless sensor networks, acoustics, and more. Despite the
usefulness of EDMs, they seem to be insufficiently known in the signal
processing community. Our goal is to rectify this mishap in a concise tutorial.
We review the fundamental properties of EDMs, such as rank or
(non)definiteness. We show how various EDM properties can be used to design
algorithms for completing and denoising distance data. Along the way, we
demonstrate applications to microphone position calibration, ultrasound
tomography, room reconstruction from echoes and phase retrieval. By spelling
out the essential algorithms, we hope to fast-track the readers in applying
EDMs to their own problems. Matlab code for all the described algorithms, and
to generate the figures in the paper, is available online. Finally, we suggest
directions for further research.Comment: - 17 pages, 12 figures, to appear in IEEE Signal Processing Magazine
- change of title in the last revisio
Two-Channel Speech Enhancement and Implementation Considerations: Noise Reduction and Speech Quality
Instantaneous PSD Estimation for Speech Enhancement based on Generalized Principal Components
Power spectral density (PSD) estimates of various microphone signal
components are essential to many speech enhancement procedures. As speech is
highly non-nonstationary, performance improvements may be gained by maintaining
time-variations in PSD estimates. In this paper, we propose an instantaneous
PSD estimation approach based on generalized principal components. Similarly to
other eigenspace-based PSD estimation approaches, we rely on recursive
averaging in order to obtain a microphone signal correlation matrix estimate to
be decomposed. However, instead of estimating the PSDs directly from the
temporally smooth generalized eigenvalues of this matrix, yielding temporally
smooth PSD estimates, we propose to estimate the PSDs from newly defined
instantaneous generalized eigenvalues, yielding instantaneous PSD estimates.
The instantaneous generalized eigenvalues are defined from the generalized
principal components, i.e. a generalized eigenvector-based transform of the
microphone signals. We further show that the smooth generalized eigenvalues can
be understood as a recursive average of the instantaneous generalized
eigenvalues. Simulation results comparing the multi-channel Wiener filter (MWF)
with smooth and instantaneous PSD estimates indicate better speech enhancement
performance for the latter. A MATLAB implementation is available online
Direct and Residual Subspace Decomposition of Spatial Room Impulse Responses
Psychoacoustic experiments have shown that directional properties of the direct sound, salient reflections, and the late reverberation of an acoustic room response can have a distinct influence on the auditory perception of a given room. Spatial room impulse responses (SRIRs) capture those properties and thus are used for direction-dependent room acoustic analysis and virtual acoustic rendering. This work proposes a subspace method that decomposes SRIRs into a direct part, which comprises the direct sound and the salient reflections, and a residual, to facilitate enhanced analysis and rendering methods by providing individual access to these components. The proposed method is based on the generalized singular value decomposition and interprets the residual as noise that is to be separated from the other components of the reverberation. Large generalized singular values are attributed to the direct part, which is then obtained as a low-rank approximation of the SRIR. By advancing from the end of the SRIR toward the beginning while iteratively updating the residual estimate, the method adapts to spatio-temporal variations of the residual. The method is evaluated using a spatio-spectral error measure and simulated SRIRs of different rooms, microphone arrays, and ratios of direct sound to residual energy. The proposed method creates lower errors than existing approaches in all tested scenarios, including a scenario with two simultaneous reflections. A case study with measured SRIRs shows the applicability of the method under real-world acoustic conditions. A reference implementation is provided
- …