76 research outputs found
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Conventional deep neural networks (DNN) for speech acoustic modeling rely on
Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary
class labels as the targets for DNN training. Subword classes in speech
recognition systems correspond to context-dependent tied states or senones. The
present work addresses some limitations of GMM-HMM senone alignments for DNN
training. We hypothesize that the senone probabilities obtained from a DNN
trained with binary labels can provide more accurate targets to learn better
acoustic models. However, DNN outputs bear inaccuracies which are exhibited as
high dimensional unstructured noise, whereas the informative components are
structured and low-dimensional. We exploit principle component analysis (PCA)
and sparse coding to characterize the senone subspaces. Enhanced probabilities
obtained from low-rank and sparse reconstructions are used as soft-targets for
DNN acoustic modeling, that also enables training with untranscribed data.
Experiments conducted on AMI corpus shows 4.6% relative reduction in word error
rate
A New Identity for the Least-square Solution of Overdetermined Set of Linear Equations
In this paper, we prove a new identity for the least-square solution of an
over-determined set of linear equation , where is an
full-rank matrix, is a column-vector of dimension , and (the number
of equations) is larger than or equal to (the dimension of the unknown
vector ). Generally, the equations are inconsistent and there is no feasible
solution for unless belongs to the column-span of . In the
least-square approach, a candidate solution is found as the unique that
minimizes the error function .
We propose a more general approach that consist in considering all the
consistent subset of the equations, finding their solutions, and taking a
weighted average of them to build a candidate solution. In particular, we show
that by weighting the solutions with the squared determinant of their
coefficient matrix, the resulting candidate solution coincides with the least
square solution
Exploiting Low-dimensional Structures to Enhance DNN Based Acoustic Modeling in Speech Recognition
We propose to model the acoustic space of deep neural network (DNN)
class-conditional posterior probabilities as a union of low-dimensional
subspaces. To that end, the training posteriors are used for dictionary
learning and sparse coding. Sparse representation of the test posteriors using
this dictionary enables projection to the space of training data. Relying on
the fact that the intrinsic dimensions of the posterior subspaces are indeed
very small and the matrix of all posteriors belonging to a class has a very low
rank, we demonstrate how low-dimensional structures enable further enhancement
of the posteriors and rectify the spurious errors due to mismatch conditions.
The enhanced acoustic modeling method leads to improvements in continuous
speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in
both clean and noisy conditions, where upto 15.4% relative reduction in word
error rate (WER) is achieved
Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings
We tackle the multi-party speech recovery problem through modeling the
acoustic of the reverberant chambers. Our approach exploits structured sparsity
models to perform room modeling and speech recovery. We propose a scheme for
characterizing the room acoustic from the unknown competing speech sources
relying on localization of the early images of the speakers by sparse
approximation of the spatial spectra of the virtual sources in a free-space
model. The images are then clustered exploiting the low-rank structure of the
spectro-temporal components belonging to each source. This enables us to
identify the early support of the room impulse response function and its unique
map to the room geometry. To further tackle the ambiguity of the reflection
ratios, we propose a novel formulation of the reverberation model and estimate
the absorption coefficients through a convex optimization exploiting joint
sparsity model formulated upon spatio-spectral sparsity of concurrent speech
representation. The acoustic parameters are then incorporated for separating
individual speech signals through either structured sparse recovery or inverse
filtering the acoustic channels. The experiments conducted on real data
recordings demonstrate the effectiveness of the proposed approach for
multi-party speech recovery and recognition.Comment: 31 page
Ad Hoc Microphone Array Calibration: Euclidean Distance Matrix Completion Algorithm and Theoretical Guarantees
This paper addresses the problem of ad hoc microphone array calibration where
only partial information about the distances between microphones is available.
We construct a matrix consisting of the pairwise distances and propose to
estimate the missing entries based on a novel Euclidean distance matrix
completion algorithm by alternative low-rank matrix completion and projection
onto the Euclidean distance space. This approach confines the recovered matrix
to the EDM cone at each iteration of the matrix completion algorithm. The
theoretical guarantees of the calibration performance are obtained considering
the random and locally structured missing entries as well as the measurement
noise on the known distances. This study elucidates the links between the
calibration error and the number of microphones along with the noise level and
the ratio of missing distances. Thorough experiments on real data recordings
and simulated setups are conducted to demonstrate these theoretical insights. A
significant improvement is achieved by the proposed Euclidean distance matrix
completion algorithm over the state-of-the-art techniques for ad hoc microphone
array calibration.Comment: In Press, available online, August 1, 2014.
http://www.sciencedirect.com/science/article/pii/S0165168414003508, Signal
Processing, 201
Convexity in source separation: Models, geometry, and algorithms
Source separation or demixing is the process of extracting multiple
components entangled within a signal. Contemporary signal processing presents a
host of difficult source separation problems, from interference cancellation to
background subtraction, blind deconvolution, and even dictionary learning.
Despite the recent progress in each of these applications, advances in
high-throughput sensor technology place demixing algorithms under pressure to
accommodate extremely high-dimensional signals, separate an ever larger number
of sources, and cope with more sophisticated signal and mixing models. These
difficulties are exacerbated by the need for real-time action in automated
decision-making systems.
Recent advances in convex optimization provide a simple framework for
efficiently solving numerous difficult demixing problems. This article provides
an overview of the emerging field, explains the theory that governs the
underlying procedures, and surveys algorithms that solve them efficiently. We
aim to equip practitioners with a toolkit for constructing their own demixing
algorithms that work, as well as concrete intuition for why they work
Speaker Direction Finding for Practical Systems: A Comparison of Different Approaches
Speaker direction finding techniques have aroused interests due to achieving the capability of receiving high-quality dis- tant signals. Interesting concepts can be achieved through the comparison of such techniques whereby importance is in achieving high quality signals at reasonable complexity rates. With this aim in mind, this paper presents a critical compari- son between two such traditional techniques; Time-Difference of Arrival (TDOA) estimation by Generalized Cross Correla- tion (GCC) and space scanning by Steered Response Power (SRP) of a beamformer. Each is analyzed under diverse con- ditions of noise and reverberation. Simulation results and experiments based on real data have been able to show that SRP with short data segments and due to its characteristic of averaging over the spatial dimension illustrate better accuracy results than that of GCC. These results have instigated a new method in the estimation of the source direction from a set of TDOAs based on spatial curvature collision. This paper dis- cusses how this procedure reduces the computational cost more than 50 times compared to the conventional method of Root Mean Square (RMS) error minimization over the candi- date locations
Redundant Hash Addressing for Large-Scale Query by Example Spoken Query Detection
State of the art query by example spoken term detection (QbE-STD) systems rely on representation of speech in terms of sequences of class-conditional posterior probabilities estimated by deep neural network (DNN). The posteriors are often used for pattern matching or dynamic time warping (DTW). Exploiting posterior probabilities as speech representation propounds diverse advantages in a classification system. One key property of the posterior representations is that they admit a highly effective hashing strategy that enables indexing the large archive in divisions for reducing the search complexity. Moreover, posterior indexing leads to a compressed representation and enables pronunciation dewarping and partial detection with no need for DTW. We exploit these characteristics of the posterior space in the context of redundant hash addressing for query-by-example spoken term detection (QbE-STD). We evaluate the QbE-STD system on AMI corpus and demonstrate that tremendous speedup and superior accuracy is achieved compared to the state-of-the-art pattern matching and DTW solutions. The system has great potential to enable massively large scale query detection
- âŠ