9,207 research outputs found
Sub-Banded Reconstructed Phase Spaces for Speech Recognition
A novel method combining filter banks and reconstructed phase spaces is proposed for the modeling and classification of speech. Reconstructed phase spaces, which are based on dynamical systems theory, have advantages over spectral-based analysis methods in that they can capture nonlinear or higher-order statistics. Recent work has shown that the natural measure of a reconstructed phase space can be used for modeling and classification of phonemes. In this work, sub-banding of speech, which has been examined for recognition of noise-corrupted speech, is studied in combination with phase space reconstruction. This sub-banding, which is motivated by empirical psychoacoustical studies, is shown to dramatically improve the phoneme classification accuracy of reconstructed phase space-based approaches. Experiments that examine the performance of fused sub-banded reconstructed phase spaces for phoneme classification are presented. Comparisons against a cepstral-based classifier show that the proposed approach is competitive with state-of-the-art methods for modeling and classification of phonemes. Combination of cepstral-based features and the sub-band RPS features shows improvement over a cepstral-only baseline
Speech enhancement with frequency domain auto-regressive modeling
Speech applications in far-field real world settings often deal with signals
that are corrupted by reverberation. The task of dereverberation constitutes an
important step to improve the audible quality and to reduce the error rates in
applications like automatic speech recognition (ASR). We propose a unified
framework of speech dereverberation for improving the speech quality and the
ASR performance using the approach of envelope-carrier decomposition provided
by an autoregressive (AR) model. The AR model is applied in the frequency
domain of the sub-band speech signals to separate the envelope and carrier
parts. A novel neural architecture based on dual path long short term memory
(DPLSTM) model is proposed, which jointly enhances the sub-band envelope and
carrier components. The dereverberated envelope-carrier signals are modulated
and the sub-band signals are synthesized to reconstruct the audio signal back.
The DPLSTM model for dereverberation of envelope and carrier components also
allows the joint learning of the network weights for the down stream ASR task.
In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES
dataset, we illustrate that the joint learning of speech dereverberation
network and the E2E ASR model yields significant performance improvements over
the baseline ASR system trained on log-mel spectrogram as well as other
benchmarks for dereverberation (average relative improvements of 10-24% over
the baseline system). The speech quality improvements, evaluated using
subjective listening tests, further highlight the improved quality of the
reconstructed audio.Comment: 10 page
Multi-Classifiers And Decision Fusion For Robust Statistical Pattern Recognition With Applications To Hyperspectral Classification
In this dissertation, a multi-classifier, decision fusion framework is proposed for robust classification of high dimensional data in small-sample-size conditions. Such datasets present two key challenges. (1) The high dimensional feature spaces compromise the classifiers’ generalization ability in that the classifier tends to overit decision boundaries to the training data. This phenomenon is commonly known as the Hughes phenomenon in the pattern classification community. (2) The small-sample-size of the training data results in ill-conditioned estimates of its statistics. Most classifiers rely on accurate estimation of these statistics for modeling training data and labeling test data, and hence ill-conditioned statistical estimates result in poorer classification performance. This dissertation tests the efficacy of the proposed algorithms to classify primarily remotely sensed hyperspectral data and secondarily diagnostic digital mammograms, since these applications naturally result in very high dimensional feature spaces and often do not have sufficiently large training datasets to support the dimensionality of the feature space. Conventional approaches, such as Stepwise LDA (S-LDA) are sub-optimal, in that they utilize a small subset of the rich spectral information provided by hyperspectral data for classification. In contrast, the approach proposed in this dissertation utilizes the entire high dimensional feature space for classification by identifying a suitable partition of this space, employing a bank-of-classifiers to perform “local” classification over this partition, and then merging these local decisions using an appropriate decision fusion mechanism. Adaptive classifier weight assignment and nonlinear pre-processing (in kernel induced spaces) are also proposed within this framework to improve its robustness over a wide range of fidelity conditions. Experimental results demonstrate that the proposed framework results in significant improvements in classification accuracies (as high as a 12% increase) over conventional approaches
Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function
This paper addresses the problem of speech separation and enhancement from
multichannel convolutive and noisy mixtures, \emph{assuming known mixing
filters}. We propose to perform the speech separation and enhancement task in
the short-time Fourier transform domain, using the convolutive transfer
function (CTF) approximation. Compared to time-domain filters, CTF has much
less taps, consequently it has less near-common zeros among channels and less
computational complexity. The work proposes three speech-source recovery
methods, namely: i) the multichannel inverse filtering method, i.e. the
multiple input/output inverse theorem (MINT), is exploited in the CTF domain,
and for the multi-source case, ii) a beamforming-like multichannel inverse
filtering method applying single source MINT and using power minimization,
which is suitable whenever the source CTFs are not all known, and iii) a
constrained Lasso method, where the sources are recovered by minimizing the
-norm to impose their spectral sparsity, with the constraint that the
-norm fitting cost, between the microphone signals and the mixing model
involving the unknown source signals, is less than a tolerance. The noise can
be reduced by setting a tolerance onto the noise power. Experiments under
various acoustic conditions are carried out to evaluate the three proposed
methods. The comparison between them as well as with the baseline methods is
presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language
Processin
Multi-Classifiers And Decision Fusion For Robust Statistical Pattern Recognition With Applications To Hyperspectral Classification
In this dissertation, a multi-classifier, decision fusion framework is proposed for robust classification of high dimensional data in small-sample-size conditions. Such datasets present two key challenges. (1) The high dimensional feature spaces compromise the classifiers’ generalization ability in that the classifier tends to overit decision boundaries to the training data. This phenomenon is commonly known as the Hughes phenomenon in the pattern classification community. (2) The small-sample-size of the training data results in ill-conditioned estimates of its statistics. Most classifiers rely on accurate estimation of these statistics for modeling training data and labeling test data, and hence ill-conditioned statistical estimates result in poorer classification performance. This dissertation tests the efficacy of the proposed algorithms to classify primarily remotely sensed hyperspectral data and secondarily diagnostic digital mammograms, since these applications naturally result in very high dimensional feature spaces and often do not have sufficiently large training datasets to support the dimensionality of the feature space. Conventional approaches, such as Stepwise LDA (S-LDA) are sub-optimal, in that they utilize a small subset of the rich spectral information provided by hyperspectral data for classification. In contrast, the approach proposed in this dissertation utilizes the entire high dimensional feature space for classification by identifying a suitable partition of this space, employing a bank-of-classifiers to perform “local” classification over this partition, and then merging these local decisions using an appropriate decision fusion mechanism. Adaptive classifier weight assignment and nonlinear pre-processing (in kernel induced spaces) are also proposed within this framework to improve its robustness over a wide range of fidelity conditions. Experimental results demonstrate that the proposed framework results in significant improvements in classification accuracies (as high as a 12% increase) over conventional approaches
A Fusion Framework for Camouflaged Moving Foreground Detection in the Wavelet Domain
Detecting camouflaged moving foreground objects has been known to be
difficult due to the similarity between the foreground objects and the
background. Conventional methods cannot distinguish the foreground from
background due to the small differences between them and thus suffer from
under-detection of the camouflaged foreground objects. In this paper, we
present a fusion framework to address this problem in the wavelet domain. We
first show that the small differences in the image domain can be highlighted in
certain wavelet bands. Then the likelihood of each wavelet coefficient being
foreground is estimated by formulating foreground and background models for
each wavelet band. The proposed framework effectively aggregates the
likelihoods from different wavelet bands based on the characteristics of the
wavelet transform. Experimental results demonstrated that the proposed method
significantly outperformed existing methods in detecting camouflaged foreground
objects. Specifically, the average F-measure for the proposed algorithm was
0.87, compared to 0.71 to 0.8 for the other state-of-the-art methods.Comment: 13 pages, accepted by IEEE TI
- …