474 research outputs found
Beyond Domain Adaptation: Unseen Domain Encapsulation via Universal Non-volume Preserving Models
Recognition across domains has recently become an active topic in the
research community. However, it has been largely overlooked in the problem of
recognition in new unseen domains. Under this condition, the delivered deep
network models are unable to be updated, adapted or fine-tuned. Therefore,
recent deep learning techniques, such as: domain adaptation, feature
transferring, and fine-tuning, cannot be applied. This paper presents a novel
Universal Non-volume Preserving approach to the problem of domain
generalization in the context of deep learning. The proposed method can be
easily incorporated with any other ConvNet framework within an end-to-end deep
network design to improve the performance. On digit recognition, we benchmark
on four popular digit recognition databases, i.e. MNIST, USPS, SVHN and
MNIST-M. The proposed method is also experimented on face recognition on
Extended Yale-B, CMU-PIE and CMU-MPIE databases and compared against other the
state-of-the-art methods. In the problem of pedestrian detection, we
empirically observe that the proposed method learns models that improve
performance across a priori unknown data distributions
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech
Zero-shot multi-speaker text-to-speech (ZSM-TTS) models aim to generate a
speech sample with the voice characteristic of an unseen speaker. The main
challenge of ZSM-TTS is to increase the overall speaker similarity for unseen
speakers. One of the most successful speaker conditioning methods for
flow-based multi-speaker text-to-speech (TTS) models is to utilize the
functions which predict the scale and bias parameters of the affine coupling
layers according to the given speaker embedding vector. In this letter, we
improve on the previous speaker conditioning method by introducing a
speaker-normalized affine coupling (SNAC) layer which allows for unseen speaker
speech synthesis in a zero-shot manner leveraging a normalization-based
conditioning technique. The newly designed coupling layer explicitly normalizes
the input by the parameters predicted from a speaker embedding vector while
training, enabling an inverse process of denormalizing for a new speaker
embedding at inference. The proposed conditioning scheme yields the
state-of-the-art performance in terms of the speech quality and speaker
similarity in a ZSM-TTS setting.Comment: Accepted to IEEE Signal Processing Letter
Noise invariant frame selection: a simple method to address the background noise problem for text-independent speaker verification
The performance of speaker-related systems usually degrades heavily in practical applications largely due to the background noise. To improve the robustness of such systems in unknown noisy environments, this paper proposes a simple pre-processing method called Noise Invariant Frame Selection (NIFS). Based on several noisy constraints, it selects noise invariant frames from utterances to represent speakers. Experiments conducted on the TIMIT database showed that the NIFS can significantly improve the performance of Vector Quantization (VQ), Gaussian Mixture Model-Universal Background Model (GMM-UBM) and i-vector-based speaker verification systems in different unknown noisy environments with different SNRs, in comparison to their baselines. Meanwhile, the proposed NIFS-based speaker systems has achieves similar performance when we change the constraints (hyper-parameters) or features, which indicates that it is easy to reproduce. Since NIFS is designed as a general algorithm, it could be further applied to other similar tasks
Likelihood-Maximizing-Based Multiband Spectral Subtraction for Robust Speech Recognition
Automatic speech recognition performance degrades significantly when speech is affected by environmental noise. Nowadays, the major challenge is to achieve good robustness in adverse noisy conditions so that automatic speech recognizers can be used in real situations. Spectral subtraction (SS) is a well-known and effective approach; it was originally designed for improving the quality of speech signal judged by human listeners. SS techniques usually improve the quality and intelligibility of speech signal while speech recognition systems need compensation techniques to reduce mismatch between noisy speech features and clean trained acoustic model. Nevertheless, correlation can be expected between speech quality improvement and the increase in recognition accuracy. This paper proposes a novel approach for solving this problem by considering SS and the speech recognizer not as two independent entities cascaded together, but rather as two interconnected components of a single system, sharing the common goal of improved speech recognition accuracy. This will incorporate important information of the statistical models of the recognition engine as a feedback for tuning SS parameters. By using this architecture, we overcome the drawbacks of previously proposed methods and achieve better recognition accuracy. Experimental evaluations show that the proposed method can achieve significant improvement of recognition rates across a wide range of signal to noise ratios
Structural uncertainty identification using mode shape information
This thesis is concerned with efficient uncertainty identification (UI) – namely the nonlinear inverse problem of establishing specific statistical properties of an uncertain structure from a practically-limited supply of low-frequency dynamic response information. An established UI approach (published in 2005) which uses Maximum Likelihood Estimation (MLE) and the Perturbation Method of uncertainty propagation is adopted for the study using (for the first time) mode shape information rather than just natural or resonant frequencies. The thesis develops a method based on the use of selected coefficients in a generalized displacement model i.e. a weighted series of spatially-continuous multiply-differentiable base functions to approximate the structural free-vibration response of an uncertain structure. The focus is placed on the estimation (from relatively small data sets) of the statistical properties of the location of an attached point-mass with normally-distributed position.
Simulated data for uncertain point-mass-loaded linear beam and plate structures is initially used to test the method making use of as much exact or closed-form differentiable information as possible to obtain frequencies and mode shapes. In the case of plate structures, extensive use is made of the Rayleigh Ritz method to generate the required response coefficients. This is shown to have significant advantages over alternatives such as the Finite Element method. The approach developed for use with free vibration information is then tested on measured experimental data obtained from an acoustically-forced clamped plate. Structural displacement measurements are taken from the plate using Vibromap 1000, a commercially-available ESPI-based holomodal measurement system capable of wide-field vibration response observation in real-time, or quantitative displacement response measurement.
The thesis shows that the developed uncertainty identification method works well for beams and plates using simulated free-vibration dat
A Nonlinear Mixture Autoregressive Model For Speaker Verification
In this work, we apply a nonlinear mixture autoregressive (MixAR) model to supplant the Gaussian mixture model for speaker verification. MixAR is a statistical model that is a probabilistically weighted combination of components, each of which is an autoregressive filter in addition to a mean. The probabilistic mixing and the datadependent weights are responsible for the nonlinear nature of the model. Our experiments with synthetic as well as real speech data from standard speech corpora show that MixAR model outperforms GMM, especially under unseen noisy conditions. Moreover, MixAR did not require delta features and used 2.5x fewer parameters to achieve comparable or better performance as that of GMM using static as well as delta features. Also, MixAR suffered less from overitting issues than GMM when training data was sparse. However, MixAR performance deteriorated more quickly than that of GMM when evaluation data duration was reduced. This could pose limitations on the required minimum amount of evaluation data when using MixAR model for speaker verification
Acoustic Adaptation to Dynamic Background Conditions with Asynchronous Transformations
This paper proposes a framework for performing adaptation to complex and non-stationary background conditions in Automatic Speech Recognition (ASR) by means of asynchronous Constrained Maximum Likelihood Linear Regression (aCMLLR) transforms and asynchronous Noise Adaptive Training (aNAT). The proposed method aims to apply the feature transform that best compensates the background for every input frame. The implementation is done with a new Hidden Markov Model (HMM) topology that expands the usual left-to-right HMM into parallel branches adapted to different background conditions and permits transitions among them. Using this, the proposed adaptation does not require ground truth or previous knowledge about the background in each frame as it aims to maximise the overall log-likelihood of the decoded utterance. The proposed aCMLLR transforms can be further improved by retraining models in an aNAT fashion and by using speaker-based MLLR transforms in cascade for an efficient modelling of background effects and speaker. An initial evaluation in a modified version of the WSJCAM0 corpus incorporating 7 different background conditions provides a benchmark in which to evaluate the use of aCMLLR transforms. A relative reduction of 40.5% in Word Error Rate (WER) was achieved by the combined use of aCMLLR and MLLR in cascade. Finally, this selection of techniques was applied in the transcription of multi-genre media broadcasts, where the use of aNAT training, aCMLLR transforms and MLLR transforms provided a relative improvement of 2–3%
- …