114 research outputs found
End-to-End Language Identification Using High-Order Utterance Representation with Bilinear Pooling
A key problem in spoken language identification (LID) is how to design effective representations which are specific to language information. Recent advances in deep neural networks have led to significant improvements in results, with deep end-to-end methods proving effective. This paper proposes a novel network which aims to model an effective representation for high (first and second)-order statistics of LID-senones, defined as being LID analogues of senones in speech recognition. The high-order information extracted through bilinear pooling is robust to speakers, channels and background noise.
Evaluation with NIST LRE 2009 shows improved performance compared to current state-of-the-art DBF/i-vector systems, achieving over 33% and 20% relative equal error rate (EER) improvement for 3s and 10s utterances and over 40% relative Cavg improvement for all durations
Prosodic-Enhanced Siamese Convolutional Neural Networks for Cross-Device Text-Independent Speaker Verification
In this paper a novel cross-device text-independent speaker verification
architecture is proposed. Majority of the state-of-the-art deep architectures
that are used for speaker verification tasks consider Mel-frequency cepstral
coefficients. In contrast, our proposed Siamese convolutional neural network
architecture uses Mel-frequency spectrogram coefficients to benefit from the
dependency of the adjacent spectro-temporal features. Moreover, although
spectro-temporal features have proved to be highly reliable in speaker
verification models, they only represent some aspects of short-term acoustic
level traits of the speaker's voice. However, the human voice consists of
several linguistic levels such as acoustic, lexicon, prosody, and phonetics,
that can be utilized in speaker verification models. To compensate for these
inherited shortcomings in spectro-temporal features, we propose to enhance the
proposed Siamese convolutional neural network architecture by deploying a
multilayer perceptron network to incorporate the prosodic, jitter, and shimmer
features. The proposed end-to-end verification architecture performs feature
extraction and verification simultaneously. This proposed architecture displays
significant improvement over classical signal processing approaches and deep
algorithms for forensic cross-device speaker verification.Comment: Accepted in 9th IEEE International Conference on Biometrics: Theory,
Applications, and Systems (BTAS 2018
End-to-end DNN-CNN Classification for Language Identification
A defining problem in spoken language identification (LID) is how to design effective representations which allow features to be extracted that are specific to language information.
Recent advances in deep neural networks for feature extraction have led to significant improvements in results, with deep end-to-end methods proving effective.
In this paper, a novel network is proposed and explored that models an effective representation using first and second-order statistics of features extracted from a well-trained phoneme-related DNN bottleneck network followed by a stack of CNN convolutional layers.
The high-order statistics extracted through second order pooling at the output of the CNN are robust to speaker and channel variability, and background noise.
Evaluation with NIST LRE 2009 shows improved performance compared to current state-of-the-art systems, achieving over 33% and 20% relative equal error rate (EER) improvement for 3s and 10s utterances
LID-senones and their statistics for language identification
Recent research on end-to-end training structures for language identification has raised the possibility that intermediate language-sensitive feature units exist which are analogous to phonetically-sensitive senones in automatic speech recognition systems. Termed LID (language identification)-senones, the statistics derived from these feature units have been shown to be beneficial in discriminating between languages, particularly for short utterances. This paper examines the evidence for the existence of LID-senones before designing and evaluating LID systems based on low and high level statistics of LID-senones with both generative and discriminative models. For the standard NIST LRE 2009 task on 23 languages, LID-senone based systems are shown to outperform state-of-the art DNN/i-vector methods both when LID-senones are used directly for classification and when LID-senone statistics are used for i-vector formation
- …