6,978 research outputs found
Language Identification with Deep Bottleneck Features
In this paper we proposed an end-to-end short utterances speech language
identification(SLD) approach based on a Long Short Term Memory (LSTM) neural
network which is special suitable for SLD application in intelligent vehicles.
Features used for LSTM learning are generated by a transfer learning method.
Bottle-neck features of a deep neural network (DNN) which are trained for
mandarin acoustic-phonetic classification are used for LSTM training. In order
to improve the SLD accuracy of short utterances a phase vocoder based
time-scale modification(TSM) method is used to reduce and increase speech rated
of the test utterance. By splicing the normal, speech rate reduced and
increased utterances, we can extend length of test utterances so as to improved
improved the performance of the SLD system. The experimental results on
AP17-OLR database shows that the proposed methods can improve the performance
of SLD, especially on short utterance with 1s and 3s durations.Comment: Preliminary work repor
Leveraging Native Language Speech for Accent Identification using Deep Siamese Networks
The problem of automatic accent identification is important for several
applications like speaker profiling and recognition as well as for improving
speech recognition systems. The accented nature of speech can be primarily
attributed to the influence of the speaker's native language on the given
speech recording. In this paper, we propose a novel accent identification
system whose training exploits speech in native languages along with the
accented speech. Specifically, we develop a deep Siamese network-based model
which learns the association between accented speech recordings and the native
language speech recordings. The Siamese networks are trained with i-vector
features extracted from the speech recordings using either an unsupervised
Gaussian mixture model (GMM) or a supervised deep neural network (DNN) model.
We perform several accent identification experiments using the CSLU Foreign
Accented English (FAE) corpus. In these experiments, our proposed approach
using deep Siamese networks yield significant relative performance improvements
of 15.4 percent on a 10-class accent identification task, over a baseline
DNN-based classification system that uses GMM i-vectors. Furthermore, we
present a detailed error analysis of the proposed accent identification system.Comment: Published in ASRU 201
A Unified Deep Neural Network for Speaker and Language Recognition
Learned feature representations and sub-phoneme posteriors from Deep Neural
Networks (DNNs) have been used separately to produce significant performance
gains for speaker and language recognition tasks. In this work we show how
these gains are possible using a single DNN for both speaker and language
recognition. The unified DNN approach is shown to yield substantial performance
improvements on the the 2013 Domain Adaptation Challenge speaker recognition
task (55% reduction in EER for the out-of-domain condition) and on the NIST
2011 Language Recognition Evaluation (48% reduction in EER for the 30s test
condition)
Novel Cascaded Gaussian Mixture Model-Deep Neural Network Classifier for Speaker Identification in Emotional Talking Environments
This research is an effort to present an effective approach to enhance
text-independent speaker identification performance in emotional talking
environments based on novel classifier called cascaded Gaussian Mixture
Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing,
implementing and evaluating a new approach for speaker identification in
emotional talking environments based on cascaded Gaussian Mixture Model-Deep
Neural Network as a classifier. The results point out that the cascaded GMM-DNN
classifier improves speaker identification performance at various emotions
using two distinct speech databases: Emirati speech database (Arabic United
Arab Emirates dataset) and Speech Under Simulated and Actual Stress (SUSAS)
English dataset. The proposed classifier outperforms classical classifiers such
as Multilayer Perceptron (MLP) and Support Vector Machine (SVM) in each
dataset. Speaker identification performance that has been attained based on the
cascaded GMM-DNN is similar to that acquired from subjective assessment by
human listeners.Comment: 15 page
Hierarchical Classification for Spoken Arabic Dialect Identification using Prosody: Case of Algerian Dialects
In daily communications, Arabs use local dialects which are hard to identify
automatically using conventional classification methods. The dialect
identification challenging task becomes more complicated when dealing with an
under-resourced dialects belonging to a same county/region. In this paper, we
start by analyzing statistically Algerian dialects in order to capture their
specificities related to prosody information which are extracted at utterance
level after a coarse-grained consonant/vowel segmentation. According to these
analysis findings, we propose a Hierarchical classification approach for spoken
Arabic algerian Dialect IDentification (HADID). It takes advantage from the
fact that dialects have an inherent property of naturally structured into
hierarchy. Within HADID, a top-down hierarchical classification is applied, in
which we use Deep Neural Networks (DNNs) method to build a local classifier for
every parent node into the hierarchy dialect structure. Our framework is
implemented and evaluated on Algerian Arabic dialects corpus. Whereas, the
hierarchy dialect structure is deduced from historic and linguistic knowledges.
The results reveal that within {\HD}, the best classifier is DNNs compared to
Support Vector Machine. In addition, compared with a baseline Flat
classification system, our HADID gives an improvement of 63.5% in term of
precision. Furthermore, overall results evidence the suitability of our
prosody-based HADID for speaker independent dialect identification while
requiring less than 6s test utterances.Comment: 33 pages, 7 figure
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
UTD-CRSS Submission for MGB-3 Arabic Dialect Identification: Front-end and Back-end Advancements on Broadcast Speech
This study presents systems submitted by the University of Texas at Dallas,
Center for Robust Speech Systems (UTD-CRSS) to the MGB-3 Arabic Dialect
Identification (ADI) subtask. This task is defined to discriminate between five
dialects of Arabic, including Egyptian, Gulf, Levantine, North African, and
Modern Standard Arabic. We develop multiple single systems with different
front-end representations and back-end classifiers. At the front-end level,
feature extraction methods such as Mel-frequency cepstral coefficients (MFCCs)
and two types of bottleneck features (BNF) are studied for an i-Vector
framework. As for the back-end level, Gaussian back-end (GB), and Generative
Adversarial Networks (GANs) classifiers are applied alternately. The best
submission (contrastive) is achieved for the ADI subtask with an accuracy of
76.94% by augmenting the randomly chosen part of the development dataset.
Further, with a post evaluation correction in the submitted system, final
accuracy is increased to 79.76%, which represents the best performance achieved
so far for the challenge on the test dataset
MIT-QCRI Arabic Dialect Identification System for the 2017 Multi-Genre Broadcast Challenge
In order to successfully annotate the Arabic speech con- tent found in
open-domain media broadcasts, it is essential to be able to process a diverse
set of Arabic dialects. For the 2017 Multi-Genre Broadcast challenge (MGB-3)
there were two possible tasks: Arabic speech recognition, and Arabic Dialect
Identification (ADI). In this paper, we describe our efforts to create an ADI
system for the MGB-3 challenge, with the goal of distinguishing amongst four
major Arabic dialects, as well as Modern Standard Arabic. Our research fo-
cused on dialect variability and domain mismatches between the training and
test domain. In order to achieve a robust ADI system, we explored both Siamese
neural network models to learn similarity and dissimilarities among Arabic
dialects, as well as i-vector post-processing to adapt domain mismatches. Both
Acoustic and linguistic features were used for the final MGB-3 submissions,
with the best primary system achieving 75% accuracy on the official 10hr test
set.Comment: Submitted to the 2017 IEEE Automatic Speech Recognition and
Understanding Workshop (ASRU 2017
Audio-replay attack detection countermeasures
This paper presents the Speech Technology Center (STC) replay attack
detection systems proposed for Automatic Speaker Verification Spoofing and
Countermeasures Challenge 2017. In this study we focused on comparison of
different spoofing detection approaches. These were GMM based methods, high
level features extraction with simple classifier and deep learning frameworks.
Experiments performed on the development and evaluation parts of the challenge
dataset demonstrated stable efficiency of deep learning approaches in case of
changing acoustic conditions. At the same time SVM classifier with high level
features provided a substantial input in the efficiency of the resulting STC
systems according to the fusion systems results.Comment: 11 pages, 3 figures, accepted for Specom 201
Statistical feature embedding for heart sound classification
Cardiovascular Disease (CVD) is considered as one of the principal causes of
death in the world. Over recent years, this field of study has attracted
researchers' attention to investigate heart sounds' patterns for disease
diagnostics. In this study, an approach is proposed for normal/abnormal heart
sound classification on the Physionet challenge 2016 dataset. For the first
time, a fixed-length feature vector; called i-vector; is extracted from each
heart sound using Mel Frequency Cepstral Coefficient (MFCC) features.
Afterwards, Principal Component Analysis (PCA) transform and Variational
Autoencoder (VAE) are applied on the i-vector to achieve dimension reduction.
Eventually, the reduced size vector is fed to Gaussian Mixture Models (GMMs)
and Support Vector Machine (SVM) for classification purpose. Experimental
results demonstrate the proposed method could achieve a performance improvement
of 16% based on Modified Accuracy (MAcc) compared with the baseline system on
the Physoinet dataset
- …