974 research outputs found
High level speaker specific features modeling in automatic speaker recognition system
Spoken words convey several levels of information. At the primary level, the speech conveys words or spoken messages, but at the secondary level, the speech also reveals information about the speakers. This work is based on the high-level speaker-specific features on statistical speaker modeling techniques that express the characteristic sound of the human voice. Using Hidden Markov model (HMM), Gaussian mixture model (GMM), and Linear Discriminant Analysis (LDA) models build Automatic Speaker Recognition (ASR) system that are computational inexpensive can recognize speakers regardless of what is said. The performance of the ASR system is evaluated for clear speech to a wide range of speech quality using a standard TIMIT speech corpus. The ASR efficiency of HMM, GMM, and LDA based modeling technique are 98.8%, 99.1%, and 98.6% and Equal Error Rate (EER) is 4.5%, 4.4% and 4.55% respectively. The EER improvement of GMM modeling technique based ASR systemcompared with HMM and LDA is 4.25% and 8.51% respectively
Acoustic Approaches to Gender and Accent Identification
There has been considerable research on the problems of speaker and language recognition
from samples of speech. A less researched problem is that of accent recognition. Although this
is a similar problem to language identification, di�erent accents of a language exhibit more
fine-grained di�erences between classes than languages. This presents a tougher problem
for traditional classification techniques. In this thesis, we propose and evaluate a number of
techniques for gender and accent classification. These techniques are novel modifications and
extensions to state of the art algorithms, and they result in enhanced performance on gender
and accent recognition.
The first part of the thesis focuses on the problem of gender identification, and presents a
technique that gives improved performance in situations where training and test conditions are
mismatched.
The bulk of this thesis is concerned with the application of the i-Vector technique to accent
identification, which is the most successful approach to acoustic classification to have emerged
in recent years. We show that it is possible to achieve high accuracy accent identification without
reliance on transcriptions and without utilising phoneme recognition algorithms. The thesis
describes various stages in the development of i-Vector based accent classification that improve
the standard approaches usually applied for speaker or language identification, which are
insu�cient. We demonstrate that very good accent identification performance is possible with
acoustic methods by considering di�erent i-Vector projections, frontend parameters, i-Vector
configuration parameters, and an optimised fusion of the resulting i-Vector classifiers we can
obtain from the same data.
We claim to have achieved the best accent identification performance on the test corpus
for acoustic methods, with up to 90% identification rate. This performance is even better than
previously reported acoustic-phonotactic based systems on the same corpus, and is very close
to performance obtained via transcription based accent identification. Finally, we demonstrate
that the utilization of our techniques for speech recognition purposes leads to considerably
lower word error rates.
Keywords: Accent Identification, Gender Identification, Speaker Identification, Gaussian
Mixture Model, Support Vector Machine, i-Vector, Factor Analysis, Feature Extraction, British
English, Prosody, Speech Recognition
Data-driven Attention and Data-independent DCT based Global Context Modeling for Text-independent Speaker Recognition
Learning an effective speaker representation is crucial for achieving
reliable performance in speaker verification tasks. Speech signals are
high-dimensional, long, and variable-length sequences that entail a complex
hierarchical structure. Signals may contain diverse information at each
time-frequency (TF) location. For example, it may be more beneficial to focus
on high-energy parts for phoneme classes such as fricatives. The standard
convolutional layer that operates on neighboring local regions cannot capture
the complex TF global context information. In this study, a general global
time-frequency context modeling framework is proposed to leverage the context
information specifically for speaker representation modeling. First, a
data-driven attention-based context model is introduced to capture the
long-range and non-local relationship across different time-frequency
locations. Second, a data-independent 2D-DCT based context model is proposed to
improve model interpretability. A multi-DCT attention mechanism is presented to
improve modeling power with alternate DCT base forms. Finally, the global
context information is used to recalibrate salient time-frequency locations by
computing the similarity between the global context and local features. The
proposed lightweight blocks can be easily incorporated into a speaker model
with little additional computational costs and effectively improves the speaker
verification performance compared to the standard ResNet model and
Squeeze\&Excitation block by a large margin. Detailed ablation studies are also
performed to analyze various factors that may impact performance of the
proposed individual modules. Results from experiments show that the proposed
global context modeling framework can efficiently improve the learned speaker
representations by achieving channel-wise and time-frequency feature
recalibration
Disentanglement in a GAN for Unconditional Speech Synthesis
Can we develop a model that can synthesize realistic speech directly from a
latent space, without explicit conditioning? Despite several efforts over the
last decade, previous adversarial and diffusion-based approaches still struggle
to achieve this, even on small-vocabulary datasets. To address this, we propose
AudioStyleGAN (ASGAN) -- a generative adversarial network for unconditional
speech synthesis tailored to learn a disentangled latent space. Building upon
the StyleGAN family of image synthesis models, ASGAN maps sampled noise to a
disentangled latent vector which is then mapped to a sequence of audio features
so that signal aliasing is suppressed at every layer. To successfully train
ASGAN, we introduce a number of new techniques, including a modification to
adaptive discriminator augmentation which probabilistically skips discriminator
updates. We apply it on the small-vocabulary Google Speech Commands digits
dataset, where it achieves state-of-the-art results in unconditional speech
synthesis. It is also substantially faster than existing top-performing
diffusion models. We confirm that ASGAN's latent space is disentangled: we
demonstrate how simple linear operations in the space can be used to perform
several tasks unseen during training. Specifically, we perform evaluations in
voice conversion, speech enhancement, speaker verification, and keyword
classification. Our work indicates that GANs are still highly competitive in
the unconditional speech synthesis landscape, and that disentangled latent
spaces can be used to aid generalization to unseen tasks. Code, models,
samples: https://github.com/RF5/simple-asgan/Comment: 12 pages, 5 tables, 4 figures. Submitted to IEEE TASLP. arXiv admin
note: substantial text overlap with arXiv:2210.0527
Subspace and graph methods to leverage auxiliary data for limited target data multi-class classification, applied to speaker verification
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.Cataloged from PDF version of thesis.Includes bibliographical references (p. 127-130).Multi-class classification can be adversely affected by the absence of sufficient target (in-class) instances for training. Such cases arise in face recognition, speaker verification, and document classification, among others. Auxiliary data-sets, which contain a diverse sampling of non-target instances, are leveraged in this thesis using subspace and graph methods to improve classification where target data is limited. The auxiliary data is used to define a compact representation that maps instances into a vector space where inner products quantify class similarity. Within this space, an estimate of the subspace that constitutes within-class variability (e.g. the recording channel in speaker verification or the illumination conditions in face recognition) can be obtained using class-labeled auxiliary data. This thesis proposes a way to incorporate this estimate into the SVM framework to perform nuisance compensation, thus improving classification performance. Another contribution is a framework that combines mapping and compensation into a single linear comparison, which motivates computationally inexpensive and accurate comparison functions. A key aspect of the work takes advantage of efficient pairwise comparisons between the training, test, and auxiliary instances to characterize their interaction within the vector space, and exploits it for improved classification in three ways. The first uses the local variability around the train and test instances to reduce false-alarms. The second assumes the instances lie on a low-dimensional manifold and uses the distances along the manifold. The third extracts relational features from a similarity graph where nodes correspond to the training, test and auxiliary instances. To quantify the merit of the proposed techniques, results of experiments in speaker verification are presented where only a single target recording is provided to train the classifier. Experiments are preformed on standard NIST corpora and methods are compared using standard evalutation metrics: detection error trade-off curves, minimum decision costs, and equal error rates.by Zahi Nadim Karam.Ph.D
IITG-Indigo System for NIST 2016 SRE Challenge
This paper describes the speaker verification (SV) system submitted to the NIST 2016 speaker recognition evaluation (SRE) challenge by Indian Institute of Technology Guwahati (IITG) under the fixed training condition task. Various SV systems are developed following the idea-level collaboration with two other Indian institutions. Unlike the previous SREs, this time the focus was on developing SV system using non-target language speech data and a small amount unlabeled data from target language/dialects. For addressing these novel challenges, we tried exploring the fusion of systems created using different features, data conditioning, and classifiers. On NIST 2016 SRE evaluation data, the presented fused system resulted in actual detection cost function (actDCF) and equal error rate (EER) of 0.81 and 12.91%, respectively. Post-evaluation, we explored a recently proposed pairwise support vector machine classifier and applied adaptive S-norm to the decision scores before fusion. With these changes, the final system achieves the actDCF and EER of 0.67 and 11.63%, respectively
Cross-domain Adaptation with Discrepancy Minimization for Text-independent Forensic Speaker Verification
Forensic audio analysis for speaker verification offers unique challenges due
to location/scenario uncertainty and diversity mismatch between reference and
naturalistic field recordings. The lack of real naturalistic forensic audio
corpora with ground-truth speaker identity represents a major challenge in this
field. It is also difficult to directly employ small-scale domain-specific data
to train complex neural network architectures due to domain mismatch and loss
in performance. Alternatively, cross-domain speaker verification for multiple
acoustic environments is a challenging task which could advance research in
audio forensics. In this study, we introduce a CRSS-Forensics audio dataset
collected in multiple acoustic environments. We pre-train a CNN-based network
using the VoxCeleb data, followed by an approach which fine-tunes part of the
high-level network layers with clean speech from CRSS-Forensics. Based on this
fine-tuned model, we align domain-specific distributions in the embedding space
with the discrepancy loss and maximum mean discrepancy (MMD). This maintains
effective performance on the clean set, while simultaneously generalizes the
model to other acoustic domains. From the results, we demonstrate that diverse
acoustic environments affect the speaker verification performance, and that our
proposed approach of cross-domain adaptation can significantly improve the
results in this scenario.Comment: To appear in INTERSPEECH 202
- …