15 research outputs found
Robust text independent closed set speaker identification systems and their evaluation
PhD ThesisThis thesis focuses upon text independent closed set speaker
identi cation. The contributions relate to evaluation studies in the
presence of various types of noise and handset e ects. Extensive
evaluations are performed on four databases.
The rst contribution is in the context of the use of the Gaussian
Mixture Model-Universal Background Model (GMM-UBM) with
original speech recordings from only the TIMIT database. Four main
simulations for Speaker Identi cation Accuracy (SIA) are presented
including di erent fusion strategies: Late fusion (score based), early
fusion (feature based) and early-late fusion (combination of feature and
score based), late fusion using concatenated static and dynamic
features (features with temporal derivatives such as rst order
derivative delta and second order derivative delta-delta features,
namely acceleration features), and nally fusion of statistically
independent normalized scores.
The second contribution is again based on the GMM-UBM
approach. Comprehensive evaluations of the e ect of Additive White
Gaussian Noise (AWGN), and Non-Stationary Noise (NSN) (with and
without a G.712 type handset) upon identi cation performance are
undertaken. In particular, three NSN types with varying Signal to
Noise Ratios (SNRs) were tested corresponding to: street tra c, a bus
interior and a crowded talking environment. The performance
evaluation also considered the e ect of late fusion techniques based on
score fusion, namely mean, maximum, and linear weighted sum fusion.
The databases employed were: TIMIT, SITW, and NIST 2008; and 120
speakers were selected from each database to yield 3,600 speech
utterances.
The third contribution is based on the use of the I-vector, four
combinations of I-vectors with 100 and 200 dimensions were employed.
Then, various fusion techniques using maximum, mean, weighted sum
and cumulative fusion with the same I-vector dimension were used to
improve the SIA. Similarly, both interleaving and concatenated I-vector
fusion were exploited to produce 200 and 400 I-vector dimensions. The
system was evaluated with four di erent databases using 120 speakers
from each database. TIMIT, SITW and NIST 2008 databases were
evaluated for various types of NSN namely, street-tra c NSN,
bus-interior NSN and crowd talking NSN; and the G.712 type handset
at 16 kHz was also applied.
As recommendations from the study in terms of the GMM-UBM
approach, mean fusion is found to yield overall best performance in terms
of the SIA with noisy speech, whereas linear weighted sum fusion is
overall best for original database recordings. However, in the I-vector
approach the best SIA was obtained from the weighted sum and the
concatenated fusion.Ministry of Higher Education
and Scienti c Research (MoHESR), and the Iraqi Cultural Attach e,
Al-Mustansiriya University, Al-Mustansiriya University College of
Engineering in Iraq for supporting my PhD scholarship
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets
Speaker verification (SV) provides billions of voice-enabled devices with
access control, and ensures the security of voice-driven technologies. As a
type of biometrics, it is necessary that SV is unbiased, with consistent and
reliable performance across speakers irrespective of their demographic, social
and economic attributes. Current SV evaluation practices are insufficient for
evaluating bias: they are over-simplified and aggregate users, not
representative of real-life usage scenarios, and consequences of errors are not
accounted for. This paper proposes design guidelines for constructing SV
evaluation datasets that address these short-comings. We propose a schema for
grading the difficulty of utterance pairs, and present an algorithm for
generating inclusive SV datasets. We empirically validate our proposed method
in a set of experiments on the VoxCeleb1 dataset. Our results confirm that the
count of utterance pairs/speaker, and the difficulty grading of utterance pairs
have a significant effect on evaluation performance and variability. Our work
contributes to the development of SV evaluation practices that are inclusive
and fair
UNSUPERVISED DOMAIN ADAPTATION FOR SPEAKER VERIFICATION IN THE WILD
Performance of automatic speaker verification (ASV) systems is very sensitive
to mismatch between training (source) and testing (target) domains. The
best way to address domain mismatch is to perform matched condition training
– gather sufficient labeled samples from the target domain and use them in
training. However, in many cases this is too expensive or impractical. Usually,
gaining access to unlabeled target domain data, e.g., from open source online
media, and labeled data from other domains is more feasible. This work focuses
on making ASV systems robust to uncontrolled (‘wild’) conditions, with
the help of some unlabeled data acquired from such conditions.
Given acoustic features from both domains, we propose learning a mapping
function – a deep convolutional neural network (CNN) with an encoder-decoder
architecture – between features of both the domains. We explore training the
network in two different scenarios: training on paired speech samples from
both domains and training on unpaired data. In the former case, where the
paired data is usually obtained via simulation, the CNN is treated as a nonii
ABSTRACT
linear regression function and is trained to minimize L2 loss between original
and predicted features from target domain. We provide empirical evidence that
this approach introduces distortions that affect verification performance. To
address this, we explore training the CNN using adversarial loss (along with
L2), which makes the predicted features indistinguishable from the original
ones, and thus, improve verification performance.
The above framework using simulated paired data, though effective, cannot
be used to train the network on unpaired data obtained by independently
sampling speech from both domains. In this case, we first train a CNN using
adversarial loss to map features from target to source. We, then, map the
predicted features back to the target domain using an auxiliary network, and
minimize a cycle-consistency loss between the original and reconstructed target
features.
Our unsupervised adaptation approach complements its supervised counterpart,
where adaptation is done using labeled data from both domains. We
focus on three domain mismatch scenarios: (1) sampling frequency mismatch
between the domains, (2) channel mismatch, and (3) robustness to far-field and
noisy speech acquired from wild conditions
MAVD: The First Open Large-Scale Mandarin Audio-Visual Dataset with Depth Information
Audio-visual speech recognition (AVSR) gains increasing attention from
researchers as an important part of human-computer interaction. However, the
existing available Mandarin audio-visual datasets are limited and lack the
depth information. To address this issue, this work establishes the MAVD, a new
large-scale Mandarin multimodal corpus comprising 12,484 utterances spoken by
64 native Chinese speakers. To ensure the dataset covers diverse real-world
scenarios, a pipeline for cleaning and filtering the raw text material has been
developed to create a well-balanced reading material. In particular, the latest
data acquisition device of Microsoft, Azure Kinect is used to capture depth
information in addition to the traditional audio signals and RGB images during
data acquisition. We also provide a baseline experiment, which could be used to
evaluate the effectiveness of the dataset. The dataset and code will be
released at https://github.com/SpringHuo/MAVD
X-VECTORS: ROBUST NEURAL EMBEDDINGS FOR SPEAKER RECOGNITION
Speaker recognition is the task of identifying speakers based on their speech signal. Typically, this involves comparing speech from a known speaker, with recordings from unknown speakers, and making same-or-different speaker decisions. If the lexical contents of the recordings are fixed to some phrase, the task is considered text-dependent, otherwise it is text-independent. This dissertation is primarily concerned with this second, less constrained problem. Since speech data lives in a complex, high-dimensional space, it is difficult to directly compare speakers. Comparisons are facilitated by embeddings: mappings from complex input patterns to low-dimensional Euclidean spaces where notions of distance or similarity are defined in natural ways. For almost ten years, systems based on i-vectors--a type of embedding extracted from a traditional generative model--have been the dominant paradigm in this field. However, in other areas of applied machine learning, such as text or vision, embeddings extracted from discriminatively trained neural networks are the state-of-the-art. Recently, this line of research has become very active in speaker recognition as well. Neural networks are a natural choice for this purpose, as they are capable of learning extremely complex mappings, and when training data resources are abundant, tend to outperform traditional methods. In this dissertation, we develop a next-generation neural embedding--denoted by x-vector--for speaker recognition. These neural embeddings are demonstrated to substantially improve upon the state-of-the-art on a number of benchmark datasets
Face mask recognition from audio: the MASC database and an overview on the mask challenge
The sudden outbreak of COVID-19 has resulted in tough challenges for the field of biometrics due to its spread via physical contact, and the regulations of wearing face masks. Given these constraints, voice biometrics can offer a suitable contact-less biometric solution; they can benefit from models that classify whether a speaker is wearing a mask or not. This article reviews the Mask Sub-Challenge (MSC) of the INTERSPEECH 2020 COMputational PARalinguistics challengE (ComParE), which focused on the following classification task: Given an audio chunk of a speaker, classify whether the speaker is wearing a mask or not. First, we report the collection of the Mask Augsburg Speech Corpus (MASC) and the baseline approaches used to solve the problem, achieving a performance of [Formula: see text] Unweighted Average Recall (UAR). We then summarise the methodologies explored in the submitted and accepted papers that mainly used two common patterns: (i) phonetic-based audio features, or (ii) spectrogram representations of audio combined with Convolutional Neural Networks (CNNs) typically used in image processing. Most approaches enhance their models by adapting ensembles of different models and attempting to increase the size of the training data using various techniques. We review and discuss the results of the participants of this sub-challenge, where the winner scored a UAR of [Formula: see text]. Moreover, we present the results of fusing the approaches, leading to a UAR of [Formula: see text]. Finally, we present a smartphone app that can be used as a proof of concept demonstration to detect in real-time whether users are wearing a face mask; we also benchmark the run-time of the best models