41 research outputs found
Acoustic model adaptation from raw waveforms with Sincnet
Raw waveform acoustic modelling has recently gained interest due to neural
networks' ability to learn feature extraction, and the potential for finding
better representations for a given scenario than hand-crafted features. SincNet
has been proposed to reduce the number of parameters required in raw-waveform
modelling, by restricting the filter functions, rather than having to learn
every tap of each filter. We study the adaptation of the SincNet filter
parameters from adults' to children's speech, and show that the
parameterisation of the SincNet layer is well suited for adaptation in
practice: we can efficiently adapt with a very small number of parameters,
producing error rates comparable to techniques using orders of magnitude more
parameters.Comment: Accepted to IEEE ASRU 201
Robust learning of acoustic representations from diverse speech data
Automatic speech recognition is increasingly applied to new domains. A key challenge is
to robustly learn, update and maintain representations to cope with transient acoustic
conditions. A typical example is broadcast media, for which speakers and environments
may change rapidly, and available supervision may be poor. The concern of this
thesis is to build and investigate methods for acoustic modelling that are robust to the
characteristics and transient conditions as embodied by such media.
The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio
with approximate labels, but training methods can be sensitive to label errors, and their
use is therefore not trivial. State-of-the-art semi-supervised training makes effective
use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid
overfitting to poor supervision, but does not make use of the transcriptions. Existing
approaches that do aim to make use of the transcriptions typically employ an algorithm
to filter or combine the transcriptions with the recognition output from a seed model,
but the final result does not encode uncertainty. We propose a method to combine the
lattice output from a biased recognition pass with the transcripts, crucially preserving
uncertainty in the lattice where appropriate. This substantially reduces the word error
rate on a broadcast task.
The second contribution is a method to factorise representations for speakers and
environments so that they may be combined in novel combinations. In realistic scenarios,
the speaker or environment transform at test time might be unknown, or there may be
insufficient data to learn a joint transform. We show that in such cases, factorised, or
independent, representations are required to avoid deteriorating performance. Using
i-vectors, we factorise speaker or environment information using multi-condition training
with neural networks. Specifically, we extract bottleneck features from networks trained
to classify either speakers or environments. The resulting factorised representations
prove beneficial when one factor is missing at test time, or when all factors are seen,
but not in the desired combination.
The third contribution is an investigation of model adaptation in a longitudinal
setting. In this scenario, we repeatedly adapt a model to new data, with the constraint
that previous data becomes unavailable. We first demonstrate the effect of such a
constraint, and show that using a cyclical learning rate may help. We then observe
that these successive models lend themselves well to ensembling. Finally, we show
that the impact of this constraint in an active learning setting may be detrimental to
performance, and suggest to combine active learning with semi-supervised training to
avoid biasing the model.
The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature
extractor, known as SincNet. In contrast to traditional techniques that warp the
filterbank frequencies in standard feature extraction, adapting SincNet parameters is
more flexible and more readily optimised, whilst maintaining interpretability. On a task
adapting from adult to child speech, we show that this layer is well suited for adaptation
and is very effective with respect to the small number of adapted parameters
Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks
Automatic Speaker Verification (ASV) systems are increasingly used in voice
bio-metrics for user authentication but are susceptible to logical and physical
spoofing attacks, posing security risks. Existing research mainly tackles
logical or physical attacks separately, leading to a gap in unified spoofing
detection. Moreover, when existing systems attempt to handle both types of
attacks, they often exhibit significant disparities in the Equal Error Rate
(EER). To bridge this gap, we present a Parallel Stacked Aggregation Network
that processes raw audio. Our approach employs a split-transform-aggregation
technique, dividing utterances into convolved representations, applying
transformations, and aggregating the results to identify logical (LA) and
physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC
datasets shows the effectiveness of the proposed system. It outperforms
state-of-the-art solutions, displaying reduced EER disparities and superior
performance in detecting spoofing attacks. This highlights the proposed
method's generalizability and superiority. In a world increasingly reliant on
voice-based security, our unified spoofing detection system provides a robust
defense against a spectrum of voice spoofing attacks, safeguarding ASVs and
user data effectively
Deep learning methods in speaker recognition: a review
This paper summarizes the applied deep learning practices in the field of
speaker recognition, both verification and identification. Speaker recognition
has been a widely used field topic of speech technology. Many research works
have been carried out and little progress has been achieved in the past 5-6
years. However, as deep learning techniques do advance in most machine learning
fields, the former state-of-the-art methods are getting replaced by them in
speaker recognition too. It seems that DL becomes the now state-of-the-art
solution for both speaker verification and identification. The standard
x-vectors, additional to i-vectors, are used as baseline in most of the novel
works. The increasing amount of gathered data opens up the territory to DL,
where they are the most effective