41 research outputs found

    Acoustic model adaptation from raw waveforms with Sincnet

    Get PDF
    Raw waveform acoustic modelling has recently gained interest due to neural networks' ability to learn feature extraction, and the potential for finding better representations for a given scenario than hand-crafted features. SincNet has been proposed to reduce the number of parameters required in raw-waveform modelling, by restricting the filter functions, rather than having to learn every tap of each filter. We study the adaptation of the SincNet filter parameters from adults' to children's speech, and show that the parameterisation of the SincNet layer is well suited for adaptation in practice: we can efficiently adapt with a very small number of parameters, producing error rates comparable to techniques using orders of magnitude more parameters.Comment: Accepted to IEEE ASRU 201

    On the Robustness and Training Dynamics of Raw Waveform Models

    Get PDF

    A Deep 2D Convolutional Network for Waveform-Based Speech Recognition

    Get PDF

    Robust learning of acoustic representations from diverse speech data

    Get PDF
    Automatic speech recognition is increasingly applied to new domains. A key challenge is to robustly learn, update and maintain representations to cope with transient acoustic conditions. A typical example is broadcast media, for which speakers and environments may change rapidly, and available supervision may be poor. The concern of this thesis is to build and investigate methods for acoustic modelling that are robust to the characteristics and transient conditions as embodied by such media. The first contribution of the thesis is a technique to make use of inaccurate transcriptions as supervision for acoustic model training. There is an abundance of audio with approximate labels, but training methods can be sensitive to label errors, and their use is therefore not trivial. State-of-the-art semi-supervised training makes effective use of a lattice of supervision, inherently encoding uncertainty in the labels to avoid overfitting to poor supervision, but does not make use of the transcriptions. Existing approaches that do aim to make use of the transcriptions typically employ an algorithm to filter or combine the transcriptions with the recognition output from a seed model, but the final result does not encode uncertainty. We propose a method to combine the lattice output from a biased recognition pass with the transcripts, crucially preserving uncertainty in the lattice where appropriate. This substantially reduces the word error rate on a broadcast task. The second contribution is a method to factorise representations for speakers and environments so that they may be combined in novel combinations. In realistic scenarios, the speaker or environment transform at test time might be unknown, or there may be insufficient data to learn a joint transform. We show that in such cases, factorised, or independent, representations are required to avoid deteriorating performance. Using i-vectors, we factorise speaker or environment information using multi-condition training with neural networks. Specifically, we extract bottleneck features from networks trained to classify either speakers or environments. The resulting factorised representations prove beneficial when one factor is missing at test time, or when all factors are seen, but not in the desired combination. The third contribution is an investigation of model adaptation in a longitudinal setting. In this scenario, we repeatedly adapt a model to new data, with the constraint that previous data becomes unavailable. We first demonstrate the effect of such a constraint, and show that using a cyclical learning rate may help. We then observe that these successive models lend themselves well to ensembling. Finally, we show that the impact of this constraint in an active learning setting may be detrimental to performance, and suggest to combine active learning with semi-supervised training to avoid biasing the model. The fourth contribution is a method to adapt low-level features in a parameter-efficient and interpretable manner. We propose to adapt the filters in a neural feature extractor, known as SincNet. In contrast to traditional techniques that warp the filterbank frequencies in standard feature extraction, adapting SincNet parameters is more flexible and more readily optimised, whilst maintaining interpretability. On a task adapting from adult to child speech, we show that this layer is well suited for adaptation and is very effective with respect to the small number of adapted parameters

    Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

    Full text link
    Automatic Speaker Verification (ASV) systems are increasingly used in voice bio-metrics for user authentication but are susceptible to logical and physical spoofing attacks, posing security risks. Existing research mainly tackles logical or physical attacks separately, leading to a gap in unified spoofing detection. Moreover, when existing systems attempt to handle both types of attacks, they often exhibit significant disparities in the Equal Error Rate (EER). To bridge this gap, we present a Parallel Stacked Aggregation Network that processes raw audio. Our approach employs a split-transform-aggregation technique, dividing utterances into convolved representations, applying transformations, and aggregating the results to identify logical (LA) and physical (PA) spoofing attacks. Evaluation of the ASVspoof-2019 and VSDC datasets shows the effectiveness of the proposed system. It outperforms state-of-the-art solutions, displaying reduced EER disparities and superior performance in detecting spoofing attacks. This highlights the proposed method's generalizability and superiority. In a world increasingly reliant on voice-based security, our unified spoofing detection system provides a robust defense against a spectrum of voice spoofing attacks, safeguarding ASVs and user data effectively

    Deep learning methods in speaker recognition: a review

    Full text link
    This paper summarizes the applied deep learning practices in the field of speaker recognition, both verification and identification. Speaker recognition has been a widely used field topic of speech technology. Many research works have been carried out and little progress has been achieved in the past 5-6 years. However, as deep learning techniques do advance in most machine learning fields, the former state-of-the-art methods are getting replaced by them in speaker recognition too. It seems that DL becomes the now state-of-the-art solution for both speaker verification and identification. The standard x-vectors, additional to i-vectors, are used as baseline in most of the novel works. The increasing amount of gathered data opens up the territory to DL, where they are the most effective
    corecore