1,087 research outputs found

    Anti-spoofing Methods for Automatic SpeakerVerification System

    Full text link
    Growing interest in automatic speaker verification (ASV)systems has lead to significant quality improvement of spoofing attackson them. Many research works confirm that despite the low equal er-ror rate (EER) ASV systems are still vulnerable to spoofing attacks. Inthis work we overview different acoustic feature spaces and classifiersto determine reliable and robust countermeasures against spoofing at-tacks. We compared several spoofing detection systems, presented so far,on the development and evaluation datasets of the Automatic SpeakerVerification Spoofing and Countermeasures (ASVspoof) Challenge 2015.Experimental results presented in this paper demonstrate that the useof magnitude and phase information combination provides a substantialinput into the efficiency of the spoofing detection systems. Also wavelet-based features show impressive results in terms of equal error rate. Inour overview we compare spoofing performance for systems based on dif-ferent classifiers. Comparison results demonstrate that the linear SVMclassifier outperforms the conventional GMM approach. However, manyresearchers inspired by the great success of deep neural networks (DNN)approaches in the automatic speech recognition, applied DNN in thespoofing detection task and obtained quite low EER for known and un-known type of spoofing attacks.Comment: 12 pages, 0 figures, published in Springer Communications in Computer and Information Science (CCIS) vol. 66

    Use of principal component analysis with linear predictive features in developing a blind SNR estimation system

    Get PDF
    Signal-to-noise ratio is an important concept in electrical communications, as it is a measurable ratio between a given transmitted signal and the inherent background noise of a transmission channel. Currently signal-to-noise ratio testing is primarily performed by using an intrusive method of comparing a corrupted signal to the original signal and giving it a score based on the comparison. However, this technique is inefficient and often impossible for practical use because it requires the original signal for comparison. A speech signal\u27s characteristics and properties could be used to develop a non-intrusive method for determining SNR, or a method that does not require the presence of the original clean signal. In this thesis, several extracted features were investigated to determine whether a neural network trained with data from corrupt speech signals could accurately estimate the SNR of a speech signal. A MultiLayer Perceptron (MLP) was trained on extracted features for each decibel level from 0dB to 30dB, in an attempt to create \u27expert classifiers\u27 for each SNR level. This type of architecture would then have 31 independent classifiers operating together to accurately estimate the signal-to-noise ratio of an unknown speech signal. Principal component analysis was also implemented to reduce dimensionality and increase class discrimination. The performance of several neural network classifier structures is examined, as well as analyzing the overall results to determine the optimal feature for estimating signal-to-noise ratio of an unknown speech signal. Decision-level fusion was the final procedure which combined the outputs of several classifier systems in an effort to reduce the estimation error

    Comparative Analysis of MLP, CNN, and RNN Models in Automatic Speech Recognition: Dissecting Performance Metric

    Get PDF
    This study conducts a comparative analysis of three prominent machine learning models: Multi-Layer Perceptrons (MLP), Convolutional Neural Networks (CNN), and Recurrent Neural Networks (RNN) with Long Short-Term Memory (LSTM) in the field of automatic speech recognition (ASR). This research is distinct in its use of the LibriSpeech 'test-clean' dataset, selected for its diversity in speaker accents and varied recording conditions, establishing it as a robust benchmark for ASR performance evaluation. Our approach involved preprocessing the audio data to ensure consistency and extracting Mel-Frequency Cepstral Coefficients (MFCCs) as the primary features, crucial for capturing the nuances of human speech. The models were meticulously configured with specific architectural details and hyperparameters. The MLP and CNN models were designed to maximize their pattern recognition capabilities, while the RNN (LSTM) was optimized for processing temporal data. To assess their performance, we employed metrics such as precision, recall, and F1-score. The MLP and CNN models demonstrated exceptional accuracy, with scores of 0.98 across these metrics, indicating their effectiveness in feature extraction and pattern recognition. In contrast, the LSTM variant of RNN showed lower efficacy, with scores below 0.60, highlighting the challenges in handling sequential speech data. The results of this study shed light on the differing capabilities of these models in ASR. While the high accuracy of MLP and CNN suggests potential overfitting, the underperformance of LSTM underscores the necessity for further refinement in sequential data processing. This research contributes to the understanding of various machine learning approaches in ASR and paves the way for future investigations. We propose exploring hybrid model architectures and enhancing feature extraction methods to develop more sophisticated, real-world ASR systems. Additionally, our findings underscore the importance of considering model-specific strengths and limitations in ASR applications, guiding the direction of future research in this rapidly evolving field
    • …
    corecore