164 research outputs found

    Synthetic speech detection and audio steganography in VoIP scenarios

    Get PDF
    The distinction between synthetic and human voice uses the techniques of the current biometric voice recognition systems, which prevent that a person’s voice, no matter if with good or bad intentions, can be confused with someone else’s. Steganography gives the possibility to hide in a file without a particular value (usually audio, video or image files) a hidden message in such a way as to not rise suspicion to any external observer. This article suggests two methods, applicable in a VoIP hypothetical scenario, which allow us to distinguish a synthetic speech from a human voice, and to insert within the Comfort Noise a text message generated in the pauses of a voice conversation. The first method takes up the studies already carried out for the Modulation Features related to the temporal analysis of the speech signals, while the second one proposes a technique that derives from the Direct Sequence Spread Spectrum, which consists in distributing the signal energy to hide on a wider band transmission. Due to space limits, this paper is only an extended abstract. The full version will contain further details on our research

    Synthetic speech detection using phase information

    Get PDF
    Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.This work has been partially supported by the Basque Government (ElkarOla Project, KK-2015/00,098) and the Spanish Ministry of Economy and Competitiveness (Restore project, TEC2015-67,163-C2-1-R)

    All-for-One and One-For-All: Deep learning-based feature fusion for Synthetic Speech Detection

    Full text link
    Recent advances in deep learning and computer vision have made the synthesis and counterfeiting of multimedia content more accessible than ever, leading to possible threats and dangers from malicious users. In the audio field, we are witnessing the growth of speech deepfake generation techniques, which solicit the development of synthetic speech detection algorithms to counter possible mischievous uses such as frauds or identity thefts. In this paper, we consider three different feature sets proposed in the literature for the synthetic speech detection task and present a model that fuses them, achieving overall better performances with respect to the state-of-the-art solutions. The system was tested on different scenarios and datasets to prove its robustness to anti-forensic attacks and its generalization capabilities.Comment: Accepted at ECML-PKDD 2023 Workshop "Deep Learning and Multimedia Forensics. Combating fake media and misinformation

    Use of the harmonic phase in synthetic speech detection

    Get PDF
    Special Session paper: recent PhD thesis descriptionThis PhD dissertation was written by Jon Sanchez and supervised by Inma Hernáez and Ibon Saratxaga. It was defended at the University of the Basque Country the 5th of February 2016. The committee members were Dr. Alfonso Ortega Giménez (UniZar), Dr. Daniel Erro Eslava (UPV/EHU) and Dr. Enric Monte Moreno (UPC). The dissertation was awarded a "sobresaliente cum laude” qualification.This work has been partially funded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project,TEC2015-67163-C2-1-R) and the Basque Government (ELKAROLA project, KK-2015/00098)

    Use of the harmonic phase in synthetic speech detection

    Get PDF
    Special Session paper: recent PhD thesis descriptionThis PhD dissertation was written by Jon Sanchez and supervised by Inma Hernáez and Ibon Saratxaga. It was defended at the University of the Basque Country the 5th of February 2016. The committee members were Dr. Alfonso Ortega Giménez (UniZar), Dr. Daniel Erro Eslava (UPV/EHU) and Dr. Enric Monte Moreno (UPC). The dissertation was awarded a "sobresaliente cum laude” qualification.This work has been partially funded by the Spanish Ministry of Economy and Competitiveness with FEDER support (RESTORE project,TEC2015-67163-C2-1-R) and the Basque Government (ELKAROLA project, KK-2015/00098)

    Towards generalisable and calibrated synthetic speech detection with self-supervised representations

    Full text link
    Generalisation -- the ability of a model to perform well on unseen data -- is crucial for building reliable deep fake detectors. However, recent studies have shown that the current audio deep fake models fall short of this desideratum. In this paper we show that pretrained self-supervised representations followed by a simple logistic regression classifier achieve strong generalisation capabilities, reducing the equal error rate from 30% to 8% on the newly introduced In-the-Wild dataset. Importantly, this approach also produces considerably better calibrated models when compared to previous approaches. This means that we can trust our model's predictions more and use these for downstream tasks, such as uncertainty estimation. In particular, we show that the entropy of the estimated probabilities provides a reliable way of rejecting uncertain samples and further improving the accuracy.Comment: Submitted to ICASSP 202

    Synthetic Speech Detection Using Deep Neural Networks

    Get PDF
    With the advancements in deep learning and other techniques, synthetic speech is getting closer to a natural sounding voice. Some of the state-of-art technologies achieve such a high level of naturalness that even humans have difficulties distinguishing real speech from computer generated speech. Moreover, these technologies allow a person to train a speech synthesizer with a target voice, creating a model that is able to reproduce someone's voice with high fidelity. With this research, we thoroughly analyze how synthetic speech is generated and propose deep learning methodologies to detect such synthesized utterances. We first collected a significant amount of real and synthetic utterances to create the Fake or Real (FoR) dataset. Then, we analyzed the performance of the latest deep learning models in the classification of such utterances. Our proposed model achieves 99.86% accuracy in synthetic speech detection, which is a significant improvement from a human performance (65.7%)
    corecore