11 research outputs found
Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection
In this paper, we propose the multi-perspective information fusion (MPIF)
Res2Net with random Specmix for fake speech detection (FSD). The main purpose
of this system is to improve the model's ability to learn precise forgery
information for FSD task in low-quality scenarios. The task of random Specmix,
a data augmentation, is to improve the generalization ability of the model and
enhance the model's ability to locate discriminative information. Specmix cuts
and pastes the frequency dimension information of the spectrogram in the same
batch of samples without introducing other data, which helps the model to
locate the really useful information. At the same time, we randomly select
samples for augmentation to reduce the impact of data augmentation directly
changing all the data. Once the purpose of helping the model to locate
information is achieved, it is also important to reduce unnecessary
information. The role of MPIF-Res2Net is to reduce redundant interference
information. Deceptive information from a single perspective is always similar,
so the model learning this similar information will produce redundant spoofing
clues and interfere with truly discriminative information. The proposed
MPIF-Res2Net fuses information from different perspectives, making the
information learned by the model more diverse, thereby reducing the redundancy
caused by similar information and avoiding interference with the learning of
discriminative information. The results on the ASVspoof 2021 LA dataset
demonstrate the effectiveness of our proposed method, achieving EER and
min-tDCF of 3.29% and 0.2557, respectively.Comment: Accepted by DADA202
Learning to Behave Like Clean Speech: Dual-Branch Knowledge Distillation for Noise-Robust Fake Audio Detection
Most research in fake audio detection (FAD) focuses on improving performance
on standard noise-free datasets. However, in actual situations, there is
usually noise interference, which will cause significant performance
degradation in FAD systems. To improve the noise robustness, we propose a
dual-branch knowledge distillation fake audio detection (DKDFAD) method.
Specifically, a parallel data flow of the clean teacher branch and the noisy
student branch is designed, and interactive fusion and response-based
teacher-student paradigms are proposed to guide the training of noisy data from
the data distribution and decision-making perspectives. In the noise branch,
speech enhancement is first introduced for denoising, which reduces the
interference of strong noise. The proposed interactive fusion combines
denoising features and noise features to reduce the impact of speech distortion
and seek consistency with the data distribution of clean branch. The
teacher-student paradigm maps the student's decision space to the teacher's
decision space, making noisy speech behave as clean. In addition, a joint
training method is used to optimize the two branches to achieve global
optimality. Experimental results based on multiple datasets show that the
proposed method performs well in noisy environments and maintains performance
in cross-dataset experiments
DGSD: Dynamical Graph Self-Distillation for EEG-Based Auditory Spatial Attention Detection
Auditory Attention Detection (AAD) aims to detect target speaker from brain
signals in a multi-speaker environment. Although EEG-based AAD methods have
shown promising results in recent years, current approaches primarily rely on
traditional convolutional neural network designed for processing Euclidean data
like images. This makes it challenging to handle EEG signals, which possess
non-Euclidean characteristics. In order to address this problem, this paper
proposes a dynamical graph self-distillation (DGSD) approach for AAD, which
does not require speech stimuli as input. Specifically, to effectively
represent the non-Euclidean properties of EEG signals, dynamical graph
convolutional networks are applied to represent the graph structure of EEG
signals, which can also extract crucial features related to auditory spatial
attention in EEG signals. In addition, to further improve AAD detection
performance, self-distillation, consisting of feature distillation and
hierarchical distillation strategies at each layer, is integrated. These
strategies leverage features and classification results from the deepest
network layers to guide the learning of shallow layers. Our experiments are
conducted on two publicly available datasets, KUL and DTU. Under a 1-second
time window, we achieve results of 90.0\% and 79.6\% accuracy on KUL and DTU,
respectively. We compare our DGSD method with competitive baselines, and the
experimental results indicate that the detection performance of our proposed
DGSD method is not only superior to the best reproducible baseline but also
significantly reduces the number of trainable parameters by approximately 100
times
Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features
Recently, pioneer research works have proposed a large number of acoustic
features (log power spectrogram, linear frequency cepstral coefficients,
constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining
good performance, and showing that different subbands have different
contributions to audio deepfake detection. However, this lacks an explanation
of the specific information in the subband, and these features also lose
information such as phase. Inspired by the mechanism of synthetic speech, the
fundamental frequency (F0) information is used to improve the quality of
synthetic speech, while the F0 of synthetic speech is still too average, which
differs significantly from that of real speech. It is expected that F0 can be
used as important information to discriminate between bonafide and fake speech,
while this information cannot be used directly due to the irregular
distribution of F0. Insteadly, the frequency band containing most of F0 is
selected as the input feature. Meanwhile, to make full use of the phase and
full-band information, we also propose to use real and imaginary spectrogram
features as complementary input features and model the disjoint subbands
separately. Finally, the results of F0, real and imaginary spectrogram features
are fused. Experimental results on the ASVspoof 2019 LA dataset show that our
proposed system is very effective for the audio deepfake detection task,
achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all
systems