With the proliferation of Audio Language Model (ALM) based deepfake audio,
there is an urgent need for generalized detection methods. ALM-based deepfake
audio currently exhibits widespread, high deception, and type versatility,
posing a significant challenge to current audio deepfake detection (ADD) models
trained solely on vocoded data. To effectively detect ALM-based deepfake audio,
we focus on the mechanism of the ALM-based audio generation method, the
conversion from neural codec to waveform. We initially construct the Codecfake
dataset, an open-source large-scale dataset, including 2 languages, over 1M
audio samples, and various test conditions, focus on ALM-based audio detection.
As countermeasure, to achieve universal detection of deepfake audio and tackle
domain ascent bias issue of original SAM, we propose the CSAM strategy to learn
a domain balanced and generalized minima. In our experiments, we first
demonstrate that ADD model training with the Codecfake dataset can effectively
detects ALM-based audio. Furthermore, our proposed generalization
countermeasure yields the lowest average Equal Error Rate (EER) of 0.616%
across all test conditions compared to baseline models. The dataset and
associated code are available online