The malicious use of deep speech synthesis models may pose significant threat
to society. Therefore, many studies have emerged to detect the so-called
``deepfake audio". However, these studies focus on the binary detection of real
audio and fake audio. For some realistic application scenarios, it is needed to
know what tool or model generated the deepfake audio. This raises a question:
Can we recognize the system fingerprints of deepfake audio? Therefore, in this
paper, we propose a deepfake audio dataset for system fingerprint recognition
(SFR) and conduct an initial investigation. We collected the dataset from five
speech synthesis systems using the latest state-of-the-art deep learning
technologies, including both clean and compressed sets. In addition, to
facilitate the further development of system fingerprint recognition methods,
we give researchers some benchmarks that can be compared, and research
findings. The dataset will be publicly available.Comment: 12 pages, 3 figures. arXiv admin note: text overlap with
arXiv:2208.0964