295 research outputs found
SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing
Voice anti-spoofing systems are crucial auxiliaries for automatic speaker
verification (ASV) systems. A major challenge is caused by unseen attacks
empowered by advanced speech synthesis technologies. Our previous research on
one-class learning has improved the generalization ability to unseen attacks by
compacting the bona fide speech in the embedding space. However, such
compactness lacks consideration of the diversity of speakers. In this work, we
propose speaker attractor multi-center one-class learning (SAMO), which
clusters bona fide speech around a number of speaker attractors and pushes away
spoofing attacks from all the attractors in a high-dimensional embedding space.
For training, we propose an algorithm for the co-optimization of bona fide
speech clustering and bona fide/spoof classification. For inference, we propose
strategies to enable anti-spoofing for speakers without enrollment. Our
proposed system outperforms existing state-of-the-art single systems with a
relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA
evaluation set
Phase perturbation improves channel robustness for speech spoofing countermeasures
In this paper, we aim to address the problem of channel robustness in speech
countermeasure (CM) systems, which are used to distinguish synthetic speech
from human natural speech. On the basis of two hypotheses, we suggest an
approach for perturbing phase information during the training of time-domain CM
systems. Communication networks often employ lossy compression codec that
encodes only magnitude information, therefore heavily altering phase
information. Also, state-of-the-art CM systems rely on phase information to
identify spoofed speech. Thus, we believe the information loss in the phase
domain induced by lossy compression codec degrades the performance of the
unseen channel. We first establish the dependence of time-domain CM systems on
phase information by perturbing phase in evaluation, showing strong
degradation. Then, we demonstrated that perturbing phase during training leads
to a significant performance improvement, whereas perturbing magnitude leads to
further degradation.Comment: 5 pages; Accepted to INTERSPEECH 202
SingFake: Singing Voice Deepfake Detection
The rise of singing voice synthesis presents critical challenges to artists
and industry stakeholders over unauthorized voice usage. Unlike synthesized
speech, synthesized singing voices are typically released in songs containing
strong background music that may hide synthesis artifacts. Additionally,
singing voices present different acoustic and linguistic characteristics from
speech utterances. These unique properties make singing voice deepfake
detection a relevant but significantly different problem from synthetic speech
detection. In this work, we propose the singing voice deepfake detection task.
We first present SingFake, the first curated in-the-wild dataset consisting of
28.93 hours of bonafide and 29.40 hours of deepfake song clips in five
languages from 40 singers. We provide a train/val/test split where the test
sets include various scenarios. We then use SingFake to evaluate four
state-of-the-art speech countermeasure systems trained on speech utterances. We
find these systems lag significantly behind their performance on speech test
data. When trained on SingFake, either using separated vocal tracks or song
mixtures, these systems show substantial improvement. However, our evaluations
also identify challenges associated with unseen singers, communication codecs,
languages, and musical contexts, calling for dedicated research into singing
voice deepfake detection. The SingFake dataset and related resources are
available online.Comment: Submitted to ICASSP 202
RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning
This paper presents a deep reinforcement learning algorithm for online
accompaniment generation, with potential for real-time interactive
human-machine duet improvisation. Different from offline music generation and
harmonization, online music accompaniment requires the algorithm to respond to
human input and generate the machine counterpart in a sequential order. We cast
this as a reinforcement learning problem, where the generation agent learns a
policy to generate a musical note (action) based on previously generated
context (state). The key of this algorithm is the well-functioning reward
model. Instead of defining it using music composition rules, we learn this
model from monophonic and polyphonic training data. This model considers the
compatibility of the machine-generated note with both the machine-generated
context and the human-generated context. Experiments show that this algorithm
is able to respond to the human part and generate a melodic, harmonic and
diverse machine part. Subjective evaluations on preferences show that the
proposed algorithm generates music pieces of higher quality than the baseline
method
- …