198 research outputs found
Timbre-reserved Adversarial Attack in Speaker Identification
As a type of biometric identification, a speaker identification (SID) system
is confronted with various kinds of attacks. The spoofing attacks typically
imitate the timbre of the target speakers, while the adversarial attacks
confuse the SID system by adding a well-designed adversarial perturbation to an
arbitrary speech. Although the spoofing attack copies a similar timbre as the
victim, it does not exploit the vulnerability of the SID model and may not make
the SID system give the attacker's desired decision. As for the adversarial
attack, despite the SID system can be led to a designated decision, it cannot
meet the specified text or speaker timbre requirements for the specific attack
scenarios. In this study, to make the attack in SID not only leverage the
vulnerability of the SID model but also reserve the timbre of the target
speaker, we propose a timbre-reserved adversarial attack in the speaker
identification. We generate the timbre-reserved adversarial audios by adding an
adversarial constraint during the different training stages of the voice
conversion (VC) model. Specifically, the adversarial constraint is using the
target speaker label to optimize the adversarial perturbation added to the VC
model representations and is implemented by a speaker classifier joining in the
VC model training. The adversarial constraint can help to control the VC model
to generate the speaker-wised audio. Eventually, the inference of the VC model
is the ideal adversarial fake audio, which is timbre-reserved and can fool the
SID system.Comment: 11 pages, 8 figure
Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification
In this study, we propose a timbre-reserved adversarial attack approach for
speaker identification (SID) to not only exploit the weakness of the SID model
but also preserve the timbre of the target speaker in a black-box attack
setting. Particularly, we generate timbre-reserved fake audio by adding an
adversarial constraint during the training of the voice conversion model. Then,
we leverage a pseudo-Siamese network architecture to learn from the black-box
SID model constraining both intrinsic similarity and structural similarity
simultaneously. The intrinsic similarity loss is to learn an intrinsic
invariance, while the structural similarity loss is to ensure that the
substitute SID model shares a similar decision boundary to the fixed black-box
SID model. The substitute model can be used as a proxy to generate
timbre-reserved fake audio for attacking. Experimental results on the Audio
Deepfake Detection (ADD) challenge dataset indicate that the attack success
rate of our proposed approach yields up to 60.58% and 55.38% in the white-box
and black-box scenarios, respectively, and can deceive both human beings and
machines.Comment: 5 page
TESSP: Text-Enhanced Self-Supervised Speech Pre-training
Self-supervised speech pre-training empowers the model with the contextual
structure inherent in the speech signal while self-supervised text pre-training
empowers the model with linguistic information. Both of them are beneficial for
downstream speech tasks such as ASR. However, the distinct pre-training
objectives make it challenging to jointly optimize the speech and text
representation in the same model. To solve this problem, we propose
Text-Enhanced Self-Supervised Speech Pre-training (TESSP), aiming to
incorporate the linguistic information into speech pre-training. Our model
consists of three parts, i.e., a speech encoder, a text encoder and a shared
encoder. The model takes unsupervised speech and text data as the input and
leverages the common HuBERT and MLM losses respectively. We also propose
phoneme up-sampling and representation swapping to enable joint modeling of the
speech and text information. Specifically, to fix the length mismatching
problem between speech and text data, we phonemize the text sequence and
up-sample the phonemes with the alignment information extracted from a small
set of supervised data. Moreover, to close the gap between the learned speech
and text representations, we swap the text representation with the speech
representation extracted by the respective private encoders according to the
alignment information. Experiments on the Librispeech dataset shows the
proposed TESSP model achieves more than 10% improvement compared with WavLM on
the test-clean and test-other sets. We also evaluate our model on the SUPERB
benchmark, showing our model has better performance on Phoneme Recognition,
Acoustic Speech Recognition and Speech Translation compared with WavLM.Comment: 9 pages, 4 figure
Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling
Speech data on the Internet are proliferating exponentially because of the
emergence of social media, and the sharing of such personal data raises obvious
security and privacy concerns. One solution to mitigate these concerns involves
concealing speaker identities before sharing speech data, also referred to as
speaker anonymization. In our previous work, we have developed an automatic
speaker verification (ASV)-model-free anonymization framework to protect
speaker privacy while preserving speech intelligibility. Although the framework
ranked first place in VoicePrivacy 2022 challenge, the anonymization was
imperfect, since the speaker distinguishability of the anonymized speech was
deteriorated. To address this issue, in this paper, we directly model the
formant distribution and fundamental frequency (F0) to represent speaker
identity and anonymize the source speech by the uniformly scaling formant and
F0. By directly scaling the formant and F0, the speaker distinguishability
degradation of the anonymized speech caused by the introduction of other
speakers is prevented. The experimental results demonstrate that our proposed
framework can improve the speaker distinguishability and significantly
outperforms our previous framework in voice distinctiveness. Furthermore, our
proposed method also can trade off the privacy-utility by using different
scaling factors.Comment: Submitted to ICASSP 202
- …