28 research outputs found
Timbre-reserved Adversarial Attack in Speaker Identification
As a type of biometric identification, a speaker identification (SID) system
is confronted with various kinds of attacks. The spoofing attacks typically
imitate the timbre of the target speakers, while the adversarial attacks
confuse the SID system by adding a well-designed adversarial perturbation to an
arbitrary speech. Although the spoofing attack copies a similar timbre as the
victim, it does not exploit the vulnerability of the SID model and may not make
the SID system give the attacker's desired decision. As for the adversarial
attack, despite the SID system can be led to a designated decision, it cannot
meet the specified text or speaker timbre requirements for the specific attack
scenarios. In this study, to make the attack in SID not only leverage the
vulnerability of the SID model but also reserve the timbre of the target
speaker, we propose a timbre-reserved adversarial attack in the speaker
identification. We generate the timbre-reserved adversarial audios by adding an
adversarial constraint during the different training stages of the voice
conversion (VC) model. Specifically, the adversarial constraint is using the
target speaker label to optimize the adversarial perturbation added to the VC
model representations and is implemented by a speaker classifier joining in the
VC model training. The adversarial constraint can help to control the VC model
to generate the speaker-wised audio. Eventually, the inference of the VC model
is the ideal adversarial fake audio, which is timbre-reserved and can fool the
SID system.Comment: 11 pages, 8 figure
Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification
In this study, we propose a timbre-reserved adversarial attack approach for
speaker identification (SID) to not only exploit the weakness of the SID model
but also preserve the timbre of the target speaker in a black-box attack
setting. Particularly, we generate timbre-reserved fake audio by adding an
adversarial constraint during the training of the voice conversion model. Then,
we leverage a pseudo-Siamese network architecture to learn from the black-box
SID model constraining both intrinsic similarity and structural similarity
simultaneously. The intrinsic similarity loss is to learn an intrinsic
invariance, while the structural similarity loss is to ensure that the
substitute SID model shares a similar decision boundary to the fixed black-box
SID model. The substitute model can be used as a proxy to generate
timbre-reserved fake audio for attacking. Experimental results on the Audio
Deepfake Detection (ADD) challenge dataset indicate that the attack success
rate of our proposed approach yields up to 60.58% and 55.38% in the white-box
and black-box scenarios, respectively, and can deceive both human beings and
machines.Comment: 5 page
Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features
Voice conversion for highly expressive speech is challenging. Current
approaches struggle with the balancing between speaker similarity,
intelligibility and expressiveness. To address this problem, we propose
Expressive-VC, a novel end-to-end voice conversion framework that leverages
advantages from both neural bottleneck feature (BNF) approach and information
perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav
encoder to form a content extractor to learn linguistic and para-linguistic
features respectively, where BNFs come from a robust pre-trained ASR model and
the perturbed wave becomes speaker-irrelevant after signal perturbation. We
further fuse the linguistic and para-linguistic features through an attention
mechanism, where speaker-dependent prosody features are adopted as the
attention query, which result from a prosody encoder with target speaker
embedding and normalized pitch and energy of source speech as input. Finally
the decoder consumes the integrated features and the speaker-dependent prosody
feature to generate the converted speech. Experiments demonstrate that
Expressive-VC is superior to several state-of-the-art systems, achieving both
high expressiveness captured from the source speech and high speaker similarity
with the target speaker; meanwhile intelligibility is well maintained
Preserving background sound in noise-robust voice conversion via multi-task learning
Background sound is an informative form of art that is helpful in providing a
more immersive experience in real-application voice conversion (VC) scenarios.
However, prior research about VC, mainly focusing on clean voices, pay rare
attention to VC with background sound. The critical problem for preserving
background sound in VC is inevitable speech distortion by the neural separation
model and the cascade mismatch between the source separation model and the VC
model. In this paper, we propose an end-to-end framework via multi-task
learning which sequentially cascades a source separation (SS) module, a
bottleneck feature extraction module and a VC module. Specifically, the source
separation task explicitly considers critical phase information and confines
the distortion caused by the imperfect separation process. The source
separation task, the typical VC task and the unified task shares a uniform
reconstruction loss constrained by joint training to reduce the mismatch
between the SS and VC modules. Experimental results demonstrate that our
proposed framework significantly outperforms the baseline systems while
achieving comparable quality and speaker similarity to the VC models trained
with clean data.Comment: Submitted to ICASSP 202
Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling
Speech data on the Internet are proliferating exponentially because of the
emergence of social media, and the sharing of such personal data raises obvious
security and privacy concerns. One solution to mitigate these concerns involves
concealing speaker identities before sharing speech data, also referred to as
speaker anonymization. In our previous work, we have developed an automatic
speaker verification (ASV)-model-free anonymization framework to protect
speaker privacy while preserving speech intelligibility. Although the framework
ranked first place in VoicePrivacy 2022 challenge, the anonymization was
imperfect, since the speaker distinguishability of the anonymized speech was
deteriorated. To address this issue, in this paper, we directly model the
formant distribution and fundamental frequency (F0) to represent speaker
identity and anonymize the source speech by the uniformly scaling formant and
F0. By directly scaling the formant and F0, the speaker distinguishability
degradation of the anonymized speech caused by the introduction of other
speakers is prevented. The experimental results demonstrate that our proposed
framework can improve the speaker distinguishability and significantly
outperforms our previous framework in voice distinctiveness. Furthermore, our
proposed method also can trade off the privacy-utility by using different
scaling factors.Comment: Submitted to ICASSP 202
DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion
Voice conversion is becoming increasingly popular, and a growing number of
application scenarios require models with streaming inference capabilities. The
recently proposed DualVC attempts to achieve this objective through streaming
model architecture design and intra-model knowledge distillation along with
hybrid predictive coding to compensate for the lack of future information.
However, DualVC encounters several problems that limit its performance. First,
the autoregressive decoder has error accumulation in its nature and limits the
inference speed as well. Second, the causal convolution enables streaming
capability but cannot sufficiently use future information within chunks. Third,
the model is unable to effectively address the noise in the unvoiced segments,
lowering the sound quality. In this paper, we propose DualVC 2 to address these
issues. Specifically, the model backbone is migrated to a Conformer-based
architecture, empowering parallel inference. Causal convolution is replaced by
non-causal convolution with dynamic chunk mask to make better use of
within-chunk future information. Also, quiet attention is introduced to enhance
the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC
and other baseline systems in both subjective and objective metrics, with only
186.4 ms latency. Our audio samples are made publicly available.Comment: Accepted by ICASSP202
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts
Style voice conversion aims to transform the style of source speech to a
desired style according to real-world application demands. However, the current
style voice conversion approach relies on pre-defined labels or reference
speech to control the conversion process, which leads to limitations in style
diversity or falls short in terms of the intuitive and interpretability of
style representation. In this study, we propose PromptVC, a novel style voice
conversion approach that employs a latent diffusion model to generate a style
vector driven by natural language prompts. Specifically, the style vector is
extracted by a style encoder during training, and then the latent diffusion
model is trained independently to sample the style vector from noise, with this
process being conditioned on natural language prompts. To improve style
expressiveness, we leverage HuBERT to extract discrete tokens and replace them
with the K-Means center embedding to serve as the linguistic content, which
minimizes residual style information. Additionally, we deduplicate the same
discrete token and employ a differentiable duration predictor to re-predict the
duration of each token, which can adapt the duration of the same linguistic
content to different styles. The subjective and objective evaluation results
demonstrate the effectiveness of our proposed system.Comment: Submitted to ICASSP 202
UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis
Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn