Search CORE

28 research outputs found

Timbre-reserved Adversarial Attack in Speaker Identification

Author: Guo Pengcheng
Wang Qing
Xie Lei
Yao Jixun
Zhang Li
Publication venue
Publication date: 02/09/2023
Field of study

As a type of biometric identification, a speaker identification (SID) system is confronted with various kinds of attacks. The spoofing attacks typically imitate the timbre of the target speakers, while the adversarial attacks confuse the SID system by adding a well-designed adversarial perturbation to an arbitrary speech. Although the spoofing attack copies a similar timbre as the victim, it does not exploit the vulnerability of the SID model and may not make the SID system give the attacker's desired decision. As for the adversarial attack, despite the SID system can be led to a designated decision, it cannot meet the specified text or speaker timbre requirements for the specific attack scenarios. In this study, to make the attack in SID not only leverage the vulnerability of the SID model but also reserve the timbre of the target speaker, we propose a timbre-reserved adversarial attack in the speaker identification. We generate the timbre-reserved adversarial audios by adding an adversarial constraint during the different training stages of the voice conversion (VC) model. Specifically, the adversarial constraint is using the target speaker label to optimize the adversarial perturbation added to the VC model representations and is implemented by a speaker classifier joining in the VC model training. The adversarial constraint can help to control the VC model to generate the speaker-wised audio. Eventually, the inference of the VC model is the ideal adversarial fake audio, which is timbre-reserved and can fool the SID system.Comment: 11 pages, 8 figure

arXiv.org e-Print Archive

Pseudo-Siamese Network based Timbre-reserved Black-box Adversarial Attack in Speaker Identification

Author: Guo Pengcheng
Wang Qing
Wang Ziqian
Xie Lei
Yao Jixun
Publication venue
Publication date: 30/05/2023
Field of study

In this study, we propose a timbre-reserved adversarial attack approach for speaker identification (SID) to not only exploit the weakness of the SID model but also preserve the timbre of the target speaker in a black-box attack setting. Particularly, we generate timbre-reserved fake audio by adding an adversarial constraint during the training of the voice conversion model. Then, we leverage a pseudo-Siamese network architecture to learn from the black-box SID model constraining both intrinsic similarity and structural similarity simultaneously. The intrinsic similarity loss is to learn an intrinsic invariance, while the structural similarity loss is to ensure that the substitute SID model shares a similar decision boundary to the fixed black-box SID model. The substitute model can be used as a proxy to generate timbre-reserved fake audio for attacking. Experimental results on the Audio Deepfake Detection (ADD) challenge dataset indicate that the attack success rate of our proposed approach yields up to 60.58% and 55.38% in the white-box and black-box scenarios, respectively, and can deceive both human beings and machines.Comment: 5 page

arXiv.org e-Print Archive

Expressive-VC: Highly Expressive Voice Conversion with Attention Fusion of Bottleneck and Perturbation Features

Author: Bi Mengxiao
Ning Ziqian
Wang Zhichao
Xie Lei
Xie Qicong
Xue Liumeng
Yao Jixun
Zhu Pengcheng
Publication venue
Publication date: 09/11/2022
Field of study

Voice conversion for highly expressive speech is challenging. Current approaches struggle with the balancing between speaker similarity, intelligibility and expressiveness. To address this problem, we propose Expressive-VC, a novel end-to-end voice conversion framework that leverages advantages from both neural bottleneck feature (BNF) approach and information perturbation approach. Specifically, we use a BNF encoder and a Perturbed-Wav encoder to form a content extractor to learn linguistic and para-linguistic features respectively, where BNFs come from a robust pre-trained ASR model and the perturbed wave becomes speaker-irrelevant after signal perturbation. We further fuse the linguistic and para-linguistic features through an attention mechanism, where speaker-dependent prosody features are adopted as the attention query, which result from a prosody encoder with target speaker embedding and normalized pitch and energy of source speech as input. Finally the decoder consumes the integrated features and the speaker-dependent prosody feature to generate the converted speech. Experiments demonstrate that Expressive-VC is superior to several state-of-the-art systems, achieving both high expressiveness captured from the source speech and high speaker similarity with the target speaker; meanwhile intelligibility is well maintained

arXiv.org e-Print Archive

Preserving background sound in noise-robust voice conversion via multi-task learning

Author: Guo Pengcheng
Lei Yi
Li Hai
Liu Junhui
Ning Ziqian
Wang Qing
Xie Danming
Xie Lei
Yao Jixun
Publication venue
Publication date: 06/11/2022
Field of study

Background sound is an informative form of art that is helpful in providing a more immersive experience in real-application voice conversion (VC) scenarios. However, prior research about VC, mainly focusing on clean voices, pay rare attention to VC with background sound. The critical problem for preserving background sound in VC is inevitable speech distortion by the neural separation model and the cascade mismatch between the source separation model and the VC model. In this paper, we propose an end-to-end framework via multi-task learning which sequentially cascades a source separation (SS) module, a bottleneck feature extraction module and a VC module. Specifically, the source separation task explicitly considers critical phase information and confines the distortion caused by the imperfect separation process. The source separation task, the typical VC task and the unified task shares a uniform reconstruction loss constrained by joint training to reduce the mismatch between the SS and VC modules. Experimental results demonstrate that our proposed framework significantly outperforms the baseline systems while achieving comparable quality and speaker similarity to the VC models trained with clean data.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Distinguishable Speaker Anonymization based on Formant and Fundamental Frequency Scaling

Author: Guo Pengcheng
Lei Yi
Liu Jie
Wang Namin
Wang Qing
Xie Lei
Yao Jixun
Publication venue
Publication date: 06/11/2022
Field of study

Speech data on the Internet are proliferating exponentially because of the emergence of social media, and the sharing of such personal data raises obvious security and privacy concerns. One solution to mitigate these concerns involves concealing speaker identities before sharing speech data, also referred to as speaker anonymization. In our previous work, we have developed an automatic speaker verification (ASV)-model-free anonymization framework to protect speaker privacy while preserving speech intelligibility. Although the framework ranked first place in VoicePrivacy 2022 challenge, the anonymization was imperfect, since the speaker distinguishability of the anonymized speech was deteriorated. To address this issue, in this paper, we directly model the formant distribution and fundamental frequency (F0) to represent speaker identity and anonymize the source speech by the uniformly scaling formant and F0. By directly scaling the formant and F0, the speaker distinguishability degradation of the anonymized speech caused by the introduction of other speakers is prevented. The experimental results demonstrate that our proposed framework can improve the speaker distinguishability and significantly outperforms our previous framework in voice distinctiveness. Furthermore, our proposed method also can trade off the privacy-utility by using different scaling factors.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

DualVC 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion

Author: Bi Mengxiao
Jiang Yuepeng
Ning Ziqian
Wang Shuai
Xie Lei
Yao Jixun
Zhu Pengcheng
Publication venue
Publication date: 18/01/2024
Field of study

Voice conversion is becoming increasingly popular, and a growing number of application scenarios require models with streaming inference capabilities. The recently proposed DualVC attempts to achieve this objective through streaming model architecture design and intra-model knowledge distillation along with hybrid predictive coding to compensate for the lack of future information. However, DualVC encounters several problems that limit its performance. First, the autoregressive decoder has error accumulation in its nature and limits the inference speed as well. Second, the causal convolution enables streaming capability but cannot sufficiently use future information within chunks. Third, the model is unable to effectively address the noise in the unvoiced segments, lowering the sound quality. In this paper, we propose DualVC 2 to address these issues. Specifically, the model backbone is migrated to a Conformer-based architecture, empowering parallel inference. Causal convolution is replaced by non-causal convolution with dynamic chunk mask to make better use of within-chunk future information. Also, quiet attention is introduced to enhance the model's noise robustness. Experiments show that DualVC 2 outperforms DualVC and other baseline systems in both subjective and objective metrics, with only 186.4 ms latency. Our audio samples are made publicly available.Comment: Accepted by ICASSP202

arXiv.org e-Print Archive

PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts

Author: Hu Yanni
Lei Yi
Lu Heng
Ning Ziqian
Pan Yu
Xie Lei
Yang Yuguang
Yao Jixun
Yin Jingjing
Zhou Hongbin
Publication venue
Publication date: 17/09/2023
Field of study

Style voice conversion aims to transform the style of source speech to a desired style according to real-world application demands. However, the current style voice conversion approach relies on pre-defined labels or reference speech to control the conversion process, which leads to limitations in style diversity or falls short in terms of the intuitive and interpretability of style representation. In this study, we propose PromptVC, a novel style voice conversion approach that employs a latent diffusion model to generate a style vector driven by natural language prompts. Specifically, the style vector is extracted by a style encoder during training, and then the latent diffusion model is trained independently to sample the style vector from noise, with this process being conditioned on natural language prompts. To improve style expressiveness, we leverage HuBERT to extract discrete tokens and replace them with the K-Means center embedding to serve as the linguistic content, which minimizes residual style information. Additionally, we deduplicate the same discrete token and employ a differentiable duration predictor to re-predict the duration of each token, which can adapt the duration of the same linguistic content to different styles. The subjective and objective evaluation results demonstrate the effectiveness of our proposed system.Comment: Submitted to ICASSP 202

arXiv.org e-Print Archive

Allelic effects on starch structure and properties of six starch biosynthetic genes in a rice recombinant inbred line population

Author: A Kharabian-Masouleh
A Kharabian-Masouleh
A Kubo
A Nishi
AG Koziol
Anthony Millar
B Kosar-Hashemi
C Yan
CG Biliaderis
CY Tsai
DL Xie
EM Septiningsih
F Grimaud
G Yu
H Satoh
HY Hirano
IJ Tetlow
IM Park
JC Lanceras
Jixun Luo
JJ Chen
JS Bao
JV Castro
K Mizuno
K Zhao
LC Hannah
M Gao
M Gao
M Shure
Matthew K Morell
MG O’Shea
MK Morell
N Fedoroff
N Fujita
N Fujita
OE Nelson
P Govindaraj
P He
PD Commuri
RM Ward
RP Cuevas
S Hizukuri
S Hizukuri
S Hizukuri
S Rahman
S Varavinit
Stephen A Jobling
T Nakamura
T Sasaki
T Umemoto
T Umemoto
T Umemoto
T Umemoto
VM Butardo
X Zheng
XL Cai
XZ Han
Y Nakamura
Y Nakamura
Y Nakamura
Y Sano
Y Takemoto-Kuno
Y Tan
YP Han
Z Li
Zhongyi Li
ZX Tian
ZX Tian
ZY Gao
ZY Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

UniSyn: An End-to-End Unified Model for Text-to-Speech and Singing Voice Synthesis

Author: Lei Yi
Su Dan
Wang Xinsheng
Xie Lei
Xie Qicong
Yang Shan
Yao Jixun
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 26/06/2023
Field of study

Text-to-speech (TTS) and singing voice synthesis (SVS) aim at generating high-quality speaking and singing voice according to textual input and music scores, respectively. Unifying TTS and SVS into a single system is crucial to the applications requiring both of them. Existing methods usually suffer from some limitations, which rely on either both singing and speaking data from the same person or cascaded models of multiple tasks. To address these problems, a simplified elegant framework for TTS and SVS, named UniSyn, is proposed in this paper. It is an end-to-end unified model that can make a voice speak and sing with only singing or speaking data from this person. To be specific, a multi-conditional variational autoencoder (MC-VAE), which constructs two independent latent sub-spaces with the speaker- and style-related (i.e. speak or sing) conditions for flexible control, is proposed in UniSyn. Moreover, supervised guided-VAE and timbre perturbation with the Wasserstein distance constraint are leveraged to further disentangle the speaker timbre and style. Experiments conducted on two speakers and two singers demonstrate that UniSyn can generate natural speaking and singing voice without corresponding training data. The proposed approach outperforms the state-of-the-art end-to-end voice generation work, which proves the effectiveness and advantages of UniSyn

Association for the Advancement of Artificial Intelligence: AAAI Publications

Redirecting the Cyclization Steps of Fungal Polyketide Synthase

Author: Jixun Zhan
Kenji Watanabe
Suzanne M. Ma
Wenjun Zhang
Xinkai Xie
Yi Tang
Publication venue: 'American Chemical Society (ACS)'
Publication date
Field of study

Crossref