Search CORE

147 research outputs found

Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder

Author: Gu Yicheng
Wu Zhizheng
Xue Liumeng
Zhang Xueyao
Publication venue
Publication date: 25/11/2023
Field of study

Generative Adversarial Network (GAN) based vocoders are superior in inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator to promote GAN-based vocoders. Most existing time-frequency-representation-based discriminators are rooted in Short-Time Fourier Transform (STFT), whose time-frequency resolution in a spectrogram is fixed, making it incompatible with signals like singing voices that require flexible attention for different frequency bands. Motivated by that, our study utilizes the Constant-Q Transform (CQT), which owns dynamic resolution among frequencies, contributing to a better modeling ability in pitch accuracy and harmonic tracking. Specifically, we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates on the CQT spectrogram at multiple scales and performs sub-band processing according to different octaves. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed method. Moreover, we also verified that the CQT-based and the STFT-based discriminators could be complementary under joint training. Specifically, enhanced by the proposed MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen singers

arXiv.org e-Print Archive

Youla-Kucera parameterized adaptive tracking control for optical data storage systems

Author: Jingtao Lei
Mei Liu
Mingshuang Huang
Zhizheng Wu
Publication venue: 'JVE International Ltd.'
Publication date: 15/05/2015
Field of study

In the next generation optical data storage systems, the tolerance of the tracking error will become even smaller under various unknown working situations. However, the unknown external disturbances caused by vibrations make it difficult to maintain the desired tracking precision during normal disk operation. It is proposed in this paper to use an adaptive regulation approach to maintain the tracking error below its desired value despite these unknown disturbances. The design of the regulator is formulated by augmenting a base controller into a Youla-Kucera (Q) parameterized set of stabilizing controllers so that both the deterministic and the random disturbances can be deal with properly. The adaptive algorithm is developed to search the desired Q parameter which satisfies the Internal Model Principle and thus the exact regulation against the unknown deterministic disturbance can be achieved. The performance of the proposed control approach is evaluated with experimental results that illustrate the capability of the proposed adaptive regulator to attenuate the unknown disturbances and achieve the desired tracking precision

Journal of Vibroengineering

PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network

Author: Ge Meng
Li Haizhou
Liu Qinghua
Wu Zhizheng
Publication venue
Publication date: 13/09/2023
Field of study

It is common in everyday spoken communication that we look at the turning head of a talker to listen to his/her voice. Humans see the talker to listen better, so do machines. However, previous studies on audio-visual speaker extraction have not effectively handled the varying talking face. This paper studies how to take full advantage of the varying talking face. We propose a Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that incorporates an additional pose-invariant view to improve audio-visual speaker extraction. Specifically, we generate the pose-invariant view from each original pose orientation, which enables the model to receive a consistent frontal view of the talker regardless of his/her head pose, therefore, forming a multi-view visual input for the speaker. Experiments on the multi-view MEAD and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the state-of-the-art and is more robust to pose variations.Comment: Interspeech 202

arXiv.org e-Print Archive

Voice conversion versus speaker verification: an overview

Author: Li Haizhou
Wu Zhizheng
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2014
Field of study

A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.Published versio

Crossref

Edinburgh Research Explorer

DR-NTU (Digital Repository of NTU)

Investigating gated recurrent neural networks for speech synthesis

Author: King Simon
Wu Zhizheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 11/01/2016
Field of study

Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus reducing generation complexity considerably without degrading quality.Comment: Accepted by ICASSP 201

arXiv.org e-Print Archive

Edinburgh Research Explorer

Sentence-level control vectors for deep neural network speech synthesis

Author: King Simon
Watts Oliver
Wu Zhizheng
Publication venue
Publication date: 30/09/2015
Field of study

Edinburgh Research Explorer