147 research outputs found
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
Youla-Kucera parameterized adaptive tracking control for optical data storage systems
In the next generation optical data storage systems, the tolerance of the tracking error will become even smaller under various unknown working situations. However, the unknown external disturbances caused by vibrations make it difficult to maintain the desired tracking precision during normal disk operation. It is proposed in this paper to use an adaptive regulation approach to maintain the tracking error below its desired value despite these unknown disturbances. The design of the regulator is formulated by augmenting a base controller into a Youla-Kucera (Q) parameterized set of stabilizing controllers so that both the deterministic and the random disturbances can be deal with properly. The adaptive algorithm is developed to search the desired Q parameter which satisfies the Internal Model Principle and thus the exact regulation against the unknown deterministic disturbance can be achieved. The performance of the proposed control approach is evaluated with experimental results that illustrate the capability of the proposed adaptive regulator to attenuate the unknown disturbances and achieve the desired tracking precision
PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
It is common in everyday spoken communication that we look at the turning
head of a talker to listen to his/her voice. Humans see the talker to listen
better, so do machines. However, previous studies on audio-visual speaker
extraction have not effectively handled the varying talking face. This paper
studies how to take full advantage of the varying talking face. We propose a
Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that
incorporates an additional pose-invariant view to improve audio-visual speaker
extraction. Specifically, we generate the pose-invariant view from each
original pose orientation, which enables the model to receive a consistent
frontal view of the talker regardless of his/her head pose, therefore, forming
a multi-view visual input for the speaker. Experiments on the multi-view MEAD
and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the
state-of-the-art and is more robust to pose variations.Comment: Interspeech 202
Voice conversion versus speaker verification: an overview
A speaker verification system automatically accepts or rejects a claimed identity of a speaker based on a speech sample. Recently, a major progress was made in speaker verification which leads to mass market adoption, such as in smartphone and in online commerce for user authentication. A major concern when deploying speaker verification technology is whether a system is robust against spoofing attacks. Speaker verification studies provided us a good insight into speaker characterization, which has contributed to the progress of voice conversion technology. Unfortunately, voice conversion has become one of the most easily accessible techniques to carry out spoofing attacks; therefore, presents a threat to speaker verification systems. In this paper, we will briefly introduce the fundamentals of voice conversion and speaker verification technologies. We then give an overview of recent spoofing attack studies under different conditions with a focus on voice conversion spoofing attack. We will also discuss anti-spoofing attack measures for speaker verification.Published versio
Investigating gated recurrent neural networks for speech synthesis
Recently, recurrent neural networks (RNNs) as powerful sequence models have
re-emerged as a potential acoustic model for statistical parametric speech
synthesis (SPSS). The long short-term memory (LSTM) architecture is
particularly attractive because it addresses the vanishing gradient problem in
standard RNNs, making them easier to train. Although recent studies have
demonstrated that LSTMs can achieve significantly better performance on SPSS
than deep feed-forward neural networks, little is known about why. Here we
attempt to answer two questions: a) why do LSTMs work well as a sequence model
for SPSS; b) which component (e.g., input gate, output gate, forget gate) is
most important. We present a visual analysis alongside a series of experiments,
resulting in a proposal for a simplified architecture. The simplified
architecture has significantly fewer parameters than an LSTM, thus reducing
generation complexity considerably without degrading quality.Comment: Accepted by ICASSP 201
- …