386 research outputs found
Synthetic speech detection using phase information
Taking advantage of the fact that most of the speech processing techniques neglect the phase information, we seek to detect phase perturbations in order to prevent synthetic impostors attacking Speaker Verification systems. Two Synthetic Speech Detection (SSD) systems that use spectral phase related information are reviewed and evaluated in this work: one based on the Modified Group Delay (MGD), and the other based on the Relative Phase Shift, (RPS). A classical module-based MFCC system is also used as baseline. Different training strategies are proposed and evaluated using both real spoofing samples and copy-synthesized signals from the natural ones, aiming to alleviate the issue of getting real data to train the systems. The recently published ASVSpoof2015 database is used for training and evaluation. Performance with completely unrelated data is also checked using synthetic speech from the Blizzard Challenge as evaluation material. The results prove that phase information can be successfully used for the SSD task even with unknown attacks.This work has been partially supported by the Basque Government (ElkarOla Project, KK-2015/00,098) and the Spanish Ministry of Economy and Competitiveness (Restore project, TEC2015-67,163-C2-1-R)
Improving trajectory modelling for DNN-based speech synthesis by using stacked bottleneck features and minimum generation error training
We propose two novel techniques-stacking bottleneck features and minimum generation error (MGE) training criterion-to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The MGE training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory recurrent neural network systems
Sentence-level control vectors for deep neural network speech synthesis
This paper describes the use of a low-dimensional vector representation of sentence acoustics to control the output of a feed-forward deep neural network text-to-speech system on a sentence-by-sentence basis. Vector representations for sentences in the training corpus are learned during network training along with other parameters of the model. Although the network is trained on a frame-by-frame basis, the standard frame-level inputs representing linguistic features are supplemented by features from a projection layer which outputs a learned representation of sentence-level acoustic characteristics. The projection layer contains dedicated parameters for each sentence in the training data which are optimised jointly with the standard network weights. Sentence-specific parameters are optimised on all frames of the relevant sentence -- these parameters therefore allow the network to account for sentence-level variation in the data which is not predictable from the standard linguistic inputs. Results show that the global prosodic characteristics of synthetic speech can be controlled simply and robustly at run time by supplementing basic linguistic features with sentence-level control vectors which are novel but designed to be consistent with those observed in the training corpus
Towards minimum perceptual error training for DNN-based speech synthesis
We propose to use a perceptually-oriented domain to improve the quality of text-to-speech generated by deep neural networks (DNNs). We train a DNN that predicts the parameters required for speech reconstruction but whose cost function is calculated in another domain. In this paper, to represent this perceptual domain we extract an approximated version of the Spectro-Temporal Excitation Pattern that was originally proposed as part of a model of hearing speech in noise. We train DNNs that predict band aperiodicity, fundamental frequency and Mel cepstral coefficients and compare generated speech when the spectral cost function is defined in the Mel cepstral, warped log spectrum or perceptual domains. Objective results indicate that the perceptual domain system achieves the highest quality
Multi-Scale Sub-Band Constant-Q Transform Discriminator for High-Fidelity Vocoder
Generative Adversarial Network (GAN) based vocoders are superior in inference
speed and synthesis quality when reconstructing an audible waveform from an
acoustic representation. This study focuses on improving the discriminator to
promote GAN-based vocoders. Most existing time-frequency-representation-based
discriminators are rooted in Short-Time Fourier Transform (STFT), whose
time-frequency resolution in a spectrogram is fixed, making it incompatible
with signals like singing voices that require flexible attention for different
frequency bands. Motivated by that, our study utilizes the Constant-Q Transform
(CQT), which owns dynamic resolution among frequencies, contributing to a
better modeling ability in pitch accuracy and harmonic tracking. Specifically,
we propose a Multi-Scale Sub-Band CQT (MS-SB-CQT) Discriminator, which operates
on the CQT spectrogram at multiple scales and performs sub-band processing
according to different octaves. Experiments conducted on both speech and
singing voices confirm the effectiveness of our proposed method. Moreover, we
also verified that the CQT-based and the STFT-based discriminators could be
complementary under joint training. Specifically, enhanced by the proposed
MS-SB-CQT and the existing MS-STFT Discriminators, the MOS of HiFi-GAN can be
boosted from 3.27 to 3.87 for seen singers and from 3.40 to 3.78 for unseen
singers
PIAVE: A Pose-Invariant Audio-Visual Speaker Extraction Network
It is common in everyday spoken communication that we look at the turning
head of a talker to listen to his/her voice. Humans see the talker to listen
better, so do machines. However, previous studies on audio-visual speaker
extraction have not effectively handled the varying talking face. This paper
studies how to take full advantage of the varying talking face. We propose a
Pose-Invariant Audio-Visual Speaker Extraction Network (PIAVE) that
incorporates an additional pose-invariant view to improve audio-visual speaker
extraction. Specifically, we generate the pose-invariant view from each
original pose orientation, which enables the model to receive a consistent
frontal view of the talker regardless of his/her head pose, therefore, forming
a multi-view visual input for the speaker. Experiments on the multi-view MEAD
and in-the-wild LRS3 dataset demonstrate that PIAVE outperforms the
state-of-the-art and is more robust to pose variations.Comment: Interspeech 202
Spoofing and Anti-Spoofing: A Shared View of Speaker Verification, Speech Synthesis and Voice Conversion
Automatic speaker verification (ASV) offers a low-cost and flexible biometric solution to person authentication. While the reliability of ASV systems is now considered sufficient to support mass-market adoption, there are concerns that the technology is vulnerable to spoofing, also referred to as presentation attacks. Spoofing refers to an attack whereby a fraudster attempts to manipulate a biometric system by masquerading as another, enrolled person. On the other hand, speaker adaptation in speech synthesis and voice conversion techniques attempt to mimic a target speaker's voice automatically, and hence present a genuine threat to ASV systems. The research community has responded to speech synthesis and voice conversion spoofing attacks with dedicated countermeasures which aim to detect and deflect such attacks. Even if the literature shows that they can be effective, the problem is far from being solved; ASV systems remain vulnerable to spoofing, and a deeper understanding of speaker verification, speech synthesis and voice conversion will be fundamental to the pursuit of spoofing-robust speaker verification. While the level of interest is growing, the level of effort to develop spoofing countermeasures for ASV is lagging behind that for other biometric modalities. What's more, the vulnerabilities of ASV to spoofing are now well acknowledged. A tutorial on spoofing and anti-spoofing from the combined perspective of speaker verification, speech synthesis and voice conversion is much needed. The tutorial will attract, not only members of the growing anti-spoofing research community, but also the broader community of general practitioners in speaker verification, speech synthesis and voice conversion. The speakers have led the research community in anti-spoofing for ASV since 2013, have jointly authored a growing number of conference papers, book chapters and the latest survey paper published in Speech Communications in 2015. Between them they have organised two special sessions and one evaluation/challenge (http://www.spoofingchallenge.org/) on the same topic. The experience gained through these activities is be the foundation of this tutorial proposal for APSIPA ASC 2015
Youla-Kucera parameterized adaptive tracking control for optical data storage systems
In the next generation optical data storage systems, the tolerance of the tracking error will become even smaller under various unknown working situations. However, the unknown external disturbances caused by vibrations make it difficult to maintain the desired tracking precision during normal disk operation. It is proposed in this paper to use an adaptive regulation approach to maintain the tracking error below its desired value despite these unknown disturbances. The design of the regulator is formulated by augmenting a base controller into a Youla-Kucera (Q) parameterized set of stabilizing controllers so that both the deterministic and the random disturbances can be deal with properly. The adaptive algorithm is developed to search the desired Q parameter which satisfies the Internal Model Principle and thus the exact regulation against the unknown deterministic disturbance can be achieved. The performance of the proposed control approach is evaluated with experimental results that illustrate the capability of the proposed adaptive regulator to attenuate the unknown disturbances and achieve the desired tracking precision
- …
