5 research outputs found
SMART-In English: Learn English Using Speech Recognition
English is an international language and important to learn. For someone learning English sometimes is a difficulty, especially in pronunciation. Therefore, SMART-In is a prototype of Android App that uses Speech Recognition technology by utilizing services from the Cloud Speech API (Application Programming Interface). SMART-In English can be used as an alternative to English learning, especially in the pronunciation of a word. Using speech recognition can display the score of the pronunciation spoken by the user, recorded, show a level the pronunciation of the word and display the correct pronunciation
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data
Singing voice conversion aims to convert singer's voice from source to target
without changing singing content. Parallel training data is typically required
for the training of singing voice conversion system, that is however not
practical in real-life applications. Recent encoder-decoder structures, such as
variational autoencoding Wasserstein generative adversarial network (VAW-GAN),
provide an effective way to learn a mapping through non-parallel training data.
In this paper, we propose a singing voice conversion framework that is based on
VAW-GAN. We train an encoder to disentangle singer identity and singing prosody
(F0 contour) from phonetic content. By conditioning on singer identity and F0,
the decoder generates output spectral features with unseen target singer
identity, and improves the F0 rendering. Experimental results show that the
proposed framework achieves better performance than the baseline frameworks.Comment: Accepted to APSIPA ASC 202
Non-Parallel Voice Conversion with Cyclic Variational Autoencoder
In this paper, we present a novel technique for a non-parallel voice
conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based
spectral modeling. In a variational autoencoder(VAE) framework, a latent space,
usually with a Gaussian prior, is used to encode a set of input features. In a
VAE-based VC, the encoded latent features are fed into a decoder, along with
speaker-coding features, to generate estimated spectra with either the original
speaker identity (reconstructed) or another speaker identity (converted). Due
to the non-parallel modeling condition, the converted spectra can not be
directly optimized, which heavily degrades the performance of a VAE-based VC.
In this work, to overcome this problem, we propose to use CycleVAE-based
spectral model that indirectly optimizes the conversion flow by recycling the
converted features back into the system to obtain corresponding cyclic
reconstructed spectra that can be directly optimized. The cyclic flow can be
continued by using the cyclic reconstructed features as input for the next
cycle. The experimental results demonstrate the effectiveness of the proposed
CycleVAE-based VC, which yields higher accuracy of converted spectra, generates
latent features with higher correlation degree, and significantly improves the
quality and conversion accuracy of the converted speech.Comment: Accepted to INTERSPEECH 201
Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction
This paper presents a low-latency real-time (LLRT) non-parallel voice
conversion (VC) framework based on cyclic variational autoencoder (CycleVAE)
and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a
robust non-parallel multispeaker spectral model, which utilizes a
speaker-independent latent space and a speaker-dependent code to generate
reconstructed/converted spectral features given the spectral features of an
input speaker. On the other hand, MWDLP is an efficient and a high-quality
neural vocoder that can handle multispeaker data and generate speech waveform
for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we
propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral
features and is built with a sparse network architecture. Further, to improve
the modeling performance, we also propose a novel fine-tuning procedure that
refines the frame-rate CycleVAE network by utilizing the waveform loss from the
MWDLP network. The experimental results demonstrate that the proposed framework
achieves high-performance VC, while allowing for LLRT usage with a single-core
of -- GHz CPU on a real-time factor of --, including
input/output, feature extraction, on a frame shift of ms, a window length
of ms, and lookup frames.Comment: Accepted for SSW1
An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification
Anti-spoofing for automatic speaker verification is now a well established
area of research, with three competitive challenges having been held in the
last 6 years. A great deal of research effort over this time has been invested
into the development of front-end representations tailored to the spoofing
detection task. One such approach known as constant Q cepstral coefficients
(CQCCs) have been shown to be especially effective in detecting attacks
implemented with a unit selection based speech synthesis algorithm. Despite
their success, they largely fail in detecting other forms of spoofing attack
where more traditional front-end representations give substantially better
results. Similar differences were also observed in the most recent, 2019
edition of the ASVspoof challenge series. This paper reports our attempts to
help explain these observations. The explanation is shown to lie in the level
of attention paid by each front-end to different sub-band components of the
spectrum. Thus far, surprisingly little has been learned about what artefacts
are being detected by spoofing countermeasures. Our work hence aims to shed
light upon signal or spectrum level artefacts that serve to distinguish
different forms of spoofing attack from genuine, bone fide speech. With a
better understanding of these artefacts we will be better positioned to design
more reliable countermeasures.Comment: Accepted to Speaker Odyssey (The Speaker and Language Recognition
Workshop), 2020, 8 page