Search CORE

5 research outputs found

SMART-In English: Learn English Using Speech Recognition

Author: Handani Sitaresmi Wahyu
Indartono Kuat
Saputra Dhanar Intan Surya
Wijanarko Andik
Publication venue: 'Universitas Muhammadiyah Yogyakarta'
Publication date: 30/03/2020
Field of study

English is an international language and important to learn. For someone learning English sometimes is a difficulty, especially in pronunciation. Therefore, SMART-In is a prototype of Android App that uses Speech Recognition technology by utilizing services from the Cloud Speech API (Application Programming Interface). SMART-In English can be used as an alternative to English learning, especially in the pronunciation of a word. Using speech recognition can display the score of the pronunciation spoken by the user, recorded, show a level the pronunciation of the word and display the correct pronunciation

Leading & Enlightening Journal UMY

VAW-GAN for Singing Voice Conversion with Non-parallel Training Data

Author: Li Haizhou
Lu Junchen
Sisman Berrak
Zhou Kun
Publication venue
Publication date: 03/11/2020
Field of study

Singing voice conversion aims to convert singer's voice from source to target without changing singing content. Parallel training data is typically required for the training of singing voice conversion system, that is however not practical in real-life applications. Recent encoder-decoder structures, such as variational autoencoding Wasserstein generative adversarial network (VAW-GAN), provide an effective way to learn a mapping through non-parallel training data. In this paper, we propose a singing voice conversion framework that is based on VAW-GAN. We train an encoder to disentangle singer identity and singing prosody (F0 contour) from phonetic content. By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity, and improves the F0 rendering. Experimental results show that the proposed framework achieves better performance than the baseline frameworks.Comment: Accepted to APSIPA ASC 202

arXiv.org e-Print Archive

Non-Parallel Voice Conversion with Cyclic Variational Autoencoder

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 23/07/2019
Field of study

In this paper, we present a novel technique for a non-parallel voice conversion (VC) with the use of cyclic variational autoencoder (CycleVAE)-based spectral modeling. In a variational autoencoder(VAE) framework, a latent space, usually with a Gaussian prior, is used to encode a set of input features. In a VAE-based VC, the encoded latent features are fed into a decoder, along with speaker-coding features, to generate estimated spectra with either the original speaker identity (reconstructed) or another speaker identity (converted). Due to the non-parallel modeling condition, the converted spectra can not be directly optimized, which heavily degrades the performance of a VAE-based VC. In this work, to overcome this problem, we propose to use CycleVAE-based spectral model that indirectly optimizes the conversion flow by recycling the converted features back into the system to obtain corresponding cyclic reconstructed spectra that can be directly optimized. The cyclic flow can be continued by using the cyclic reconstructed features as input for the next cycle. The experimental results demonstrate the effectiveness of the proposed CycleVAE-based VC, which yields higher accuracy of converted spectra, generates latent features with higher correlation degree, and significantly improves the quality and conversion accuracy of the converted speech.Comment: Accepted to INTERSPEECH 201

arXiv.org e-Print Archive

Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction

Author: Tobing Patrick Lumban
Toda Tomoki
Publication venue
Publication date: 04/07/2021
Field of study

This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. The experimental results demonstrate that the proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of

2.1

2.7

GHz CPU on a real-time factor of

0.87

0.95

, including input/output, feature extraction, on a frame shift of

10

ms, a window length of

27.5

ms, and

2

lookup frames.Comment: Accepted for SSW1

arXiv.org e-Print Archive

An explainability study of the constant Q cepstral coefficient spoofing countermeasure for automatic speaker verification

Author: Evans Nicholas
Nautsch Andreas
Patino Jose
Tak Hemlata
Todisco Massimiliano
Publication venue
Publication date: 19/04/2020
Field of study

Anti-spoofing for automatic speaker verification is now a well established area of research, with three competitive challenges having been held in the last 6 years. A great deal of research effort over this time has been invested into the development of front-end representations tailored to the spoofing detection task. One such approach known as constant Q cepstral coefficients (CQCCs) have been shown to be especially effective in detecting attacks implemented with a unit selection based speech synthesis algorithm. Despite their success, they largely fail in detecting other forms of spoofing attack where more traditional front-end representations give substantially better results. Similar differences were also observed in the most recent, 2019 edition of the ASVspoof challenge series. This paper reports our attempts to help explain these observations. The explanation is shown to lie in the level of attention paid by each front-end to different sub-band components of the spectrum. Thus far, surprisingly little has been learned about what artefacts are being detected by spoofing countermeasures. Our work hence aims to shed light upon signal or spectrum level artefacts that serve to distinguish different forms of spoofing attack from genuine, bone fide speech. With a better understanding of these artefacts we will be better positioned to design more reliable countermeasures.Comment: Accepted to Speaker Odyssey (The Speaker and Language Recognition Workshop), 2020, 8 page

arXiv.org e-Print Archive