Search CORE

3,001 research outputs found

A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation

Author: Banno Hideki
Kawahara Hideki
Morise Masanori
Sakakibara Ken-Ichi
Toda Tomoki
Publication venue: 'International Speech Communication Association'
Publication date: 09/06/2017
Field of study

We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0dB to 80dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background noise. The proposed method is applicable to a wide variety of time windowing functions with very low sidelobe levels. The estimate combines the frequency derivative and the time-frequency derivative of the mapping from filter center frequency to the output instantaneous frequency. This procedure can replace the periodicity detection and aperiodicity estimation subsystems of recently introduced open source vocoder, YANG vocoder. Source code of MATLAB implementation of this method will also be open sourced.Comment: 8 pages 9 figures, Submitted and accepted in Interspeech201

arXiv.org e-Print Archive

Crossref

Collapsed speech segment detection and suppression for WaveNet vocoder

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 09/08/2018
Field of study

In this paper, we propose a technique to alleviate the quality degradation caused by collapsed speech segments sometimes generated by the WaveNet vocoder. The effectiveness of the WaveNet vocoder for generating natural speech from acoustic features has been proved in recent works. However, it sometimes generates very noisy speech with collapsed speech segments when only a limited amount of training data is available or significant acoustic mismatches exist between the training and testing data. Such a limitation on the corpus and limited ability of the model can easily occur in some speech generation applications, such as voice conversion and speech enhancement. To address this problem, we propose a technique to automatically detect collapsed speech segments. Moreover, to refine the detected segments, we also propose a waveform generation technique for WaveNet using a linear predictive coding constraint. Verification and subjective tests are conducted to investigate the effectiveness of the proposed techniques. The verification results indicate that the detection technique can detect most collapsed segments. The subjective evaluations of voice conversion demonstrate that the generation technique significantly improves the speech quality while maintaining the same speaker similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201

arXiv.org e-Print Archive

Crossref

An introduction to statistical parametric speech synthesis

Author: King Simon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/10/2011
Field of study

Edinburgh Research Explorer

Perceptually smooth timbral guides by state-space analysis of phase-vocoder parameters

Author: Bristow-Johnson R.
David Cooper
Moorer J. A.
Nick Bailey
Pollard H. F.
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2000
Field of study

Sculptor is a phase-vocoder-based package of programs that allows users to explore timbral manipulation of sound in real time. It is the product of a research program seeking ultimately to perform gestural capture by analysis of the sound a performer makes using a conventional instrument. Since the phase-vocoder output is of high dimensionality — typically more than 1,000 channels per analysis frame—mapping phase-vocoder output to appropriate input parameters for a synthesizer is only feasible in theory

CiteSeerX

Crossref

White Rose Research Online

You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation

Author: Andrusenko Andrei
Korostik Roman
Laptev Aleksandr
Medennikov Ivan
Rybin Sergey
Svischev Aleksey
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/07/2020
Field of study

Data augmentation is one of the most effective ways to make end-to-end automatic speech recognition (ASR) perform close to the conventional hybrid approach, especially when dealing with low-resource tasks. Using recent advances in speech synthesis (text-to-speech, or TTS), we build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model. We argue that, when the training data amount is relatively low, this approach can allow an end-to-end model to reach hybrid systems' quality. For an artificial low-to-medium-resource setup, we compare the proposed augmentation with the semi-supervised learning technique. We also investigate the influence of vocoder usage on final ASR performance by comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an external language model, our approach outperforms a semi-supervised setup for LibriSpeech test-clean and only 33% worse than a comparable supervised setup. Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other

arXiv.org e-Print Archive

Crossref

Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language

Author: Chen F.
Dingchang Zheng
Fei Chen
Kamath S.
Scalart P.
Watson C. S.
Yu Tsao
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/09/2017
Field of study

Vocoder simulation studies have suggested that the carrier signal type employed affects the intelligibility of vocoded speech. The present work further assessed how carrier signal type interacts with additional signal processing, namely, single-channel noise suppression and envelope dynamic range compression, in determining the intelligibility of vocoder simulations. In Experiment 1, Mandarin sentences that had been corrupted by speech spectrum-shaped noise (SSN) or two-talker babble (2TB) were processed by one of four single-channel noise-suppression algorithms before undergoing tone-vocoded (TV) or noise-vocoded (NV) processing. In Experiment 2, dynamic ranges of multiband envelope waveforms were compressed by scaling of the mean-removed envelope waveforms with a compression factor before undergoing TV or NV processing. TV Mandarin sentences yielded higher intelligibility scores with normal-hearing (NH) listeners than did noise-vocoded sentences. The intelligibility advantage of noise-suppressed vocoded speech depended on the masker type (SSN vs 2TB). NV speech was more negatively influenced by envelope dynamic range compression than was TV speech. These findings suggest that an interactional effect exists between the carrier signal type employed in the vocoding process and envelope distortion caused by signal processing

Crossref

Anglia Ruskin Research

Coventry University Pure Portal