3,001 research outputs found
A modulation property of time-frequency derivatives of filtered phase and its application to aperiodicity and fo estimation
We introduce a simple and linear SNR (strictly speaking, periodic to random
power ratio) estimator (0dB to 80dB without additional
calibration/linearization) for providing reliable descriptions of aperiodicity
in speech corpus. The main idea of this method is to estimate the background
random noise level without directly extracting the background noise. The
proposed method is applicable to a wide variety of time windowing functions
with very low sidelobe levels. The estimate combines the frequency derivative
and the time-frequency derivative of the mapping from filter center frequency
to the output instantaneous frequency. This procedure can replace the
periodicity detection and aperiodicity estimation subsystems of recently
introduced open source vocoder, YANG vocoder. Source code of MATLAB
implementation of this method will also be open sourced.Comment: 8 pages 9 figures, Submitted and accepted in Interspeech201
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
Perceptually smooth timbral guides by state-space analysis of phase-vocoder parameters
Sculptor is a phase-vocoder-based package of programs
that allows users to explore timbral manipulation
of sound in real time. It is the product
of a research program seeking ultimately to perform
gestural capture by analysis of the sound a
performer makes using a conventional instrument.
Since the phase-vocoder output is of high dimensionality —
typically more than 1,000 channels per
analysis frame—mapping phase-vocoder output to
appropriate input parameters for a synthesizer is
only feasible in theory
You Do Not Need More Data: Improving End-To-End Speech Recognition by Text-To-Speech Data Augmentation
Data augmentation is one of the most effective ways to make end-to-end
automatic speech recognition (ASR) perform close to the conventional hybrid
approach, especially when dealing with low-resource tasks. Using recent
advances in speech synthesis (text-to-speech, or TTS), we build our TTS system
on an ASR training database and then extend the data with synthesized speech to
train a recognition model. We argue that, when the training data amount is
relatively low, this approach can allow an end-to-end model to reach hybrid
systems' quality. For an artificial low-to-medium-resource setup, we compare
the proposed augmentation with the semi-supervised learning technique. We also
investigate the influence of vocoder usage on final ASR performance by
comparing Griffin-Lim algorithm with our modified LPCNet. When applied with an
external language model, our approach outperforms a semi-supervised setup for
LibriSpeech test-clean and only 33% worse than a comparable supervised setup.
Our system establishes a competitive result for end-to-end ASR trained on
LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for
test-other
Effects of noise suppression and envelope dynamic range compression on the intelligibility of vocoded sentences for a tonal language
Vocoder simulation studies have suggested that the carrier signal type employed affects the intelligibility of vocoded speech. The present work further assessed how carrier signal type interacts with additional signal processing, namely, single-channel noise suppression and envelope dynamic range compression, in determining the intelligibility of vocoder simulations. In Experiment 1, Mandarin sentences that had been corrupted by speech spectrum-shaped noise (SSN) or two-talker babble (2TB) were processed by one of four single-channel noise-suppression algorithms before undergoing tone-vocoded (TV) or noise-vocoded (NV) processing. In Experiment 2, dynamic ranges of multiband envelope waveforms were compressed by scaling of the mean-removed envelope waveforms with a compression factor before undergoing TV or NV processing. TV Mandarin sentences yielded higher intelligibility scores with normal-hearing (NH) listeners than did noise-vocoded sentences. The intelligibility advantage of noise-suppressed vocoded speech depended on the masker type (SSN vs 2TB). NV speech was more negatively influenced by envelope dynamic range compression than was TV speech. These findings suggest that an interactional effect exists between the carrier signal type employed in the vocoding process and envelope distortion caused by signal processing
- …