27 research outputs found
Singing voice correction using canonical time warping
Expressive singing voice correction is an appealing but challenging problem.
A robust time-warping algorithm which synchronizes two singing recordings can
provide a promising solution. We thereby propose to address the problem by
canonical time warping (CTW) which aligns amateur singing recordings to
professional ones. A new pitch contour is generated given the alignment
information, and a pitch-corrected singing is synthesized back through the
vocoder. The objective evaluation shows that CTW is robust against
pitch-shifting and time-stretching effects, and the subjective test
demonstrates that CTW prevails the other methods including DTW and the
commercial auto-tuning software. Finally, we demonstrate the applicability of
the proposed method in a practical, real-world scenario
A flexible speech feature converter based on an enhanced architecture of U-net
In order to analyze speech or audio, many methods are applied to transform the time domain signals into various features such as the mel spectral features and WORLD vocoder features. These two types of features can both be extracted from speech or used to synthesize speech. On the other hand, certain applications call for conversion between different types of features. To convert mel spectral features to WORLD vocoder features, one possible method is to first synthesize time domain signal from mel spectrogram and then do the feature extraction by WORLD vocoder. The goal of this project is to develop a direct way to achieve this transformation, i.e., convert mel spectrogram output of text-to-speech (TTS) system to WORLD vocoder features. In this project, a feature converter is designed to accomplish our aim. The converter has an enhanced neural network architecture based on the U-net. In our design, except for the basic architecture of U-net, the Res Path composed of residual blocks and linear transformations are added on the skip connection. Our flexible system can complete feature conversion directly at feature level without processing in the time domain. In addition to the function of converting mel spectrogram to WORLD features, the reverse transformation from WORLD features to mel spectrogram is also attainable by a few adjustments. The transformed feature has achieved good performance in objective metrics and the converter generalized well to different speakers, which can be applied to produce high quality speech via vocoder resynthesis.Includes bibliographical references
音声モーフィングにおける基準点付与の自動化
Automatic reference point placement method for voice morphing is reported in this paper. Voice morphing is one of fundamental voice editing methods to blend feature vector sequences of two voices based on corresponding reference points. Reference points are basically assigned by hands, and depends on the quality of voice morphing output. Moreover, assigning reference points is a time-consuming task. The proposed method realizes to assign reference points on spectrogram in time- and frequency-domain automatically based on temporal decomposition (TD) and line spectral frequency (LSF). As results of two-speakers’ voice morphing, the proposed method was worked well by using voice and its transcription as inputs
Speech de-identification with deep neural networks
Cloud-based speech services are powerful practical tools but the privacy of the speakers raises important legal concerns when exposed to the Internet. We propose a deep neural network solution that removes personal characteristics from human speech by converting it to the voice of a Text-to-Speech (TTS) system before sending the utterance to the cloud. The network learns to transcode sequences of vocoder parameters, delta and delta-delta features of human speech to those of the TTS engine. We evaluated several TTS systems, vocoders and audio alignment techniques. We measured the performance of our method by (i) comparing the result of speech recognition on the de-identified utterances with the original texts, (ii) computing the Mel-Cepstral Distortion of the aligned TTS and the transcoded sequences, and (iii) questioning human participants in A-not-B, 2AFC and 6AFC tasks. Our approach achieves the level required by diverse applications
Collapsed speech segment detection and suppression for WaveNet vocoder
In this paper, we propose a technique to alleviate the quality degradation
caused by collapsed speech segments sometimes generated by the WaveNet vocoder.
The effectiveness of the WaveNet vocoder for generating natural speech from
acoustic features has been proved in recent works. However, it sometimes
generates very noisy speech with collapsed speech segments when only a limited
amount of training data is available or significant acoustic mismatches exist
between the training and testing data. Such a limitation on the corpus and
limited ability of the model can easily occur in some speech generation
applications, such as voice conversion and speech enhancement. To address this
problem, we propose a technique to automatically detect collapsed speech
segments. Moreover, to refine the detected segments, we also propose a waveform
generation technique for WaveNet using a linear predictive coding constraint.
Verification and subjective tests are conducted to investigate the
effectiveness of the proposed techniques. The verification results indicate
that the detection technique can detect most collapsed segments. The subjective
evaluations of voice conversion demonstrate that the generation technique
significantly improves the speech quality while maintaining the same speaker
similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201