27 research outputs found

    Singing voice correction using canonical time warping

    Full text link
    Expressive singing voice correction is an appealing but challenging problem. A robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time warping (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario

    A flexible speech feature converter based on an enhanced architecture of U-net

    Get PDF
    In order to analyze speech or audio, many methods are applied to transform the time domain signals into various features such as the mel spectral features and WORLD vocoder features. These two types of features can both be extracted from speech or used to synthesize speech. On the other hand, certain applications call for conversion between different types of features. To convert mel spectral features to WORLD vocoder features, one possible method is to first synthesize time domain signal from mel spectrogram and then do the feature extraction by WORLD vocoder. The goal of this project is to develop a direct way to achieve this transformation, i.e., convert mel spectrogram output of text-to-speech (TTS) system to WORLD vocoder features. In this project, a feature converter is designed to accomplish our aim. The converter has an enhanced neural network architecture based on the U-net. In our design, except for the basic architecture of U-net, the Res Path composed of residual blocks and linear transformations are added on the skip connection. Our flexible system can complete feature conversion directly at feature level without processing in the time domain. In addition to the function of converting mel spectrogram to WORLD features, the reverse transformation from WORLD features to mel spectrogram is also attainable by a few adjustments. The transformed feature has achieved good performance in objective metrics and the converter generalized well to different speakers, which can be applied to produce high quality speech via vocoder resynthesis.Includes bibliographical references

    音声モーフィングにおける基準点付与の自動化

    Get PDF
    Automatic reference point placement method for voice morphing is reported in this paper. Voice morphing is one of fundamental voice editing methods to blend feature vector sequences of two voices based on corresponding reference points. Reference points are basically assigned by hands, and depends on the quality of voice morphing output. Moreover, assigning reference points is a time-consuming task. The proposed method realizes to assign reference points on spectrogram in time- and frequency-domain automatically based on temporal decomposition (TD) and line spectral frequency (LSF). As results of two-speakers’ voice morphing, the proposed method was worked well by using voice and its transcription as inputs

    Speech de-identification with deep neural networks

    Get PDF
    Cloud-based speech services are powerful practical tools but the privacy of the speakers raises important legal concerns when exposed to the Internet. We propose a deep neural network solution that removes personal characteristics from human speech by converting it to the voice of a Text-to-Speech (TTS) system before sending the utterance to the cloud. The network learns to transcode sequences of vocoder parameters, delta and delta-delta features of human speech to those of the TTS engine. We evaluated several TTS systems, vocoders and audio alignment techniques. We measured the performance of our method by (i) comparing the result of speech recognition on the de-identified utterances with the original texts, (ii) computing the Mel-Cepstral Distortion of the aligned TTS and the transcoded sequences, and (iii) questioning human participants in A-not-B, 2AFC and 6AFC tasks. Our approach achieves the level required by diverse applications

    Collapsed speech segment detection and suppression for WaveNet vocoder

    Full text link
    In this paper, we propose a technique to alleviate the quality degradation caused by collapsed speech segments sometimes generated by the WaveNet vocoder. The effectiveness of the WaveNet vocoder for generating natural speech from acoustic features has been proved in recent works. However, it sometimes generates very noisy speech with collapsed speech segments when only a limited amount of training data is available or significant acoustic mismatches exist between the training and testing data. Such a limitation on the corpus and limited ability of the model can easily occur in some speech generation applications, such as voice conversion and speech enhancement. To address this problem, we propose a technique to automatically detect collapsed speech segments. Moreover, to refine the detected segments, we also propose a waveform generation technique for WaveNet using a linear predictive coding constraint. Verification and subjective tests are conducted to investigate the effectiveness of the proposed techniques. The verification results indicate that the detection technique can detect most collapsed segments. The subjective evaluations of voice conversion demonstrate that the generation technique significantly improves the speech quality while maintaining the same speaker similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201
    corecore