Search CORE

27 research outputs found

Singing voice correction using canonical time warping

Author: Chen Ming-Tso
Chi Tai-Shih
Luo Yin-Jyun
Su Li
Publication venue
Publication date: 23/11/2017
Field of study

Expressive singing voice correction is an appealing but challenging problem. A robust time-warping algorithm which synchronizes two singing recordings can provide a promising solution. We thereby propose to address the problem by canonical time warping (CTW) which aligns amateur singing recordings to professional ones. A new pitch contour is generated given the alignment information, and a pitch-corrected singing is synthesized back through the vocoder. The objective evaluation shows that CTW is robust against pitch-shifting and time-stretching effects, and the subjective test demonstrates that CTW prevails the other methods including DTW and the commercial auto-tuning software. Finally, we demonstrate the applicability of the proposed method in a practical, real-world scenario

arXiv.org e-Print Archive

Crossref

A flexible speech feature converter based on an enhanced architecture of U-net

Author: Yue Yanghao
Publication venue: University of Missouri--Columbia
Publication date
Field of study

In order to analyze speech or audio, many methods are applied to transform the time domain signals into various features such as the mel spectral features and WORLD vocoder features. These two types of features can both be extracted from speech or used to synthesize speech. On the other hand, certain applications call for conversion between different types of features. To convert mel spectral features to WORLD vocoder features, one possible method is to first synthesize time domain signal from mel spectrogram and then do the feature extraction by WORLD vocoder. The goal of this project is to develop a direct way to achieve this transformation, i.e., convert mel spectrogram output of text-to-speech (TTS) system to WORLD vocoder features. In this project, a feature converter is designed to accomplish our aim. The converter has an enhanced neural network architecture based on the U-net. In our design, except for the basic architecture of U-net, the Res Path composed of residual blocks and linear transformations are added on the skip connection. Our flexible system can complete feature conversion directly at feature level without processing in the time domain. In addition to the function of converting mel spectrogram to WORLD features, the reverse transformation from WORLD features to mel spectrogram is also attainable by a few adjustments. The transformed feature has achieved good performance in objective metrics and the converter generalized well to different speakers, which can be applied to produce high quality speech via vocoder resynthesis.Includes bibliographical references

University of Missouri: MOspace

音声モーフィングにおける基準点付与の自動化

Author: 川本真一
滝澤照太
鶴見智
Publication venue
Publication date: 26/04/2019
Field of study

Automatic reference point placement method for voice morphing is reported in this paper. Voice morphing is one of fundamental voice editing methods to blend feature vector sequences of two voices based on corresponding reference points. Reference points are basically assigned by hands, and depends on the quality of voice morphing output. Moreover, assigning reference points is a time-consuming task. The proposed method realizes to assign reference points on spectrogram in time- and frequency-domain automatically based on temporal decomposition (TD) and line spectral frequency (LSF). As results of two-speakers’ voice morphing, the proposed method was worked well by using voice and its transcription as inputs

Gunma University Academic Information Repository

Speech de-identification with deep neural networks

Author: Fodor Ádám
Kopácsi László
Lőrincz András
Milacski Zoltán Ádám
Publication venue: University of Szeged, Institute of Informatics
Publication date: 01/01/2021
Field of study

Cloud-based speech services are powerful practical tools but the privacy of the speakers raises important legal concerns when exposed to the Internet. We propose a deep neural network solution that removes personal characteristics from human speech by converting it to the voice of a Text-to-Speech (TTS) system before sending the utterance to the cloud. The network learns to transcode sequences of vocoder parameters, delta and delta-delta features of human speech to those of the TTS engine. We evaluated several TTS systems, vocoders and audio alignment techniques. We measured the performance of our method by (i) comparing the result of speech recognition on the de-identified utterances with the original texts, (ii) computing the Mel-Cepstral Distortion of the aligned TTS and the transcoded sequences, and (iii) questioning human participants in A-not-B, 2AFC and 6AFC tasks. Our approach achieves the level required by diverse applications

University of Szeged

Collapsed speech segment detection and suppression for WaveNet vocoder

Author: Hayashi Tomoki
Kobayashi Kazuhiro
Tobing Patrick Lumban
Toda Tomoki
Wu Yi-Chiao
Publication venue
Publication date: 09/08/2018
Field of study

In this paper, we propose a technique to alleviate the quality degradation caused by collapsed speech segments sometimes generated by the WaveNet vocoder. The effectiveness of the WaveNet vocoder for generating natural speech from acoustic features has been proved in recent works. However, it sometimes generates very noisy speech with collapsed speech segments when only a limited amount of training data is available or significant acoustic mismatches exist between the training and testing data. Such a limitation on the corpus and limited ability of the model can easily occur in some speech generation applications, such as voice conversion and speech enhancement. To address this problem, we propose a technique to automatically detect collapsed speech segments. Moreover, to refine the detected segments, we also propose a waveform generation technique for WaveNet using a linear predictive coding constraint. Verification and subjective tests are conducted to investigate the effectiveness of the proposed techniques. The verification results indicate that the detection technique can detect most collapsed segments. The subjective evaluations of voice conversion demonstrate that the generation technique significantly improves the speech quality while maintaining the same speaker similarity.Comment: 5 pages, 6 figures. Proc. Interspeech, 201

arXiv.org e-Print Archive

Crossref