Article thumbnail

Predicting F0 and voicing from NAM-captured whispered speech

By Viet-Anh Tran, Gérard Bailly, Hélène Loevenbruck and Tomoki Toda

Abstract

International audienceThe NAM-to-speech conversion proposed by Toda and colleagues which converts Non-Audible Murmur (NAM) to audible speech by statistical mapping trained using aligned corpora is a very promising technique, but its performance is still insufficient, mainly due to the difficulty in estimating F0 of the transformed voice from unvoiced speech. In this paper, we propose a method to improve F0 estimation and voicing decision in a NAM-to-speech conversion system based on Gaussian Mixture Models (GMM) applied to whispered speech. Instead of combining voicing decision and F0 estimation in a single GMM, a simple feed-forward neural network is used to detect voiced segments in the whisper while a GMM estimates a continuous melodic contour based on training voiced segments. The error rate for the voiced/unvoiced decision of the network is 6.8% compared to 9.2% with the original system. Our proposal benefits also to F0 estimation error

Topics: [SPI.SIGNAL]Engineering Sciences [physics]/Signal and Image processing, [INFO.INFO-TS]Computer Science [cs]/Signal and Image Processing
Publisher: HAL CCSD
Year: 2008
OAI identifier: oai:HAL:hal-00333290v1
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • https://hal.archives-ouvertes.... (external link)
  • https://hal.archives-ouvertes.... (external link)
  • https://hal.archives-ouvertes.... (external link)
  • https://hal.archives-ouvertes.... (external link)
  • https://hal.archives-ouvertes.... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.