884 research outputs found

    Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders

    Full text link
    An effective approach to non-parallel voice conversion (VC) is to utilize deep neural networks (DNNs), specifically variational auto encoders (VAEs), to model the latent structure of speech in an unsupervised manner. A previous study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for VC. How- ever, VAE using other types of spectral features such as mel- cepstral coefficients (MCCs), which are related to human per- ception and have been widely used in VC, have not been prop- erly investigated. Instead of using one specific type of spectral feature, it is expected that VAE may benefit from using multi- ple types of spectral features simultaneously, thereby improving the capability of VAE for VC. To this end, we propose a novel VAE framework (called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple objectives in order to constrain the behavior of the learned encoder and de- coder. Experimental results demonstrate that the proposed CD- VAE framework outperforms the conventional VAE framework in terms of subjective tests.Comment: Accepted to ISCSLP 201

    Developing Sparse Representations for Anchor-Based Voice Conversion

    Get PDF
    Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes. We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations. Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”

    Customization of IBM Intu’s Voice by Connecting Text-to-Speech Services and a Voice Conversion Network

    Get PDF
    IBM has recently launched Project Intu, which extends the existing web-based cognitive service Watson with the Internet of Things to provide an intelligent personal assistant service. We propose a voice customization service that allows a user to directly customize the voice of Intu. The method for voice customization is based on IBM Watson’s text-to-speech service and voice conversion model. A user can train the voice conversion model by providing a minimum of approximately 100 speech samples in the preferred voice (target voice). The output voice of Intu (source voice) is then converted into the target voice. Furthermore, the user does not need to offer parallel data for the target voice since the transcriptions of the source speech and target speech are the same. We also suggest methods to maximize the efficiency of voice conversion and determine the proper amount of target speech based on several experiments. When we measured the elapsed time for each process, we observed that feature extraction accounts for 59.7% of voice conversion time, which implies that fixing inefficiencies in feature extraction should be prioritized. We used the mel-cepstral distortion between the target speech and reconstructed speech as an index for conversion accuracy and found that, when the number of target speech samples for training is less than 100, the general performance of the model degrades

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations
    • 

    corecore