87 research outputs found

    Multiple-average-voice-based speech synthesis

    Get PDF

    Analysis of Speaker Adaptation Algorithms for HMM-based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm

    Get PDF
    In this paper we analyze the effects of several factors and configuration choices encountered during training and model construction when we want to obtain better and more stable adaptation in HMM-based speech synthesis. We then propose a new adaptation algorithm called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms. Here we investigate six major aspects of the speaker adaptation: initial models transform functions, estimation criteria, and sensitivity of several linear regression adaptation algorithms algorithms. Analyzing the effect of the initial model, we compare speaker-dependent models, gender-independent models, and the simultaneous use of the gender-dependent models to single use of the gender-dependent models. Analyzing the effect of the transform functions, we compare the transform function for only mean vectors with that for mean vectors and covariance matrices. Analyzing the effect of the estimation criteria, we compare the ML criterion with a robust estimation criterion called structural MAP. We evaluate the sensitivity of several thresholds for the piecewise linear regression algorithms and take up methods combining MAP adaptation with the linear regression algorithms. We incorporate these adaptation algorithms into our speech synthesis system and present several subjective and objective evaluation results showing the utility and effectiveness of these algorithms in speaker adaptation for HMM-based speech synthesis

    Improved Average-Voice-based Speech Synthesis Using Gender-Mixed Modeling and a Parameter Generation Algorithm Considering GV

    Get PDF
    For constructing a speech synthesis system which can achieve diverse voices, we have been developing a speaker independent approach of HMM-based speech synthesis in which statistical average voice models are adapted to a target speaker using a small amount of speech data. In this paper, we incorporate a high-quality speech vocoding method STRAIGHT and a parameter generation algorithm with global variance into the system for improving quality of synthetic speech. Furthermore, we introduce a feature-space speaker adaptive training algorithm and a gender mixed modeling technique for conducting further normalization of the average voice model. We build an English text-to-speech system using these techniques and show the performance of the system

    Rejtett Markov-modell alapĂș szövegfelolvasĂł adaptĂĄciĂłja fĂ©lig spontĂĄn magyar beszĂ©ddel

    Get PDF
    Napjainkban szĂĄmos automatikus szövegfelolvasĂĄsi mĂłdszer lĂ©tezik, de az elmĂșlt Ă©vekben a legnagyobb figyelmet a statisztikai parametrikus beszĂ©dkeltĂ©si mĂłdszer, ezen belĂŒl is a rejtett Markov-modell (Hidden Markov Model, HMM) alapĂș szövegfelolvasĂĄs kapta. A HMM-alapĂș szövegfelolvasĂĄs minsĂ©ge megközelĂ­ti a manapsĂĄg legjobbnak szĂĄmĂ­tĂł elemkivĂĄlasztĂĄsos szintĂ©zisĂ©t, Ă©s ezen tĂșl szĂĄmos elnnyel rendelkezik: adatbĂĄzisa kevĂ©s helyet foglal el, lehetsĂ©ges Ășj hangokat kĂŒlön felvĂ©telek nĂ©lkĂŒl lĂ©trehozni, Ă©rzelmeket kifejezni vele, Ă©s mĂĄr nĂ©hĂĄny mondatnyi felvĂ©tel esetĂ©n is lehetsĂ©ges az adott beszĂ©l hangkarakterĂ©t visszaadni. Jelen cikkben bemutatjuk a HMM-alapĂș beszĂ©dkeltĂ©s alapjait, a beszĂ©ladaptĂĄciĂłjĂĄnak lehetsĂ©geit, a magyar nyelvre elkĂ©szĂŒlt beszĂ©lfĂŒggetlen HMM adatbĂĄzist Ă©s a beszĂ©ladaptĂĄciĂł folyamatĂĄt fĂ©lig spontĂĄn magyar beszĂ©d esetĂ©n. Az eredmĂ©nyek kiĂ©rtĂ©kelĂ©se cĂ©ljĂĄbĂłl meghallgatĂĄsos tesztet vĂ©gzĂŒnk nĂ©gy kĂŒlönböz hang adaptĂĄciĂłja esetĂ©n, melyeket szintĂ©n ismertetĂŒnk a cikkĂŒnkben

    Feature extraction and event detection for automatic speech recognition

    Get PDF

    Robust speech recognition with spectrogram factorisation

    Get PDF
    Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments. Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech. This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations

    Design of reservoir computing systems for the recognition of noise corrupted speech and handwriting

    Get PDF
    • 

    corecore