339 research outputs found

    A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis

    Get PDF

    A Cybernetics Update for Competitive Deep Learning System

    Get PDF
    A number of recent reports in the peer-reviewed literature have discussed irreproducibility of results in biomedical research. Some of these articles suggest that the inability of independent research laboratories to replicate published results has a negative impact on the development of, and confidence in, the biomedical research enterprise. To get more resilient data and to achieve higher reproducible result, we present an adaptive and learning system reference architecture for smart learning system interface. To get deeper inspiration, we focus our attention on mammalian brain neurophysiology. In fact, from a neurophysiological point of view, neuroscientist LeDoux finds two preferential amygdala pathways in the brain of the laboratory mouse. The low road is a pathway which is able to transmit a signal from a stimulus to the thalamus, and then to the amygdala, which then activates a fast-response in the body. The high road is activated simultaneously. This is a slower road which also includes the cortical parts of the brain, thus creating a conscious impression of what the stimulus is (to develop a rational mechanism of defense for instance). To mimic this biological reality, our main idea is to use a new input node able to bind known information to the unknown one coherently. Then, unknown "environmental noise" or/and local "signal input" information can be aggregated to known "system internal control status" information, to provide a landscape of attractor points, which either fast or slow and deeper system response can computed from. In this way, ideal cybernetics system interaction levels can be matched exactly to practical system modeling interaction styles, with no paradigmatic operational ambiguity and minimal information loss. The present paper is a relevant contribute to classic cybernetics updating towards a new General Theory of Systems, a post-Bertalanffy Systemics

    Parallel and cascaded deep neural networks for text-to-speech synthesis

    Get PDF

    A study of speaker adaptation for DNN-based speech synthesis

    Get PDF
    A major advantage of statistical parametric speech synthe-sis (SPSS) over unit-selection speech synthesis is its adapt-ability and controllability in changing speaker characteristics and speaking style. Recently, several studies using deep neu-ral networks (DNNs) as acoustic models for SPSS have shown promising results. However, the adaptability of DNNs in SPSS has not been systematically studied. In this paper, we conduct an experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels. In particular, we augment a low-dimensional speaker-specific vector with linguistic features as input to represent speaker identity, perform model adapta-tion to scale the hidden activation weights, and perform a fea-ture space transformation at the output layer to modify gen-erated acoustic features. We systematically analyse the per-formance of each individual adaptation technique and that of their combinations. Experimental results confirm the adaptabil-ity of the DNN, and listening tests demonstrate that the DNN can achieve significantly better adaptation performance than the hidden Markov model (HMM) baseline in terms of naturalness and speaker similarity. Index Terms: Speech synthesis, acoustic model, deep neural network, speaker adaptatio

    A dynamic deep learning approach for intonation modeling

    Get PDF
    Intonation plays a crucial role in making synthetic speech sound more natural. However, intonation modeling largely remains an open question. In my thesis, the interpolated F0 is parameterized dynamically by means of sign values, encoding the direction of pitch change, and corresponding quantized magnitude values, encoding the amount of pitch change in such direction. The sign and magnitude values are used for the training of a dedicated neural network. The proposed methodology is evaluated and compared to a state-of-the-art DNN-based TTS system. To this end, a segmental synthesizer was implemented to normalize the effect of the spectrum. The synthesizer uses the F0 and linguistic features to predict the spectrum, aperiodicity, and voicing information. The proposed methodology performs as well as the reference system, and we observe a trend for native speakers to prefer the proposed intonation model
    • 

    corecore