7 research outputs found

    Автоматический синтез нейтральной и выразительной речи

    Get PDF
    В интеллектуальных приложениях, использующих речевые технологии, синтезированная речь должна звучать естественно и выразительно. В статье описана разработанная технология синтеза речи, обеспечивающая озвучивание произвольных орфографических текстов на украинском языке в нейтральном и выразительном стилях с сохранением индивидуальных особенностей голоса и произношения. Основное внимание уделено просодической модели интонирования, используемой для синтеза речи с нейтральной и выразительной интонацией.В інтелектуальних прикладних розробках, в яких використовуються мовленнєві технології, синтезоване мовлення повинне звучати натурально та виразно. У статті описується розроблена технологія синтезу мовлення, яка забезпечує озвучення довільних орфографічних текстів українською мовою у нейтральному та виразному стилях. Основна увага приділяється просодичній моделі інтонування, яка використовується для синтезу мовлення з нейтральною та виразною інтонацією

    Применение лингвистических признаков для автоматического определения интонационно выделенных слов в русскоязычном тексте

    Get PDF
    The article presents a method of detecting prosodically prominent words, i.e. words that carry most of the information in the utterance. The method relies on lexical, grammatical and syntactic markers of prominence, and can be used in Text-to-Speech synthesis systems to make synthesized speech sound more natural. Three different classification methods were used: Naive Bayes, Maximum Entropy and Conditional Random Fields models. The results of the experiments show that discriminative models provide more balanced values of the performance metrics, while the generative model is potentially more useful for detecting prominent words in speech signal. The results of the study are comparable with the performances of similar systems developed for other languages, and in some cases surpass them.В данной статье предлагается метод автоматического предсказания интонационно выделенных слов, то есть наиболее важной информации в высказывании. Метод опирается на использование лексических, грамматических и синтаксических маркеров интонационного выделения, что делает возможным его применение в системах синтеза речи по тексту, где реализация интонационного выделения может повысить естественность звучания синтезированной речи. В качестве методов классификации независимо друг от друга использовалось несколько различных моделей: наивная байесовская модель, модель максимальной энтропии и условные случайные поля. Сопоставление результатов, полученных в ходе нескольких экспериментов, показало, что использовавшиеся дискриминативные модели демонстрируют сбалансированные и примерно равные значения метрик качества, в то время как генеративная модель потенциально более пригодна для поиска интонационно выделенных слов в речевом сигнале. Результаты, представленные в статье, сравнимы и в некоторых случаях превосходят аналогичные системы, разработанные для других языков

    The CSTR/Cereproc Blizzard Entry 2008: The Inconvenient Data

    Get PDF
    In a commercial system data used for unit selection systems is collected with a heavy emphasis on homogeneous neutral data that has sufficient coverage for the units that will be used in the system. In this years Blizzard entry CSTR and CereProc present a joint entry where the emphasis has been to explore techniques to deal with data which is not homogeneous (the English entry) and did not have appropriate coverage for a diphone based system (the Mandarin entry where tone/phone combinations were treated as distinct phone categories). In addition, two further problems were addressed, 1) Making use of non-homogeneous data for creating a voice that can realise both expressive and neutral speaking styles (the English entry) 2) Building a unit selection system with no native understanding of the language but depending instead on external native evaluation (the Mandarin Entry)

    Probabilistic Amplitude Demodulation features in Speech Synthesis for Improving Prosody

    Get PDF
    Abstract Amplitude demodulation (AM) is a signal decomposition technique by which a signal can be decomposed to a product of two signals, i.e, a quickly varying carrier and a slowly varying modulator. In this work, the probabilistic amplitude demodulation (PAD) features are used to improve prosody in speech synthesis. The PAD is applied iteratively for generating syllable and stress amplitude modulations in a cascade manner. The PAD features are used as a secondary input scheme along with the standard text-based input features in statistical parametric speech syn- thesis. Specifically, deep neural network (DNN)-based speech synthesis is used to evaluate the importance of these features. Objective evaluation has shown that the proposed system using the PAD features has improved mainly prosody modelling; it outperforms the baseline system by approximately 5% in terms of relative reduction in root mean square error (RMSE) of the fundamental frequency (F0). The significance of this improvement is validated by subjective evaluation of the overall speech quality, achieving 38.6% over 19.5% preference score in respect to the baseline system, in an ABX test

    Altering speech synthesis prosody through real time natural gestural control

    Get PDF
    A significant amount of research has been and continues to be undertaken into generating expressive prosody within speech synthesis. Separately, recent developments in HMM-based synthesis (specifically pHTS, developed at University of Mons) provide a platform for reactive speech synthesis, able to react in real time to surroundings or user interaction. Considering both of these elements, this project explores whether it is possible to generate superior prosody in a speech synthesis system, using natural gestural controls, in real time. Building on a previous piece of work undertaken at The University of Edinburgh, a system is constructed in which a user may apply a variety of prosodic effects in real time through natural gestures, recognised by a Microsoft Kinect sensor. Gestures are recognised and prosodic adjustments made through a series of hand-crafted rules (based on data gathered from preliminary experiments), though machine learning techniques are also considered within this project and recommended for future iterations of the work. Two sets of formal experiments are implemented, both of which suggest that - under further development - the system developed may work successfully in a real world environment. Firstly, user tests show that subjects can learn to control the device successfully, adding prosodic effects to the intended words in the majority of cases with practice. Results are likely to improve further as buffering issues are resolved. Secondly, listening tests show that the prosodic effects currently implemented significantly increase perceived naturalness, and in some cases are able to alter the semantic perception of a sentence in an intended way. Alongside this paper, a demonstration video of the project may be found on the accompanying CD, or online at http://tinyurl.com/msc-synthesis. The reader is advised to view this demonstration, as a way of understanding how the system functions and sounds in action

    Modelling prominence and emphasis improves unit-selection synthesis

    Get PDF
    We describe the results of large scale perception experiments showing improvements in synthesising two distinct kinds of prominence: standard pitch-accent and strong emphatic accents. Previously prominence assignment has been mainly evaluated by computing accuracy on a prominence-labelled test set. By contrast we integrated an automatic pitch-accent classifier into the unit selection target cost and showed that listeners preferred these synthesised sentences. We also describe an improved recording script for collecting emphatic accents, and show that generating emphatic accents leads to further improvements in the fiction genre over incorporating pitch accent only. Finally, we show differences in the effects of prominence between child-directed speech and news and fiction genres

    Intonation Modelling for Speech Synthesis and Emphasis Preservation

    Get PDF
    Speech-to-speech translation is a framework which recognises speech in an input language, translates it to a target language and synthesises speech in this target language. In such a system, variations in the speech signal which are inherent to natural human speech are lost, as the information goes through the different building blocks of the translation process. The work presented in this thesis addresses aspects of speech synthesis which are lost in traditional speech-to-speech translation approaches. The main research axis of this thesis is the study of prosody for speech synthesis and emphasis preservation. A first investigation of regional accents of spoken French is carried out to understand the sensitivity of native listeners with respect to accented speech synthesis. Listening tests show that standard adaptation methods for speech synthesis are not sufficient for listeners to perceive accentedness. On the other hand, combining adaptation with original prosody allows perception of accents. Addressing the need of a more suitable prosody model, a physiologically plausible intonation model is proposed. Inspired by the command-response model, it has basic components, which can be related to muscle responses to nerve impulses. These components are assumed to be a representation of muscle control of the vocal folds. A motivation for such a model is its theoretical language independence, based on the fact that humans share the same vocal apparatus. An automatic parameter extraction method which integrates a perceptually relevant measure is proposed with the model. This approach is evaluated and compared with the standard command-response model. Two corpora including sentences with emphasised words are presented, in the context of the SIWIS project. The first is a multilingual corpus with speech from multiple speaker; the second is a high quality speech synthesis oriented corpus from a professional speaker. Two broad uses of the model are evaluated. The first shows that it is difficult to predict model parameters; however the second shows that parameters can be transferred in the context of emphasis synthesis. A relation between model parameters and linguistic features such as stress and accent is demonstrated. Similar observations are made between the parameters and emphasis. Following, we investigate the extraction of atoms in emphasised speech and their transfer in neutral speech, which turns out to elicit emphasis perception. Using clustering methods, this is extended to the emphasis of other words, using linguistic context. This approach is validated by listening tests, in the case of English
    corecore