115 research outputs found

    Unit selection in a concatenative speech synthesis system using a large speech database

    Get PDF
    One approach to the generation of natural-sounding synthesized speech waveforms is to select and concatenate units from a large speech database. Units (in the current work, phonemes) are selected to produce a natural realisation of a target phoneme sequence predicted from text which is annotated with prosodic and phonetic context information. We propose that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. This framework has many similarities to HMM-based speech recognition. A pruned Viterbi search is used to select the best units for synthesis from the database. This approach to waveform synthesis permits training from natural speech: two methods for training from speech are presented which provide weights which produce more natural speech than can be obtained by hand-tuning

    Synthesis using speaker adaptation from speech recognition DB

    Get PDF
    This paper deals with the creation of multiple voices from a Hidden Markov Model based speech synthesis system (HTS). More than 150 Catalan synthetic voices were built using Hidden Markov Models (HMM) and speaker adaptation techniques. Training data for building a Speaker-Independent (SI) model were selected from both a general purpose speech synthesis database (FestCat;) and a database design ed for training Automatic Speech Recognition (ASR) systems (Catalan SpeeCon database). The SpeeCon database was also used to adapt the SI model to different speakers. Using an ASR designed database for TTS purposes provided many different amateur voices, with few minutes of recordings not performed in studio conditions. This paper shows how speaker adaptation techniques provide the right tools to generate multiple voices with very few adaptation data. A subjective evaluation was carried out to assess the intelligibility and naturalness of the generated voices as well as the similarity of the adapted voices to both the original speaker and the average voice from the SI model.Peer ReviewedPostprint (published version

    Using same-language machine translation to create alternative target sequences for text-to-speech synthesis

    Get PDF
    Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime

    Can older people remember medication reminders presented using synthetic speech?

    Get PDF
    Reminders are often part of interventions to help older people adhere to complicated medication regimes. Computer-generated (synthetic) speech is ideal for tailoring reminders to different medication regimes. Since synthetic speech may be less intelligible than human speech, in particular under difficult listening conditions, we assessed how well older people can recall synthetic speech reminders for medications. 44 participants aged 50-80 with no cognitive impairment recalled reminders for one or four medications after a short distraction. We varied background noise, speech quality, and message design. Reminders were presented using a human voice and two synthetic voices. Data were analyzed using generalized linear mixed models. Reminder recall was satisfactory if reminders were restricted to one familiar medication, regardless of the voice used. Repeating medication names supported recall of lists of medications. We conclude that spoken reminders should build on familiar information and be integrated with other adherence support measures. © The Author 2014. Published by Oxford University Press on behalf of the American Medical Informatics Association. All rights reserved. For Permissions, please email: [email protected] numbered affiliations see end of article

    Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

    Full text link
    This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of 4.534.53 comparable to a MOS of 4.584.58 for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and F0F_0 features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.Comment: Accepted to ICASSP 201

    Фанетычная мінімізацыя корпуса тэкстаў на беларускай мове для навучання сістэмы сінтэзу маўлення

    Get PDF
    The most modern speech synthesis systems are based on the corpus-based method. The corpus-based method, unlike previously popular compilation method, uses natural speech database that does not consist of separate specially selected elements of compilation, but represents the corpus of phonograms of natural speech. Large amounts of text and corresponding audio information, which represents a significant challenge for so-called under-resourced languages, which include Belarusian, are required to achieve high-quality synthesized speech in this approach. In this case, a common approach is to use phonetic minimization, special selection of texts, when the amount of text corpus is maximally reduced, but at the same time phonetic fullness is preserved. The article discusses the information about the nature and the functioning the corpus-based method of sound signal generation in speech synthesis systems, provides a detailed overview of the approaches to the formation of text and speech corpuses, required for speech generation by the corpus-based method. The second half of the work is devoted to the description of the elaborated algorithm of the text corpus phonetic minimization in Belarusian language, as well as technical and linguistic resources used to implement it. A description of the developed software prototype as well as a description of the series of experiments on phonetic minimization are given to demonstrate the efficiency of the algorithm.Большасць сучасных сістэм сінтэзу маўлення базіруюць сваю працу на корпусным метадзе. Корпусны метад, у адрозненні ад папулярнага раней кампіляцыйнага, выкарыстоўвае базу дадзеных натуральнага маўлення, якая складаецца не з асобных спецыяльна выбраных элементаў кампіляцыі, а ўяўляе сабой корпус фанаграм натуральнага маўлення. Для дасягнення высокай якасці сінтэзаванага маўлення пры такім падыходзе патрабуюцца вялікія аб’ѐмы тэкставай і адпаведнай гукавой інфармацыі, што з’яўляецца істотнай праблемай для так званых нерэсурсных моў, да якіх адносіцца і беларуская. У такім выпадку, як правіла, прымяняецца фанетычная мінімізацыя – адмысловы адбор тэкстаў, у выніку якога аб’ѐм тэкставага корпуса максімальна змяншаецца, але пры гэтым захоўваецца фанетычная паўната. У артыкуле разглядаюцца звесткі пра сутнасць і спосаб працы корпуснага метаду генерацыі гукавога сігналу ў сістэмах сінтэзу маўлення, прыводзіцца падрабязны агляд падыходаў да фарміравання тэкставых і маўленчых карпусоў, неабходных для генерацыі маўлення корпусным метадам. Другая палова працы прысвечана апісанню распрацаванага алгарытму фанетычнай мінімізацыі корпуса тэкстаў на беларускай мове, а таксама тэхнічных і лінгвістычных рэсурсаў, выкарыстаных для яго рэалізацыі. Прыводзяцца апісанні распрацаванага праграмнага прататыпа і шэрагу праведзеных аўтарам эксперыментаў па фанетычнай мінімізацыі

    Developing a Text to Speech System for Dzongkha

    Get PDF
    Text to Speech plays a vital role in imparting information to the general population who have difficulty reading text but can understand spoken language. In Bhutan, many people fall in this category in adopting the national language ‘Dzongkha’ and system of such kind will have advantages in the community. In addition, the language will heighten its digital evolution in narrowing the digital gap. The same is more important in helping people with visual impairment. Text to speech systems are widely used in talking BOTs to news readers and announcement systems. This paper presents an attempt towards developing a working model of Text to Speech system for Dzongkha language. It also presents the development of a transcription or grapheme table for phonetic transcription from Dzongkha text to its equivalent phone set. The transcription tables for both consonants and vowels have been prepared in such a way that it facilitates better compatibility in computing. A total of 3000 sentences have been manually transcribed and recorded with a single male voice. The speech synthesis is based on a statistical method with concatenative speech generation on FESTIVAL platform. The model is generated using the two variants CLUSTERGEN and CLUNITS of the FESTIVAL speech tools FESTVOX. The development of system prototype is of the first kind for the Dzongkha language. Keywords: Natural Language processing (NLP), Dzongkha, Text to speech (TTS) system, Statistical speech synthesis, phoneme, corpus, transcription DOI: 10.7176/CEIS/12-1-04 Publication date: January 31st 202
    corecore