788 research outputs found

    Sequence to Sequence Neural Speech Synthesis with Prosody Modification Capabilities

    Full text link
    Modern sequence to sequence neural TTS systems provide close to natural speech quality. Such systems usually comprise a network converting linguistic/phonetic features sequence to an acoustic features sequence, cascaded with a neural vocoder. The generated speech prosody (i.e. phoneme durations, pitch and loudness) is implicitly present in the acoustic features, being mixed with spectral information. Although the speech sounds natural, its prosody realization is randomly chosen and cannot be easily altered. The prosody control becomes an even more difficult task if no prosodic labeling is present in the training data. Recently, much progress has been achieved in unsupervised speaking style learning and generation, however human inspection is still required after the training for discovery and interpretation of the speaking styles learned by the system. In this work we introduce a fully automatic method that makes the system aware of the prosody and enables sentence-wise speaking pace and expressiveness control on a continuous scale. While being useful by itself in many applications, the proposed prosody control can also improve the overall quality and expressiveness of the synthesized speech, as demonstrated by subjective listening evaluations. We also propose a novel augmented attention mechanism, that facilitates better pace control sensitivity and faster attention convergence.Comment: published at 10th ISCA Speech Synthesis Workshop (SSW-10, 2019

    Latentin prosodia-avaruuden analysointi ja puhetyylien hallinta suomenkielisessÀ end-to-end puhesynteesissÀ

    Get PDF
    Viime vuosina syvÀoppimisen saralla tapahtunut kehitys on mahdollistanut neuroverkkoihin perustuvan puhesynteesin, joka lÀhes luonnollisen puheen tuottamisen lisÀksi sallii syntetisoidun puheen akustisten ominaisuuksien hallinnan. TÀmÀ merkitsee sitÀ, ettÀ on mahdollista tuottaa eloisaa puhetta eri tyyleillÀ, jotka sopivat kyseiseen kontekstiin. Yksi tapa, jolla tÀmÀ voidaan saavuttaa, on lisÀtÀ syntetisaattoriin referenssi-enkooderi, joka toimii pullonkaulana mallintaen prosodiaan liittyvÀn latentin avaruuden. TÀmÀn tutkimuksen pÀÀmÀÀrÀnÀ oli analysoida kuinka referenssi-enkooderin latentti avaruus mallintaa moninaisia ja realistisia puhetyylejÀ, ja miten puheennosten akustiset ominaisuudet ja niiden latentin avaruuden representaatiot korreloivat keskenÀÀn. Toinen pÀÀmÀÀrÀ oli arvioida kuinka syntetisoidun puheen tyyliÀ voi kontrolloida. Tutkimuksessa kÀytettiin referenssi-enkooderilla varustettua Tacotron 2 syntetisaattoria, joka oli koulutettu yhden naispuhujan luetulla puheella usealla puhetyylillÀ. Latenttia avaruutta analysoitiin tekemÀllÀ pÀÀkomponenttianalyysi puhedatan kaikista puheennoksista otetuille referenssivektoreille, jotta saataisiin esille puhetyylien keskeisimmÀt erot. Olettaen puhetyyleillÀ olevan akustisia korrelaatteja, tutkittiin pÀÀkomponenttien ja mitattujen akustisten ominaisuuksien vÀlillÀ olevaa mahdollista yhteyttÀ. Syntetisoitua puhetta analysoitiin kahdella tapaa: objektiivisella evaluaatiolla, joka arvioi akustisia ominaisuuksia ja subjektiivisella evaluaatiolla, joka arvioi syntetisoidun puheen sopivuutta liittyen puhuttuun lauseeseen. Tulokset osoittivat, ettÀ referenssienkooderi mallinsi tyylillisiÀ eroja hyvin, mutta tyylit olivat monisyisiÀ ja niissÀ oli merkittÀvÀÀ sisÀistÀ vaihtelua. PÀÀkomponenttianalyysi erotteli akustiset piirteet jossain mÀÀrin, ja tilastollinen analyysi osoitti yhteyden latentin avaruuden ja prosodisten ominaisuuksien vÀlillÀ. Objektiivinen evaluaatio antoi ymmÀrtÀÀ, ettÀ syntetisaattori ei tuottanut tyylien kaikkia akustisia ominaisuuksia, mutta subjektiivinen evaluaatio nÀytti, ettÀ mallinnus riitti vaikuttamaan sopivuuteen liittyviin arvioihin. Toisin sanoen spontaanilla tyylillÀ syntetisoitua puhetta pidettiin formaalia sopivampana spontaaniin tekstityyliin ja pÀinvastoin.In recent years, advances in deep learning have made it possible to develop neural speech synthesizers that not only generate near natural speech but also enable us to control its acoustic features. This means it is possible to synthesize expressive speech with different speaking styles that fit a given context. One way to achieve this control is by adding a reference encoder on the synthesizer that works as a bottleneck modeling a prosody related latent space. The aim of this study was to analyze how the latent space of a reference encoder models diverse and realistic speaking styles, and what correlation there is between the phonetic features of encoded utterances and their latent space representations. Another aim was to analyze how the synthesizer output could be controlled in terms of speaking styles. The model used in the study was a Tacotron 2 speech synthesizer with a reference encoder that was trained with read speech uttered in various styles by one female speaker. The latent space was analyzed with principal component analysis on the reference encoder outputs for all of the utterances in order to extract salient features that differentiate the styles. Basing on the assumption that there are acoustic correlates to speaking styles, a possible connection between the principal components and measured acoustic features of the encoded utterances was investigated. For the synthesizer output, two evaluations were conducted: an objective evaluation assessing acoustic features and a subjective evaluation assessing appropriateness of synthesized speech in regard to the uttered sentence. The results showed that the reference encoder modeled stylistic differences well, but the styles were complex with major internal variation within the styles. The principal component analysis disentangled the acoustic features somewhat and a statistical analysis showed a correlation between the latent space and prosodic features. The objective evaluation suggested that the synthesizer did not produce all of the acoustic features of the styles, but the subjective evaluation showed that it did enough to affect judgments of appropriateness, i.e., speech synthesized in an informal style was deemed more appropriate than formal style for informal style sentences and vice versa

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Prosody beyond pitch and emotion in speech and music: evidence from right hemisphere brain damage and congenital amusia

    Get PDF
    This dissertation examines the relationship of prosodic processing in language and music from a new perspective, considering acoustic features that have not been studied before in the framework of the parallel study of language and music. These features are argued to contribute to the effect of ‘expressiveness’ which is here defined as the combination of the acoustic features (variation in duration, pitch, loudness, and articulation) that results in aesthetic appreciation of the linguistic and the musical acoustic stream and which is distinct from pitch, emotional and pragmatic prosody as well as syntactic structure. The present investigation took a neuropsychological approach, comparing the performance of a right temporo-parietal stroke patient IB; a congenitally amusic individual, BZ; and 24 control participants with and without musical training. Apart from the main focus on the perception of ‘expressiveness’, additional aspects of language and music perception were studied. A new battery was designed that consisted of 8 tasks; ‘speech prosody detection’, ‘expressive speech prosody’, ‘expressive music prosody’, ‘emotional speech prosody’, ‘emotional music prosody, ‘speech pitch’, ‘speech rate’, and ‘music tempo’. These tasks addressed both theoretical and methodological issues in this comparative cognitive framework. IB’s performance on the expressive speech prosody task revealed a severe perceptual impairment, whereas his performance on the analogous music task examining ‘expressiveness’ was unimpaired. BZ also performed successfully on the same music task despite being characterised as congenital amusic by an earlier study. Musically untrained controls also had a successful performance. The data from IB suggest that speech and music stimuli encompassing similar features are not necessarily processed by the same mechanisms. These results can have further implications for the approach to the relationship of language and music within the study of cognitive deficits
    • 

    corecore