363 research outputs found

    Speech vocoding for laboratory phonology

    Get PDF
    Using phonological speech vocoding, we propose a platform for exploring relations between phonology and speech processing, and in broader terms, for exploring relations between the abstract and physical structures of a speech signal. Our goal is to make a step towards bridging phonology and speech processing and to contribute to the program of Laboratory Phonology. We show three application examples for laboratory phonology: compositional phonological speech modelling, a comparison of phonological systems and an experimental phonological parametric text-to-speech (TTS) system. The featural representations of the following three phonological systems are considered in this work: (i) Government Phonology (GP), (ii) the Sound Pattern of English (SPE), and (iii) the extended SPE (eSPE). Comparing GP- and eSPE-based vocoded speech, we conclude that the latter achieves slightly better results than the former. However, GP - the most compact phonological speech representation - performs comparably to the systems with a higher number of phonological features. The parametric TTS based on phonological speech representation, and trained from an unlabelled audiobook in an unsupervised manner, achieves intelligibility of 85% of the state-of-the-art parametric speech synthesis. We envision that the presented approach paves the way for researchers in both fields to form meaningful hypotheses that are explicitly testable using the concepts developed and exemplified in this paper. On the one hand, laboratory phonologists might test the applied concepts of their theoretical models, and on the other hand, the speech processing community may utilize the concepts developed for the theoretical phonological models for improvements of the current state-of-the-art applications

    Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks

    Get PDF
    In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe

    A Comparison Between STRAIGHT, Glottal, an Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

    Get PDF
    Speech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Puhe on olennainen osa ihmistenvälistä informaation siirtoa. Vaikka kielellistä sisältöä pidetään yleisesti puheen tärkeimpänä ominaisuutena, puhesignaali sisältää myös runsaasti muuta informaatiota kuten prosodisia vihjeitä, jotka muokkaavat siirrettävän informaation merkitystä. Tämä informaatio tuotetaan suurilta osin näennäisjaksollisella glottisherätteellä, joka on puheen herätteenä toimiva akustinen virtaussignaali. Säätämällä äänihuulten alapuolista painetta ja äänihuulten kireyttä ihmiset muuttavat glottisherätteen ominaisuuksia viestittääkseen esimerkiksi tunnetilaa. Glottaalinen käänteissuodatus (GKS) on laskennallinen menetelmä glottisherätteen estimointiin nauhoitetusta puhesignaalista. Glottisherätteen perusteella puheen laadusta voidaan tunnistaa useita piirteitä kuten ääntötapa, sekä hetkellisesti että ajan funktiona. Puheen perustutkimuksen, kuten fonetiikan, lisäksi viimeaikaiset edistykset GKS:ssä ja koneoppimisessa ovat avaamassa mahdollisuuksia laajempaan GKS:n soveltamiseen puheteknologiassa, kuten puhesynteesissä ja puheen biopiirteistämisessä paralingvistisiä sovelluksia varten. Haasteena on kuitenkin se, että GKS on vaikea käänteisongelma, jossa todellista puhetta vastaavan glottisherätteen suora mittaus on mahdotonta. Tästä johtuen GKS:ssä käytettävien algoritmien kehitystyö ja arviointi perustuu etukäteisoletuksiin puhesignaalin ominaisuuksista. Tässä väitöskirjassa esitetyissä menetelmissä on yhteisenä oletuksena se, että ääntöväylän siirtofunktio voidaan arvioida (joka on GKS:n pääongelma) aikapainottamalla GKS:n optimointikriteeriä niin, että glottisherätteen pääeksitaatiopiikkin vaikutus vaimenee. Tässä väitöskirjassa GKS:ta tutkitaan useasta eri näkökulmasta, jotka sisältävät kaksi uutta GKS-menetelmää, jotka parantavat arviointituloksia aikaisempiin menetelmiin verrattuna, sekä perustutkimusta käänteissuodatusprosessin automatisointiin liittyen. Lisäksi GKS-pohjaista ääntöväylän siirtofunktiota käytetään formanttiestimoinnissa sekä kuulohavaintopainotettuna versiona puheen spektrin verhokäyrän arvioinnissa. Tämän väitöskirjan keskeisin puheteknologiasovellus on GKS-pohjaisten puheen spektrin verhokäyrämallien sekä glottisheräteaaltomuotojen käyttö kohdedatana neuroverkkomalleille tilastollisessa parametrisessa puhesynteesissä. Saatujen tulosten perusteella kehitetyt menetelmät parantavat GKS-pohjaisten menetelmien laatua kaikilla äänityypeillä, mutta puhesynteesisovelluksissa GKS-pohjaiset ratkaisut hyödyttävät edelleen lähinnä matalia miesääniä
    corecore