Search CORE

196 research outputs found

HMM-Based Speech Synthesis Utilizing Glottal Inverse Filtering

Author: Alku P.
Nurminen J.
Pulakka H.
Raitio T.
Suni A.
Vainio M.
Yamagishi J.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Aspiration noise during phonation : synthesis, analysis, and pitch-scale modification

Author: Mehta Daryush (Daryush Dinyar)
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2006
Field of study

Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 139-145).The current study investigates the synthesis and analysis of aspiration noise in synthesized and spoken vowels. Based on the linear source-filter model of speech production, we implement a vowel synthesizer in which the aspiration noise source is temporally modulated by the periodic source waveform. Modulations in the noise source waveform and their synchrony with the periodic source are shown to be salient for natural-sounding vowel synthesis. After developing the synthesis framework, we research past approaches to separate the two additive components of the model. A challenge for analysis based on this model is the accurate estimation of the aspiration noise component that contains energy across the frequency spectrum and temporal characteristics due to modulations in the noise source. Spectral harmonic/noise component analysis of spoken vowels shows evidence of noise modulations with peaks in the estimated noise source component synchronous with both the open phase of the periodic source and with time instants of glottal closure. Inspired by this observation of natural modulations in the aspiration noise source, we develop an alternate approach to the speech signal processing aim of accurate pitch-scale modification. The proposed strategy takes a dual processing approach, in which the periodic and noise components of the speech signal are separately analyzed, modified, and re-synthesized. The periodic component is modified using our implementation of time-domain pitch-synchronous overlap-add, and the noise component is handled by modifying characteristics of its source waveform.(cont.) Since we have modeled an inherent coupling between the original periodic and aspiration noise sources, the modification algorithm is designed to preserve the synchrony between temporal modulations of the two sources. The reconstructed modified signal is perceived to be natural-sounding and generally reduces artifacts that are typically heard in current modification techniques.by Daryush Mehta.S.M

DSpace@MIT

Software Tools and Analysis Methods for the Use of Electromagnetic Articulography Data in Speech Research

Author: Kolb Andrew
Publication venue: e-Publications@Marquette
Publication date: 01/04/2015
Field of study

Recent work with Electromagnetic Articulography (EMA) has shown it to be an excellent tool for characterizing speech kinematics. By tracking the position and orientation of sensors placed on the jaws, lips, teeth and tongue as they move in an electromagnetic field, information about movement and coordination of the articulators can be obtained with great time resolution. This technique has far-reaching applications for advancing fields related to speech articulation, including recognition, synthesis, motor learning, and clinical assessments. As more EMA data becomes widely available, a growing need exists for software that performs basic processing and analysis functions. The objective of this work is to create and demonstrate the use of new software tools that make full use of the information provided in EMA datasets, with a goal of maximizing the impact of EMA research. A new method for biteplate-correcting orientation data is presented, allowing orientation data to be used for articulatory analysis. Two examples of applications using orientation data are presented: a tool for jaw-angle measurement using a single EMA sensor, and a tongue interpolation tool based on three EMA sensors attached to the tongue. The results demonstrate that combined position and orientation data give a more complete picture of articulation than position data alone, and that orientation data should be incorporated in future work with EMA. A new standalone, GUI-based software tool is also presented for visualization of EMA data. It includes simultaneous real-time playback of kinematic and acoustic data, as well as basic analysis capabilities for both types of data. A comparison of the visualization tool to existing EMA software shows that it provides superior visualization and comparable analysis features to existing software. The tool will be included with the Marquette University EMA-MAE database to aid researchers working with this dataset

epublications@Marquette

Parameterization of a computational physical model for glottal flow using inverse filtering and high-speed videoendoscopy

Author: Alku Paavo
Geneid Ahmed
Malinen Jarmo
Murtola Tiina
Publication venue
Publication date: 01/02/2018
Field of study

High-speed videoendoscopy, glottal inverse filtering, and physical modeling can be used to obtain complementary information about speech production. In this study, the three methodologies are combined to pursue a better understanding of the relationship between the glottal air flow and glottal area. Simultaneously acquired high-speed video and glottal inverse filtering data from three male and three female speakers were used. Significant correlations were found between the quasi-open and quasi-speed quotients of the glottal area (extracted from the high-speed videos) and glottal flow (estimated using glottal inverse filtering), but only the quasi-open quotient relationship could be represented as a linear model. A simple physical glottal flow model with three different glottal geometries was optimized to match the data. The results indicate that glottal flow skewing can be modeled using an inertial vocal/subglottal tract load and that estimated inertia within the glottis is sensitive to the quality of the data. Parameter optimisation also appears to favour combining the simplest glottal geometry with viscous losses and the more complex glottal geometries with entrance/exit effects in the glottis.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Deep throat as a source of information

Author: Abelin Åsa
Heldner Mattias
Nagano-Madsen Yasuko
Wagner Petra
Włodarczak Marcin
Publication venue: University of Gothenburg, Department of Languages and Literatures Department of Philosophy, Linguistics and Theory of Science
Publication date: 01/01/2018
Field of study

Heldner M, Wagner P, Włodarczak M. Deep throat as a source of information. In: Abelin Å, Nagano-Madsen Y, eds. Proceedings FONETIK 2018. Göteborg: University of Gothenburg, Department of Languages and Literatures Department of Philosophy, Linguistics and Theory of Science; 2018

Publications at Bielefeld University

GlottDNN - A full-band glottal vocoder for statistical parametric speech synthesis

Author: Airaksinen Manu
Alku Paavo
Bollepalli Bajibabu
Juvela Lauri
King Simon
Wu Zhizheng
Publication venue
Publication date: 08/09/2016
Field of study

Edinburgh Research Explorer

A Comparison Between STRAIGHT, Glottal, an Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

Author: Airaksinen Manu
Alku Paavo
Bollepalli Bajibabu
Juvela Lauri
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Speech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Puhe on olennainen osa ihmistenvälistä informaation siirtoa. Vaikka kielellistä sisältöä pidetään yleisesti puheen tärkeimpänä ominaisuutena, puhesignaali sisältää myös runsaasti muuta informaatiota kuten prosodisia vihjeitä, jotka muokkaavat siirrettävän informaation merkitystä. Tämä informaatio tuotetaan suurilta osin näennäisjaksollisella glottisherätteellä, joka on puheen herätteenä toimiva akustinen virtaussignaali. Säätämällä äänihuulten alapuolista painetta ja äänihuulten kireyttä ihmiset muuttavat glottisherätteen ominaisuuksia viestittääkseen esimerkiksi tunnetilaa. Glottaalinen käänteissuodatus (GKS) on laskennallinen menetelmä glottisherätteen estimointiin nauhoitetusta puhesignaalista. Glottisherätteen perusteella puheen laadusta voidaan tunnistaa useita piirteitä kuten ääntötapa, sekä hetkellisesti että ajan funktiona. Puheen perustutkimuksen, kuten fonetiikan, lisäksi viimeaikaiset edistykset GKS:ssä ja koneoppimisessa ovat avaamassa mahdollisuuksia laajempaan GKS:n soveltamiseen puheteknologiassa, kuten puhesynteesissä ja puheen biopiirteistämisessä paralingvistisiä sovelluksia varten. Haasteena on kuitenkin se, että GKS on vaikea käänteisongelma, jossa todellista puhetta vastaavan glottisherätteen suora mittaus on mahdotonta. Tästä johtuen GKS:ssä käytettävien algoritmien kehitystyö ja arviointi perustuu etukäteisoletuksiin puhesignaalin ominaisuuksista. Tässä väitöskirjassa esitetyissä menetelmissä on yhteisenä oletuksena se, että ääntöväylän siirtofunktio voidaan arvioida (joka on GKS:n pääongelma) aikapainottamalla GKS:n optimointikriteeriä niin, että glottisherätteen pääeksitaatiopiikkin vaikutus vaimenee. Tässä väitöskirjassa GKS:ta tutkitaan useasta eri näkökulmasta, jotka sisältävät kaksi uutta GKS-menetelmää, jotka parantavat arviointituloksia aikaisempiin menetelmiin verrattuna, sekä perustutkimusta käänteissuodatusprosessin automatisointiin liittyen. Lisäksi GKS-pohjaista ääntöväylän siirtofunktiota käytetään formanttiestimoinnissa sekä kuulohavaintopainotettuna versiona puheen spektrin verhokäyrän arvioinnissa. Tämän väitöskirjan keskeisin puheteknologiasovellus on GKS-pohjaisten puheen spektrin verhokäyrämallien sekä glottisheräteaaltomuotojen käyttö kohdedatana neuroverkkomalleille tilastollisessa parametrisessa puhesynteesissä. Saatujen tulosten perusteella kehitetyt menetelmät parantavat GKS-pohjaisten menetelmien laatua kaikilla äänityypeillä, mutta puhesynteesisovelluksissa GKS-pohjaiset ratkaisut hyödyttävät edelleen lähinnä matalia miesääniä

Crossref

Edinburgh Research Explorer

Aaltodoc Publication Archive

Vocal Imitation in Sensorimotor Learning Models: a Comparative Review

Author: Hinaut Xavier
Leblois Arthur
Pagliarini Silvia
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 30/11/2020
Field of study

International audienceSensorimotor learning represents a challenging problem for natural and artificial systems. Several computational models have been proposed to explain the neural and cognitive mechanisms at play in the brain. In general, these models can be decomposed in three common components: a sensory system, a motor control device and a learning framework. The latter includes the architecture, the learning rule or optimisation method, and the exploration strategy used to guide learning. In this review, we focus on imitative vocal learning, that is exemplified in song learning in birds and speech acquisition in humans. We aim to synthesise, analyse and compare the various models of vocal learning that have been proposed, highlighting their common points and differences. We first introduce the biological context, including the behavioural and physiological hallmarks of vocal learning and sketch the neural circuits involved. Then, we detail the different components of a vocal learning model and how they are implemented in the reviewed models

INRIA a CCSD electronic archive server