594 research outputs found

    Automatic Detection of Laryngeal Pathology on Sustained Vowels Using Short-Term Cepstral Parameters: Analysis of Performance and Theoretical Justification

    Get PDF
    The majority of speech signal analysis procedures for automatic detection of laryngeal pathologies mainly rely on parameters extracted from time domain processing. Moreover, calculation of these parameters often requires prior pitch period estimation; therefore, their validity heavily depends on the robustness of pitch detection. Within this paper, an alternative approach based on cepstral- domain processing is presented which has the advantage of not requiring pitch estimation, thus providing a gain in both simplicity and robustness. While the proposed scheme is similar to solutions based on Mel-frequency cepstral parameters, already present in literature, it has an easier physical interpretation while achieving similar performance standards

    Spectral estimation and significance of glottal-pulse parameters

    Get PDF

    Perceptual aspects of voice-source parameters

    Get PDF
    xii+114hlm.;24c

    Cepstral peak prominence: a comprehensive analysis

    Full text link
    An analytical study of cepstral peak prominence (CPP) is presented, intended to provide an insight into its meaning and relation with voice perturbation parameters. To carry out this analysis, a parametric approach is adopted in which voice production is modelled using the traditional source-filter model and the first cepstral peak is assumed to have Gaussian shape. It is concluded that the meaning of CPP is very similar to that of the first rahmonic and some insights are provided on its dependence with fundamental frequency and vocal tract resonances. It is further shown that CPP integrates measures of voice waveform and periodicity perturbations, be them either amplitude, frequency or noise

    Relevance of the glottal pulse and the vocal tract in gender detection

    Get PDF
    Gender detection is a very important objective to improve efficiency in tasks as speech or speaker recognition, among others. Traditionally gender detection has been focused on fundamental frequency (f0) and cepstral features derived from voiced segments of speech. The methodology presented here consists in obtaining uncorrelated glottal and vocal tract components which are parameterized as mel-frequency coefficients. K-fold and cross-validation using QDA and GMM classifiers showed that better detection rates are reached when glottal source and vocal tract parameters are used in a gender-balanced database of running speech from 340 speakers

    Is nonlinear propagation responsible for the brassiness of elephant trumpet calls?

    Get PDF
    African elephants (Loxodonta africana) produce a broad diversity of sounds ranging from infrasonic rumbles to much higher frequency trumpets. Trumpet calls are very loud voiced signals given by highly aroused elephants, and appear to be produced by a forceful expulsion of air through the trunk. Some trumpet calls have a very distinctive quality that is unique in the animal kingdom, but resemble the "brassy" sounds that can be produced with brass musical instruments such as trumpets or trombones. Brassy musical sounds are characterised by a flat spectral slope caused by the nonlinear propagation of the source wave as it travels through the long bore of the instrument. The extent of this phenomenon, which normally occurs at high intensity levels (e.g. fortissimo), depends on the fundamental frequency (F0) of the source as well as on the length of the resonating tube. Interestingly, the length of the vocal tract of the elephant (as measured from the vocal folds to the end of the trunk) approximates the critical length for shockwave formation, given the fundamental frequency and intensity of trumpet calls. We suggest that this phenomenon could explain the unique, distinctive brassy quality of elephant trumpet calls

    Models and Analysis of Vocal Emissions for Biomedical Applications

    Get PDF
    The International Workshop on Models and Analysis of Vocal Emissions for Biomedical Applications (MAVEBA) came into being in 1999 from the particularly felt need of sharing know-how, objectives and results between areas that until then seemed quite distinct such as bioengineering, medicine and singing. MAVEBA deals with all aspects concerning the study of the human voice with applications ranging from the neonate to the adult and elderly. Over the years the initial issues have grown and spread also in other aspects of research such as occupational voice disorders, neurology, rehabilitation, image and video analysis. MAVEBA takes place every two years always in Firenze, Italy

    A Comparison Between STRAIGHT, Glottal, an Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

    Get PDF
    Speech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Puhe on olennainen osa ihmistenvÀlistÀ informaation siirtoa. Vaikka kielellistÀ sisÀltöÀ pidetÀÀn yleisesti puheen tÀrkeimpÀnÀ ominaisuutena, puhesignaali sisÀltÀÀ myös runsaasti muuta informaatiota kuten prosodisia vihjeitÀ, jotka muokkaavat siirrettÀvÀn informaation merkitystÀ. TÀmÀ informaatio tuotetaan suurilta osin nÀennÀisjaksollisella glottisherÀtteellÀ, joka on puheen herÀtteenÀ toimiva akustinen virtaussignaali. SÀÀtÀmÀllÀ ÀÀnihuulten alapuolista painetta ja ÀÀnihuulten kireyttÀ ihmiset muuttavat glottisherÀtteen ominaisuuksia viestittÀÀkseen esimerkiksi tunnetilaa. Glottaalinen kÀÀnteissuodatus (GKS) on laskennallinen menetelmÀ glottisherÀtteen estimointiin nauhoitetusta puhesignaalista. GlottisherÀtteen perusteella puheen laadusta voidaan tunnistaa useita piirteitÀ kuten ÀÀntötapa, sekÀ hetkellisesti ettÀ ajan funktiona. Puheen perustutkimuksen, kuten fonetiikan, lisÀksi viimeaikaiset edistykset GKS:ssÀ ja koneoppimisessa ovat avaamassa mahdollisuuksia laajempaan GKS:n soveltamiseen puheteknologiassa, kuten puhesynteesissÀ ja puheen biopiirteistÀmisessÀ paralingvistisiÀ sovelluksia varten. Haasteena on kuitenkin se, ettÀ GKS on vaikea kÀÀnteisongelma, jossa todellista puhetta vastaavan glottisherÀtteen suora mittaus on mahdotonta. TÀstÀ johtuen GKS:ssÀ kÀytettÀvien algoritmien kehitystyö ja arviointi perustuu etukÀteisoletuksiin puhesignaalin ominaisuuksista. TÀssÀ vÀitöskirjassa esitetyissÀ menetelmissÀ on yhteisenÀ oletuksena se, ettÀ ÀÀntövÀylÀn siirtofunktio voidaan arvioida (joka on GKS:n pÀÀongelma) aikapainottamalla GKS:n optimointikriteeriÀ niin, ettÀ glottisherÀtteen pÀÀeksitaatiopiikkin vaikutus vaimenee. TÀssÀ vÀitöskirjassa GKS:ta tutkitaan useasta eri nÀkökulmasta, jotka sisÀltÀvÀt kaksi uutta GKS-menetelmÀÀ, jotka parantavat arviointituloksia aikaisempiin menetelmiin verrattuna, sekÀ perustutkimusta kÀÀnteissuodatusprosessin automatisointiin liittyen. LisÀksi GKS-pohjaista ÀÀntövÀylÀn siirtofunktiota kÀytetÀÀn formanttiestimoinnissa sekÀ kuulohavaintopainotettuna versiona puheen spektrin verhokÀyrÀn arvioinnissa. TÀmÀn vÀitöskirjan keskeisin puheteknologiasovellus on GKS-pohjaisten puheen spektrin verhokÀyrÀmallien sekÀ glottisherÀteaaltomuotojen kÀyttö kohdedatana neuroverkkomalleille tilastollisessa parametrisessa puhesynteesissÀ. Saatujen tulosten perusteella kehitetyt menetelmÀt parantavat GKS-pohjaisten menetelmien laatua kaikilla ÀÀnityypeillÀ, mutta puhesynteesisovelluksissa GKS-pohjaiset ratkaisut hyödyttÀvÀt edelleen lÀhinnÀ matalia miesÀÀniÀ
    • 

    corecore