993 research outputs found

    Meta Learning Approach to Phone Duration Modeling

    Get PDF
    One of the essential prerequisites for achieving the naturalness of synthesized speech is the possibility of the automatic prediction of phone duration, due to the high importance of segmental duration in speech perception. In this paper we present a new phone duration prediction model for the Serbian language using meta learning approach. Based on the data obtained from the analysis of a large speech database, we used a feature set of 21 parameters describing phones and their contexts. These include attributes related to the segmental identity, manner of articulation (for consonants), attributes related to phonological context, such as segment types and voicing values of neighboring phones, presence or absence of lexical stress, morphological attributes, such as part-of-speech, and prosodic attributes, such as phonological word length, the position of the segment in the syllable, the position of the syllable in a word, the position of a word in a phrase, phrase break level, etc. Phone duration model obtained using meta learning algorithm outperformed the best individual model by approximately 2,0% and 1,7% in terms of the relative reduction of the root-mean-squared error and the mean absolute error, respectively

    Relative Salience of Speech Rhythm and Speech Rate on Perceived Foreign Accent in a Second Language

    Get PDF
    We investigated the independent contribution of speech rate and speech rhythm to perceived foreign accent. To address this issue we used a resynthesis technique that allows neutralizing segmental and tonal idiosyncrasies between identical sentences produced by French learners of English at different proficiency levels and maintaining the idiosyncrasies pertaining to prosodic timing patterns. We created stimuli that (1) preserved the idiosyncrasies in speech rhythm while controlling for the differences in speech rate between the utterances; (2) preserved the idiosyncrasies in speech rate while controlling for the differences in speech rhythm between the utterances; and (3) preserved the idiosyncrasies both in speech rate and speech rhythm. All the stimuli were created in intoned (with imposed intonational contour) and flat (with monotonized, constant F0) conditions. The original and the resynthesized sentences were rated by native speakers of English for degree of foreign accent. We found that both speech rate and speech rhythm influence the degree of perceived foreign accent, but the effect of speech rhythm is larger than that of speech rate. We also found that intonation enhances the perception of fine differences in rhythmic patterns but reduces the perceptual salience of fine differences in speech rate

    Konsonandikeskne vĂ€ltesĂŒsteem eesti ja inarisaami keeles

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneKolme pikkuskategooriaga konsonandikeskne vĂ€ltesĂŒsteem esineb vĂ€ga vĂ€hestes keeltes, teadaolevalt ainult soome-ugri keeltes: eesti, liivi ja inarisaami keeles ning veel mĂ”ningates saami keeltes. Doktoritöö keskendub neist kahele – eesti ja inarisaami keelele, millest esimene kuulub soome-ugri keelte lÀÀnemeresoome ja teine saami keelterĂŒhma. Eesti keeles esineb keerukas kolmevĂ€ltesĂŒsteem, kus vastandus moodustatakse nii vokaalide, konsonantide kui ka mĂ”lema pĂ”hjal. Inarisaami keeles leiab kolm pikkuskategooriat vaid konsonantide puhul, vokaalidel esineb kahene vastandus. Eksperimentaalfoneetiline vĂ€itekiri uurib, kuidas kolmene konsonandikeskne kvantiteedisĂŒsteem nendes keeltes foneetiliselt avaldub kĂ”netaktis. Vaadeldakse omadusi, mis kolme vĂ€ldet ĂŒksteisest eristavad. Teise suurema teemana kĂ€sitleb doktoritöö segmentaalse konteksti rolli eesti keele konsonandivĂ€lte avaldumisel. Töö tulemused nĂ€itavad, et kolme konsonandikeskset vĂ€ldet eristab nii eesti kui ka inarisaami keeles konsonandi enda kestus, mis on suuremas vĂ€ltes pikem. Keeltevahelised erinevused tulevad vĂ€lja kategooriate omavaheliste kestussuhete kaudu: eesti keeles eristuvad teineteisest rohkem esimene ja teine vĂ€lde, inarisaamis aga teine ja kolmas vĂ€lde. Kui eesti keeles lĂŒheneb konsonandile jĂ€rgnev rĂ”hutu silbi vokaal vastavalt konsonandivĂ€lte kasvades, siis inarisaamis lĂŒhenevad mĂ”lemad, nii konsonandile eelnev rĂ”hulise silbi vokaal kui ka sellele jĂ€rgnev rĂ”hutu silbi vokaal. PĂ”hitoonikontuurid inarisaami eri struktuuriga kahesilbilistes sĂ”nades mĂ€rkimisvÀÀrselt ei varieeru, kuid konsonandivĂ€lte kasvades intensiivsuse vÀÀrtuste erinevus esimese ja teise silbi vokaali vahel suureneb. Samas eri vĂ€ltes oleva vokaalidevahelise helilise konsonandi enda intensiivsus ei muutu. PĂ”hitoon on eesti keeles oluline teise ja kolmanda vĂ€lte eristamisel, kuid klusiilide puhul, kus pĂ”hitooni liikumist jĂ€lgida ei saa, on ka leitud, et vĂ€lte tajumiseks piisab kestuslikest tunnustest. Doktoritöö eesti keele artikulatsioonikatse tulemused nĂ€itavad, et kolmese konsonandikeskse vĂ€lte avaldumisel on oluline osa segmentaalsel kontekstil. Kui mĂ”ningate artikulatoorsete liigutuste puhul saab nĂ€ha vĂ€ltega seotud kolmeseid mustreid (huulte sulgemisliigutuse kestuses konsonandi hÀÀldamisel, keeleliigutuste kestuses ĂŒleminekul konsonandile eelnevalt vokaalilt jĂ€rgnevale), siis erineva sĂ”naalgulise konsonandi ja ĂŒmbritseva vokaalikonteksti tĂ”ttu esineb varieerumist, kus esimene ja teine vĂ€lde vastanduvad kolmandale vĂ”i vastandub esimene vĂ€lde teisele ja kolmandale. Ka spontaankĂ”ne materjali pĂ”hjal tehtud akustiline analĂŒĂŒs nĂ€itas, et erinevate konsonantide puhul realiseerub kolmene vĂ€lde mĂ”nevĂ”rra erinevalt ning sealjuures on oluline seos konsonandi ja seda ĂŒmbritsevate vokaalide omaduste vahel.Quantity systems with three length categories for consonants can be found in a small number of languages, all of which belong to the Finno-Ugric languages: Estonian, Livonian, Inari Saami, and some other Saami languages. The focus of this dissertation is on two of them, Estonian and Inari Saami, the former belonging to the Finnic and the latter to the Saamic branch. Estonian exhibits a complex quantity system forming ternary length categories with vowels, consonants, or combinations of both. In Inari Saami, ternary length distinction is found for consonants, while vocalic quantity shows binary oppositions. This thesis comprises experimental phonetic studies answering two main questions: how is ternary consonantal quantity in Estonian and Inari Saami realized phonetically, and how does quantity interact with segmental context. The results showed that, in both languages, the three-way consonantal quantity is manifested in consonant durations that are longer in higher quantity degrees. While Estonian first and second quantity are further apart from each other, in Inari Saami second and third quantity are more distinct. Cross-linguistic differences also appear in the relations between intervocalic consonants and neighboring vowels. In Estonian, the vowel following the consonant is shorter after a long and overlong consonant than after a short one. Quantity differences in Inari Saami are realized in shorter durations of both vowels in terms of increasing consonantal quantity. Fundamental frequency contours in Inari Saami are roughly the same in words with different structures. Intensity measures, however, show greater differences between the vowels surrounding the consonant when the quantity of the consonant increases. The intensity of the sonorant consonant does not change in different quantities. The results of the articulatory study of this thesis show variation in quantity manifestations in Estonian geminate consonants due to varied segmental context. Some articulatory movements exhibit three-way patterns associated with quantity categories (in the duration of the lip closing gesture for the consonant and tongue transition gesture from the preceding vowel to the following vowel); for others the first and second quantity are opposed to the third quantity or the first quantity degree is opposed to the second and third ones. Similar patters were found in the acoustic data from spontaneous speech. The durational properties of ternary quantity are realized differently for different intervocalic consonants, and variation is also caused by coarticulatory effects of the surrounding vowels.https://www.ester.ee/record=b524109

    Prediction of accent commands for the Fujisaki intonation model

    Get PDF
    This paper presents a model to predict the accent commands (henceforth ACs) of the Fujisaki Model for the F0 contour, being known the phrase commands (henceforth FCs). Accent commands are associated with syllables. For each syllable, an artificial neural network (ANN) decides, with an accuracy of 89.4% whether there will be or not an associated AC. For syllables with associated AC, the amplitude, Aa, the onset time anticipation, T1a, and the offset time anticipation, T2a, are predicted by additional ANNs, with resulting linear correlation coefficient of 0.602, 0.743 and 0.650, respectively. The features used for each ANN are presented and discussed. Finally a comparison between target and predicted F0 contour is presented

    Automatic Pronunciation Assessment -- A Review

    Full text link
    Pronunciation assessment and its application in computer-aided pronunciation training (CAPT) have seen impressive progress in recent years. With the rapid growth in language processing and deep learning over the past few years, there is a need for an updated review. In this paper, we review methods employed in pronunciation assessment for both phonemic and prosodic. We categorize the main challenges observed in prominent research trends, and highlight existing limitations, and available resources. This is followed by a discussion of the remaining challenges and possible directions for future work.Comment: 9 pages, accepted to EMNLP Finding

    The Phonetics and Phonology of the Polish Calling Melodies

    Get PDF
    Two calling melodies of Polish were investigated, the routine call, used to call someone for an everyday reason, and the urgent call, which conveys disapproval of the addressee’s actions. A Discourse Completion Task was used to elicit the two melodies from speakers of Polish using twelve names from one to four syllables long; there were three names per syllable count, and speakers produced three tokens of each name with each melody. The results, based on eleven speakers, show that the routine calling melody consists of a low F0 stretch followed by a rise-fall-rise; the urgent calling melody, on the other hand, is a simple rise-fall. Systematic differences were found in the scaling and alignment of tonal targets: the routine call showed late alignment of the accentual pitch peak and in most instances lower scaling of targets. The accented vowel was also affected, being overall louder in the urgent call. Based on the data and comparisons with other Polish melodies, we analyse the routine call as LH* !H-H% and the urgent call as H* L-L%. We discuss the results and our analysis in light of recent findings on calling melodies in other languages, and explore their repercussions for intonational phonology and the modelling of intonation

    Fast Speech in Unit Selection Speech Synthesis

    Get PDF
    Moers-Prinz D. Fast Speech in Unit Selection Speech Synthesis. Bielefeld: UniversitÀt Bielefeld; 2020.Speech synthesis is part of the everyday life of many people with severe visual disabilities. For those who are reliant on assistive speech technology the possibility to choose a fast speaking rate is reported to be essential. But also expressive speech synthesis and other spoken language interfaces may require an integration of fast speech. Architectures like formant or diphone synthesis are able to produce synthetic speech at fast speech rates, but the generated speech does not sound very natural. Unit selection synthesis systems, however, are capable of delivering more natural output. Nevertheless, fast speech has not been adequately implemented into such systems to date. Thus, the goal of the work presented here was to determine an optimal strategy for modeling fast speech in unit selection speech synthesis to provide potential users with a more natural sounding alternative for fast speech output

    Explaining the PENTA model: a reply to Arvaniti and Ladd

    Get PDF
    This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Prosody in text-to-speech synthesis using fuzzy logic

    Get PDF
    For over a thousand years, inventors, scientists and researchers have tried to reproduce human speech. Today, the quality of synthesized speech is not equivalent to the quality of real speech. Most research on speech synthesis focuses on improving the quality of the speech produced by Text-to-Speech (TTS) systems. The best TTS systems use unit selection-based concatenation to synthesize speech. However, this method is very timely and the speech database is very large. Diphone concatenated synthesized speech requires less memory, but sounds robotic. This thesis explores the use of fuzzy logic to make diphone concatenated speech sound more natural. A TTS is built using both neural networks and fuzzy logic. Text is converted into phonemes using neural networks. Fuzzy logic is used to control the fundamental frequency for three types of sentences. In conclusion, the fuzzy system produces f0 contours that make the diphone concatenated speech sound more natural
    • 

    corecore