Search CORE

829 research outputs found

Simulating vocal learning of spoken language: Beyond imitation

Author: Birkholz Peter
Gerazov Branislav
Halliday Lorna
Krug Paul K
Prom-on Santitham
van Niekerk Daniel R
Xu Anqi
Xu Yi
Publication venue: 'Elsevier BV'
Publication date: 01/02/2023
Field of study

Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and phonological correspondence problems, can be addressed by linguistically grounded auditory perception. In particular, we show how the articulation of consonant-vowel syllables may be learnt from auditory percepts that can represent either individual utterances by speakers with different vocal tract characteristics or ideal phonetic realisations. The result is an optimisation-based implementation of vocal exploration – incorporating semantic, auditory, and articulatory signals – that can serve as a basis for simulating vocal learning beyond imitation

UCL Discovery

Segmental alignment of English syllables with singleton and cluster onsets

Author: Liu Z
Xu Y
Publication venue: International Speech Communication Association (ISCA)
Publication date: 30/08/2021
Field of study

Recent research has shown fresh evidence that consonant and vowel are synchronised at the syllable onset, as predicted by a number of theoretical models. The finding was made by using a minimal contrast paradigm to determine segment onset in Mandarin CV syllables, which differed from the conventional method of detecting gesture onset with a velocity threshold [1]. It has remained unclear, however, if CV co-onset also occurs between the nucleus vowel and a consonant cluster, as predicted by the articulatory syllable model [2]. This study applied the minimal contrast paradigm to British English in both CV and clusterV (CLV) syllables, and analysed the spectral patterns with signal chopping in conjunction with recurrent neural networks (RNN) with long short-term memory (LSTM) [3]. Results show that vowel onset is synchronised with the onset of the first consonant in a cluster, thus supporting the articulatory syllable model

UCL Discovery

Relating Objective and Subjective Performance Measures for AAM-based Visual Speech Synthesizers

Author: Matthews I
Theobald B
Publication venue
Publication date: 01/01/2012
Field of study

We compare two approaches for synthesizing visual speech using Active Appearance Models (AAMs): one that utilizes acoustic features as input, and one that utilizes a phonetic transcription as input. Both synthesizers are trained using the same data and the performance is measured using both objective and subjective testing. We investigate the impact of likely sources of error in the synthesized visual speech by introducing typical errors into real visual speech sequences and subjectively measuring the perceived degradation. When only a small region (e.g. a single syllable) of ground-truth visual speech is incorrect we find that the subjective score for the entire sequence is subjectively lower than sequences generated by our synthesizers. This observation motivates further consideration of an often ignored issue, which is to what extent are subjective measures correlated with objective measures of performance? Significantly, we find that the most commonly used objective measures of performance are not necessarily the best indicator of viewer perception of quality. We empirically evaluate alternatives and show that the cost of a dynamic time warp of synthesized visual speech parameters to the respective ground-truth parameters is a better indicator of subjective quality

University of East Anglia digital repository

A syllable-based investigation of coarticulation

Author: Liu Zirui
Publication venue: UCL (University College London)
Publication date: 28/06/2023
Field of study

Coarticulation has been long investigated in Speech Sciences and Linguistics (Kühnert & Nolan, 1999). This thesis explores coarticulation through a syllable based model (Y. Xu, 2020). First, it is hypothesised that consonant and vowel are synchronised at the syllable onset for the sake of reducing temporal degrees of freedom, and such synchronisation is the essence of coarticulation. Previous efforts in the examination of CV alignment mainly report onset asynchrony (Gao, 2009; Shaw & Chen, 2019). The first study of this thesis tested the synchrony hypothesis using articulatory and acoustic data in Mandarin. Departing from conventional approaches, a minimal triplet paradigm was applied, in which the CV onsets were determined through the consonant and vowel minimal pairs, respectively. Both articulatory and acoustical results showed that CV articulation started in close temporal proximity, supporting the synchrony hypothesis. The second study extended the research to English and syllables with cluster onsets. By using acoustic data in conjunction with Deep Learning, supporting evidence was found for co-onset, which is in contrast to the widely reported c-center effect (Byrd, 1995). Secondly, the thesis investigated the mechanism that can maximise synchrony – Dimension Specific Sequential Target Approximation (DSSTA), which is highly relevant to what is commonly known as coarticulation resistance (Recasens & Espinosa, 2009). Evidence from the first two studies show that, when conflicts arise due to articulation requirements between CV, the CV gestures can be fulfilled by the same articulator on separate dimensions simultaneously. Last but not least, the final study tested the hypothesis that resyllabification is the result of coarticulation asymmetry between onset and coda consonants. It was found that neural network based models could infer syllable affiliation of consonants, and those inferred resyllabified codas had similar coarticulatory structure with canonical onset consonants. In conclusion, this thesis found that many coarticulation related phenomena, including local vowel to vowel anticipatory coarticulation, coarticulation resistance, and resyllabification, stem from the articulatory mechanism of the syllable

UCL Discovery

Modelling English diphthongs with dynamic articulatory targets

Author: Birkholz Peter
Gerazov Branislav
Krug Paul Konstantin
Prom-on Santitham
van Niekerk Daniel
Xu Anqi
Xu Yi
Publication venue: 'Center for Open Science'
Publication date: 22/04/2022
Field of study

The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we used computational modelling to explore the underlying forms of diphthongs. We tested the assumption that diphthongs have dynamic articulatory targets by training an articulatory synthesiser with a three-dimensional (3D) vocal tract model to learn English words. An automatic phoneme recogniser was constructed to guide the learning of the diphthongs. Listening experiments by native listeners indicated that the model succeeded in learning highly intelligible diphthongs, providing support for the dynamic target assumption. The modelling approach paves a new way for validating hypotheses of speech perception and production

UCL Discovery

Influence of syllable-coda voicing on the acoustic properties of syllable-onset /l/ in English

Author: Alfonso
Beddor
Butcher
Chen
Denes
Fischer
Fischer-Jørgensen
Fitch
Forrest
Fowler
Fujimura
Gaskell
Grossberg
Grossberg
Guenther
Hawkins
Hawkins
Hogan
Hoole
Hoole
House
Huffman
Jassem
Jones
Jusczyk
Kelly
Kingston
Lisker
Löfquist
Marslen-Wilson
Massaro
Moore
Newton
Nittrouer
Noël Nguyen
Ogden
Peterson
Pisoni
Pisoni
Plaut
Protopapas
Raphael
Remez
Remez
Sarah Hawkins
Slater
Sproat
Stevens
Stevens
Streeter
Summers
Summers
Suomi
van Santen
Warren
Weismer
West
Westbury
Whalen
Whalen
Wolf
Wright
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

International audienceProperties of syllable onset /l/ that depend on the voicing of the syllable coda were measured for four speakers, representing different nonrhotic British English accents that differ in their phonetic realization of onset /l/ and in their system of phonological contrast involving onset /l/ and /r/. Onset /l/ was longer before voiced than voiceless codas for all four speakers, and darker for two of them as measured by lower F2 frequency, and for these two and one other as measured by spectral center of gravity (COG). There were no coda-dependent differences in f0 in the /l/, and F1 frequency differed only for the fourth speaker. The vowel was also longer for all four speakers when the coda was voiced (as expected), while F1 was lower and F2 normally higher. One speaker provided data with fricative or affricate onsets: fricated segments were longer before voiced codas, but no coda-dependent COG differences were found. At least when the onset includes /l/, phonological voicing of the coda seems to be reflected in complex acousticphonetic properties distributed across the whole syllable, some properties being localized, others not. We describe these properties as variations in a brightsomber dimension. In most accents, when the coda is voiceless, the syllable is relatively bright: small proportions of periodic energy which is relatively high frequency at the syllable edges, and a high proportion of silence or aperiodic energy. When the coda is voiced, the syllable is relatively somber: a high proportion of periodic energy which is relatively low frequency at the syllable edges, and relatively small amounts of silence and aperiodic energy. Other accents use other combinations, dependent on the phonetic and phonological properties of liquids in the particular accent. The association of onset darkness and coda voicing does not seem to be ascribable to anticipatory coarticulation of features essential to voicing itself; this observation provides support for nonsegmental models of speech perception in which fine phonetic detail is mapped directly to linguistic structure without reference to phoneme-sized segments

Crossref

HAL AMU

Towards an Integrative Information Society: Studies on Individuality in Speech and Sign

Author: Ojala Stina
Publication venue: Turku Centre for Computer Science
Publication date: 21/05/2011
Field of study

The flow of information within modern information society has increased rapidly over the last decade. The major part of this information flow relies on the individual’s abilities to handle text or speech input. For the majority of us it presents no problems, but there are some individuals who would benefit from other means of conveying information, e.g. signed information flow. During the last decades the new results from various disciplines have all suggested towards the common background and processing for sign and speech and this was one of the key issues that I wanted to investigate further in this thesis. The basis of this thesis is firmly within speech research and that is why I wanted to design analogous test batteries for widely used speech perception tests for signers – to find out whether the results for signers would be the same as in speakers’ perception tests. One of the key findings within biology – and more precisely its effects on speech and communication research – is the mirror neuron system. That finding has enabled us to form new theories about evolution of communication, and it all seems to converge on the hypothesis that all communication has a common core within humans. In this thesis speech and sign are discussed as equal and analogical counterparts of communication and all research methods used in speech are modified for sign. Both speech and sign are thus investigated using similar test batteries. Furthermore, both production and perception of speech and sign are studied separately. An additional framework for studying production is given by gesture research using cry sounds. Results of cry sound research are then compared to results from children acquiring sign language. These results show that individuality manifests itself from very early on in human development. Articulation in adults, both in speech and sign, is studied from two perspectives: normal production and re-learning production when the apparatus has been changed. Normal production is studied both in speech and sign and the effects of changed articulation are studied with regards to speech. Both these studies are done by using carrier sentences. Furthermore, sign production is studied giving the informants possibility for spontaneous speech. The production data from the signing informants is also used as the basis for input in the sign synthesis stimuli used in sign perception test battery. Speech and sign perception were studied using the informants’ answers to questions using forced choice in identification and discrimination tasks. These answers were then compared across language modalities. Three different informant groups participated in the sign perception tests: native signers, sign language interpreters and Finnish adults with no knowledge of any signed language. This gave a chance to investigate which of the characteristics found in the results were due to the language per se and which were due to the changes in modality itself. As the analogous test batteries yielded similar results over different informant groups, some common threads of results could be observed. Starting from very early on in acquiring speech and sign the results were highly individual. However, the results were the same within one individual when the same test was repeated. This individuality of results represented along same patterns across different language modalities and - in some occasions - across language groups. As both modalities yield similar answers to analogous study questions, this has lead us to providing methods for basic input for sign language applications, i.e. signing avatars. This has also given us answers to questions on precision of the animation and intelligibility for the users – what are the parameters that govern intelligibility of synthesised speech or sign and how precise must the animation or synthetic speech be in order for it to be intelligible. The results also give additional support to the well-known fact that intelligibility in fact is not the same as naturalness. In some cases, as shown within the sign perception test battery design, naturalness decreases intelligibility. This also has to be taken into consideration when designing applications. All in all, results from each of the test batteries, be they for signers or speakers, yield strikingly similar patterns, which would indicate yet further support for the common core for all human communication. Thus, we can modify and deepen the phonetic framework models for human communication based on the knowledge obtained from the results of the test batteries within this thesis.Siirretty Doriast

UTUPub

Effects of Tonal Coarticulation and Prosodic Positions on Tonal Contours of Low Rising Tones: In the Case of Xiamen Dialect

Author: Feng Hui
Hu Yiying
Li Aijun
Zhao Qinghua
Publication venue
Publication date: 03/06/2023
Field of study

Few studies have worked on the effects of tonal coarticulation and prosodic positions on the low rising tone in Xiamen Dialect. This study addressed such an issue. To do so, a new method, the Tonal Contour Analysis in Tonal Triangle, was proposed to measure the subtle curvature of the tonal contour. Findings are as follows: (1) The low rising tone in Xiamen Dialect has a tendency towards the falling-rising tone, which is significantly affected by the tonal coarticulation and prosodic positions. (2) The low rising tone presents as a falling-rising tone when preceded by a tone with a high offset, and as a low rising tone when preceded by a tone that ends up low. (3) The curvature of the low rising tone is greatest in the sentence-initial position, and is positively correlated to its own duration.Comment: To be published in InterSpeech 202

arXiv.org e-Print Archive

A silent speech system based on permanent magnet articulography and direct synthesis

Author: Bai Jie
Cheah Lam A.
Ell Stephen R.
Gilbert James M.
Gonzalez Jose A.
Green Phil D.
Moore Roger K.
Publication venue: 'Elsevier BV'
Publication date: 14/03/2016
Field of study

In this paper we present a silent speech interface (SSI) system aimed at restoring speech communication for individuals who have lost their voice due to laryngectomy or diseases affecting the vocal folds. In the proposed system, articulatory data captured from the lips and tongue using permanent magnet articulography (PMA) are converted into audible speech using a speaker-dependent transformation learned from simultaneous recordings of PMA and audio signals acquired before laryngectomy. The transformation is represented using a mixture of factor analysers, which is a generative model that allows us to efficiently model non-linear behaviour and perform dimensionality reduction at the same time. The learned transformation is then deployed during normal usage of the SSI to restore the acoustic speech signal associated with the captured PMA data. The proposed system is evaluated using objective quality measures and listening tests on two databases containing PMA and audio recordings for normal speakers. Results show that it is possible to reconstruct speech from articulator movements captured by an unobtrusive technique without an intermediate recognition step. The SSI is capable of producing speech of sufficient intelligibility and naturalness that the speaker is clearly identifiable, but problems remain in scaling up the process to function consistently for phonetically rich vocabularies

Repository@Hull - Worktribe

Assessment of naturalness in the ProSynth speech synthesis project

Author: Hawkins S
Heid S
House J
Huckvale M
Publication venue: 'Institute of Electrical Engineers of Japan (IEE Japan)'
Publication date: 22/07/2006
Field of study

UCL Discovery