2,148 research outputs found

    Speech synthesis, Speech simulation and speech science

    Get PDF
    Speech synthesis research has been transformed in recent years through the exploitation of speech corpora - both for statistical modelling and as a source of signals for concatenative synthesis. This revolution in methodology and the new techniques it brings calls into question the received wisdom that better computer voice output will come from a better understanding of how humans produce speech. This paper discusses the relationship between this new technology of simulated speech and the traditional aims of speech science. The paper suggests that the goal of speech simulation frees engineers from inadequate linguistic and physiological descriptions of speech. But at the same time, it leaves speech scientists free to return to their proper goal of building a computational model of human speech production

    THE CHILD AND THE WORLD: How Children acquire Language

    Get PDF
    HOW CHILDREN ACQUIRE LANGUAGE Over the last few decades research into child language acquisition has been revolutionized by the use of ingenious new techniques which allow one to investigate what in fact infants (that is children not yet able to speak) can perceive when exposed to a stream of speech sound, the discriminations they can make between different speech sounds, differentspeech sound sequences and different words. However on the central features of the mystery, the extraordinarily rapid acquisition of lexicon and complex syntactic structures, little solid progress has been made. The questions being researched are how infants acquire and produce the speech sounds (phonemes) of the community language; how infants find words in the stream of speech; and how they link words to perceived objects or action, that is, discover meanings. In a recent general review in Nature of children's language acquisition, Patricia Kuhl also asked why we do not learn new languages as easily at 50 as at 5 and why computers have not cracked the human linguistic code. The motor theory of language function and origin makes possible a plausible account of child language acquisition generally from which answers can be derived also to these further questions. Why computers so far have been unable to 'crack' the language problem becomes apparent in the light of the motor theory account: computers can have no natural relation between words and their meanings; they have no conceptual store to which the network of words is linked nor do they have the innate aspects of language functioning - represented by function words; computers have no direct links between speech sounds and movement patterns and they do not have the instantly integrated neural patterning underlying thought - they necessarily operate serially and hierarchically. Adults find the acquisition of a new language much more difficult than children do because they are already neurally committed to the link between the words of their first language and the elements in their conceptual store. A second language being acquired by an adult is in direct competition for neural space with the network structures established for the first language

    Sound-Action Symbolism

    Get PDF
    Recent evidence has shown linkages between actions and segmental elements of speech. For instance, close-front vowels are sound symbolically associated with the precision grip, and front vowels are associated with forward-directed limb movements. The current review article presents a variety of such sound-action effects and proposes that they compose a category of sound symbolism that is based on grounding a conceptual knowledge of a referent in articulatory and manual action representations. In addition, the article proposes that even some widely known sound symbolism phenomena such as the sound-magnitude symbolism can be partially based on similar sensorimotor grounding. It is also discussed that meaning of suprasegmental speech elements in many instances is similarly grounded in body actions. Sound symbolism, prosody, and body gestures might originate from the same embodied mechanisms that enable a vivid and iconic expression of a meaning of a referent to the recipient.Peer reviewe

    PĂ”hiemotsioonid eestikeelses etteloetud kĂ”nes: akustiline analĂŒĂŒs ja modelleerimine

    Get PDF
    VĂ€itekirja elektrooniline versioon ei sisalda publikatsiooneDoktoritööl oli kaks eesmĂ€rki: saada teada, milline on kolme pĂ”hiemotsiooni – rÔÔmu, kurbuse ja viha – akustiline vĂ€ljendumine eestikeelses etteloetud kĂ”nes, ning luua neile uurimistulemustele tuginedes eestikeelsele kĂ”nesĂŒntesaatorile parameetrilise sĂŒnteesi jaoks emotsionaalse kĂ”ne akustilised mudelid, mis aitaksid sĂŒntesaatoril Ă€ratuntavalt nimetatud emotsioone vĂ€ljendada. Kuna sĂŒnteeskĂ”net rakendatakse paljudes valdkondades, nĂ€iteks inimese ja masina suhtluses, multimeedias vĂ”i puuetega inimeste abivahendites, siis on vĂ€ga oluline, et sĂŒnteeskĂ”ne kĂ”laks loomulikuna, vĂ”imalikult inimese rÀÀkimise moodi. Üks viis sĂŒnteeskĂ”ne loomulikumaks muuta on lisada sellesse emotsioone, tehes seda mudelite abil, mis annavad sĂŒntesaatorile ette emotsioonide vĂ€ljendamiseks vajalikud akustiliste parameetrite vÀÀrtuste kombinatsioonid. Emotsionaalse kĂ”ne mudelite loomiseks peab teadma, kuidas emotsioonid inimkĂ”nes hÀÀleliselt vĂ€ljenduvad. Selleks tuli uurida, kas, millisel mÀÀral ja mis suunas emotsioonid akustiliste parameetrite (nĂ€iteks pĂ”hitooni, intensiivsuse ja kĂ”netempo) vÀÀrtusi mĂ”jutavad ning millised parameetrid vĂ”imaldavad emotsioone ĂŒksteisest ja neutraalsest kĂ”nest eristada. Saadud tulemuste pĂ”hjal oli vĂ”imalik luua emotsioonide akustilisi mudeleid* ning katseisikud hindasid, milliste mudelite jĂ€rgi on emotsioonid sĂŒnteeskĂ”nes Ă€ratuntavad. Eksperiment kinnitas, et akustikaanalĂŒĂŒsi tulemustele tuginevate mudelitega suudab eestikeelne kĂ”nesĂŒntesaator rahuldavalt vĂ€ljendada nii kurbust kui ka viha, kuid mitte rÔÔmu. Doktoritöö kajastab ĂŒht vĂ”imalikku viisi, kuidas rÔÔm, kurbus ja viha eestikeelses kĂ”nes hÀÀleliselt vĂ€ljenduvad, ning esitab mudelid, mille abil emotsioone eestikeelsesse sĂŒnteeskĂ”nesse lisada. Uurimistöö on lĂ€htepunkt edasisele eestikeelse emotsionaalse sĂŒnteeskĂ”ne akustiliste mudelite arendamisele. * Katsemudelite jĂ€rgi sĂŒnteesitud emotsionaalset kĂ”net saab kuulata aadressil https://www.eki.ee/heli/index.php?option=com_content&view=article&id=7&Itemid=494.The present doctoral dissertation had two major purposes: (a) to find out and describe the acoustic expression of three basic emotions – joy, sadness and anger – in read Estonian speech, and (b) to create, based on the resulting description, acoustic models of emotional speech, designed to help parametric synthesis of Estonian speech recognizably express the above emotions. As far as synthetic speech has many applications in different fields, such as human-machine interaction, multimedia, or aids for the disabled, it is vital that the synthetic speech should sound natural, that is, as human-like as possible. One of the ways to naturalness lies through adding emotions to the synthetic speech by means of models feeding the synthesiser with combinations of acoustic parametric values necessary for emotional expression. In order to create such models of emotional speech, it is first necessary to have a detailed knowledge of the vocal expression of emotions in human speech. For that purpose I had to investigate to what extent, if any, and in what direction emotions influence the values of speech acoustic parameters (e.g., fundamental frequency, intensity and speech rate), and which parameters enable discrimination of emotions from each other and from neutral speech. The results provided material for creating acoustic models of emotions* to be presented to evaluators, who were asked to decide which of the models helped to produce synthetic speech with recognisable emotions. The experiment proved that with models based on acoustic results, an Estonian speech synthesiser can satisfactorily express sadness and anger, while joy was not so well recognised by listeners. This doctoral dissertation describes one of the possible ways for the vocal expression of joy, sadness and anger in Estonian speech and presents some models enabling addition of emotions to Estonian synthetic speech. The study serves as a starting point for the future development of acoustic models for Estonian emotional synthetic speech. * Recorded examples of emotional speech synthesised using the test models can be accessed at https://www.eki.ee/heli/index.php?option=com_content&view=article&id=7&Itemid=494

    Integrating user-centred design in the development of a silent speech interface based on permanent magnetic articulography

    Get PDF
    Abstract: A new wearable silent speech interface (SSI) based on Permanent Magnetic Articulography (PMA) was developed with the involvement of end users in the design process. Hence, desirable features such as appearance, port-ability, ease of use and light weight were integrated into the prototype. The aim of this paper is to address the challenges faced and the design considerations addressed during the development. Evaluation on both hardware and speech recognition performances are presented here. The new prototype shows a com-parable performance with its predecessor in terms of speech recognition accuracy (i.e. ~95% of word accuracy and ~75% of sequence accuracy), but significantly improved appearance, portability and hardware features in terms of min-iaturization and cost

    Speech emotion recognition based on bi-directional acoustic–articulatory conversion

    Get PDF
    Acoustic and articulatory signals are naturally coupled and complementary. The challenge of acquiring articulatory data and the nonlinear ill-posedness of acoustic–articulatory conversions have resulted in previous studies on speech emotion recognition (SER) primarily relying on unidirectional acoustic–articulatory conversions. However, these studies have ignored the potential benefits of bi-directional acoustic–articulatory conversion. Addressing the problem of nonlinear ill-posedness and effectively extracting and utilizing these two modal features in SER remain open research questions. To bridge this gap, this study proposes a Bi-A2CEmo framework that simultaneously addresses the bi-directional acoustic-articulatory conversion for SER. This framework comprises three components: a Bi-MGAN that addresses the nonlinear ill-posedness problem, KCLNet that enhances the emotional attributes of the mapped features, and ResTCN-FDA that fully exploits the emotional attributes of the features. Another challenge is the absence of a parallel acoustic-articulatory emotion database. To overcome this issue, this study utilizes electromagnetic articulography (EMA) to create a multi-modal acoustic-articulatory emotion database for Mandarin Chinese called STEM-E2^2VA. A comparative analysis is then conducted between the proposed method and state-of-the-art models to evaluate the effectiveness of the framework. Bi-A2CEmo achieves an accuracy of 89.04\% in SER, which is an improvement of 5.27\% compared with the actual acoustic and articulatory features recorded by the EMA. The results for the STEM-E2^2VA dataset show that Bi-MGAN achieves a higher accuracy in mapping and inversion than conventional conversion networks. Visualization of the mapped features before and after enhancement reveals that KCLNet reduces the intra-class spacing while increasing the inter-class spacing of the features. ResTCN-FDA demonstrates high recognition accuracy on three publicly available datasets. The experimental results show that the proposed bi-directional acoustic-articulatory conversion framework can significantly improve the SER performance

    Systematic Influence of Perceived Grasp Shape on Speech Production

    Get PDF
    Previous research has shown that precision and power grip performance is consistently influenced by simultaneous articulation. For example, power grip responses are performed relatively fast with the open-back vowel [a], whereas precision grip responses are performed relatively fast with the close-front vowel [i]. In the present study, the participants were presented with a picture of a hand shaped to the precision or power grip. They were required to pronounce speech sounds according to the front/above perspective of the hand. The results showed that not only the grip performance is affected by simultaneously pronouncing the speech sound but also the production of speech sound can be affected by viewing an image of a grip. The precision grip stimulus triggered relatively rapid production of the front-close vowel [i]. In contrast, the effect related to the power grip stimulus was mostly linked to the vertical dimension of the pronounced vowel since this stimulus triggered relatively rapid production of the back-open vowel [a] and back-mid-open vowel [o] while production of the back-close vowel [u] was not influenced by it. The fact that production of the dorsal consonant [k] or coronal consonant [t] were not influenced by these stimuli suggests that the effect was not associated with a relative front-back tongue shape of the articulation in the absence of changes in any vertical articulatory components. These findings provide evidence for an intimate interaction between certain articulatory gestures and grip types, suggesting that an overlapping visuomotor network operates for planning articulatory gestures and grasp actions.Peer reviewe

    Sound symbolism, speech expressivity and crossmodality

    Get PDF
    The direct links existing between sound and meaning which characterize sound symbolism can be thought of as mainly related to two kinds of phenomena: sound iconicity and sound metaphors. The first refers to the mirror relations established between sound and meaning effects (Nobile, 2011) and the latter as coined by Fonagy (1983) refers to the relationships based on analogies between meaning and speech sound production characteristics. Four relevant codes to the study of sound symbolism phenomena have been mentioned in the phonetic literature: the frequency code (Ohala, 1994), the respiratory code, the effort code (Gussenhoven, 2002) and the sirenic code (Gussenhoven, 2016). In the present work sound symbolism is taken to be the basis of speech expressivity because the meaning effects attributed to the spoken mode by the listeners are thought to be based on the acoustic features of sounds deriving from the various articulatory maneuvers yielding breath, voice, noise, resonance and silence. Based on the impression caused by the acoustic features, listeners attribute physiological, physical, psychological and social characteristics to speakers. In this way, speech can be considered both expressive and impressive, because it is used to convey meaning effects but it also impress listeners. Both segmental and prosodic elements are used to express meaning effects in speech. Among the prosodic elements vocal quality settings have received less attention regarding speech expressive uses. We argue that the investigation of the expressive uses of voice quality settings can be better approached if these settings are grouped according to their shared acoustic output properties and vocal tract configurations. Results of experiments relating symbolic uses of vocal qualities to semantic, acoustic and visual features by means of multidimensional analysis are reported and the expressive and impressive roles of vocal quality settings in spoken communication are discussed in relation to motivated links between sound forms and meaning effects. KEY WORDS: sound and meaning;  sound symbolism; speech expressivity; voice quality; acoustic analysis; perceptual analysis

    Auditory perceptual assessment of voices: Examining perceptual ratings as a function of voice experience

    Get PDF
    Understanding voice usage is vital to our understanding of human interaction. What is known about the auditory perceptual evaluation of voices comes mainly from studies of voice professionals, who evaluate operatic/lyrical singing in specific contexts. This is surprising as recordings of singing voices from different musical styles are an omnipresent phenomenon, evoking reactions in listeners with various levels of expertise. Understanding how untrained listeners perceive and describe voices will open up new research possibilities and enhance vocal communication between listeners. Here three studies with a mixed-methods approach aimed at: (1) evaluating the ability of untrained listeners to describe voices, and (2) determining what auditory features were most salient in participants’ discrimination of voices. In an interview (N = 20) and a questionnaire study (N = 48), free voice descriptions by untrained listeners of 23 singing voices primarily from popular music were compared with terms used by voice professionals, revealing that participants were able to describe voices using vocal characteristics from essential categories indicating sound quality, pitch changes, articulation, and variability in expression. Nine items were derived and used in an online survey for the evaluation of six voices by trained and untrained listeners in a German (N = 216) and an English (N = 50) sample, revealing that neither language nor expertise affected the assessment of the singers. A discriminant analysis showed that roughness and tension were important features for voice discrimination. The measurement of vocal expression created in the current study will be informative for studying voice perception and evaluation more generally
    • 

    corecore