Search CORE

160 research outputs found

Making music through real-time voice timbre analysis: machine learning and timbral control

Author: Stowell Dan
Publication venue: 'Queen Mary University of London'
Publication date: 01/01/2010
Field of study

PhDPeople can achieve rich musical expression through vocal sound { see for example human beatboxing, which achieves a wide timbral variety through a range of extended techniques. Yet the vocal modality is under-exploited as a controller for music systems. If we can analyse a vocal performance suitably in real time, then this information could be used to create voice-based interfaces with the potential for intuitive and ful lling levels of expressive control. Conversely, many modern techniques for music synthesis do not imply any particular interface. Should a given parameter be controlled via a MIDI keyboard, or a slider/fader, or a rotary dial? Automatic vocal analysis could provide a fruitful basis for expressive interfaces to such electronic musical instruments. The principal questions in applying vocal-based control are how to extract musically meaningful information from the voice signal in real time, and how to convert that information suitably into control data. In this thesis we address these questions, with a focus on timbral control, and in particular we develop approaches that can be used with a wide variety of musical instruments by applying machine learning techniques to automatically derive the mappings between expressive audio input and control output. The vocal audio signal is construed to include a broad range of expression, in particular encompassing the extended techniques used in human beatboxing. The central contribution of this work is the application of supervised and unsupervised machine learning techniques to automatically map vocal timbre to synthesiser timbre and controls. Component contributions include a delayed decision-making strategy for low-latency sound classi cation, a regression-tree method to learn associations between regions of two unlabelled datasets, a fast estimator of multidimensional di erential entropy and a qualitative method for evaluating musical interfaces based on discourse analysis

Queen Mary Research Online

OpenGrey Repository

Subsidia: Tools and Resources for Speech Sciences

Author: Lahoz-Bengoechea José María (Ed)
Pérez Ramón Rubén (Ed)
Publication venue
Publication date: 01/01/2019
Field of study

Este libro, resultado de la colaboración de investigadores expertos en sus respectivas áreas, pretende ser una ayuda a la comunidad científica en tanto en cuanto recopila y describe una serie de materiales de gran utilidad para seguir avanzando en la investigació

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Málaga

Articulatory Copy Synthesis Based on the Speech Synthesizer VocalTractLab

Author: Gao Yingming
Publication venue
Publication date: 04/08/2022
Field of study

Articulatory copy synthesis (ACS), a subarea of speech inversion, refers to the reproduction of natural utterances and involves both the physiological articulatory processes and their corresponding acoustic results. This thesis proposes two novel methods for the ACS of human speech using the articulatory speech synthesizer VocalTractLab (VTL) to address or mitigate the existing problems of speech inversion, such as non-unique mapping, acoustic variation among different speakers, and the time-consuming nature of the process. The first method involved finding appropriate VTL gestural scores for given natural utterances using a genetic algorithm. It consisted of two steps: gestural score initialization and optimization. In the first step, gestural scores were initialized using the given acoustic signals with speech recognition, grapheme-to-phoneme (G2P), and a VTL rule-based method for converting phoneme sequences to gestural scores. In the second step, the initial gestural scores were optimized by a genetic algorithm via an analysis-by-synthesis (ABS) procedure that sought to minimize the cosine distance between the acoustic features of the synthetic and natural utterances. The articulatory parameters were also regularized during the optimization process to restrict them to reasonable values. The second method was based on long short-term memory (LSTM) and convolutional neural networks, which were responsible for capturing the temporal dependence and the spatial structure of the acoustic features, respectively. The neural network regression models were trained, which used acoustic features as inputs and produced articulatory trajectories as outputs. In addition, to cover as much of the articulatory and acoustic space as possible, the training samples were augmented by manipulating the phonation type, speaking effort, and the vocal tract length of the synthetic utterances. Furthermore, two regularization methods were proposed: one based on the smoothness loss of articulatory trajectories and another based on the acoustic loss between original and predicted acoustic features. The best-performing genetic algorithms and convolutional LSTM systems (evaluated in terms of the difference between the estimated and reference VTL articulatory parameters) obtained average correlation coefficients of 0.985 and 0.983 for speaker-dependent utterances, respectively, and their reproduced speech achieved recognition accuracies of 86.25% and 64.69% for speaker-independent utterances of German words, respectively. When applied to German sentence utterances, as well as English and Mandarin Chinese word utterances, the neural network based ACS systems achieved recognition accuracies of 73.88%, 52.92%, and 52.41%, respectively. The results showed that both of these methods not only reproduced the articulatory processes but also reproduced the acoustic signals of reference utterances. Moreover, the regularization methods led to more physiologically plausible articulatory processes and made the estimated articulatory trajectories be more articulatorily preferred by VTL, thus reproducing more natural and intelligible speech. This study also found that the convolutional layers, when used in conjunction with batch normalization layers, automatically learned more distinctive features from log power spectrograms. Furthermore, the neural network based ACS systems trained using German data could be generalized to the utterances of other languages

Technische Universität Dresden: Qucosa

Proceedings of the Sixteenth Australasian International Conference on Speech Science and Technology

Author
Publication venue: ASSTA
Publication date: 31/12/2016
Field of study

UCL Discovery

Aspekte der Charakterisierung phonologischer Sprachstörungen vs. verzögerter Spracherwerb bei jordanischem Arabisch sprechenden Kindern

Author: Bader Sa'da Salman Issa
Publication venue: Bielefeld University
Publication date: 01/01/2010
Field of study

Bader S'da SI. Issues in the characterisation of phonological speech impairment vs. delayed acquisition in Jordanian Arabic-Speaking children. Bielefeld (Germany): Bielefeld University; 2010.Eine Studie des Spracherwerbs des jordanischen Arabisch bei jungen Muttersprachlern.A study with children speaking or acquiring Jordanian Arabic with or without phonological impairments

Publications at Bielefeld University

Behavioural and neural insights into the recognition and motivational salience of familiar voice identities

Author: Kanber Elise
Publication venue: UCL (University College London)
Publication date: 28/06/2022
Field of study

The majority of voices encountered in everyday life belong to people we know, such as close friends, relatives, or romantic partners. However, research to date has overlooked this type of familiarity when investigating voice identity perception. This thesis aimed to address this gap in the literature, through a detailed investigation of voice perception across different types of familiarity: personally familiar voices, famous voices, and lab-trained voices. The experimental chapters of the thesis cover two broad research topics: 1) Measuring the recognition and representation of personally familiar voice identities in comparison with labtrained identities, and 2) Investigating motivation and reward in relation to hearing personally valued voices compared with unfamiliar voice identities. In the first of these, an exploration of the extent of human voice recognition capabilities was undertaken using personally familiar voices of romantic partners. The perceptual benefits of personal familiarity for voice and speech perception were examined, as well as an investigation into how voice identity representations are formed through exposure to new voice identities. Evidence for highly robust voice representations for personally familiar voices was found in the face of perceptual challenges, which greatly exceeded those found for lab-trained voices of varying levels of familiarity. Conclusions are drawn about the relevance of the amount and type of exposure on speaker recognition, the expertise we have with certain voices, and the framing of familiarity as a continuum rather than a binary categorisation. The second topic utilised voices of famous singers and their “super-fans” as listeners to probe reward and motivational responses to hearing these valued voices, using behavioural and neuroimaging experiments. Listeners were found to work harder, as evidenced by faster reaction times, to hear their musical idol compared to less valued voices in an effort-based decision-making task, and the neural correlates of these effects are reported and examined

UCL Discovery

Individual Differences in Speech Production and Perception

Author
Publication venue: 'Peter Lang, International Academic Publishers'
Publication date
Field of study

Inter-individual variation in speech is a topic of increasing interest both in human sciences and speech technology. It can yield important insights into biological, cognitive, communicative, and social aspects of language. Written by specialists in psycholinguistics, phonetics, speech development, speech perception and speech technology, this volume presents experimental and modeling studies that provide the reader with a deep understanding of interspeaker variability and its role in speech processing, speech development, and interspeaker interactions. It discusses how theoretical models take into account individual behavior, explains why interspeaker variability enriches speech communication, and summarizes the limitations of the use of speaker information in forensics

OAPEN Library

Recommended from our members

The impact of head and body postures on the acoustic speech signal

Author: Flory Yvonne
Publication venue: Department of Theoretical and Applied Linguistics
Publication date: 07/04/2015
Field of study

This dissertation is aimed at investigating the impact of postural changes within speakers on the acoustic speech signal to complement research on articulatory changes under the same conditions. The research is therefore relevant for forensic phonetics, where quantifying within-speaker variation is vital for the accuracy of speaker comparison. To this end, two acoustic studies were carried out to quantify the influence of five head positions and three body orientations on the acoustic speech signal. Results show that there is a consistent change in the third formant, a change which was most evident in the body orientation measurements, and to a lesser extent in the head position data. Analysis of the results with respect to compensation strategies indicates that speakers employ different strategies to compensate for these perturbations to their vocal tract. Some speakers did not exhibit large differences in their speech signal, while others appeared to compensate much less. Across all speakers, the effect was much stronger in what were deemed ‘less natural’, postures. That is, speakers were apparently less able to predict and compensate for the impact of prone body orientation on their speech than for that of the more natural supine orientation. In addition to the acoustic studies, a perception experiment assessed whether listeners could make use of acoustic cues to determine the posture of the speaker. Stimuli were chosen with, by design, stronger or weaker acoustic cues to posture, in order to elicit a possible difference in identification performance. Listeners were nevertheless not able to identify above chance whether a speaker was sitting or lying in prone body orientation even when hearing the set with stronger cues. Further combined articulatory and acoustic research will have to be carried out to disentangle which articulatory behaviours correlate with the acoustic changes presented in order to draw a more comprehensive picture of the effects of postural variation on speech.This work was supported by the Art and Humanities Research Council

Apollo (Cambridge)

Sociololinguistic competence and the bilingual's adoption of phonetic variants: auditory and instrumental data from English-Arabic bilinguals

Author: Khattab Ghada
Publication venue: University of Leeds
Publication date: 01/09/2002
Field of study

This study is an auditory and acoustic investigation of the speech production patterns developed by English-Arabic bilingual children. The subjects are three Lebanese children aged five, seven and ten, all born and raised in Yorkshire, England. Monolingual friends of the same age were chosen as controls, and the parents of all bilingual and monolingual children were also taped to obtain a detailed assessment of the sound patterns available in the subjects' environment. The study addresses the question of interaction between the bilingual's phonological systems by calling for a refinement of the notion of a `phonological system' using insights from recent phonetic and sociolinguistic work on variability in speech (e. g. Docherty, Foulkes, Tillotson, & Watt, 2002; Docherty & Foulkes, 2000; Local, 1983; Pisoni, 1997; Roberts, 1997; Scobbie, 2002). The variables under study include /1/, In, and VOT production. These were chosen due to the existence of different patterns in their production in English and Arabic that vary according to contextual and dialectal factors. Data were collected using a variety of picture-naming, story-telling, and free-play activities for the children, and reading lists, story-telling, and interviews for the adults. To control for language mode (Grosjean, 1998), the bilinguals were recorded in different language sessions with different interviewers. Results for the monolingual children and adults in this study underline the importance of including controls in any study of bilingual speech development for a better interpretation of the bilinguals' patterns. Input from the adults proved highly variable and at times conflicted with published patterns normally found in the literature for the variables under study. Results for the bilinguals show that they have developed separate sociolinguistically-appropriate production patterns for each of their languages that are on the whole similar to those of monolinguals but that also reflect the bilinguals' rich socio-phonetic repertoire. The interaction between the bilinguals' languages is mainly restricted to the bilingual mode and is a sign of their developing sociolinguistic competence

White Rose E-theses Online

Physical mechanisms may be as important as brain mechanisms in evolution of speech [Commentary on Ackerman, Hage, & Ziegler. Brain Mechanisms of acoustic communication in humans and nonhuman primates: an evolutionary perspective]

Author: De Boer B.
Perlman M.
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2014
Field of study

We present two arguments why physical adaptations for vocalization may be as important as neural adaptations. First, fine control over vocalization is not easy for physical reasons, and modern humans may be exceptional. Second, we present an example of a gorilla that shows rudimentary voluntary control over vocalization, indicating that some neural control is already shared with great apes

Crossref

University of Birmingham Research Portal

MPG.PuRe