5,556 research outputs found

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State vowel Categorization

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. The transformation from speaker-dependent to speaker-independent language representations enables speech to be learned and understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitch-independent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Systems control theory applied to natural and synthetic musical sounds

    Get PDF
    Systems control theory is a far developped field which helps to study stability, estimation and control of dynamical systems. The physical behaviour of musical instruments, once described by dynamical systems, can then be controlled and numerically simulated for many purposes. The aim of this paper is twofold: first, to provide the theoretical background on linear system theory, both in continuous and discrete time, mainly in the case of a finite number of degrees of freedom ; second, to give illustrative examples on wind instruments, such as the vocal tract represented as a waveguide, and a sliding flute

    Computational and Robotic Models of Early Language Development: A Review

    Get PDF
    We review computational and robotics models of early language learning and development. We first explain why and how these models are used to understand better how children learn language. We argue that they provide concrete theories of language learning as a complex dynamic system, complementing traditional methods in psychology and linguistics. We review different modeling formalisms, grounded in techniques from machine learning and artificial intelligence such as Bayesian and neural network approaches. We then discuss their role in understanding several key mechanisms of language development: cross-situational statistical learning, embodiment, situated social interaction, intrinsically motivated learning, and cultural evolution. We conclude by discussing future challenges for research, including modeling of large-scale empirical data about language acquisition in real-world environments. Keywords: Early language learning, Computational and robotic models, machine learning, development, embodiment, social interaction, intrinsic motivation, self-organization, dynamical systems, complexity.Comment: to appear in International Handbook on Language Development, ed. J. Horst and J. von Koss Torkildsen, Routledg

    Speaker Normalization Using Cortical Strip Maps: A Neural Model for Steady State Vowel Identification

    Full text link
    Auditory signals of speech are speaker-dependent, but representations of language meaning are speaker-independent. Such a transformation enables speech to be understood from different speakers. A neural model is presented that performs speaker normalization to generate a pitchindependent representation of speech sounds, while also preserving information about speaker identity. This speaker-invariant representation is categorized into unitized speech items, which input to sequential working memories whose distributed patterns can be categorized, or chunked, into syllable and word representations. The proposed model fits into an emerging model of auditory streaming and speech categorization. The auditory streaming and speaker normalization parts of the model both use multiple strip representations and asymmetric competitive circuits, thereby suggesting that these two circuits arose from similar neural designs. The normalized speech items are rapidly categorized and stably remembered by Adaptive Resonance Theory circuits. Simulations use synthesized steady-state vowels from the Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)] vowel database and achieve accuracy rates similar to those achieved by human listeners. These results are compared to behavioral data and other speaker normalization models.National Science Foundation (SBE-0354378); Office of Naval Research (N00014-01-1-0624

    Power balanced time-varying lumped parameter model of a vocal tract: modelling and simulation

    Get PDF
    International audienceVoice and speech production greatly relies on the ability of the vocal tract to articulate a wide variety of sounds. This ability is related to the accurate control of the geometry (and its variations in space and time) in order to generate vowels (including diphthongs) and consonants. Some well-known vibro-acoustic models of the vocal tract rely on a discretized geometry, such as concatenated cylinders, the radius of which varies in time to account for the articulation (see e.g. Maeda, Speech Comm. 1:199-229, 1982).We here propose a lumped parameter model of waves in the vocal tract considering the motion of the boundaries. A particular attention is paid to passivity and the well-posedness of the power balance in the context of time-varying geometrical parameters. To this end, the proposed model is recast in the theoretical framework of port-Hamiltonian systems that ensure the power balance. The modularity of this framework is also well-suited to interconnect this model to that of deformable walls (in a power-balanced way).We show the capacities of the model in two time-domain numerical experiments: first for a static configuration (time-invariant geometry), then a dynamic one (time-varying geometries) of a two-cylinder vocal tract

    Cepstral peak prominence: a comprehensive analysis

    Full text link
    An analytical study of cepstral peak prominence (CPP) is presented, intended to provide an insight into its meaning and relation with voice perturbation parameters. To carry out this analysis, a parametric approach is adopted in which voice production is modelled using the traditional source-filter model and the first cepstral peak is assumed to have Gaussian shape. It is concluded that the meaning of CPP is very similar to that of the first rahmonic and some insights are provided on its dependence with fundamental frequency and vocal tract resonances. It is further shown that CPP integrates measures of voice waveform and periodicity perturbations, be them either amplitude, frequency or noise

    The Fricative Sound Source Spectrum Derived From a Vocal Tract Analog.

    Get PDF
    The applications of speech synthesis for computer voice response and speech analysis present the need for highly intelligible and natural synthesized speech. In order to improve the synthesis of fricative and related sounds, the use of simple models for the source spectrum of fricative sounds is investigated. The investigation is based on the use of a vocal tract analog and experimental measurements. Measurements of the sound pressure spectra of fricative consonants are made. Simple sound pressure measurements and measurements based on the technique for measuring intensity are utilized. The fricatives studied are /f/, /th/, /s/, /sh/, and /h/. Fricative sound source spectra are determined by applying an inverse filter to the measured fricative sound pressure spectra. The inverse filtering function is derived from a vocal tract analog. The resulting fricative source spectra are fit to a truncated Fourier series. The results show that structure is evident in all the source spectra except /f/. The presence of structure was related to turbulent flows. The structure of turbulent flows is relevant since fricative sound production is induced by turbulence. The structure of turbulent flows with Reynolds number near the critical Reynolds number is dependent on the initial conditions, the boundary conditions, and on the nonlinearity of the Navier Stokes equations. These three factors are tied together by bifurcation theory which is used to explain the structure present in the fricative source spectra. Also, the possibility that the structure is a by-product of the vocal tract analog is allowed. In any case, the structure evident in the source spectra indicates the use of simple models for the source spectra of fricative sounds is in error or the vocal tract analog requires revision. The fricative source spectra determined in this study can be used in future speech synthesizers. Also, the same procedure employed in this study can be used for speech analysis of speech impaired subjects

    Neurally driven synthesis of learned, complex vocalizations

    Get PDF
    Brain machine interfaces (BMIs) hold promise to restore impaired motor function and serve as powerful tools to study learned motor skill. While limb-based motor prosthetic systems have leveraged nonhuman primates as an important animal model,1–4 speech prostheses lack a similar animal model and are more limited in terms of neural interface technology, brain coverage, and behavioral study design.5–7 Songbirds are an attractive model for learned complex vocal behavior. Birdsong shares a number of unique similarities with human speech,8–10 and its study has yielded general insight into multiple mechanisms and circuits behind learning, execution, and maintenance of vocal motor skill.11–18 In addition, the biomechanics of song production bear similarity to those of humans and some nonhuman primates.19–23 Here, we demonstrate a vocal synthesizer for birdsong, realized by mapping neural population activity recorded from electrode arrays implanted in the premotor nucleus HVC onto low-dimensional compressed representations of song, using simple computational methods that are implementable in real time. Using a generative biomechanical model of the vocal organ (syrinx) as the low-dimensional target for these mappings allows for the synthesis of vocalizations that match the bird's own song. These results provide proof of concept that high-dimensional, complex natural behaviors can be directly synthesized from ongoing neural activity. This may inspire similar approaches to prosthetics in other species by exploiting knowledge of the peripheral systems and the temporal structure of their output.Fil: Arneodo, Ezequiel Matías. University of California; Estados Unidos. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto de Física La Plata. Universidad Nacional de La Plata. Facultad de Ciencias Exactas. Instituto de Física La Plata; ArgentinaFil: Chen, Shukai. University of California; Estados UnidosFil: Brown, Daril E.. University of California; Estados UnidosFil: Gilja, Vikash. University of California; Estados UnidosFil: Gentner, Timothy Q.. The Kavli Institute For Brain And Mind; Estados Unidos. University of California; Estados Unido

    Probabilistic generative modeling of speech

    Get PDF
    Speech processing refers to a set of tasks that involve speech analysis and synthesis. Most speech processing algorithms model a subset of speech parameters of interest and blur the rest using signal processing techniques and feature extraction. However, evidence shows that many speech parameters can be more accurately estimated if they are modeled jointly; speech synthesis also benefits from joint modeling. This thesis proposes a probabilistic generative model for speech called the Probabilistic Acoustic Tube (PAT). The highlights of the model are threefold. First, it is among the very first works to build a complete probabilistic model for speech. Second, it has a well-designed model for the phase spectrum of speech, which has been hard to model and often neglected. Third, it models the AM-FM effects in speech, which are perceptually significant but often ignored in frame-based speech processing algorithms. Experiment shows that the proposed model has good potential for a number of speech processing tasks
    • …
    corecore