343 research outputs found
The Self-Organization of Speech Sounds
The speech code is a vehicle of language: it defines
a set of forms used by a community to carry information.
Such a code is necessary to support the linguistic
interactions that allow humans to communicate.
How then may a speech code be formed prior to the
existence of linguistic interactions?
Moreover, the human speech code is discrete and compositional,
shared by all the individuals of a community but different
across communities, and phoneme inventories are characterized by
statistical regularities. How can a speech code with these properties form?
We try to approach these questions in the paper,
using the ``methodology of the artificial''. We
build a society of artificial agents, and detail a mechanism that
shows the formation of a discrete speech code without pre-supposing
the existence of linguistic capacities or of coordinated interactions.
The mechanism is based on a low-level model of
sensory-motor interactions. We show that the integration of certain very
simple and non language-specific neural devices
leads to the formation of a speech code that
has properties similar to the human speech code.
This result relies on the self-organizing properties of a generic
coupling between perception and production
within agents, and on the interactions between agents.
The artificial system helps us to develop better intuitions on how speech
might have appeared, by showing how self-organization
might have helped natural selection to find speech
From Holistic to Discrete Speech Sounds: The Blind Snow-Flake Maker Hypothesis
Sound is a medium used by humans to carry information.
The existence of this kind of
medium is a pre-requisite for language. It is organized
into a code, called speech, which
provides a repertoire of forms that is shared in each
language community. This code is necessary to support the linguistic
interactions that allow humans to communicate.
How then may a speech code be formed prior to the
existence of linguistic interactions?
Moreover, the human speech code is characterized by several
properties: speech is digital and compositional (vocalizations
are made of units re-used systematically in other syllables);
phoneme inventories have precise regularities as well as
great diversity in human languages; all the speakers of a
language community categorize sounds in the same manner,
but each language has its own system of categorization,
possibly very different from every other.
How can a speech code with these properties form?
These are the questions we will approach in the paper. We will
study them using the method of the artificial. We will
build a society of artificial agents, and study what mechanisms
may provide answers. This will not prove directly what mechanisms
were used for humans, but rather give ideas about what kind
of mechanism may have been used. This allows us to shape the
search space of possible answers, in particular by showing
what is sufficient and what is not necessary.
The mechanism we present is based on a low-level model of
sensory-motor interactions. We show that the integration of certain very
simple and non language-specific neural devices
allows a population of agents to build a speech code that
has the properties mentioned above. The originality is
that it pre-supposes neither a functional pressure for
communication, nor the ability to have coordinated
social interactions (they do not play language or imitation
games). It relies on the self-organizing properties of a generic
coupling between perception and production both
within agents, and on the interactions between agents
From Analogue to Digital Vocalizations
Sound is a medium used by humans to carry information.
The existence of this kind of
medium is a pre-requisite for language. It is organized
into a code, called speech, which
provides a repertoire of forms that is shared in each
language community. This code is necessary to support the linguistic
interactions that allow humans to communicate.
How then may a speech code be formed prior to the
existence of linguistic interactions?
Moreover, the human speech code is characterized by several
properties: speech is digital and compositional (vocalizations
are made of units re-used systematically in other syllables);
phoneme inventories have precise regularities as well as
great diversity in human languages; all the speakers of a
language community categorize sounds in the same manner,
but each language has its own system of categorization,
possibly very different from every other.
How can a speech code with these properties form?
These are the questions we will approach in the paper. We will
study them using the method of the artificial. We will
build a society of artificial agents, and study what mechanisms
may provide answers. This will not prove directly what mechanisms
were used for humans, but rather give ideas about what kind
of mechanism may have been used. This allows us to shape the
search space of possible answers, in particular by showing
what is sufficient and what is not necessary.
The mechanism we present is based on a low-level model of
sensory-motor interactions. We show that the integration of certain very
simple and non language-specific neural devices
allows a population of agents to build a speech code that
has the properties mentioned above. The originality is
that it pre-supposes neither a functional pressure for
communication, nor the ability to have coordinated
social interactions (they do not play language or imitation
games). It relies on the self-organizing properties of a generic
coupling between perception and production both
within agents, and on the interactions between agents
Estimating underlying articulatory targets of Thai vowels by using deep learning based on generating synthetic samples from a 3D vocal tract model and data augmentation
Representation learning is one of the fundamental issues in modeling articulatory-based speech synthesis using target-driven models. This paper proposes a computational strategy for learning underlying articulatory targets from a 3D articulatory speech synthesis model using a bi-directional long short-term memory recurrent neural network based on a small set of representative seed samples. From a seeding set, a larger training set was generated that provided richer contextual variations for the model to learn. The deep learning model for acoustic-to-target mapping was then trained to model the inverse relation of the articulation process. This method allows the trained model to map the given acoustic data onto the articulatory target parameters which can then be used to identify the distribution based on linguistic contexts. The model was evaluated based on its effectiveness in mapping acoustics to articulation, and the perceptual accuracy of speech reproduced from the estimated articulation. The results indicate that the model can accurately imitate speech with a high degree of phonemic precision
Self-Supervised Solution to the Control Problem of Articulatory Synthesis
Given an articulatory-to-acoustic forward model, it is a priori
unknown how its motor control must be operated to achieve a
desired acoustic result. This control problem is a fundamental
issue of articulatory speech synthesis and the cradle of acousticto-articulatory inversion, a discipline which attempts to address
the issue by the means of various methods. This work presents
an end-to-end solution to the articulatory control problem, in
which synthetic motor trajectories of Monte-Carlo-generated
artificial speech are linked to input modalities (such as natural speech recordings or phoneme sequence input) via speakerindependent latent representations of a vector-quantized variational autoencoder. The proposed method is self-supervised and
thus, in principle, synthesizer and speaker model independent
Artificial Vocal Learning guided by Phoneme Recognition and Visual Information
This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based phoneme recognition in order to calculate the speech acquisition objective function. This function guides a learning framework that involves the state-of-the-art articulatory speech synthesizer VocalTractLab as the motor-to-acoustic forward model. In this way, an extensive set of German phonemes, including most of the consonants and all stressed vowels, was produced successfully. The synthetic phonemes were rated as highly intelligible by human listeners. Furthermore, it is shown that visual speech information, such as lip and jaw movements, can be extracted from video recordings and be incorporated into the learning framework as an additional loss component during the optimization process. It was observed that this visual loss did not increase the overall intelligibility of phonemes. Instead, the visual loss acted as a regularization mechanism that facilitated the finding of more biologically plausible solutions in the articulatory domain
Simulating vocal learning of spoken language: Beyond imitation
Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and phonological correspondence problems, can be addressed by linguistically grounded auditory perception. In particular, we show how the articulation of consonant-vowel syllables may be learnt from auditory percepts that can represent either individual utterances by speakers with different vocal tract characteristics or ideal phonetic realisations. The result is an optimisation-based implementation of vocal exploration – incorporating semantic, auditory, and articulatory signals – that can serve as a basis for simulating vocal learning beyond imitation
Repeat after me: Self-supervised learning of acoustic-to-articulatory mapping by vocal imitation
We propose a computational model of speech production combining a pre-trained
neural articulatory synthesizer able to reproduce complex speech stimuli from a
limited set of interpretable articulatory parameters, a DNN-based internal
forward model predicting the sensory consequences of articulatory commands, and
an internal inverse model based on a recurrent neural network recovering
articulatory commands from the acoustic speech input. Both forward and inverse
models are jointly trained in a self-supervised way from raw acoustic-only
speech data from different speakers. The imitation simulations are evaluated
objectively and subjectively and display quite encouraging performances
Exploration strategies for articulatory synthesis of complex syllable onsets
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings without manual intervention. We demonstrate the production of CCV syllables and discuss the quality of the articulatory gestures with reference to coarticulation
Early Human Vocalization Development: A Collection of Studies Utilizing Automated Analysis of Naturalistic Recordings and Neural Network Modeling
Understanding early human vocalization development is a key part of understanding the origins of human communication. What are the characteristics of early human vocalizations and how do they change over time? What mechanisms underlie these changes? This dissertation is a collection of three papers that take a computational approach to addressing these questions, using neural network simulation and automated analysis of naturalistic data.The first paper uses a self-organizing neural network to automatically derive holistic acoustic features characteristic of prelinguistic vocalizations. A supervised neural network is used to classify vocalizations into human-judged categories and to predict the age of the child vocalizing. The study represents a first step toward taking a data-driven approach to describing infant vocalizations. Its performance in classification represents progress toward developing automated analysis tools for coding infant vocalization types.The second paper is a computational model of early vocal motor learning. It adapts a popular type of neural network, the self-organizing map, in order to control a vocal tract simulator and in order to have learning be dependent on whether the model\u27s actions are reinforced. The model learns both to control production of sound at the larynx (phonation), an early-developing skill that is a prerequisite for speech, and to produce vowels that gravitate toward the vowels in a target language (either English or Korean) for which it is reinforced. The model provides a computationally-specified explanation for how neuromotor representations might be acquired in infancy through the combination of exploration, reinforcement, and self-organized learning.The third paper utilizes automated analysis to uncover patterns of vocal interaction between child and caregiver that unfold over the course of day-long, totally naturalistic recordings. The participants include 16- to 48-month-old children with and without autism. Results are consistent with the idea that there is a social feedback loop wherein children produce speech-related vocalizations, these are preferentially responded to by adults, and this contingency of adult response shapes future child vocalizations. Differences in components of this feedback loop are observed in autism, as well as with different maternal education levels
- …