13 research outputs found
Computational Models for Sensorimotor Integration and Word Learning
This dissertation is an investigation of computational models for sensorimotor integration and word learning in pre-linguistic development. In particular, computational models are investigated for three problems: (1) acoustic-to-articulatory mapping or speech inversion, (2) speech motor skill acquisition and speech production, and (3) cross-situational noun learning. For the first problem, we show that the simpler general regression neural network model performs at par, if not better, than the state-of-the-art deep belief network in experiments with MOCHA-TIMIT and MNGU0 databases. For the second problem, we propose a developmental agent with perception (audio), action (vocalization) and learning capabilities, in the predictive coding framework. We show that, when exposed to an environment of linguistic sounds (Harvard-Haskins database of regularly-timed speech) without any external reinforcement signal, the agent learns to generate speech-like sounds (acoustic babbling followed by proto-syllables and vowels) as well as the timing for motor command execution. Random goal exploration leads to the self-organization of developmental stages of vocal sequences in the agent due to increase in complexity of vocalization. For the third problem, we investigate reinforcement learning models for early word learning. Cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver’s speech are considered. We show that, when a reinforcement learning model is exposed to a group of speakers, it comes to understand an initial set of vocabulary items belonging to the language used by the group. In standard experiments with the CHILDES dataset, the attentional-prosodic deep Q-network model outperforms existing word learning models
A predictive coding framework for a developmental agent: Speech motor skill acquisition and speech production
Predictive coding has been hypothesized as a universal principle guiding the operation in different brain areas. In this paper, a predictive coding framework for a developmental agent with perception (audio), action (vocalization), and learning capabilities is proposed. The agent learns concurrently to plan optimally and the associations between sensory and motor parameters, by minimizing the sensory prediction error in an unsupervised manner. The proposed agent is solely driven by sensory prediction error and does not require reinforcement. It learns initially by self-exploration and later by imitation from the ambient environment. Our goal is to investigate the process of speech motor skill acquisition and speech production in such an agent. Standard vocal exploration experiments show that it learns to generate speech-like sounds (acoustic babbling followed by proto-syllables and vowels) as well as the timing for motor command execution. Random goal exploration leads to the self-organization of developmental stages of vocal sequences in the agent due to increase in complexity of vocalization. The self-organization is invariant to certain acoustic feature representations. Self-exploration allows the agent to learn to imitate environmental sounds quickly. It learns to vocalize differently in different environments
Speech recognition using cepstral articulatory features
Though speech recognition has been widely investigated in the past decades, the role of articulation in recognition has received scant attention. Recognition accuracy increases when recognizers are trained with acoustic features in conjunction with articulatory ones. Traditionally, acoustic features are represented by mel-frequency cepstral coefficients (MFCCs) while articulatory features are represented by the locations or trajectories of the articulators. We propose the articulatory cepstral coefficients (ACCs) as features which are the cepstral coefficients of the time-location articulatory signal. We show that ACCs yield state-of-the-art results in phoneme classification and recognition on benchmark datasets over a wide range of experiments. The similarity of MFCCs and ACCs and their superior performance in isolation and conjunction indicate that common algorithms can be effectively used for acoustic and articulatory signals
Improved speech inversion using general regression neural network
The problem of nonlinear acoustic to articulatory inversion mapping is investigated in the feature space using two models, the deep belief network (DBN) which is the state-of-the-art, and the general regression neural network (GRNN). The task is to estimate a set of articulatory features for improved speech recognition. Experiments with MOCHA-TIMIT and MNGU0 databases reveal that, for speech inversion, GRNN yields a lower root-mean-square error and a higher correlation than DBN. It is also shown that conjunction of acoustic and GRNN-estimated articulatory features yields state-of-the-art accuracy in broad class phonetic classification and phoneme recognition using less computational power
Emergence of vocal developmental sequences in a predictive coding model of speech acquisition
Learning temporal patterns among primitive speech sequences and being able to control the motor apparatus for effective production of the learned patterns are imperative for speech acquisition in infants. In this paper, we develop a predictive coding model whose objective is to minimize the sensory (auditory) and proprioceptive prediction errors. Temporal patterns are learned by minimizing the former while control is learned by minimizing the latter. The model is learned using a set of synthetically generated syllables, as in other contemporary models. We show that the proposed model outperforms existing ones in learning vocalization classes. It also computes the control/muscle activation which is useful for determining the degree of easiness of vocalization
Pragmatically Framed Cross-Situational Noun Learning Using Computational Reinforcement Models
Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver\u27s speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver\u27s utterance and learns the relations between that word and the objects in its visual environment. The novel words to novel objects language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-?, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task
Pragmatically Framed Cross-Situational Noun Learning Using Computational Reinforcement Models
Cross-situational learning and social pragmatic theories are prominent mechanisms for learning word meanings (i.e., word-object pairs). In this paper, the role of reinforcement is investigated for early word-learning by an artificial agent. When exposed to a group of speakers, the agent comes to understand an initial set of vocabulary items belonging to the language used by the group. Both cross-situational learning and social pragmatic theory are taken into account. As social cues, joint attention and prosodic cues in caregiver's speech are considered. During agent-caregiver interaction, the agent selects a word from the caregiver's utterance and learns the relations between that word and the objects in its visual environment. The “novel words to novel objects” language-specific constraint is assumed for computing rewards. The models are learned by maximizing the expected reward using reinforcement learning algorithms [i.e., table-based algorithms: Q-learning, SARSA, SARSA-λ, and neural network-based algorithms: Q-learning for neural network (Q-NN), neural-fitted Q-network (NFQ), and deep Q-network (DQN)]. Neural network-based reinforcement learning models are chosen over table-based models for better generalization and quicker convergence. Simulations are carried out using mother-infant interaction CHILDES dataset for learning word-object pairings. Reinforcement is modeled in two cross-situational learning cases: (1) with joint attention (Attentional models), and (2) with joint attention and prosodic cues (Attentional-prosodic models). Attentional-prosodic models manifest superior performance to Attentional ones for the task of word-learning. The Attentional-prosodic DQN outperforms existing word-learning models for the same task
Identifying hearing deficiencies from statistically learned speech features for personalized tuning of cochlear implants
Cochlear implants (CIs) are an effective intervention for individuals with severe-to-profound sensorineural hearing loss. Currently, no tuning procedure exists that can fully exploit the technology. We propose online unsupervised algorithms to learn features from the speech of a severely-to- profoundly hearing-impaired patient round-the-clock and compare the features to those learned from the normal hearing population using a set of neurophysiological metrics. Experimental results are presented. The information from comparison can be exploited to modify the signal processing in a patient\u27s CI to enhance his audibility of speech
Identifying hearing deficiencies from statistically learned speech features for personalized tuning of cochlear implants
Cochlear implants (CIs) are an effective intervention for individuals with severe-to-profound sensorineural hearing loss. Currently, no tuning procedure exists that can fully exploit the technology. We propose online unsupervised algorithms to learn features from the speech of a severely-to- profoundly hearing-impaired patient round-the-clock and compare the features to those learned from the normal hearing population using a set of neurophysiological metrics. Experimental results are presented. The information from comparison can be exploited to modify the signal processing in a patient\u27s CI to enhance his audibility of speech