76,182 research outputs found
Recognizing Speech in a Novel Accent: The Motor Theory of Speech Perception Reframed
The motor theory of speech perception holds that we perceive the speech of
another in terms of a motor representation of that speech. However, when we
have learned to recognize a foreign accent, it seems plausible that recognition
of a word rarely involves reconstruction of the speech gestures of the speaker
rather than the listener. To better assess the motor theory and this
observation, we proceed in three stages. Part 1 places the motor theory of
speech perception in a larger framework based on our earlier models of the
adaptive formation of mirror neurons for grasping, and for viewing extensions
of that mirror system as part of a larger system for neuro-linguistic
processing, augmented by the present consideration of recognizing speech in a
novel accent. Part 2 then offers a novel computational model of how a listener
comes to understand the speech of someone speaking the listener's native
language with a foreign accent. The core tenet of the model is that the
listener uses hypotheses about the word the speaker is currently uttering to
update probabilities linking the sound produced by the speaker to phonemes in
the native language repertoire of the listener. This, on average, improves the
recognition of later words. This model is neutral regarding the nature of the
representations it uses (motor vs. auditory). It serve as a reference point for
the discussion in Part 3, which proposes a dual-stream neuro-linguistic
architecture to revisits claims for and against the motor theory of speech
perception and the relevance of mirror neurons, and extracts some implications
for the reframing of the motor theory
Symbol Emergence in Robotics: A Survey
Humans can learn the use of language through physical interaction with their
environment and semiotic communication with other people. It is very important
to obtain a computational understanding of how humans can form a symbol system
and obtain semiotic skills through their autonomous mental development.
Recently, many studies have been conducted on the construction of robotic
systems and machine-learning methods that can learn the use of language through
embodied multimodal interaction with their environment and other systems.
Understanding human social interactions and developing a robot that can
smoothly communicate with human users in the long term, requires an
understanding of the dynamics of symbol systems and is crucially important. The
embodied cognition and social interaction of participants gradually change a
symbol system in a constructive manner. In this paper, we introduce a field of
research called symbol emergence in robotics (SER). SER is a constructive
approach towards an emergent symbol system. The emergent symbol system is
socially self-organized through both semiotic communications and physical
interactions with autonomous cognitive developmental agents, i.e., humans and
developmental robots. Specifically, we describe some state-of-art research
topics concerning SER, e.g., multimodal categorization, word discovery, and a
double articulation analysis, that enable a robot to obtain words and their
embodied meanings from raw sensory--motor information, including visual
information, haptic information, auditory information, and acoustic speech
signals, in a totally unsupervised manner. Finally, we suggest future
directions of research in SER.Comment: submitted to Advanced Robotic
Prediction of intent in robotics and multi-agent systems.
Moving beyond the stimulus contained in observable agent behaviour, i.e. understanding the underlying intent of the observed agent is of immense interest in a variety of domains that involve collaborative and competitive scenarios, for example assistive robotics, computer games, robot-human interaction, decision support and intelligent tutoring. This review paper examines approaches for performing action recognition and prediction of intent from a multi-disciplinary perspective, in both single robot and multi-agent scenarios, and analyses the underlying challenges, focusing mainly on generative approaches
Telephone conversation impairs sustained visual attention via a central bottleneck
Recent research has shown that holding telephone conversations disrupts one's driving ability. We asked whether this effect could be attributed to a visual attention impairment. In Experiment 1, participants conversed on a telephone or listened to a narrative while engaged in multiple object tracking (MOT), a task requiring sustained visual attention. We found that MOT was disrupted in the telephone conversation condition, relative to single-task MOT performance, but that listening to a narrative had no effect. In Experiment 2, we asked which component of conversation might be interfering with MOT performance. We replicated the conversation and single-task conditions of Experiment 1 and added two conditions in which participants heard a sequence of words over a telephone. In the shadowing condition, participants simply repeated each word in the sequence. In the generation condition, participants were asked to generate a new word based on each word in the sequence. Word generation interfered with MOT performance, but shadowing did not. The data indicate that telephone conversation disrupts attention at a central stage, the act of generating verbal stimuli, rather than at a peripheral stage, such as listening or speaking
A convolutional neural-network model of human cochlear mechanics and filter tuning for real-time applications
Auditory models are commonly used as feature extractors for automatic
speech-recognition systems or as front-ends for robotics, machine-hearing and
hearing-aid applications. Although auditory models can capture the biophysical
and nonlinear properties of human hearing in great detail, these biophysical
models are computationally expensive and cannot be used in real-time
applications. We present a hybrid approach where convolutional neural networks
are combined with computational neuroscience to yield a real-time end-to-end
model for human cochlear mechanics, including level-dependent filter tuning
(CoNNear). The CoNNear model was trained on acoustic speech material and its
performance and applicability were evaluated using (unseen) sound stimuli
commonly employed in cochlear mechanics research. The CoNNear model accurately
simulates human cochlear frequency selectivity and its dependence on sound
intensity, an essential quality for robust speech intelligibility at negative
speech-to-background-noise ratios. The CoNNear architecture is based on
parallel and differentiable computations and has the power to achieve real-time
human performance. These unique CoNNear features will enable the next
generation of human-like machine-hearing applications
Recommended from our members
The role of HG in the analysis of temporal iteration and interaural correlation
Towards a complete multiple-mechanism account of predictive language processing [Commentary on Pickering & Garrod]
Although we agree with Pickering & Garrod (P&G) that prediction-by-simulation and prediction-by-association are important mechanisms of anticipatory language processing, this commentary suggests that they: (1) overlook other potential mechanisms that might underlie prediction in language processing, (2) overestimate the importance of prediction-by-association in early childhood, and (3) underestimate the complexity and significance of several factors that might mediate prediction during language processing
An integrated theory of language production and comprehension
Currently, production and comprehension are regarded as quite distinct in accounts of language processing. In rejecting this dichotomy, we instead assert that producing and understanding are interwoven, and that this interweaving is what enables people to predict themselves and each other. We start by noting that production and comprehension are forms of action and action perception. We then consider the evidence for interweaving in action, action perception, and joint action, and explain such evidence in terms of prediction. Specifically, we assume that actors construct forward models of their actions before they execute those actions, and that perceivers of others' actions covertly imitate those actions, then construct forward models of those actions. We use these accounts of action, action perception, and joint action to develop accounts of production, comprehension, and interactive language. Importantly, they incorporate well-defined levels of linguistic representation (such as semantics, syntax, and phonology). We show (a) how speakers and comprehenders use covert imitation and forward modeling to make predictions at these levels of representation, (b) how they interweave production and comprehension processes, and (c) how they use these predictions to monitor the upcoming utterances. We show how these accounts explain a range of behavioral and neuroscientific data on language processing and discuss some of the implications of our proposal
Being first matters: topographical representational similarity analysis of ERP signals reveals separate networks for audiovisual temporal binding depending on the leading sense
In multisensory integration, processing in one sensory modality is enhanced by complementary information from other modalities. Inter-sensory timing is crucial in this process as only inputs reaching the brain within a restricted temporal window are perceptually bound. Previous research in the audiovisual field has investigated various features of the temporal binding window (TBW), revealing asymmetries in its size and plasticity depending on the leading input (auditory-visual, AV; visual-auditory, VA). We here tested whether separate neuronal mechanisms underlie this AV-VA dichotomy in humans. We recorded high-density EEG while participants performed an audiovisual simultaneity judgment task including various AV/VA asynchronies and unisensory control conditions (visual-only, auditory-only) and tested whether AV and VA processing generate different patterns of brain activity. After isolating the multisensory components of AV/VA event-related potentials (ERPs) from the sum of their unisensory constituents, we run a time-resolved topographical representational similarity analysis (tRSA) comparing AV and VA ERP maps. Spatial cross-correlation matrices were built from real data to index the similarity between AV- and VA-maps at each time point (500ms window post-stimulus) and then correlated with two alternative similarity model matrices: AVmaps=VAmaps vs. AVmapsâ VAmaps. The tRSA results favored the AVmapsâ VAmaps model across all time points, suggesting that audiovisual temporal binding (indexed by synchrony perception) engages different neural pathways depending on the leading sense. The existence of such dual route supports recent theoretical accounts proposing that multiple binding mechanisms are implemented in the brain to accommodate different information parsing strategies in auditory and visual sensory systems
- âŠ