14 research outputs found
Visually grounded learning of keyword prediction from untranscribed speech
During language acquisition, infants have the benefit of visual cues to
ground spoken language. Robots similarly have access to audio and visual
sensors. Recent work has shown that images and spoken captions can be mapped
into a meaningful common space, allowing images to be retrieved using speech
and vice versa. In this setting of images paired with untranscribed spoken
captions, we consider whether computer vision systems can be used to obtain
textual labels for the speech. Concretely, we use an image-to-words multi-label
visual classifier to tag images with soft textual labels, and then train a
neural network to map from the speech to these soft targets. We show that the
resulting speech system is able to predict which words occur in an
utterance---acting as a spoken bag-of-words classifier---without seeing any
parallel speech and text. We find that the model often confuses semantically
related words, e.g. "man" and "person", making it even more effective as a
semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code;
accepted to Interspeech 201
Training ASR models by Generation of Contextual Information
Supervised ASR models have reached unprecedented levels of accuracy, thanks
in part to ever-increasing amounts of labelled training data. However, in many
applications and locales, only moderate amounts of data are available, which
has led to a surge in semi- and weakly-supervised learning research. In this
paper, we conduct a large-scale study evaluating the effectiveness of
weakly-supervised learning for speech recognition by using loosely related
contextual information as a surrogate for ground-truth labels. For weakly
supervised training, we use 50k hours of public English social media videos
along with their respective titles and post text to train an encoder-decoder
transformer model. Our best encoder-decoder models achieve an average of 20.8%
WER reduction over a 1000 hours supervised baseline, and an average of 13.4%
WER reduction when using only the weakly supervised encoder for CTC
fine-tuning. Our results show that our setup for weak supervision improved both
the encoder acoustic representations as well as the decoder language generation
abilities
Simultaneous Speech Segmentation and Cross-Situational Statistical Learning in Monolinguals, Bilinguals, and Multilinguals
Statistical learning (SL) is a mechanism that learners use to segment words from continuous speech and map them to their correct referents through the computation of co-occurrence probabilities (Saffran et al., 1996a,1996b; Yu & Smith, 2007). So far, most SL studies have investigated segmentation and mapping using separate tasks (Graf-Estes et al., 2007). Therefore, very little is known about how these two processes interact, which is crucial to understand real world language learning. Moreover, nothing is known about how knowledge of an additional language influences this interaction. In this study, we exposed monolinguals, bilinguals, trilinguals, and quadrilinguals to a joint SL task allowing the tracking of both syllable co-occurrences and word-object co-occurrences at the same time. Two familiarization phases were used, during which participants simultaneously heard two phrases of an artificial language and saw two unique objects on the screen across multiple trials. In the test phase, participants were tested on their speech segmentation and mapping abilities. Our results indicated that although all groups showed different learning trajectories and strategies, they all succeeded at learning label words and mapping them to their correct referents after a second exposure. We also found that quadrilinguals outperformed both bilinguals and monolinguals in segmenting label words. Our preliminary findings suggest that, regardless of language experience, learners are capable of overcoming the cognitive load and the complexity of computing simultaneously available segmentation and mapping statistics, provided they are given sufficient exposure. In addition, knowledge of four languages could enhance the ability to detect word boundaries using statistical information
Oscillatory activity and EEG phase synchrony of concurrent word segmentation and meaning-mapping in 9-year-old children
When learning a new language, one must segment words from continuous speech and associate them with meanings. These complex processes can be boosted by attentional mechanisms triggered by multi-sensory information. Previous electrophysiological studies suggest that brain oscillations are sensitive to different hierarchical complexity levels of the input, making them a plausible neural substrate for speech parsing. Here, we investigated the functional role of brain oscillations during concurrent speech segmentation and meaning acquisition in sixty 9-year-old children. We collected EEG data during an audio-visual statistical learning task during which children were exposed to a learning condition with consistent word-picture associations and a random condition with inconsistent word-picture associations before being tested on their ability to recall words and word-picture associations. We capitalized on the brain dynamics to align neural activity to the same rate as an external rhythmic stimulus to explore modulations of neural synchronization and phase synchronization between electrodes during multi-sensory word learning. Results showed enhanced power at both word- and syllabic-rate and increased EEG phase synchronization between frontal and occipital regions in the learning compared to the random condition. These findings suggest that multi-sensory cueing and attentional mechanisms play an essential role in children's successful word learning
Learning Across Senses: Cross-Modal Effects in Multisensory Statistical Learning
It is currently unknown whether statistical learning is supported by modality-general or modality-specific mechanisms. One issue within this debate concerns the independence of learning in one modality from learning in other modalities. In the present study, the authors examined the extent to which statistical learning across modalities is independent by simultaneously presenting learners with auditory and visual streams. After establishing baseline rates of learning for each stream independently, they systematically varied the amount of audiovisual correspondence across 3 experiments. They found that learners were able to segment both streams successfully only when the boundaries of the audio and visual triplets were in alignment. This pattern of results suggests that learners are able to extract multiple statistical regularities across modalities provided that there is some degree of cross-modal coherence. They discuss the implications of their results in light of recent claims that multisensory statistical learning is guided by modality-independent mechanisms
How may the basal ganglia contribute to auditory categorization and speech perception?
Listeners must accomplish two complementary perceptual feats in extracting a message from speech. They must discriminate linguistically-relevant acoustic variability and generalize across irrelevant variability. Said another way, they must categorize speech. Since the mapping of acoustic variability is language-specific, these categories must be learned from experience. Thus, understanding how, in general, the auditory system acquires and represents categories can inform us about the toolbox of mechanisms available to speech perception. This perspective invites consideration of findings from cognitive neuroscience literatures outside of the speech domain as a means of constraining models of speech perception. Although neurobiological models of speech perception have mainly focused on cerebral cortex, research outside the speech domain is consistent with the possibility of significant subcortical contributions in category learning. Here, we review the functional role of one such structure, the basal ganglia. We examine research from animal electrophysiology, human neuroimaging, and behavior to consider characteristics of basal ganglia processing that may be advantageous for speech category learning. We also present emerging evidence for a direct role for basal ganglia in learning auditory categories in a complex, naturalistic task intended to model the incidental manner in which speech categories are acquired. To conclude, we highlight new research questions that arise in incorporating the broader neuroscience research literature in modeling speech perception, and suggest how understanding contributions of the basal ganglia can inform attempts to optimize training protocols for learning non-native speech categories in adulthood
Recommended from our members
Divided attention does not affect the acquisition and consolidation of transitional probabilities
Statistical learning facilitates the efficient processing and prediction of environmental events and contributes to the acquisition of automatic behaviors. Whereas a minimal level of attention seems to be required for learning to occur, it is still unclear how acquisition and consolidation of statistical knowledge are affected when attention is divided during learning. To test the effect of divided attention on statistical learning and consolidation, ninety-six healthy young adults performed the Alternating Serial Reaction Time task in which they incidentally acquired second-order transitional probabilities. Half of the participants completed the task with a concurrent secondary intentional sequence learning task that was applied to the same stimulus stream. The other half of the participants performed the task without any attention manipulation. Performance was retested after a 12-h post-learning offline period. Half of each group slept during the delay, while the other half had normal daily activity, enabling us to test the effect of delay activity (sleep vs. wake) on the consolidation of statistical knowledge. Divided attention had no effect on statistical learning: The acquisition of second-order transitional probabilities was comparable with and without the secondary task. Consolidation was neither affected by divided attention: Statistical knowledge was similarly retained over the 12-h delay, irrespective of the delay activity. Our findings can contribute to a better understanding of the role of attentional processes in and the robustness of visuomotor statistical learning and consolidation
Brain Signatures of Embodied Semantics and Language: A Consensus Paper
According to embodied theories (including embodied, embedded, extended, enacted, situated, and grounded approaches to cognition), language representation is intrinsically linked to our interactions with the world around us, which is reflected in specific brain signatures during language processing and learning. Moving on from the original rivalry of embodied vs. amodal theories, this consensus paper addresses a series of carefully selected questions that aim at determining when and how rather than whether motor and perceptual processes are involved in language processes. We cover a wide range of research areas, from the neurophysiological signatures of embodied semantics, e.g., event-related potentials and fields as well as neural oscillations, to semantic processing and semantic priming effects on concrete and abstract words, to first and second language learning and, finally, the use of virtual reality for examining embodied semantics. Our common aim is to better understand the role of motor and perceptual processes in language representation as indexed by language comprehension and learning. We come to the consensus that, based on seminal research conducted in the field, future directions now call for enhancing the external validity of findings by acknowledging the multimodality, multidimensionality, flexibility and idiosyncrasy of embodied and situated language and semantic processes
Brain Signatures of Embodied Semantics and Language: A Consensus Paper
According to embodied theories (including embodied, embedded, extended, enacted, situated, and grounded approaches to cognition), language representation is intrinsically linked to our interactions with the world around us, which is reflected in specific brain signatures during language processing and learning. Moving on from the original rivalry of embodied vs. amodal theories, this consensus paper addresses a series of carefully selected questions that aim at determining when and how rather than whether motor and perceptual processes are involved in language processes. We cover a wide range of research areas, from the neurophysiological signatures of embodied semantics, e.g., event-related potentials and fields as well as neural oscillations, to semantic processing and semantic priming effects on concrete and abstract words, to first and second language learning and, finally, the use of virtual reality for examining embodied semantics. Our common aim is to better understand the role of motor and perceptual processes in language representation as indexed by language comprehension and learning. We come to the consensus that, based on seminal research conducted in the field, future directions now call for enhancing the external validity of findings by acknowledging the multimodality, multidimensionality, flexibility and idiosyncrasy of embodied and situated language and semantic processes