14 research outputs found

    Visually grounded learning of keyword prediction from untranscribed speech

    Full text link
    During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful common space, allowing images to be retrieved using speech and vice versa. In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech. Concretely, we use an image-to-words multi-label visual classifier to tag images with soft textual labels, and then train a neural network to map from the speech to these soft targets. We show that the resulting speech system is able to predict which words occur in an utterance---acting as a spoken bag-of-words classifier---without seeing any parallel speech and text. We find that the model often confuses semantically related words, e.g. "man" and "person", making it even more effective as a semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code; accepted to Interspeech 201

    Training ASR models by Generation of Contextual Information

    Full text link
    Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities

    Simultaneous Speech Segmentation and Cross-Situational Statistical Learning in Monolinguals, Bilinguals, and Multilinguals

    Get PDF
    Statistical learning (SL) is a mechanism that learners use to segment words from continuous speech and map them to their correct referents through the computation of co-occurrence probabilities (Saffran et al., 1996a,1996b; Yu & Smith, 2007). So far, most SL studies have investigated segmentation and mapping using separate tasks (Graf-Estes et al., 2007). Therefore, very little is known about how these two processes interact, which is crucial to understand real world language learning. Moreover, nothing is known about how knowledge of an additional language influences this interaction. In this study, we exposed monolinguals, bilinguals, trilinguals, and quadrilinguals to a joint SL task allowing the tracking of both syllable co-occurrences and word-object co-occurrences at the same time. Two familiarization phases were used, during which participants simultaneously heard two phrases of an artificial language and saw two unique objects on the screen across multiple trials. In the test phase, participants were tested on their speech segmentation and mapping abilities. Our results indicated that although all groups showed different learning trajectories and strategies, they all succeeded at learning label words and mapping them to their correct referents after a second exposure. We also found that quadrilinguals outperformed both bilinguals and monolinguals in segmenting label words. Our preliminary findings suggest that, regardless of language experience, learners are capable of overcoming the cognitive load and the complexity of computing simultaneously available segmentation and mapping statistics, provided they are given sufficient exposure. In addition, knowledge of four languages could enhance the ability to detect word boundaries using statistical information

    Oscillatory activity and EEG phase synchrony of concurrent word segmentation and meaning-mapping in 9-year-old children

    Get PDF
    When learning a new language, one must segment words from continuous speech and associate them with meanings. These complex processes can be boosted by attentional mechanisms triggered by multi-sensory information. Previous electrophysiological studies suggest that brain oscillations are sensitive to different hierarchical complexity levels of the input, making them a plausible neural substrate for speech parsing. Here, we investigated the functional role of brain oscillations during concurrent speech segmentation and meaning acquisition in sixty 9-year-old children. We collected EEG data during an audio-visual statistical learning task during which children were exposed to a learning condition with consistent word-picture associations and a random condition with inconsistent word-picture associations before being tested on their ability to recall words and word-picture associations. We capitalized on the brain dynamics to align neural activity to the same rate as an external rhythmic stimulus to explore modulations of neural synchronization and phase synchronization between electrodes during multi-sensory word learning. Results showed enhanced power at both word- and syllabic-rate and increased EEG phase synchronization between frontal and occipital regions in the learning compared to the random condition. These findings suggest that multi-sensory cueing and attentional mechanisms play an essential role in children's successful word learning

    Learning Across Senses: Cross-Modal Effects in Multisensory Statistical Learning

    Get PDF
    It is currently unknown whether statistical learning is supported by modality-general or modality-specific mechanisms. One issue within this debate concerns the independence of learning in one modality from learning in other modalities. In the present study, the authors examined the extent to which statistical learning across modalities is independent by simultaneously presenting learners with auditory and visual streams. After establishing baseline rates of learning for each stream independently, they systematically varied the amount of audiovisual correspondence across 3 experiments. They found that learners were able to segment both streams successfully only when the boundaries of the audio and visual triplets were in alignment. This pattern of results suggests that learners are able to extract multiple statistical regularities across modalities provided that there is some degree of cross-modal coherence. They discuss the implications of their results in light of recent claims that multisensory statistical learning is guided by modality-independent mechanisms

    How may the basal ganglia contribute to auditory categorization and speech perception?

    Get PDF
    Listeners must accomplish two complementary perceptual feats in extracting a message from speech. They must discriminate linguistically-relevant acoustic variability and generalize across irrelevant variability. Said another way, they must categorize speech. Since the mapping of acoustic variability is language-specific, these categories must be learned from experience. Thus, understanding how, in general, the auditory system acquires and represents categories can inform us about the toolbox of mechanisms available to speech perception. This perspective invites consideration of findings from cognitive neuroscience literatures outside of the speech domain as a means of constraining models of speech perception. Although neurobiological models of speech perception have mainly focused on cerebral cortex, research outside the speech domain is consistent with the possibility of significant subcortical contributions in category learning. Here, we review the functional role of one such structure, the basal ganglia. We examine research from animal electrophysiology, human neuroimaging, and behavior to consider characteristics of basal ganglia processing that may be advantageous for speech category learning. We also present emerging evidence for a direct role for basal ganglia in learning auditory categories in a complex, naturalistic task intended to model the incidental manner in which speech categories are acquired. To conclude, we highlight new research questions that arise in incorporating the broader neuroscience research literature in modeling speech perception, and suggest how understanding contributions of the basal ganglia can inform attempts to optimize training protocols for learning non-native speech categories in adulthood

    Brain Signatures of Embodied Semantics and Language: A Consensus Paper

    Get PDF
    According to embodied theories (including embodied, embedded, extended, enacted, situated, and grounded approaches to cognition), language representation is intrinsically linked to our interactions with the world around us, which is reflected in specific brain signatures during language processing and learning. Moving on from the original rivalry of embodied vs. amodal theories, this consensus paper addresses a series of carefully selected questions that aim at determining when and how rather than whether motor and perceptual processes are involved in language processes. We cover a wide range of research areas, from the neurophysiological signatures of embodied semantics, e.g., event-related potentials and fields as well as neural oscillations, to semantic processing and semantic priming effects on concrete and abstract words, to first and second language learning and, finally, the use of virtual reality for examining embodied semantics. Our common aim is to better understand the role of motor and perceptual processes in language representation as indexed by language comprehension and learning. We come to the consensus that, based on seminal research conducted in the field, future directions now call for enhancing the external validity of findings by acknowledging the multimodality, multidimensionality, flexibility and idiosyncrasy of embodied and situated language and semantic processes

    Brain Signatures of Embodied Semantics and Language: A Consensus Paper

    Get PDF
    According to embodied theories (including embodied, embedded, extended, enacted, situated, and grounded approaches to cognition), language representation is intrinsically linked to our interactions with the world around us, which is reflected in specific brain signatures during language processing and learning. Moving on from the original rivalry of embodied vs. amodal theories, this consensus paper addresses a series of carefully selected questions that aim at determining when and how rather than whether motor and perceptual processes are involved in language processes. We cover a wide range of research areas, from the neurophysiological signatures of embodied semantics, e.g., event-related potentials and fields as well as neural oscillations, to semantic processing and semantic priming effects on concrete and abstract words, to first and second language learning and, finally, the use of virtual reality for examining embodied semantics. Our common aim is to better understand the role of motor and perceptual processes in language representation as indexed by language comprehension and learning. We come to the consensus that, based on seminal research conducted in the field, future directions now call for enhancing the external validity of findings by acknowledging the multimodality, multidimensionality, flexibility and idiosyncrasy of embodied and situated language and semantic processes
    corecore