Search CORE

14 research outputs found

Visually grounded learning of keyword prediction from untranscribed speech

Author: Kamper Herman
Livescu Karen
Settle Shane
Shakhnarovich Gregory
Publication venue
Publication date: 25/05/2017
Field of study

During language acquisition, infants have the benefit of visual cues to ground spoken language. Robots similarly have access to audio and visual sensors. Recent work has shown that images and spoken captions can be mapped into a meaningful common space, allowing images to be retrieved using speech and vice versa. In this setting of images paired with untranscribed spoken captions, we consider whether computer vision systems can be used to obtain textual labels for the speech. Concretely, we use an image-to-words multi-label visual classifier to tag images with soft textual labels, and then train a neural network to map from the speech to these soft targets. We show that the resulting speech system is able to predict which words occur in an utterance---acting as a spoken bag-of-words classifier---without seeing any parallel speech and text. We find that the model often confuses semantically related words, e.g. "man" and "person", making it even more effective as a semantic keyword spotter.Comment: 5 pages, 3 figures, 5 tables; small updates, added link to code; accepted to Interspeech 201

arXiv.org e-Print Archive

Crossref

Training ASR models by Generation of Contextual Information

Author: Edunov Sergey
Girshick Ross
Liu Jun
Mohamed Abdelrahman
Okhonko Dmytro
Peng Fuchun
Saraf Yatharth
Singh Kritika
Wang Yongqiang
Zhang Frank
Zweig Geoffrey
Publication venue
Publication date: 14/02/2020
Field of study

Supervised ASR models have reached unprecedented levels of accuracy, thanks in part to ever-increasing amounts of labelled training data. However, in many applications and locales, only moderate amounts of data are available, which has led to a surge in semi- and weakly-supervised learning research. In this paper, we conduct a large-scale study evaluating the effectiveness of weakly-supervised learning for speech recognition by using loosely related contextual information as a surrogate for ground-truth labels. For weakly supervised training, we use 50k hours of public English social media videos along with their respective titles and post text to train an encoder-decoder transformer model. Our best encoder-decoder models achieve an average of 20.8% WER reduction over a 1000 hours supervised baseline, and an average of 13.4% WER reduction when using only the weakly supervised encoder for CTC fine-tuning. Our results show that our setup for weak supervision improved both the encoder acoustic representations as well as the decoder language generation abilities

arXiv.org e-Print Archive

Crossref

Simultaneous Speech Segmentation and Cross-Situational Statistical Learning in Monolinguals, Bilinguals, and Multilinguals

Author: Tachakourt Yasmine
Publication venue: Journal of Applied Language and Culture Studies
Publication date: 12/12/2022
Field of study

Statistical learning (SL) is a mechanism that learners use to segment words from continuous speech and map them to their correct referents through the computation of co-occurrence probabilities (Saffran et al., 1996a,1996b; Yu & Smith, 2007). So far, most SL studies have investigated segmentation and mapping using separate tasks (Graf-Estes et al., 2007). Therefore, very little is known about how these two processes interact, which is crucial to understand real world language learning. Moreover, nothing is known about how knowledge of an additional language influences this interaction. In this study, we exposed monolinguals, bilinguals, trilinguals, and quadrilinguals to a joint SL task allowing the tracking of both syllable co-occurrences and word-object co-occurrences at the same time. Two familiarization phases were used, during which participants simultaneously heard two phrases of an artificial language and saw two unique objects on the screen across multiple trials. In the test phase, participants were tested on their speech segmentation and mapping abilities. Our results indicated that although all groups showed different learning trajectories and strategies, they all succeeded at learning label words and mapping them to their correct referents after a second exposure. We also found that quadrilinguals outperformed both bilinguals and monolinguals in segmenting label words. Our preliminary findings suggest that, regardless of language experience, learners are capable of overcoming the cognitive load and the complexity of computing simultaneously available segmentation and mapping statistics, provided they are given sufficient exposure. In addition, knowledge of four languages could enhance the ability to detect word boundaries using statistical information

Revues Scientifiques Marocaines

Oscillatory activity and EEG phase synchrony of concurrent word segmentation and meaning-mapping in 9-year-old children

Author: François Clément
Olivé Guillem
Ramos Escobar Neus
Rodríguez Fornells Antoni
Segura Emma
Publication venue: 'Elsevier BV'
Publication date: 01/01/2021
Field of study

When learning a new language, one must segment words from continuous speech and associate them with meanings. These complex processes can be boosted by attentional mechanisms triggered by multi-sensory information. Previous electrophysiological studies suggest that brain oscillations are sensitive to different hierarchical complexity levels of the input, making them a plausible neural substrate for speech parsing. Here, we investigated the functional role of brain oscillations during concurrent speech segmentation and meaning acquisition in sixty 9-year-old children. We collected EEG data during an audio-visual statistical learning task during which children were exposed to a learning condition with consistent word-picture associations and a random condition with inconsistent word-picture associations before being tested on their ability to recall words and word-picture associations. We capitalized on the brain dynamics to align neural activity to the same rate as an external rhythmic stimulus to explore modulations of neural synchronization and phase synchronization between electrodes during multi-sensory word learning. Results showed enhanced power at both word- and syllabic-rate and increased EEG phase synchronization between frontal and occipital regions in the learning compared to the random condition. These findings suggest that multi-sensory cueing and attentional mechanisms play an essential role in children's successful word learning

HAL AMU

Directory of Open Access Journals

HAL Descartes

PubMed Central

Diposit Digital de la Universitat de Barcelona

Learning Across Senses: Cross-Modal Effects in Multisensory Statistical Learning

Author: Aaron D Mitchel
Daniel J Weiss
Publication venue
Publication date: 05/03/2020
Field of study

It is currently unknown whether statistical learning is supported by modality-general or modality-specific mechanisms. One issue within this debate concerns the independence of learning in one modality from learning in other modalities. In the present study, the authors examined the extent to which statistical learning across modalities is independent by simultaneously presenting learners with auditory and visual streams. After establishing baseline rates of learning for each stream independently, they systematically varied the amount of audiovisual correspondence across 3 experiments. They found that learners were able to segment both streams successfully only when the boundaries of the audio and visual triplets were in alignment. This pattern of results suggests that learners are able to extract multiple statistical regularities across modalities provided that there is some degree of cross-modal coherence. They discuss the implications of their results in light of recent claims that multisensory statistical learning is guided by modality-independent mechanisms

CiteSeerX

How may the basal ganglia contribute to auditory categorization and speech perception?

Author: Fiez Julie A
Holt Lori L
Lim Sung-Joo
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2014
Field of study

Listeners must accomplish two complementary perceptual feats in extracting a message from speech. They must discriminate linguistically-relevant acoustic variability and generalize across irrelevant variability. Said another way, they must categorize speech. Since the mapping of acoustic variability is language-specific, these categories must be learned from experience. Thus, understanding how, in general, the auditory system acquires and represents categories can inform us about the toolbox of mechanisms available to speech perception. This perspective invites consideration of findings from cognitive neuroscience literatures outside of the speech domain as a means of constraining models of speech perception. Although neurobiological models of speech perception have mainly focused on cerebral cortex, research outside the speech domain is consistent with the possibility of significant subcortical contributions in category learning. Here, we review the functional role of one such structure, the basal ganglia. We examine research from animal electrophysiology, human neuroimaging, and behavior to consider characteristics of basal ganglia processing that may be advantageous for speech category learning. We also present emerging evidence for a direct role for basal ganglia in learning auditory categories in a complex, naturalistic task intended to model the incidental manner in which speech categories are acquired. To conclude, we highlight new research questions that arise in incorporating the broader neuroscience research literature in modeling speech perception, and suggest how understanding contributions of the basal ganglia can inform attempts to optimize training protocols for learning non-native speech categories in adulthood

Crossref

Frontiers - Publisher Connector

PubMed Central

D-Scholarship@Pitt

MPG.PuRe

Recommended from our members

Divided attention does not affect the acquisition and consolidation of transitional probabilities

Author: Horvath Kata
Janacsek Karolina
Nemeth Dezso
Pesthy Orsolya
Török Csenge
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Statistical learning facilitates the efficient processing and prediction of environmental events and contributes to the acquisition of automatic behaviors. Whereas a minimal level of attention seems to be required for learning to occur, it is still unclear how acquisition and consolidation of statistical knowledge are affected when attention is divided during learning. To test the effect of divided attention on statistical learning and consolidation, ninety-six healthy young adults performed the Alternating Serial Reaction Time task in which they incidentally acquired second-order transitional probabilities. Half of the participants completed the task with a concurrent secondary intentional sequence learning task that was applied to the same stimulus stream. The other half of the participants performed the task without any attention manipulation. Performance was retested after a 12-h post-learning offline period. Half of each group slept during the delay, while the other half had normal daily activity, enabling us to test the effect of delay activity (sleep vs. wake) on the consolidation of statistical knowledge. Divided attention had no effect on statistical learning: The acquisition of second-order transitional probabilities was comparable with and without the secondary task. Consolidation was neither affected by divided attention: Statistical knowledge was similarly retained over the 12-h delay, irrespective of the delay activity. Our findings can contribute to a better understanding of the role of attentional processes in and the robustness of visuomotor statistical learning and consolidation

Greenwich Academic Literature Archive

Repository of the Academy's Library

ELTE Digital Institutional Repository (EDIT)

Brain Signatures of Embodied Semantics and Language: A Consensus Paper

Author: Repetto Claudia (ORCID:0000-0001-8365-7697)
Publication venue
Publication date: 01/01/2023
Field of study

According to embodied theories (including embodied, embedded, extended, enacted, situated, and grounded approaches to cognition), language representation is intrinsically linked to our interactions with the world around us, which is reflected in specific brain signatures during language processing and learning. Moving on from the original rivalry of embodied vs. amodal theories, this consensus paper addresses a series of carefully selected questions that aim at determining when and how rather than whether motor and perceptual processes are involved in language processes. We cover a wide range of research areas, from the neurophysiological signatures of embodied semantics, e.g., event-related potentials and fields as well as neural oscillations, to semantic processing and semantic priming effects on concrete and abstract words, to first and second language learning and, finally, the use of virtual reality for examining embodied semantics. Our common aim is to better understand the role of motor and perceptual processes in language representation as indexed by language comprehension and learning. We come to the consensus that, based on seminal research conducted in the field, future directions now call for enhancing the external validity of findings by acknowledging the multimodality, multidimensionality, flexibility and idiosyncrasy of embodied and situated language and semantic processes

PubliCatt

Brain Signatures of Embodied Semantics and Language: A Consensus Paper

Author: Ana Zappa
Anastasia Malyshevskaya
Claudia Repetto
Laura Bechtold
Maria Montefinese
Piermatteo Morucci
Samuel H. Cosper
Valentina Niccolai
Yury Shtyrov
Publication venue: Ubiquity Press
Publication date: 01/01/2023
Field of study

Directory of Open Access Journals

Archivio istituzionale della ricerca - Università di Padova