295 research outputs found

    Relating EEG to continuous speech using deep neural networks: a review

    Full text link
    Objective. When a person listens to continuous speech, a corresponding response is elicited in the brain and can be recorded using electroencephalography (EEG). Linear models are presently used to relate the EEG recording to the corresponding speech signal. The ability of linear models to find a mapping between these two signals is used as a measure of neural tracking of speech. Such models are limited as they assume linearity in the EEG-speech relationship, which omits the nonlinear dynamics of the brain. As an alternative, deep learning models have recently been used to relate EEG to continuous speech, especially in auditory attention decoding (AAD) and single-speech-source paradigms. Approach. This paper reviews and comments on deep-learning-based studies that relate EEG to continuous speech in AAD and single-speech-source paradigms. We point out recurrent methodological pitfalls and the need for a standard benchmark of model analysis. Main results. We gathered 29 studies. The main methodological issues we found are biased cross-validations, data leakage leading to over-fitted models, or disproportionate data size compared to the model's complexity. In addition, we address requirements for a standard benchmark model analysis, such as public datasets, common evaluation metrics, and good practices for the match-mismatch task. Significance. We are the first to present a review paper summarizing the main deep-learning-based studies that relate EEG to speech while addressing methodological pitfalls and important considerations for this newly expanding field. Our study is particularly relevant given the growing application of deep learning in EEG-speech decoding

    TEMPORAL CODING OF SPEECH IN HUMAN AUDITORY CORTEX

    Get PDF
    Human listeners can reliably recognize speech in complex listening environments. The underlying neural mechanisms, however, remain unclear and cannot yet be emulated by any artificial system. In this dissertation, we study how speech is represented in the human auditory cortex and how the neural representation contributes to reliable speech recognition. Cortical activity from normal hearing human subjects is noninvasively recorded using magnetoencephalography, during natural speech listening. It is first demonstrated that neural activity from auditory cortex is precisely synchronized to the slow temporal modulations of speech, when the speech signal is presented in a quiet listening environment. How this neural representation is affected by acoustic interference is then investigated. Acoustic interference degrades speech perception via two mechanisms, informational masking and energetic masking, which are addressed respectively by using a competing speech stream and a stationary noise as the interfering sound. When two speech streams are presented simultaneously, cortical activity is predominantly synchronized to the speech stream the listener attends to, even if the unattended, competing speech stream is 8 dB more intense. When speech is presented together with spectrally matched stationary noise, cortical activity remains precisely synchronized to the temporal modulations of speech until the noise is 9 dB more intense. Critically, the accuracy of neural synchronization to speech predicts how well individual listeners can understand speech in noise. Further analysis reveals that two neural sources contribute to speech synchronized cortical activity, one with a shorter response latency of about 50 ms and the other with a longer response latency of about 100 ms. The longer-latency component, but not the shorter-latency component, shows selectivity to the attended speech and invariance to background noise, indicating a transition from encoding the acoustic scene to encoding the behaviorally important auditory object, in auditory cortex. Taken together, we have demonstrated that during natural speech comprehension, neural activity in the human auditory cortex is precisely synchronized to the slow temporal modulations of speech. This neural synchronization is robust to acoustic interference, whether speech or noise, and therefore provides a strong candidate for the neural basis of acoustic background invariant speech recognition

    Auditory Attention Decoding with Task-Related Multi-View Contrastive Learning

    Full text link
    The human brain can easily focus on one speaker and suppress others in scenarios such as a cocktail party. Recently, researchers found that auditory attention can be decoded from the electroencephalogram (EEG) data. However, most existing deep learning methods are difficult to use prior knowledge of different views (that is attended speech and EEG are task-related views) and extract an unsatisfactory representation. Inspired by Broadbent's filter model, we decode auditory attention in a multi-view paradigm and extract the most relevant and important information utilizing the missing view. Specifically, we propose an auditory attention decoding (AAD) method based on multi-view VAE with task-related multi-view contrastive (TMC) learning. Employing TMC learning in multi-view VAE can utilize the missing view to accumulate prior knowledge of different views into the fusion of representation, and extract the approximate task-related representation. We examine our method on two popular AAD datasets, and demonstrate the superiority of our method by comparing it to the state-of-the-art method

    MAD-EEG: an EEG dataset for decoding auditory attention to a target instrument in polyphonic music

    Get PDF
    International audienceWe present MAD-EEG, a new, freely available dataset for studying EEG-based auditory attention decoding considering the challenging case of subjects attending to a target instrument in polyphonic music. The dataset represents the first music-related EEG dataset of its kind, enabling, in particular, studies on single-trial EEG-based attention decoding, while also opening the path for research on other EEG-based music analysis tasks. MAD-EEG has so far collected 20-channel EEG signals recorded from 8 subjects listening to solo, duo and trio music excerpts and attending to one pre-specified instrument. The proposed experimental setting differs from the ones previously considered as the stimuli are polyphonic and are played to the subject using speakers instead of headphones. The stimuli were designed considering variations in terms of number and type of instruments in the mixture, spatial rendering, music genre and melody that is played. Preliminary results obtained with a state-of-the-art stimulus reconstruction algorithm commonly used for speech stimuli show that the audio representation reconstructed from the EEG response is more correlated with that of the attended source than with the one of the unattended source, proving the dataset to be suitable for such kind of studies

    Efficient Solutions to High-Dimensional and Nonlinear Neural Inverse Problems

    Get PDF
    Development of various data acquisition techniques has enabled researchers to study the brain as a complex system and gain insight into the high-level functions performed by different regions of the brain. These data are typically high-dimensional as they pertain to hundreds of sensors and span hours of recording. In many experiments involving sensory or cognitive tasks, the underlying cortical activity admits sparse and structured representations in the temporal, spatial, or spectral domains, or combinations thereof. However, current neural data analysis approaches do not take account of sparsity in order to harness the high-dimensionality. Also, many existing approaches suffer from high bias due to the heavy usage of linear models and estimation techniques, given that cortical activity is known to exhibit various degrees of non-linearity. Finally, the majority of current methods in computational neuroscience are tailored for static estimation in batch-mode and offline settings, and with the advancement of brain-computer interface technologies, these methods need to be extended to capture neural dynamics in a real-time fashion. The objective of this dissertation is to devise novel algorithms for real-time estimation settings and to incorporate the sparsity and non-linear properties of brain activity for providing efficient solutions to neural inverse problems involving high-dimensional data. Along the same line, our goal is to provide efficient representations of these high-dimensional data that are easy to interpret and assess statistically. First, we consider the problem of spectral estimation from binary neuronal spiking data. Due to the non-linearities involved in spiking dynamics, classical spectral representation methods fail to capture the spectral properties of these data. To address this challenge, we integrate point process theory, sparse estimation, and non-linear signal processing methods to propose a spectral representation modeling and estimation framework for spiking data. Our model takes into account the sparse spectral structure of spiking data, which is crucial in the analysis of electrophysiology data in conditions such as sleep and anesthesia. We validate the performance of our spectral estimation framework using simulated spiking data as well as multi-unit spike recordings from human subjects under general anesthesia. Next, we tackle the problem of real-time auditory attention decoding from electroencephalography (EEG) or magnetoencephalography (MEG) data in a competing-speaker environment. Most existing algorithms for this purpose operate offline and require access to multiple trials for a reliable performance; hence, they are not suitable for real-time applications. To address these shortcomings, we integrate techniques from state-space modeling, Bayesian filtering, and sparse estimation to propose a real-time algorithm for attention decoding that provides robust, statistically interpretable, and dynamic measures of the attentional state of the listener. We validate the performance of our proposed algorithm using simulated and experimentally-recorded M/EEG data. Our analysis reveals that our algorithms perform comparable to the state-of-the-art offline attention decoding techniques, while providing significant computational savings. Finally, we study the problem of dynamic estimation of Temporal Response Functions (TRFs) for analyzing neural response to auditory stimuli. A TRF can be viewed as the impulse response of the brain in a linear stimulus-response model. Over the past few years, TRF analysis has provided researchers with great insight into auditory processing, specially under competing speaker environments. However, most existing results correspond to static TRF estimates and do not examine TRF dynamics, especially in multi-speaker environments with attentional modulation. Using state-space models, we provide a framework for a robust and comprehensive dynamic analysis of TRFs using single trial data. TRF components at specific lags may exhibit peaks which arise, persist, and disappear over time according to the attentional state of the listener. To account for this specific behavior in our model, we consider a state-space model with a Gaussian mixture process noise, and devise an algorithm to efficiently estimate the process noise parameters from the recorded M/EEG data. Application to simulated and recorded MEG data shows that the {proposed state-space modeling and inference framework can reliably capture the dynamic changes in the TRF, which can in turn improve our access to the attentional state in competing-speaker environments
    corecore