    NeuroHeed+: Improving Neuro-steered Speaker Extraction with Joint Auditory Attention Detection

    Neuro-steered speaker extraction aims to extract the listener's brain-attended speech signal from a multi-talker speech signal, in which the attention is derived from the cortical activity. This activity is usually recorded using electroencephalography (EEG) devices. Though promising, current methods often have a high speaker confusion error, where the interfering speaker is extracted instead of the attended speaker, degrading the listening experience. In this work, we aim to reduce the speaker confusion error in the neuro-steered speaker extraction model through a jointly fine-tuned auxiliary auditory attention detection model. The latter reinforces the consistency between the extracted target speech signal and the EEG representation, and also improves the EEG representation. Experimental results show that the proposed network significantly outperforms the baseline in terms of speaker confusion and overall signal quality in two-talker scenarios

    A DenseNet-based method for decoding auditory spatial attention with EEG

    Auditory spatial attention detection (ASAD) aims to decode the attended spatial location with EEG in a multiple-speaker setting. ASAD methods are inspired by the brain lateralization of cortical neural responses during the processing of auditory spatial attention, and show promising performance for the task of auditory attention decoding (AAD) with neural recordings. In the previous ASAD methods, the spatial distribution of EEG electrodes is not fully exploited, which may limit the performance of these methods. In the present work, by transforming the original EEG channels into a two-dimensional (2D) spatial topological map, the EEG data is transformed into a three-dimensional (3D) arrangement containing spatial-temporal information. And then a 3D deep convolutional neural network (DenseNet-3D) is used to extract temporal and spatial features of the neural representation for the attended locations. The results show that the proposed method achieves higher decoding accuracy than the state-of-the-art (SOTA) method (94.4% compared to XANet's 90.6%) with 1-second decision window for the widely used KULeuven (KUL) dataset, and the code to implement our work is available on Github: https://github.com/xuxiran/ASAD_DenseNe

    Noise processing in the auditory system with applications in speech enhancement

    Abstract: The auditory system is extremely efficient in extracting auditory information in the presence of background noise. However, speech enhancement algorithms, aimed at removing the background noise from a degraded speech signal, are not achieving results that are near the efficacy of the auditory system. The purpose of this study is thus to first investigate how noise affects the spiking activity of neurons in the auditory system and then use the brain activity in the presence of noise to design better speech enhancement algorithms. In order to investigate how noise affects the spiking activity of neurons, we first design a generalized linear model that relates the spiking activity of neurons to intrinsic and extrinsic covariates that can affect their activity, such as noise. From this model, we extract two metrics, one that shows the effects of noise on the spiking activity and another the relative effects of vocalization compared to noise. We use these metrics to analyze neural data, recorded from a structure of the auditory system named the inferior colliculus (IC), while presenting noisy vocalizations. We studied the effect of different kinds of noises (non-stationary, white and natural stationary), different vocalizations, different input sound levels and signal-to-noise ratios (SNR). We found that the presence of non-stationary noise increases the spiking activity of neurons, regardless of the SNR, input level or vocalization type. The presence of white or natural stationary noises however causes a great diversity of responses where the activity of sites could increase, decrease or remain unchanged. This shows that the noise invariance previously reported in the IC depends on the noisy conditions, which had not been observed before. We then address the problem of speech enhancement using information from the brain's processing in the presence of noise. It has been shown before that the brain waves of a listener strongly correlates with the speaker to which the listener attends. Given this, we design two speech enhancement algorithms with a denoising autoencoder structure, namely the Brain Enhanced Speech Denoiser (BESD) and U-shaped Brain Enhanced Speech Denoiser (U-BESD). These algorithms take advantage of the attended auditory information present in the brain activity of the listener to denoise a multi-talker speech. The U-BESD is built upon the BESD with the addition of skip connections and dilated convolutions. Compared to previously proposed approaches, BESD and U-BESD are trained in a single neural architecture, lowering the complexity of the algorithm. We investigate two experimental settings. In the first one, the attended speaker is known, referred to as the speaker-specific setting, and in the second one no prior information is available about the attended speaker, referred to as the speaker-independent setting. In the speaker-specific setting, we show that both the BESD and U-BESD algorithms surpass a similar denoising autoencoder. Moreover, we also show that in the speaker-independent setting, U-BESD surpasses the performance of the only known approach that also uses the brain's activity.Le système auditif est extrêmement efficace pour extraire de l’information pertinente en présence d’un bruit de fond. Par contre, les algorithmes de rehaussement de la parole, visant à supprimer le bruit d’un signal de parole bruité, n’atteignent pas des résultats proches de l’efficacité du système auditif. Le but de cette étude est donc d’abord d’étudier comment le bruit affecte l’activité neuronale dans le système auditif, puis d’utiliser l’activité cérébrale en présence de bruit pour concevoir de meilleurs algorithmes de rehaussement. Afin d’étudier comment le bruit peut affecter l’activité des neurones, nous concevons d’abord un modèle linéaire généralisé qui relie l’activité des neurones aux covariables intrinsèques et extrinsèques qui peuvent affecter leur activité, comme le bruit. De ce modèle, nous extrayons deux métriques, l’une qui permet d’étudier les effets du bruit sur l’activité neuronale et l’autre les effets relatifs sur cette activité de la vocalisation par rapport au bruit. Nous utilisons ces métriques pour analyser l’activité neuronale d’une structure du système auditif, nomée le colliculus inférieur (IC), enregistrée lors de la présentation de vocalisations bruitées. Nous avons étudié l’effet de différents types de bruits, différentes vocalisations, différents niveaux sonores d’entrée et différents rapports signal sur bruit (SNR). Nous avons constaté que la présence de bruit non stationnaire augmente l’activité des neurones, quel que soit le SNR, le niveau d’entrée ou le type de vocalisation. La présence de bruits stationnaires blancs ou naturels provoque cependant une grande diversité de réponses où l’activité des sites d’enregistrement pouvait augmenter, diminuer ou rester inchangée. Cela montre que l’invariance du bruit précédemment signalée dans l’IC dépend des conditions de bruit, ce qui n’avait pas été observé auparavant. Nous abordons ensuite le problème du rehaussement de la parole en utilisant de l’information provenant du cerveau. Il a été démontré auparavant que les ondes cérébrales d’un auditeur sont fortement corrélées avec le locuteur auquel l’auditeur porte attention. Compte tenu de cette corrélation, nous concevons deux algorithmes de rehaussement de la parole, le Brain Enhanced Speech Denoiser (BESD) et le U-shaped Brain Enhanced Speech Denoiser (U-BESD), qui tirent parti de l’information présente dans l’activité cérébrale de l’auditeur pour débruiter un signal de parole multi-locuteurs. L’U-BESD est construit à partir du BESD avec l’ajout de sauts de connexions (skip connections) et de convolutions dilatées. De plus, BESD et U-BESD sont constitués respectivement d’un seul réseau qui nécessite un seul entraînement, ce qui réduit la complexité de l’algorithme en comparaison avec les approches existantes. Nous étudions deux conditions expérimentales. Dans la première, le locuteur auquel l’auditeur porte attention est connu, et dans la seconde, ce locuteur n’est pas connu. Dans le cadre du locuteur connu, nous montrons que les algorithmes BESD et U-BESD surpassent un autoencodeur similaire. De plus, nous montrons également que dans le cadre du locuteur inconnu, le U-BESD surpasse les performances de la seule approche existante connue qui utilise également l’activité cérébrale

    Decoding auditory attention and neural language processing in adverse conditions and different listener groups

    This thesis investigated subjective, behavioural and neurophysiological (EEG) measures of speech processing in various adverse conditions and with different listener groups. In particular, this thesis focused on different neural processing stages and their relationship with auditory attention, effort, and measures of speech intelligibility. Study 1 set the groundwork by establishing a toolbox of various neural measures to investigate online speech processing, from the frequency following response (FFR) and cortical measures of speech processing, to the N400, a measure of lexico-semantic processing. Results showed that peripheral processing is heavily influenced by stimulus characteristics such as degradation, whereas central processing units are more closely linked to higher-order phenomena such as speech intelligibility. In Study 2, a similar experimental paradigm was used to investigate differences in neural processing between a hearing-impaired and a normal-hearing group. Subjects were presented with short stories in different levels of multi-talker babble noise, and with different settings on their hearing aids. Findings indicate that, particularly at lower noise levels, the hearing-impaired group showed much higher cortical entrainment than the normal- hearing group, despite similar levels of speech recognition. Intersubject correlation, another global neural measure of auditory attention, however, was similarly affected by noise levels in both the hearing-impaired and the normal-hearing group. This finding indicates extra processing in the hearing-impaired group only on the level of the auditory cortex. Study 3, in contrast to Studies 1 and 2 (which both investigated the effects of bottom-up factors on neural processing), examined the links between entrainment and top-down factors, specifically motivation; as well as reasons for the 5 higher entrainment found in hearing-impaired subjects in Study 2. Results indicated that, while behaviourally there was no difference between incentive and non-incentive conditions, neurophysiological measures of attention such as intersubject correlation were affected by the presence of an incentive to perform better. Moreover, using a specific degradation type resulted in subjects’ increased cortical entrainment under degraded conditions. These findings support the hypothesis that top-down factors such as motivation influence neurophysiological measures; and that higher entrainment to degraded speech might be triggered specifically by the reduced availability of spectral detail contained in speech

    A Comparison of Regularization Methods in Forward and Backward Models for Auditory Attention Decoding

    The decoding of selective auditory attention from noninvasive electroencephalogram (EEG) data is of interest in brain computer interface and auditory perception research. The current state-of-the-art approaches for decoding the attentional selection of listeners are based on linear mappings between features of sound streams and EEG responses (forward model), or vice versa (backward model). It has been shown that when the envelope of attended speech and EEG responses are used to derive such mapping functions, the model estimates can be used to discriminate between attended and unattended talkers. However, the predictive/reconstructive performance of the models is dependent on how the model parameters are estimated. There exist a number of model estimation methods that have been published, along with a variety of datasets. It is currently unclear if any of these methods perform better than others, as they have not yet been compared side by side on a single standardized dataset in a controlled fashion. Here, we present a comparative study of the ability of different estimation methods to classify attended speakers from multi-channel EEG data. The performance of the model estimation methods is evaluated using different performance metrics on a set of labeled EEG data from 18 subjects listening to mixtures of two speech streams. We find that when forward models predict the EEG from the attended audio, regularized models do not improve regression or classification accuracies. When backward models decode the attended speech from the EEG, regularization provides higher regression and classification accuracies

