824 research outputs found
Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
Hearing-impaired people often struggle to follow the speech stream of an individual talker in noisy environments. Recent studies show that the brain tracks attended speech and that the attended talker can be decoded from neural data on a single-trial level. This raises the possibility of âneuro-steeredâ hearing devices in which the brain-decoded intention of a hearing-impaired listener is used to enhance the voice of the attended speaker from a speech separation front-end. So far, methods that use this paradigm have focused on optimizing the brain decoding and the acoustic speech separation independently. In this work, we propose a novel framework called brain-informed speech separation (BISS)1 in which the information about the attended speech, as decoded from the subjectâs brain, is directly used to perform speech separation in the front-end. We present a deep learning model that uses neural data to extract the clean audio signal that a listener is attending to from a multi-talker speech mixture. We show that the framework can be applied successfully to the decoded output from either invasive intracranial electroencephalography (iEEG) or non-invasive electroencephalography (EEG) recordings from hearing-impaired subjects. It also results in improved speech separation, even in scenes with background noise. The generalization capability of the system renders it a perfect candidate for neuro-steered hearing-assistive devices
Computational modelling of neural mechanisms underlying natural speech perception
Humans are highly skilled at the analysis of complex auditory scenes. In particular, the human auditory system is characterized by incredible robustness to noise and can nearly effortlessly isolate the voice of a specific talker from even the busiest of mixtures. However, neural mechanisms underlying these remarkable properties remain poorly understood. This is mainly due to the inherent complexity of speech signals and multi-stage, intricate processing performed in the human auditory system. Understanding these neural mechanisms underlying speech perception is of interest for clinical practice, brain-computer interfacing and automatic speech processing systems.
In this thesis, we developed computational models characterizing neural speech processing across different stages of the human auditory pathways. In particular, we studied the active role of slow cortical oscillations in speech-in-noise comprehension through a spiking neural network model for encoding spoken sentences. The neural dynamics of the model during noisy speech encoding reflected speech comprehension of young, normal-hearing adults. The proposed theoretical model was validated by predicting the effects of non-invasive brain stimulation on speech comprehension in an experimental study involving a cohort of volunteers. Moreover, we developed a modelling framework for detecting the early, high-frequency neural response to the uninterrupted speech in non-invasive neural recordings. We applied the method to investigate top-down modulation of this response by the listener's selective attention and linguistic properties of different words from a spoken narrative. We found that in both cases, the detected responses of predominantly subcortical origin were significantly modulated, which supports the functional role of feedback, between higher- and lower levels stages of the auditory pathways, in speech perception.
The proposed computational models shed light on some of the poorly understood neural mechanisms underlying speech perception. The developed methods can be readily employed in future studies involving a range of experimental paradigms beyond these considered in this thesis.Open Acces
Noise processing in the auditory system with applications in speech enhancement
Abstract: The auditory system is extremely efficient in extracting auditory information in the presence of background noise. However, speech enhancement algorithms, aimed at removing the background noise from a degraded speech signal, are not achieving results that are near the efficacy of the auditory system. The purpose of this study is thus to first investigate how noise affects the spiking activity of neurons in the auditory system and then use the brain activity in the presence of noise to design better speech enhancement algorithms. In order to investigate how noise affects the spiking activity of neurons, we first design a generalized linear model that relates the spiking activity of neurons to intrinsic and extrinsic covariates that can affect their activity, such as noise. From this model, we extract two metrics, one that shows the effects of noise on the spiking activity and another the relative effects of vocalization compared to noise. We use these metrics to analyze neural data, recorded from a structure of the auditory system named the inferior colliculus (IC), while presenting noisy vocalizations. We studied the effect of different kinds of noises (non-stationary, white and natural stationary), different vocalizations, different input sound levels and signal-to-noise ratios (SNR). We found that the presence of non-stationary noise increases the spiking activity of neurons, regardless of the SNR, input level or vocalization type. The presence of white or natural stationary noises however causes a great diversity of responses where the activity of sites could increase, decrease or remain unchanged. This shows that the noise invariance previously reported in the IC depends on the noisy conditions, which had not been observed before. We then address the problem of speech enhancement using information from the brain's processing in the presence of noise. It has been shown before that the brain waves of a listener strongly correlates with the speaker to which the listener attends. Given this, we design two speech enhancement algorithms with a denoising autoencoder structure, namely the Brain Enhanced Speech Denoiser (BESD) and U-shaped Brain Enhanced Speech Denoiser (U-BESD). These algorithms take advantage of the attended auditory information present in the brain activity of the listener to denoise a multi-talker speech. The U-BESD is built upon the BESD with the addition of skip connections and dilated convolutions. Compared to previously proposed approaches, BESD and U-BESD are trained in a single neural architecture, lowering the complexity of the algorithm. We investigate two experimental settings. In the first one, the attended speaker is known, referred to as the speaker-specific setting, and in the second one no prior information is available about the attended speaker, referred to as the speaker-independent setting. In the speaker-specific setting, we show that both the BESD and U-BESD algorithms surpass a similar denoising autoencoder. Moreover, we also show that in the speaker-independent setting, U-BESD surpasses the performance of the only known approach that also uses the brain's activity.Le systĂšme auditif est extrĂȘmement efficace pour extraire de lâinformation pertinente en prĂ©sence dâun bruit de fond. Par contre, les algorithmes de rehaussement de la parole, visant Ă supprimer le bruit dâun signal de parole bruitĂ©, nâatteignent pas des rĂ©sultats proches de lâefficacitĂ© du systĂšme auditif. Le but de cette Ă©tude est donc dâabord dâĂ©tudier comment le bruit affecte lâactivitĂ© neuronale dans le systĂšme auditif, puis dâutiliser lâactivitĂ© cĂ©rĂ©brale en prĂ©sence de bruit pour concevoir de meilleurs algorithmes de rehaussement. Afin dâĂ©tudier comment le bruit peut affecter lâactivitĂ© des neurones, nous concevons dâabord un modĂšle linĂ©aire gĂ©nĂ©ralisĂ© qui relie lâactivitĂ© des neurones aux covariables intrinsĂšques et extrinsĂšques qui peuvent affecter leur activitĂ©, comme le bruit. De ce modĂšle, nous extrayons deux mĂ©triques, lâune qui permet dâĂ©tudier les effets du bruit sur lâactivitĂ© neuronale et lâautre les effets relatifs sur cette activitĂ© de la vocalisation par rapport au bruit. Nous utilisons ces mĂ©triques pour analyser lâactivitĂ© neuronale dâune structure du systĂšme auditif, nomĂ©e le colliculus infĂ©rieur (IC), enregistrĂ©e lors de la prĂ©sentation de vocalisations bruitĂ©es. Nous avons Ă©tudiĂ© lâeffet de diffĂ©rents types de bruits, diffĂ©rentes vocalisations, diffĂ©rents niveaux sonores dâentrĂ©e et diffĂ©rents rapports signal sur bruit (SNR). Nous avons constatĂ© que la prĂ©sence de bruit non stationnaire augmente lâactivitĂ© des neurones, quel que soit le SNR, le niveau dâentrĂ©e ou le type de vocalisation. La prĂ©sence de bruits stationnaires blancs ou naturels provoque cependant une grande diversitĂ© de rĂ©ponses oĂč lâactivitĂ© des sites dâenregistrement pouvait augmenter, diminuer ou rester inchangĂ©e. Cela montre que lâinvariance du bruit prĂ©cĂ©demment signalĂ©e dans lâIC dĂ©pend des conditions de bruit, ce qui nâavait pas Ă©tĂ© observĂ© auparavant. Nous abordons ensuite le problĂšme du rehaussement de la parole en utilisant de lâinformation provenant du cerveau. Il a Ă©tĂ© dĂ©montrĂ© auparavant que les ondes cĂ©rĂ©brales dâun auditeur sont fortement corrĂ©lĂ©es avec le locuteur auquel lâauditeur porte attention. Compte tenu de cette corrĂ©lation, nous concevons deux algorithmes de rehaussement de la parole, le Brain Enhanced Speech Denoiser (BESD) et le U-shaped Brain Enhanced Speech Denoiser (U-BESD), qui tirent parti de lâinformation prĂ©sente dans lâactivitĂ© cĂ©rĂ©brale de lâauditeur pour dĂ©bruiter un signal de parole multi-locuteurs. LâU-BESD est construit Ă partir du BESD avec lâajout de sauts de connexions (skip connections) et de convolutions dilatĂ©es. De plus, BESD et U-BESD sont constituĂ©s respectivement dâun seul rĂ©seau qui nĂ©cessite un seul entraĂźnement, ce qui rĂ©duit la complexitĂ© de lâalgorithme en comparaison avec les approches existantes. Nous Ă©tudions deux conditions expĂ©rimentales. Dans la premiĂšre, le locuteur auquel lâauditeur porte attention est connu, et dans la seconde, ce locuteur nâest pas connu. Dans le cadre du locuteur connu, nous montrons que les algorithmes BESD et U-BESD surpassent un autoencodeur similaire. De plus, nous montrons Ă©galement que dans le cadre du locuteur inconnu, le U-BESD surpasse les performances de la seule approche existante connue qui utilise Ă©galement lâactivitĂ© cĂ©rĂ©brale
Recommended from our members
Single Channel auditory source separation with neural network
Although distinguishing diïŹerent sounds in noisy environment is a relative easy task for human, source separation has long been extremely diïŹcult in audio signal processing. The problem is challenging for three reasons: the large variety of sound type, the abundant mixing conditions and the unclear mechanism to distinguish sources, especially for similar sounds.
In recent years, the neural network based methods achieved impressive successes in various problems, including the speech enhancement, where the task is to separate the clean speech out of the noise mixture. However, the current deep learning based source separator does not perform well on real recorded noisy speech, and more importantly, is not applicable in a more general source separation scenario such as overlapped speech.
In this thesis, we ïŹrstly propose extensions for the current mask learning network, for the problem of speech enhancement, to ïŹx the scale mismatch problem which is usually occurred in real recording audio. We solve this problem by combining two additional restoration layers in the existing mask learning network. We also proposed a residual learning architecture for the speech enhancement, further improving the network generalization under diïŹerent recording conditions. We evaluate the proposed speech enhancement models on CHiME 3 data. Without retraining the acoustic model, the best bi-direction LSTM with residue connections yields 25.13% relative WER reduction on real data and 34.03% WER on simulated data.
Then we propose a novel neural network based model called âdeep clusteringâ for more general source separation tasks. We train a deep network to assign contrastive embedding vectors to each time-frequency region of the spectrogram in order to implicitly predict the segmentation labels of the target spectrogram from the input mixtures. This yields a deep network-based analogue to spectral clustering, in that the embeddings form a low-rank pairwise aïŹnity matrix that approximates the ideal aïŹnity matrix, while enabling much faster performance. At test time, the clustering step âdecodesâ the segmentation implicit in the embeddings by optimizing K-means with respect to the unknown assignments. Experiments on single channel mixtures from multiple speakers show that a speaker-independent model trained on two-speaker and three speakers mixtures can improve signal quality for mixtures of held-out speakers by an average over 10dB.
We then propose an extension for deep clustering named âdeep attractorâ network that allows the system to perform eïŹcient end-to-end training. In the proposed model, attractor points for each source are ïŹrstly created the acoustic signals which pull together the time-frequency bins corresponding to each source by ïŹnding the centroids of the sources in the embedding space, which are subsequently used to determine the similarity of each bin in the mixture to each source. The network is then trained to minimize the reconstruction error of each source by optimizing the embeddings. We showed that this frame work can achieve even better results.
Lastly, we introduce two applications of the proposed models, in singing voice separation and the smart hearing aid device. For the former, a multi-task architecture is proposed, which combines the deep clustering and the classiïŹcation based network. And a new state of the art separation result was achieved, where the signal to noise ratio was improved by 11.1dB on music and 7.9dB on singing voice. In the application of smart hearing aid device, we combine the neural decoding with the separation network. The system ïŹrstly decodes the userâs attention, which is further used to guide the separator for the targeting source. Both objective study and subjective study show the proposed system can accurately decode the attention and significantly improve the user experience
- âŠ