276 research outputs found

    UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

    Full text link
    In reverberant conditions with multiple concurrent speakers, each microphone acquires a mixture signal of multiple speakers at a different location. In over-determined conditions where the microphones out-number speakers, we can narrow down the solutions to speaker images and realize unsupervised speech separation by leveraging each mixture signal as a constraint (i.e., the estimated speaker images at a microphone should add up to the mixture). Equipped with this insight, we propose UNSSOR, an algorithm for u\textbf{u}nsupervised n\textbf{n}eural s\textbf{s}peech s\textbf{s}eparation by leveraging o\textbf{o}ver-determined training mixtur\textbf{r}es. At each training step, we feed an input mixture to a deep neural network (DNN) to produce an intermediate estimate for each speaker, linearly filter the estimates, and optimize a loss so that, at each microphone, the filtered estimates of all the speakers can add up to the mixture to satisfy the above constraint. We show that this loss can promote unsupervised separation of speakers. The linear filters are computed in each sub-band based on the mixture and DNN estimates through the forward convolutive prediction (FCP) algorithm. To address the frequency permutation problem incurred by using sub-band FCP, a loss term based on minimizing intra-source magnitude scattering is proposed. Although UNSSOR requires over-determined training mixtures, we can train DNNs to achieve under-determined separation (e.g., unsupervised monaural speech separation). Evaluation results on two-speaker separation in reverberant conditions show the effectiveness and potential of UNSSOR.Comment: in submissio

    Attentional modulation of neural sound tracking in children with and without dyslexia

    Get PDF
    Auditory selective attention forms an important foundation of children's learning by enabling the prioritisation and encoding of relevant stimuli. It may also influence reading development, which relies on metalinguistic skills including the awareness of the sound structure of spoken language. Reports of attentional impairments and speech perception difficulties in noisy environments in dyslexic readers are also suggestive of the putative contribution of auditory attention to reading development. To date, it is unclear whether non-speech selective attention and its underlying neural mechanisms are impaired in children with dyslexia and to which extent these deficits relate to individual reading and speech perception abilities in suboptimal listening conditions. In this EEG study, we assessed non-speech sustained auditory selective attention in 106 7-to-12-year-old children with and without dyslexia. Children attended to one of two tone streams, detecting occasional sequence repeats in the attended stream, and performed a speech-in-speech perception task. Results show that when children directed their attention to one stream, inter-trial-phase-coherence at the attended rate increased in fronto-central sites; this, in turn, was associated with better target detection. Behavioural and neural indices of attention did not systematically differ as a function of dyslexia diagnosis. However, behavioural indices of attention did explain individual differences in reading fluency and speech-in-speech perception abilities: both these skills were impaired in dyslexic readers. Taken together, our results show that children with dyslexia do not show group-level auditory attention deficits but these deficits may represent a risk for developing reading impairments and problems with speech perception in complex acoustic environments. Research Highlights: Non-speech sustained auditory selective attention modulates EEG phase coherence in children with/without dyslexia Children with dyslexia show difficulties in speech-in-speech perception Attention relates to dyslexic readers’ speech-in-speech perception and reading skills Dyslexia diagnosis is not linked to behavioural/EEG indices of auditory attention

    An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings

    Full text link
    We performed an experimental review of current diarization systems for the conversational telephone speech (CTS) domain. In detail, we considered a total of eight different algorithms belonging to clustering-based, end-to-end neural diarization (EEND), and speech separation guided diarization (SSGD) paradigms. We studied the inference-time computational requirements and diarization accuracy on four CTS datasets with different characteristics and languages. We found that, among all methods considered, EEND-vector clustering (EEND-VC) offers the best trade-off in terms of computing requirements and performance. More in general, EEND models have been found to be lighter and faster in inference compared to clustering-based methods. However, they also require a large amount of diarization-oriented annotated data. In particular EEND-VC performance in our experiments degraded when the dataset size was reduced, whereas self-attentive EEND (SA-EEND) was less affected. We also found that SA-EEND gives less consistent results among all the datasets compared to EEND-VC, with its performance degrading on long conversations with high speech sparsity. Clustering-based diarization systems, and in particular VBx, instead have more consistent performance compared to SA-EEND but are outperformed by EEND-VC. The gap with respect to this latter is reduced when overlap-aware clustering methods are considered. SSGD is the most computationally demanding method, but it could be convenient if speech recognition has to be performed. Its performance is close to SA-EEND but degrades significantly when the training and inference data characteristics are less matched.Comment: 52 pages, 10 figure

    Évaluation et modulation des fonctions exécutives en neuroergonomie - Continuums cognitifs et expérimentaux

    Get PDF
    Des études en neuroergonomie ont montré que le pilote d’avion pouvait commettre des erreurs en raison d’une incapacité transitoire à faire preuve de flexibilité mentale. Il apparait que certains facteurs, tels qu’une forte charge mentale ou une pression temporelle importante, un niveau de stress trop élevé, la survenue de conflits, ou une perte de conscience de la situation, peuvent altérer temporairement l’efficience des fonctions exécutives permettant cette flexibilité. Depuis mes travaux initiaux, dans lesquels je me suis intéressé aux conditions qui conduisent à une négligence auditive, j’ai souhaité développer une approche scientifique visant à quantifier et limiter les effets délétères de ces différents facteurs. Ceci a été fait à travers l’étude des fonctions exécutives chez l’être humain selon le continuum cognitif (du cerveau lésé au cerveau en parfait état de fonctionnement) et le continuum expérimental (de l’ordinateur au monde réel). L’approche fondamentale de l’étude des fonctions exécutives en neurosciences combinée à l’approche neuroergonomique graduelle avec des pilotes et des patients cérébro-lésés, a permis de mieux comprendre la manière dont ces fonctions sont mises en jeu et altérées. Cette connaissance à contribuer par la suite à la mise en place de solutions pour préserver leur efficacité en situation complexe. Après avoir rappelé mon parcours académique, je présente dans ce manuscrit une sélection de travaux répartis sur trois thématiques de recherche. La première concerne l’étude des fonctions exécutives impliquées dans l’attention et en particulier la façon dont la charge perceptive et la charge mentale peuvent altérer ces fonctions. La deuxième correspond à un aspect plus appliqué de ces travaux avec l’évaluation de l’état du pilote. Il a été question d’analyser cet état selon l’activité de pilotage elle-même ou à travers la gestion et la supervision d’un système en particulier. La troisième et dernière thématique concerne la recherche de marqueurs prédictifs de la performance cognitive et l’élaboration d’entraînements cognitifs pour limiter les troubles dysexécutifs, qu’ils soient d’origine contextuelle ou lésionnelle. Ces travaux ont contribué à une meilleure compréhension des troubles cognitifs transitoires ou chroniques, mais ils ont aussi soulevé des questions auxquelles je souhaite répondre aujourd’hui. Pour illustrer cette réflexion, je présente en dernière partie de ce document mon projet de recherche qui vise à développer une approche multifactorielle de l’efficience cognitive, éthique et en science ouverte

    Cortical tracking of lexical speech units in a multi-talker background is immature in school-aged children

    Get PDF
    Available online 1 December 2022Children have more difficulty perceiving speech in noise than adults. Whether this difficulty relates to an immature processing of prosodic or linguistic elements of the attended speech is still unclear. To address the impact of noise on linguistic processing per se, we assessed how babble noise impacts the cortical tracking of intelligible speech devoid of prosody in school-aged children and adults. Twenty adults and twenty children (7-9 years) listened to synthesized French monosyllabic words presented at 2.5 Hz, either randomly or in 4-word hierarchical structures wherein 2 words formed a phrase at 1.25 Hz, and 2 phrases formed a sentence at 0.625 Hz, with or without babble noise. Neuromagnetic responses to words, phrases and sentences were identified and source-localized. Children and adults displayed significant cortical tracking of words in all conditions, and of phrases and sentences only when words formed meaningful sentences. In children compared with adults, the cortical tracking was lower for all linguistic units in conditions without noise. In the presence of noise, the cortical tracking was similarly reduced for sentence units in both groups, but remained stable for phrase units. Critically, when there was noise, adults increased the cortical tracking of monosyllabic words in the inferior frontal gyri and supratemporal auditory cortices but children did not. This study demonstrates that the difficulties of school-aged children in understanding speech in a multi-talker background might be partly due to an immature tracking of lexical but not supra-lexical linguistic units.Maxime Niesen and Marc Vander Ghinst were supported by the Fonds Erasme (Brussels, Belgium). Mathieu Bourguignon and Julie Ber- tels have been supported by the program Attract of Innoviris (grants 2015-BB2B-10 and 2019-BFB-110). Julie Bertels has been supported by a research grant from the Fonds de Soutien Marguerite-Marie Delacroix (Brussels, Belgium). Xavier De Tiège is Clinical Researcher at the Fonds de la Recherche Scientifique (FRS-FNRS, Brussels, Belgium). We warmly thank Mélina Houinsou Hans for her statistical support during the re- view process

    Iceberg: a loudspeaker-based room auralization method for auditory research

    Get PDF
    Depending on the acoustic scenario, people with hearing loss are challenged on a different scale than normal hearing people to comprehend sound, especially speech. That happen especially during social interactions within a group, which often occurs in environments with low signal-to-noise ratios. This communication disruption can create a barrier for people to acquire and develop communication skills as a child or to interact with society as an adult. Hearing loss compensation aims to provide an opportunity to restore the auditory part of socialization. Technology and academic efforts progressed to a better understanding of the human hearing system. Through constant efforts to present new algorithms, miniaturization, and new materials, constantly-improving hardware with high-end software is being developed with new features and solutions to broad and specific auditory challenges. The effort to deliver innovative solutions to the complex phenomena of hearing loss encompasses tests, verifications, and validation in various forms. As the newer devices achieve their purpose, the tests need to increase the sensitivity, requiring conditions that effectively assess their improvements. Regarding realism, many levels are required in hearing research, from pure tone assessment in small soundproof booths to hundreds of loudspeakers combined with visual stimuli through projectors or head-mounted displays, light, and movement control. Hearing aids research commonly relies on loudspeaker setups to reproduce sound sources. In addition, auditory research can use well-known auralization techniques to generate sound signals. These signals can be encoded to carry more than sound pressure level information, adding spatial information about the environment where that sound event happened or was simulated. This work reviews physical acoustics, virtualization, and auralization concepts and their uses in listening effort research. This knowledge, combined with the experiments executed during the studies, aimed to provide a hybrid auralization method to be virtualized in four-loudspeaker setups. Auralization methods are techniques used to encode spatial information into sounds. The main methods were discussed and derived, observing their spatial sound characteristics and trade-offs to be used in auditory tests with one or two participants. Two well-known auralization techniques (Ambisonics and Vector-Based Amplitude Panning) were selected and compared through a calibrated virtualization setup regarding spatial distortions in the binaural cues. The choice of techniques was based on the need for loudspeakers, although a small number of them. Furthermore, the spatial cues were examined by adding a second listener to the virtualized sound field. The outcome reinforced the literature around spatial localization and these techniques driving Ambisonics to be less spatially accurate but with greater immersion than Vector-Based Amplitude Panning. A combination study to observe changes in listening effort due to different signal-to-noise ratios and reverberation in a virtualized setup was defined. This experiment aimed to produce the correct sound field via a virtualized setup and assess listening effort via subjective impression with a questionnaire, an objective physiological outcome from EEG, and behavioral performance on word recognition. Nine levels of degradation were imposed on speech signals over speech maskers separated in the virtualized space through Ambisonics' first-order technique in a setup with 24 loudspeakers. A high correlation between participants' performance and their responses on the questionnaire was observed. The results showed that the increased virtualized reverberation time negatively impacts speech intelligibility and listening effort. A new hybrid auralization method was proposed merging the investigated techniques that presented complementary spatial sound features. The method was derived through room acoustics concepts and a specific objective parameter derived from the room impulse response called Center Time. The verification around the binaural cues was driven with three different rooms (simulated). As the validation with test subjects was not possible due to the COVID-19 pandemic situation, a psychoacoustic model was implemented to estimate the spatial accuracy of the method within a four-loudspeaker setup. Also, an investigation ran the same verification, and the model estimation was performed with the introduction of hearing aids. The results showed that it is possible to consider the hybrid method with four loudspeakers for audiological tests while considering some limitations. The setup can provide binaural cues to a maximum ambiguity angle of 30 degrees in the horizontal plane for a centered listener

    Attentional modulation of neural sound tracking in children with and without dyslexia

    Get PDF
    Auditory selective attention forms an important foundation of children's learning by enabling the prioritisation and encoding of relevant stimuli. It may also influence reading development, which relies on metalinguistic skills including the awareness of the sound structure of spoken language. Reports of attentional impairments and speech perception difficulties in noisy environments in dyslexic readers are also suggestive of the putative contribution of auditory attention to reading development. To date, it is unclear whether non-speech selective attention and its underlying neural mechanisms are impaired in children with dyslexia and to which extent these deficits relate to individual reading and speech perception abilities in suboptimal listening conditions. In this EEG study, we assessed non-speech sustained auditory selective attention in 106 7-to-12-year-old children with and without dyslexia. Children attended to one of two tone streams, detecting occasional sequence repeats in the attended stream, and performed a speech-in-speech perception task. Results show that when children directed their attention to one stream, inter-trial-phase-coherence at the attended rate increased in fronto-central sites; this, in turn, was associated with better target detection. Behavioural and neural indices of attention did not systematically differ as a function of dyslexia diagnosis. However, behavioural indices of attention did explain individual differences in reading fluency and speech-in-speech perception abilities: both these skills were impaired in dyslexic readers. Taken together, our results show that children with dyslexia do not show group-level auditory attention deficits but these deficits may represent a risk for developing reading impairments and problems with speech perception in complex acoustic environments
    • …
    corecore