12 research outputs found
Recommended from our members
Decoding auditory attention from neural representations of glimpsed and masked speech
Humans hold the remarkable capacity to attend to a single person’s voice when many people are talking. Nevertheless, individuals with hearing loss may struggle to tune into a single voice in these types of complex acoustic situations. Current hearing aids can remove background noise but are unable to selectively amplify a single person’s voice without first knowing to whom the listener aims to attend. Studies of multitalker speech perception have demonstrated an enhanced representation of attended speech in the neural responses of a listener, giving rise to the prospect of a brain-controlled hearing aid that uses Auditory Attention Decoding (AAD) algorithms to selectively amplify the target of the listener’s attention as decoded from their neural signals.
In this dissertation, I describe experiments using non-invasive and invasive electrophysiology that investigate the encoding and decoding of speech representations that inform our understanding of the influence of attention on speech perception and advance our progress toward brain-controlled hearing devices.
First, I explore the efficacy of AAD in improving speech intelligibility when switching attention between different talkers with data recorded non-invasively from listeners with hearing loss. I show that AAD can be effective at improving intelligibility for listeners with hearing loss, but current methods for AAD with non-invasive data are unable to detect changes in attention with sufficient accuracy or speed to improve intelligibility generally.
Next, I analyze invasive neural recordings to more clearly establish the boundary between the neural encoding of target and non-target speech during multitalker speech perception. In particular, I investigate whether speech perception can be achieved through glimpses, i.e. spectrotemporal regions where a talker has more energy than the background, or if the recovery of masked regions is also necessary. I find that glimpsed speech is encoded for both target and non-target talkers, while masked speech is encoded for only the target talker, with a greater response latency and distinct anatomical organization compared to glimpsed speech. These findings suggest that glimpsed and masked speech utilize separate encoding mechanisms and that attention enables the recovery of masked speech to support higher-order speech perception.
Last, I leverage my theory of the neural encoding of glimpsed and masked speech to design a novel framework for AAD. I show that differentially classifying event-related potentials to glimpsed and masked acoustic events is more effective than current models that ignore the dynamic overlap between a talker and the background. In particular, this framework enables more accurate and stable decoding that is quicker at identifying changes in attention and capable of detecting atypical uses of attention, such as divided attention or inattention. Together, this dissertation identifies key problems in the neural decoding of a listener’s attention, expands our understanding of the influence of attention on the neural encoding of speech, and leverages this understanding to design new methods for AAD that move us closer to the development of effective and intuitive brain-controlled hearing assistive devices
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that
leverages style diffusion and adversarial training with large speech language
models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its
predecessor by modeling styles as a latent random variable through diffusion
models to generate the most suitable style for the text without requiring
reference speech, achieving efficient latent diffusion while benefiting from
the diverse speech synthesis offered by diffusion models. Furthermore, we
employ large pre-trained SLMs, such as WavLM, as discriminators with our novel
differentiable duration modeling for end-to-end training, resulting in improved
speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker
LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by
native English speakers. Moreover, when trained on the LibriTTS dataset, our
model outperforms previous publicly available models for zero-shot speaker
adaptation. This work achieves the first human-level TTS on both single and
multispeaker datasets, showcasing the potential of style diffusion and
adversarial training with large SLMs. The audio demos and source code are
available at https://styletts2.github.io/
Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
Hearing-impaired people often struggle to follow the speech stream of an individual talker in noisy environments. Recent studies show that the brain tracks attended speech and that the attended talker can be decoded from neural data on a single-trial level. This raises the possibility of “neuro-steered” hearing devices in which the brain-decoded intention of a hearing-impaired listener is used to enhance the voice of the attended speaker from a speech separation front-end. So far, methods that use this paradigm have focused on optimizing the brain decoding and the acoustic speech separation independently. In this work, we propose a novel framework called brain-informed speech separation (BISS)1 in which the information about the attended speech, as decoded from the subject’s brain, is directly used to perform speech separation in the front-end. We present a deep learning model that uses neural data to extract the clean audio signal that a listener is attending to from a multi-talker speech mixture. We show that the framework can be applied successfully to the decoded output from either invasive intracranial electroencephalography (iEEG) or non-invasive electroencephalography (EEG) recordings from hearing-impaired subjects. It also results in improved speech separation, even in scenes with background noise. The generalization capability of the system renders it a perfect candidate for neuro-steered hearing-assistive devices
Reducing the environmental impact of surgery on a global scale: systematic review and co-prioritization with healthcare workers in 132 countries
Abstract
Background
Healthcare cannot achieve net-zero carbon without addressing operating theatres. The aim of this study was to prioritize feasible interventions to reduce the environmental impact of operating theatres.
Methods
This study adopted a four-phase Delphi consensus co-prioritization methodology. In phase 1, a systematic review of published interventions and global consultation of perioperative healthcare professionals were used to longlist interventions. In phase 2, iterative thematic analysis consolidated comparable interventions into a shortlist. In phase 3, the shortlist was co-prioritized based on patient and clinician views on acceptability, feasibility, and safety. In phase 4, ranked lists of interventions were presented by their relevance to high-income countries and low–middle-income countries.
Results
In phase 1, 43 interventions were identified, which had low uptake in practice according to 3042 professionals globally. In phase 2, a shortlist of 15 intervention domains was generated. In phase 3, interventions were deemed acceptable for more than 90 per cent of patients except for reducing general anaesthesia (84 per cent) and re-sterilization of ‘single-use’ consumables (86 per cent). In phase 4, the top three shortlisted interventions for high-income countries were: introducing recycling; reducing use of anaesthetic gases; and appropriate clinical waste processing. In phase 4, the top three shortlisted interventions for low–middle-income countries were: introducing reusable surgical devices; reducing use of consumables; and reducing the use of general anaesthesia.
Conclusion
This is a step toward environmentally sustainable operating environments with actionable interventions applicable to both high– and low–middle–income countries
Distinct neural encoding of glimpsed and masked speech in multitalker situations
Humans can easily tune in to one talker in a multitalker environment while still picking up bits of background speech; however, it remains unclear how we perceive speech that is masked and to what degree non-target speech is processed. Some models suggest that perception can be achieved through glimpses, which are spectrotemporal regions where a talker has more energy than the background. Other models, however, require the recovery of the masked regions. To clarify this issue, we directly recorded from primary and non-primary auditory cortex (AC) in neurosurgical patients as they attended to one talker in multitalker speech and trained temporal response function models to predict high-gamma neural activity from glimpsed and masked stimulus features. We found that glimpsed speech is encoded at the level of phonetic features for target and non-target talkers, with enhanced encoding of target speech in non-primary AC. In contrast, encoding of masked phonetic features was found only for the target, with a greater response latency and distinct anatomical organization compared to glimpsed phonetic features. These findings suggest separate mechanisms for encoding glimpsed and masked speech and provide neural evidence for the glimpsing model of speech perception. When humans tune in to one talker in a "cocktail party" scenario, what do we do with the non-target speech? This human intracranial study reveals new insights into the distinct mechanisms by which listeners process target and non-target speech in a crowded environment
Distinct neural encoding of glimpsed and masked speech in multitalker situations.
Humans can easily tune in to one talker in a multitalker environment while still picking up bits of background speech; however, it remains unclear how we perceive speech that is masked and to what degree non-target speech is processed. Some models suggest that perception can be achieved through glimpses, which are spectrotemporal regions where a talker has more energy than the background. Other models, however, require the recovery of the masked regions. To clarify this issue, we directly recorded from primary and non-primary auditory cortex (AC) in neurosurgical patients as they attended to one talker in multitalker speech and trained temporal response function models to predict high-gamma neural activity from glimpsed and masked stimulus features. We found that glimpsed speech is encoded at the level of phonetic features for target and non-target talkers, with enhanced encoding of target speech in non-primary AC. In contrast, encoding of masked phonetic features was found only for the target, with a greater response latency and distinct anatomical organization compared to glimpsed phonetic features. These findings suggest separate mechanisms for encoding glimpsed and masked speech and provide neural evidence for the glimpsing model of speech perception
Not Available
Not AvailableSoil and Water Assessment Tool (SWAT) was used to assess the water yield and evapotranspiration for the Gomti River basin, India for over a period of 25 years (1985–2010). Streamflow calibration and validation of results showed satisfactory performance (NSE: 0.68–0.51; RSR: 0.56–0.68; |PBIAS|: 2.5–24.3) of the model. The water yield was higher in the midstream sub-basins compared to upstream and downstream sub-basins whereas évapotranspiration per unit area decreased from upstream to the downstream. Both évapotranspiration and water yield at upstream and midstream sub-basins increased from 1985 to 2010, whereas water yield at downstream decreased from 1985 to 2010. We found that the spatial and temporal patterns of évapotranspiration and water yield were closely linked to climatic conditions and irrigation in the basin. The long-term trends in water yield point to a drying tendency of downstream sub-basin covering the districts of Jaunpur and Varanasi.Not Availabl
Brain-informed speech separation (BISS) for enhancement of target speaker in multitalker speech perception
© 2020 Hearing-impaired people often struggle to follow the speech stream of an individual talker in noisy environments. Recent studies show that the brain tracks attended speech and that the attended talker can be decoded from neural data on a single-trial level. This raises the possibility of “neuro-steered” hearing devices in which the brain-decoded intention of a hearing-impaired listener is used to enhance the voice of the attended speaker from a speech separation front-end. So far, methods that use this paradigm have focused on optimizing the brain decoding and the acoustic speech separation independently. In this work, we propose a novel framework called brain-informed speech separation (BISS) in which the information about the attended speech, as decoded from the subject's brain, is directly used to perform speech separation in the front-end. We present a deep learning model that uses neural data to extract the clean audio signal that a listener is attending to from a multi-talker speech mixture. We show that the framework can be applied successfully to the decoded output from either invasive intracranial electroencephalography (iEEG) or non-invasive electroencephalography (EEG) recordings from hearing-impaired subjects. It also results in improved speech separation, even in scenes with background noise. The generalization capability of the system renders it a perfect candidate for neuro-steered hearing-assistive devices.ISSN:1053-8119ISSN:1095-957