8 research outputs found

    Deep speech inpainting of time-frequency masks

    Full text link
    Transient loud intrusions, often occurring in noisy environments, can completely overpower speech signal and lead to an inevitable loss of information. While existing algorithms for noise suppression can yield impressive results, their efficacy remains limited for very low signal-to-noise ratios or when parts of the signal are missing. To address these limitations, here we propose an end-to-end framework for speech inpainting, the context-based retrieval of missing or severely distorted parts of time-frequency representation of speech. The framework is based on a convolutional U-Net trained via deep feature losses, obtained using speechVGG, a deep speech feature extractor pre-trained on an auxiliary word classification task. Our evaluation results demonstrate that the proposed framework can recover large portions of missing or distorted time-frequency representation of speech, up to 400 ms and 3.2 kHz in bandwidth. In particular, our approach provided a substantial increase in STOI & PESQ objective metrics of the initially corrupted speech samples. Notably, using deep feature losses to train the framework led to the best results, as compared to conventional approaches.Comment: Accepted to InterSpeech202

    Computational modelling of neural mechanisms underlying natural speech perception

    Get PDF
    Humans are highly skilled at the analysis of complex auditory scenes. In particular, the human auditory system is characterized by incredible robustness to noise and can nearly effortlessly isolate the voice of a specific talker from even the busiest of mixtures. However, neural mechanisms underlying these remarkable properties remain poorly understood. This is mainly due to the inherent complexity of speech signals and multi-stage, intricate processing performed in the human auditory system. Understanding these neural mechanisms underlying speech perception is of interest for clinical practice, brain-computer interfacing and automatic speech processing systems. In this thesis, we developed computational models characterizing neural speech processing across different stages of the human auditory pathways. In particular, we studied the active role of slow cortical oscillations in speech-in-noise comprehension through a spiking neural network model for encoding spoken sentences. The neural dynamics of the model during noisy speech encoding reflected speech comprehension of young, normal-hearing adults. The proposed theoretical model was validated by predicting the effects of non-invasive brain stimulation on speech comprehension in an experimental study involving a cohort of volunteers. Moreover, we developed a modelling framework for detecting the early, high-frequency neural response to the uninterrupted speech in non-invasive neural recordings. We applied the method to investigate top-down modulation of this response by the listener's selective attention and linguistic properties of different words from a spoken narrative. We found that in both cases, the detected responses of predominantly subcortical origin were significantly modulated, which supports the functional role of feedback, between higher- and lower levels stages of the auditory pathways, in speech perception. The proposed computational models shed light on some of the poorly understood neural mechanisms underlying speech perception. The developed methods can be readily employed in future studies involving a range of experimental paradigms beyond these considered in this thesis.Open Acces

    Self-Supervised Learning for Speech Enhancement through Synthesis

    Full text link
    Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We leverage rich representations from self-supervised learning (SSL) speech models to discover relevant features. We conduct a candidate search across 15 potential SSL front-ends and subsequently train our vocoder adversarially with the best SSL configuration. Additionally, we demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation. Finally, we conduct both objective evaluations and subjective listening studies to show our system improves objective metrics and outperforms an existing state-of-the-art SE model subjectively

    BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping

    Full text link
    Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.Comment: Submitted to HEAR-PMLR 202

    Transcranial alternating current stimulation in the theta band but not in the delta band modulates the comprehension of naturalistic speech in noise

    Get PDF
    © 2020 Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).Auditory cortical activity entrains to speech rhythms and has been proposed as a mechanism for online speech processing. In particular, neural activity in the theta frequency band (4–8 ​Hz) tracks the onset of syllables which may aid the parsing of a speech stream. Similarly, cortical activity in the delta band (1–4 ​Hz) entrains to the onset of words in natural speech and has been found to encode both syntactic as well as semantic information. Such neural entrainment to speech rhythms is not merely an epiphenomenon of other neural processes, but plays a functional role in speech processing: modulating the neural entrainment through transcranial alternating current stimulation influences the speech-related neural activity and modulates the comprehension of degraded speech. However, the distinct functional contributions of the delta- and of the theta-band entrainment to the modulation of speech comprehension have not yet been investigated. Here we use transcranial alternating current stimulation with waveforms derived from the speech envelope and filtered in the delta and theta frequency bands to alter cortical entrainment in both bands separately. We find that transcranial alternating current stimulation in the theta band but not in the delta band impacts speech comprehension. Moreover, we find that transcranial alternating current stimulation with the theta-band portion of the speech envelope can improve speech-in-noise comprehension beyond sham stimulation. Our results show a distinct contribution of the theta- but not of the delta-band stimulation to the modulation of speech comprehension. In addition, our findings open up a potential avenue of enhancing the comprehension of speech in noise.Peer reviewe

    Machine learning profiles of cardiovascular risk in patients with diabetes mellitus: the Silesia Diabetes-Heart Project

    Get PDF
    AimsAs cardiovascular disease (CVD) is a leading cause of death for patients with diabetes mellitus (DM), we aimed to find important factors that predict cardiovascular (CV) risk using a machine learning (ML) approach.Methods and resultsWe performed a single center, observational study in a cohort of 238 DM patients (mean age ± SD 52.15 ± 17.27 years, 54% female) as a part of the Silesia Diabetes-Heart Project. Having gathered patients' medical history, demographic data, laboratory test results, results from the Michigan Neuropathy Screening Instrument (assessing diabetic peripheral neuropathy) and Ewing's battery examination (determining the presence of cardiovascular autonomic neuropathy), we managed use a ML approach to predict the occurrence of overt CVD on the basis of five most discriminative predictors with the area under the receiver operating characteristic curve of 0.86 (95% CI 0.80-0.91). Those features included the presence of past or current foot ulceration, age, the treatment with beta-blocker (BB) and angiotensin converting enzyme inhibitor (ACEi). On the basis of the aforementioned parameters, unsupervised clustering identified different CV risk groups. The highest CV risk was determined for the eldest patients treated in large extent with ACEi but not BB and having current foot ulceration, and for slightly younger individuals treated extensively with both above-mentioned drugs, with relatively small percentage of diabetic ulceration.ConclusionsUsing a ML approach in a prospective cohort of patients with DM, we identified important factors that predicted CV risk. If a patient was treated with ACEi or BB, is older and has/had a foot ulcer, this strongly predicts that he/she is at high risk of having overt CVD

    Dataset for: Hearing aids do not alter cortical entrainment to speech at audible levels in mild-to-moderately hearing-impaired subjects

    No full text
    Dataset supports: Vanheusden, Frederique Jos et al. (2019). Hearing aids do not alter cortical entrainment to speech at audible levels in mild-to-moderately hearing-impaired subjects. Frontiers in Human Neuroscience</span
    corecore