37 research outputs found

    Non-Intrusive Speech Intelligibility Prediction

    Get PDF

    Model-based speech enhancement for hearing aids

    Get PDF

    Data-driven Speech Intelligibility Enhancement and Prediction for Hearing Aids

    Get PDF
    Hearing impairment is a widespread problem around the world. It is estimated that one in six people are living with some degree of hearing loss. Moderate and severe hearing impairment has been recognised as one of the major causes of disability, which is associated with declines in the quality of life, mental illness and dementia. However, investigation shows that only 10-20\% of older people with significant hearing impairment wear hearing aids. One of the main factors causing the low uptake is that current devices struggle to help hearing aid users understand speech in noisy environments. For the purpose of compensating for the elevated hearing thresholds and dysfunction of source separation processing caused by the impaired auditory system, amplification and denoising have been the major focuses of current hearing aid studies to improve the intelligibility of speech in noise. Also, it is important to derive a metric that can fairly predict speech intelligibility for the better development of hearing aid techniques. This thesis aims to enhance the speech intelligibility of hearing impaired listeners. Motivated by the success of data-driven approaches in many speech processing applications, this work proposes the differentiable hearing aid speech processing (DHASP) framework to optimise both the amplification and denoising modules within a hearing aid processor. This is accomplished by setting an intelligibility-based optimisation objective and taking advantage of large-scale speech databases to train the hearing aid processor to maximise the intelligibility for the listeners. The first set of experiments is conducted on both clean and noisy speech databases, and the results from objective evaluation suggest that the amplification fittings optimised within the DHASP framework can outperform a widely used and well-recognised fitting. The second set of experiments is conducted on a large-scale database with simulated domestic noisy scenes. The results from both objective and subjective evaluations show that the DHASP-optimised hearing aid processor incorporating a deep neural network-based denoising module can achieve competitive performance in terms of intelligibility enhancement. A precise intelligibility predictor can provide reliable evaluation results to save the cost of expensive and time-consuming subjective evaluation. Inspired by the findings that automatic speech recognition (ASR) models show similar recognition results as humans in some experiments, this work exploits ASR models for intelligibility prediction. An intrusive approach using ASR hidden representations and a non-intrusive approach using ASR uncertainty are proposed and explained in the third and fourth experimental chapters. Experiments are conducted on two databases, one with monaural speech in speech-spectrum-shaped noise with normal hearing listeners, and the other one with processed binaural speech in domestic noise with hearing impaired listeners. Results suggest that both the intrusive and non-intrusive approaches can achieve top performances and outperform a number of widely used intelligibility prediction approaches. In conclusion, this thesis covers both the enhancement and prediction of speech intelligibility for hearing aids. The proposed hearing aid processor optimised within the proposed DHASP framework can significantly improve the intelligibility of speech in noise for hearing impaired listeners. Also, it is shown that the proposed ASR-based intelligibility prediction approaches can achieve state-of-the-art performances against a number of widely used intelligibility predictors

    Data-Driven Speech Intelligibility Prediction

    Get PDF

    Speech Intelligibility Prediction for Hearing Aid Systems

    Get PDF

    Joint estimation of reverberation time and early-to-late reverberation ratio from single-channel speech signals

    Get PDF
    The reverberation time (RT) and the early-to-late reverberation ratio (ELR) are two key parameters commonly used to characterize acoustic room environments. In contrast to conventional blind estimation methods that process the two parameters separately, we propose a model for joint estimation to predict the RT and the ELR simultaneously from single-channel speech signals from either full-band or sub-band frequency data, which is referred to as joint room parameter estimator (jROPE). An artificial neural network is employed to learn the mapping from acoustic observations to the RT and the ELR classes. Auditory-inspired acoustic features obtained by temporal modulation filtering of the speech time-frequency representations are used as input for the neural network. Based on an in-depth analysis of the dependency between the RT and the ELR, a two-dimensional (RT, ELR) distribution with constrained boundaries is derived, which is then exploited to evaluate four different configurations for jROPE. Experimental results show that-in comparison to the single-task ROPE system which individually estimates the RT or the ELR-jROPE provides improved results for both tasks in various reverberant and (diffuse) noisy environments. Among the four proposed joint types, the one incorporating multi-task learning with shared input and hidden layers yields the best estimation accuracies on average. When encountering extreme reverberant conditions with RTs and ELRs lying beyond the derived (RT, ELR) distribution, the type considering RT and ELR as a joint parameter performs robustly, in particular. From state-of-the-art algorithms that were tested in the acoustic characterization of environments challenge, jROPE achieves comparable results among the best for all individual tasks (RT and ELR estimation from full-band and sub-band signals)

    Probabilistic Models of Speech Quality

    Get PDF
    東京電機大学202

    Evaluating cognitive load of text-to-speech synthesis

    Get PDF
    This thesis addresses the vital topic of evaluating synthetic speech and its impact on the end-user, taking into consideration potential negative implications on cognitive load. While conventional methods like transcription tests and Mean Opinion Scores (MOS) tests offer a valuable overall understanding of system performance, they fail to provide deeper insights into the reasons behind the performance. As text-to-speech (TTS) systems are increasingly used in real-world applications, it becomes crucial to explore whether synthetic speech imposes a greater cognitive load on listeners compared to human speech, as excessive cognitive effort could lead to fatigue over time. The study focuses on assessing the cognitive load of synthetic speech by presenting two methodologies: the dual-task paradigm and pupillometry. The dual-task paradigm initially seemed promising but was eventually deemed unreliable and unsuitable due to uncertainties in experimental setups which requires further investigation. However, pupillometry emerged as a viable approach, demonstrating its efficacy in detecting differences in cognitive load among various speech synthesizers. Notably, the research confirmed that accurate measurement of listening difficulty requires imposing sufficient cognitive load on listeners. To achieve this, the most viable experimental setup involved measuring the pupil response while listening to speech in the presence of noise. Through these experiments, intriguing contrasts between human and synthetic speech were revealed. Human speech consistently demanded the least cognitive load. On the other hand, state-of-the-art TTS systems showed promising results, indicating a significant improvement in their cognitive load performance compared to rule-based synthesizers of the past. Pupillometry offers a deeper understanding of the contributing factors to increased cognitive load in synthetic speech processing. Particularly, an experiment highlighted that the separate modeling of spectral feature prediction and duration in TTS systems led to heightened cognitive load. However, encouragingly, many modern end-to-end TTS systems have addressed these issues by predicting acoustic features within a unified framework, and thus effectively reducing the overall cognitive load imposed by synthetic speech. As the gap between human and synthetic speech diminishes with advancements in TTS technology, continuous evaluation using pupillometry remains essential for optimizing TTS systems for low cognitive load. Although pupillometry demands advanced analysis techniques and is time-consuming, the meaningful insights it provides into the cognitive load of synthetic speech contribute to an enhanced user experience and better TTS system development. Overall, this work successfully establishes pupillometry as a viable and effective method for measuring cognitive load of synthetic speech, propelling synthetic speech evaluation beyond traditional metrics. By gaining a deeper understanding of synthetic speech's interaction with the human cognitive processing system, researchers and developers can work towards creating TTS systems that offer improved user experiences with reduced cognitive load, ultimately enhancing the overall usability and acceptance of such technologies. Note: There was a 2-year break in the work reported in this thesis where an initial pilot was performed in early 2020 and was then suspended due to the covid-19 pandemic. Experiments were therefore rerun in 2022/23 with the most recent state-of-the-art models so that we could determine whether the increased cognitive load result is still applicable. This thesis was thus concluded by answering whether such cognitive load methods developed in this thesis are still useful, practical and/or relevant for current state-of-the-art text-to-speech systems
    corecore