1,865 research outputs found

    The listening talker: A review of human and algorithmic context-induced modifications of speech

    Get PDF
    International audienceSpeech output technology is finding widespread application, including in scenarios where intelligibility might be compromised - at least for some listeners - by adverse conditions. Unlike most current algorithms, talkers continually adapt their speech patterns as a response to the immediate context of spoken communication, where the type of interlocutor and the environment are the dominant situational factors influencing speech production. Observations of talker behaviour can motivate the design of more robust speech output algorithms. Starting with a listener-oriented categorisation of possible goals for speech modification, this review article summarises the extensive set of behavioural findings related to human speech modification, identifies which factors appear to be beneficial, and goes on to examine previous computational attempts to improve intelligibility in noise. The review concludes by tabulating 46 speech modifications, many of which have yet to be perceptually or algorithmically evaluated. Consequently, the review provides a roadmap for future work in improving the robustness of speech output

    Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments

    Get PDF
    Eliminating the negative effect of non-stationary environmental noise is a long-standing research topic for automatic speech recognition that stills remains an important challenge. Data-driven supervised approaches, including ones based on deep neural networks, have recently emerged as potential alternatives to traditional unsupervised approaches and with sufficient training, can alleviate the shortcomings of the unsupervised methods in various real-life acoustic environments. In this light, we review recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems. We separately discuss single- and multi-channel techniques developed for the front-end and back-end of speech recognition systems, as well as joint front-end and back-end training frameworks

    Array Configuration-Agnostic Personal Voice Activity Detection Based on Spatial Coherence

    Full text link
    Personal voice activity detection has received increased attention due to the growing popularity of personal mobile devices and smart speakers. PVAD is often an integral element to speech enhancement and recognition for these applications in which lightweight signal processing is only enabled for the target user. However, in real-world scenarios, the detection performance may degrade because of competing speakers, background noise, and reverberation. To address this problem, we proposed to use equivalent rectangular bandwidth ERB-scaled spatial coherence as the input feature to train an array configuration-agnostic PVAD network. Whereas the network model requires only 112k parameters, it exhibits excellent detection performance and robustness in adverse acoustic conditions. Notably, the proposed ARCA-PVAD system is scalable to array configurations. Experimental results have demonstrated the superior performance achieved by the proposed ARCA-PVAD system over a baseline in terms of the area under receiver operating characteristic curve and equal error rate.Comment: Accepted by INTER-NOISE 2023. arXiv admin note: text overlap with arXiv:2211.0874

    Objective Assessment of Machine Learning Algorithms for Speech Enhancement in Hearing Aids

    Get PDF
    Speech enhancement in assistive hearing devices has been an area of research for many decades. Noise reduction is particularly challenging because of the wide variety of noise sources and the non-stationarity of speech and noise. Digital signal processing (DSP) algorithms deployed in modern hearing aids for noise reduction rely on certain assumptions on the statistical properties of undesired signals. This could be disadvantageous in accurate estimation of different noise types, which subsequently leads to suboptimal noise reduction. In this research, a relatively unexplored technique based on deep learning, i.e. Recurrent Neural Network (RNN), is used to perform noise reduction and dereverberation for assisting hearing-impaired listeners. For noise reduction, the performance of the deep learning model was evaluated objectively and compared with that of open Master Hearing Aid (openMHA), a conventional signal processing based framework, and a Deep Neural Network (DNN) based model. It was found that the RNN model can suppress noise and improve speech understanding better than the conventional hearing aid noise reduction algorithm and the DNN model. The same RNN model was shown to reduce reverberation components with proper training. A real-time implementation of the deep learning model is also discussed

    Decoding auditory attention and neural language processing in adverse conditions and different listener groups

    Get PDF
    This thesis investigated subjective, behavioural and neurophysiological (EEG) measures of speech processing in various adverse conditions and with different listener groups. In particular, this thesis focused on different neural processing stages and their relationship with auditory attention, effort, and measures of speech intelligibility. Study 1 set the groundwork by establishing a toolbox of various neural measures to investigate online speech processing, from the frequency following response (FFR) and cortical measures of speech processing, to the N400, a measure of lexico-semantic processing. Results showed that peripheral processing is heavily influenced by stimulus characteristics such as degradation, whereas central processing units are more closely linked to higher-order phenomena such as speech intelligibility. In Study 2, a similar experimental paradigm was used to investigate differences in neural processing between a hearing-impaired and a normal-hearing group. Subjects were presented with short stories in different levels of multi-talker babble noise, and with different settings on their hearing aids. Findings indicate that, particularly at lower noise levels, the hearing-impaired group showed much higher cortical entrainment than the normal- hearing group, despite similar levels of speech recognition. Intersubject correlation, another global neural measure of auditory attention, however, was similarly affected by noise levels in both the hearing-impaired and the normal-hearing group. This finding indicates extra processing in the hearing-impaired group only on the level of the auditory cortex. Study 3, in contrast to Studies 1 and 2 (which both investigated the effects of bottom-up factors on neural processing), examined the links between entrainment and top-down factors, specifically motivation; as well as reasons for the 5 higher entrainment found in hearing-impaired subjects in Study 2. Results indicated that, while behaviourally there was no difference between incentive and non-incentive conditions, neurophysiological measures of attention such as intersubject correlation were affected by the presence of an incentive to perform better. Moreover, using a specific degradation type resulted in subjects’ increased cortical entrainment under degraded conditions. These findings support the hypothesis that top-down factors such as motivation influence neurophysiological measures; and that higher entrainment to degraded speech might be triggered specifically by the reduced availability of spectral detail contained in speech

    Early adductive reasoning for blind signal separation

    Full text link
    We demonstrate that explicit and systematic incorporation of abductive reasoning capabilities into algorithms for blind signal separation can yield significant performance improvements. Our formulated mechanisms apply to the output data of signal processing modules in order to conjecture the structure of time-frequency interactions between the signal components that are to be separated. The conjectured interactions are used to drive subsequent signal separation processes that are as a result less blind to the interacting signal components and, therefore, more effective. We refer to this type of process as early abductive reasoning (EAR); the “early” refers to the fact that in contrast to classical Artificial Intelligence paradigms, the reasoning process here is utilized before the signal processing transformations are completed. We have used our EAR approach to formulate a practical algorithm that is more effective in realistically noisy conditions than reference algorithms that are representative of the current state of the art in two-speaker pitch tracking. Our algorithm uses the Blackboard architecture from Artificial Intelligence to control EAR and advanced signal processing modules. The algorithm has been implemented in MATLAB and successfully tested on a database of 570 mixture signals representing simultaneous speakers in a variety of real-world, noisy environments. With 0 dB Target-to-Masking Ratio (TMR) and no noise, the Gross Error Rate (GER) for our algorithm is 5% in comparison to the best GER performance of 11% among the reference algorithms. In diffuse noisy environments (such as street or restaurant environments), we find that our algorithm on the average outperforms the best reference algorithm by 9.4%. With directional noise, our algorithm also outperforms the best reference algorithm by 29%. The extracted pitch tracks from our algorithm were also used to carry out comb filtering for separating the harmonics of the two speakers from each other and from the other sound sources in the environment. The separated signals were evaluated subjectively by a set of 20 listeners to be of reasonable quality
    corecore