11 research outputs found
Multi-objective Non-intrusive Hearing-aid Speech Assessment Model
Without the need for a clean reference, non-intrusive speech assessment
methods have caught great attention for objective evaluations. While deep
learning models have been used to develop non-intrusive speech assessment
methods with promising results, there is limited research on hearing-impaired
subjects. This study proposes a multi-objective non-intrusive hearing-aid
speech assessment model, called HASA-Net Large, which predicts speech quality
and intelligibility scores based on input speech signals and specified
hearing-loss patterns. Our experiments showed the utilization of pre-trained
SSL models leads to a significant boost in speech quality and intelligibility
predictions compared to using spectrograms as input. Additionally, we examined
three distinct fine-tuning approaches that resulted in further performance
improvements. Furthermore, we demonstrated that incorporating SSL models
resulted in greater transferability to OOD dataset. Finally, this study
introduces HASA-Net Large, which is a non-invasive approach for evaluating
speech quality and intelligibility. HASA-Net Large utilizes raw waveforms and
hearing-loss patterns to accurately predict speech quality and intelligibility
levels for individuals with normal and impaired hearing and demonstrates
superior prediction performance and transferability
Objective Assessment of Machine Learning Algorithms for Speech Enhancement in Hearing Aids
Speech enhancement in assistive hearing devices has been an area of research for many decades. Noise reduction is particularly challenging because of the wide variety of noise sources and the non-stationarity of speech and noise. Digital signal processing (DSP) algorithms deployed in modern hearing aids for noise reduction rely on certain assumptions on the statistical properties of undesired signals. This could be disadvantageous in accurate estimation of different noise types, which subsequently leads to suboptimal noise reduction. In this research, a relatively unexplored technique based on deep learning, i.e. Recurrent Neural Network (RNN), is used to perform noise reduction and dereverberation for assisting hearing-impaired listeners. For noise reduction, the performance of the deep learning model was evaluated objectively and compared with that of open Master Hearing Aid (openMHA), a conventional signal processing based framework, and a Deep Neural Network (DNN) based model. It was found that the RNN model can suppress noise and improve speech understanding better than the conventional hearing aid noise reduction algorithm and the DNN model. The same RNN model was shown to reduce reverberation components with proper training. A real-time implementation of the deep learning model is also discussed
Recommended from our members
Comparison of effects on subjective intelligibility and quality of speech in babble for two algorithms: A deep recurrent neural network and spectral subtraction.
The effects on speech intelligibility and sound quality of two noise-reduction algorithms were compared: a deep recurrent neural network (RNN) and spectral subtraction (SS). The RNN was trained using sentences spoken by a large number of talkers with a variety of accents, presented in babble. Different talkers were used for testing. Participants with mild-to-moderate hearing loss were tested. Stimuli were given frequency-dependent linear amplification to compensate for the individual hearing losses. A paired-comparison procedure was used to compare all possible combinations of three conditions. The conditions were: speech in babble with no processing (NP) or processed using the RNN or SS. In each trial, the same sentence was played twice using two different conditions. The participants indicated which one was better and by how much in terms of speech intelligibility and (in separate blocks) sound quality. Processing using the RNN was significantly preferred over NP and over SS processing for both subjective intelligibility and sound quality, although the magnitude of the preferences was small. SS processing was not significantly preferred over NP for either subjective intelligibility or sound quality. Objective computational measures of speech intelligibility predicted better intelligibility for RNN than for SS or NP
CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment
Speech quality assessment has been a critical component in many voice
communication related applications such as telephony and online conferencing.
Traditional intrusive speech quality assessment requires the clean reference of
the degraded utterance to provide an accurate quality measurement. This
requirement limits the usability of these methods in real-world scenarios. On
the other hand, non-intrusive subjective measurement is the ``golden standard"
in evaluating speech quality as human listeners can intrinsically evaluate the
quality of any degraded speech with ease. In this paper, we propose a novel
end-to-end model structure called Convolutional Context-Aware Transformer
(CCAT) network to predict the mean opinion score (MOS) of human raters. We
evaluate our model on three MOS-annotated datasets spanning multiple languages
and distortion types and submit our results to the ConferencingSpeech 2022
Challenge. Our experiments show that CCAT provides promising MOS predictions
compared to current state-of-art non-intrusive speech assessment models with
average Pearson correlation coefficient (PCC) increasing from 0.530 to 0.697
and average RMSE decreasing from 0.768 to 0.570 compared to the baseline model
on the challenge evaluation test set
Objective and Subjective Evaluation of Wideband Speech Quality
Traditional landline and cellular communications use a bandwidth of 300 - 3400 Hz for transmitting speech. This narrow bandwidth impacts quality, intelligibility and naturalness of transmitted speech. There is an impending change within the telecommunication industry towards using wider bandwidth speech, but the enlarged bandwidth also introduces a few challenges in speech processing. Echo and noise are two challenging issues in wideband telephony, due to increased perceptual sensitivity by users.
Subjective and/or objective measurements of speech quality are important in benchmarking speech processing algorithms and evaluating the effect of parameters like noise, echo, and delay in wideband telephony. Subjective measures include ratings of speech quality by listeners, whereas objective measures compute a metric based on the reference and degraded speech samples. While subjective quality ratings are the gold - standard\u27\u27, they are also time- and resource- consuming. An objective metric that correlates highly with subjective data is attractive, as it can act as a substitute for subjective quality scores in gauging the performance of different algorithms and devices.
This thesis reports results from a series of experiments on subjective and objective speech quality evaluation for wideband telephony applications. First, a custom wideband noise reduction database was created that contained speech samples corrupted by different background noises at different signal to noise ratios (SNRs) and processed by six different noise reduction algorithms. Comprehensive subjective evaluation of this database revealed an interaction between the algorithm performance, noise type and SNR. Several auditory-based objective metrics such as the Loudness Pattern Distortion (LPD) measure based on the Moore - Glasberg auditory model were evaluated in predicting the subjective scores. In addition, the performance of Bayesian Multivariate Regression Splines(BMLS) was also evaluated in terms of mapping the scores calculated by the objective metrics to the true quality scores. The combination of LPD and BMLS resulted in high correlation with the subjective scores and was used as a substitution for fine - tuning the noise reduction algorithms.
Second, the effect of echo and delay on the wideband speech was evaluated in both listening and conversational context, through both subjective and objective measures. A database containing speech samples corrupted by echo with different delay and frequency response characteristics was created, and was later used to collect subjective quality ratings. The LPD - BMLS objective metric was then validated using the subjective scores.
Third, to evaluate the effect of echo and delay in conversational context, a realtime simulator was developed. Pairs of subjects conversed over the simulated system and rated the quality of their conversations which were degraded by different amount of echo and delay. The quality scores were analysed and LPD+BMLS combination was found to be effective in predicting subjective impressions of quality for condition-averaged data
Understanding hearing aid sound quality for music-listening
To improve speech intelligibility for individuals with hearing loss, hearing aids amplify speech using gains derived from evidence-based prescriptive methods, in addition to other advanced signal processing mechanisms. While the evidence supports the use of hearing aid signal processing for speech intelligibility, these signal processing adjustments can also be detrimental to hearing aid sound quality, with poor hearing aid sound quality cited as a barrier to device adoption. Poor sound quality is also of concern for music-listening, in which intelligibility is likely not a consideration. A series of electroacoustic and behavioural studies were conducted to study sound quality issues in hearing aids, with a focus on music. An objective sound quality metric was validated for real hearing aid fittings, enabling researchers to predict sound quality impacts of signal processing adjustments. Qualitative interviews with hearing aid user musicians revealed that users’ primary concern was understanding the conductor’s speech during rehearsals, with hearing aid music sound quality issues a secondary concern. However, reported sound quality issues were consistent with music-listening sound quality complaints in the literature. Therefore, follow-up experiments focused on sound quality issues. An examination of different manufacturers’ hearing aids revealed significant music sound quality preferences for some devices over others. Electroacoustic measurements on these devices revealed that bass content varied more between devices than levels in other spectral ranges or nonlinearity, and increased bass levels were most associated with improved sound quality ratings. In a sound quality optimization study, listeners increased the bass and reduced the treble relative to typically-prescribed gains, for both speech and music. However, adjustments were smaller in magnitude for speech compared to music because they were also associated with a decline in speech intelligibility. These findings encourage the increase of bass and reduction of treble to improve hearing aid music sound quality, but only to the degree that speech intelligibility is not compromised. Future research is needed on the prediction of hearing aid music quality, the provision of low-frequency gain in open-fit hearing aids, genre-specific adjustments, hearing aid compression and music, and direct-to-consumer technology
An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation
Speech enhancement and speech separation are two related tasks, whose purpose
is to extract either one or more target speech signals, respectively, from a
mixture of sounds generated by several sources. Traditionally, these tasks have
been tackled using signal processing and machine learning techniques applied to
the available acoustic signals. Since the visual aspect of speech is
essentially unaffected by the acoustic environment, visual information from the
target speakers, such as lip movements and facial expressions, has also been
used for speech enhancement and speech separation systems. In order to
efficiently fuse acoustic and visual information, researchers have exploited
the flexibility of data-driven approaches, specifically deep learning,
achieving strong performance. The ceaseless proposal of a large number of
techniques to extract features and fuse multimodal information has highlighted
the need for an overview that comprehensively describes and discusses
audio-visual speech enhancement and separation based on deep learning. In this
paper, we provide a systematic survey of this research topic, focusing on the
main elements that characterise the systems in the literature: acoustic
features; visual features; deep learning methods; fusion techniques; training
targets and objective functions. In addition, we review deep-learning-based
methods for speech reconstruction from silent videos and audio-visual sound
source separation for non-speech signals, since these methods can be more or
less directly applied to audio-visual speech enhancement and separation.
Finally, we survey commonly employed audio-visual speech datasets, given their
central role in the development of data-driven approaches, and evaluation
methods, because they are generally used to compare different systems and
determine their performance