5,502 research outputs found
Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index
This paper considers speech enhancement of signals picked up in one noisy
environment which must be presented to a listener in another noisy environment.
Recently, it has been shown that an optimal solution to this problem requires
the consideration of the noise sources in both environments jointly. However,
the existing optimal mutual information based method requires a complicated
system model that includes natural speech variations, and relies on
approximations and assumptions of the underlying signal distributions. In this
paper, we propose to use a simpler signal model and optimize speech
intelligibility based on the Approximated Speech Intelligibility Index (ASII).
We derive a closed-form solution to the joint far- and near-end speech
enhancement problem that is independent of the marginal distribution of signal
coefficients, and that achieves similar performance to existing work. In
addition, we do not need to model or optimize for natural speech variations
An evaluation of intrusive instrumental intelligibility metrics
Instrumental intelligibility metrics are commonly used as an alternative to
listening tests. This paper evaluates 12 monaural intrusive intelligibility
metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and
. In addition, this paper investigates the ability of
intelligibility metrics to generalize to new types of distortions and analyzes
why the top performing metrics have high performance. The intelligibility data
were obtained from 11 listening tests described in the literature. The stimuli
included Dutch, Danish, and English speech that was distorted by additive
noise, reverberation, competing talkers, pre-processing enhancement, and
post-processing enhancement. SIIB and HASPI had the highest performance
achieving a correlation with listening test scores on average of
and , respectively. The high performance of SIIB may, in part, be
the result of SIIBs developers having access to all the intelligibility data
considered in the evaluation. The results show that intelligibility metrics
tend to perform poorly on data sets that were not used during their
development. By modifying the original implementations of SIIB and STOI, the
advantage of reducing statistical dependencies between input features is
demonstrated. Additionally, the paper presents a new version of SIIB called
, which has similar performance to SIIB and HASPI,
but takes less time to compute by two orders of magnitude.Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language
Processing, 201
A Weighted STOI Intelligibility Metric Based On Mutual Information
It is known that the information required for the intelligibility of a speech signal is distributed non-uniformly in time. In this paper we propose WSTOI, a modified version of STOI, a speech intelligibility metric. With WSTOI the contribution of each time-frequency cell is weighted by an estimate of its intelligibility content. This estimate is equal to the mutual information between two hypothetical signals at either end of a simplified model of human communication. Listening tests show that the modification improves the prediction accuracy of STOI at all performance levels on both long and short utterances. An improvement was observed across all tested noise types and suppression algorithms
Contributions of local speech encoding and functional connectivity to audio-visual speech perception
Seeing a speakerâs face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speakerâs face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments
DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score
We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for soundquality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create highquality output signals. However, since most OSQA scores are not analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of black-box optimization, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability-density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized
- âŚ