Search CORE

5,502 research outputs found

Joint Far- and Near-End Speech Intelligibility Enhancement based on the Approximated Speech Intelligibility Index

Author: Bertelsen Lars Søndergaard
Fuglsig Andreas Jonas
Jensen Jesper
Mariager Peter
Tan Zheng-Hua
Østergaard Jan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/11/2021
Field of study

This paper considers speech enhancement of signals picked up in one noisy environment which must be presented to a listener in another noisy environment. Recently, it has been shown that an optimal solution to this problem requires the consideration of the noise sources in both environments jointly. However, the existing optimal mutual information based method requires a complicated system model that includes natural speech variations, and relies on approximations and assumptions of the underlying signal distributions. In this paper, we propose to use a simpler signal model and optimize speech intelligibility based on the Approximated Speech Intelligibility Index (ASII). We derive a closed-form solution to the joint far- and near-end speech enhancement problem that is independent of the marginal distribution of signal coefficients, and that achieves similar performance to existing work. In addition, we do not need to model or optimize for natural speech variations

arXiv.org e-Print Archive

VBN

An evaluation of intrusive instrumental intelligibility metrics

Author: Hendriks Richard C.
Kleijn W. Bastiaan
Van Kuyk Steven
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Instrumental intelligibility metrics are commonly used as an alternative to listening tests. This paper evaluates 12 monaural intrusive intelligibility metrics: SII, HEGP, CSII, HASPI, NCM, QSTI, STOI, ESTOI, MIKNN, SIMI, SIIB, and

\text{sEPSM}^\text{corr}

. In addition, this paper investigates the ability of intelligibility metrics to generalize to new types of distortions and analyzes why the top performing metrics have high performance. The intelligibility data were obtained from 11 listening tests described in the literature. The stimuli included Dutch, Danish, and English speech that was distorted by additive noise, reverberation, competing talkers, pre-processing enhancement, and post-processing enhancement. SIIB and HASPI had the highest performance achieving a correlation with listening test scores on average of

\rho=0.92

and

\rho=0.89

, respectively. The high performance of SIIB may, in part, be the result of SIIBs developers having access to all the intelligibility data considered in the evaluation. The results show that intelligibility metrics tend to perform poorly on data sets that were not used during their development. By modifying the original implementations of SIIB and STOI, the advantage of reducing statistical dependencies between input features is demonstrated. Additionally, the paper presents a new version of SIIB called

\text{SIIB}^\text{Gauss}

, which has similar performance to SIIB and HASPI, but takes less time to compute by two orders of magnitude.Comment: Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing, 201

arXiv.org e-Print Archive

A Weighted STOI Intelligibility Metric Based On Mutual Information

Author: Brookes D
Lightburn L
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/12/2015
Field of study

It is known that the information required for the intelligibility of a speech signal is distributed non-uniformly in time. In this paper we propose WSTOI, a modified version of STOI, a speech intelligibility metric. With WSTOI the contribution of each time-frequency cell is weighted by an estimate of its intelligibility content. This estimate is equal to the mutual information between two hypothetical signals at either end of a simplified model of human communication. Listening tests show that the modification improves the prediction accuracy of STOI at all performance levels on both long and short utterances. An improvement was observed across all tested noise types and suppression algorithms

Spiral - Imperial College Digital Repository

Contributions of local speech encoding and functional connectivity to audio-visual speech perception

Author: Abrams
Alexandrou
Alho
Arnal
Arnal
Arnal
Ashburner
Auksztulewicz
Beauchamp
Belitski
Bernstein
Besle
Besserve
Besserve
Binder
Bornkessel-Schlesewsky
Bourguignon
Brainard
Callan
Callan
Callan
Canolty
Chandrasekaran
Chandrasekaran
Chennu
Chu
Clos
Crosse
Ding
Ding
Du
Evans
Ferstl
Fonteneau
Freedman
Ghazanfar
Ghazanfar
Giraud
Gow
Grant
Greenberg
Gross
Guediche
Hasson
Hasson
Hasson
Heim
Hickok
Hickok
Hipp
Horowitz-Kraus
Ince
Ince
Ince
Kandylaki
Kayser
Kayser
Kayser
Keitel
Krieger-Redwood
Kriegeskorte
Lakatos
Lee
Maris
Massey
McGettigan
Meister
Mesgarani
Morillon
Morís Fernández
Nath
Ng
Ohshiro
Oostenveld
Osnes
Panzeri
Park
Park
Peelle
Peelle
Pickering
Poeppel
Poeppel
Pola
Pouget
Price
Rauschecker
Riedel
Ross
Schepers
Schneidman
Schreiber
Schroeder
Schroeder
Schwartz
Schwartz
Skipper
Sohoglu
Sumby
Tavano
Thorne
van Atteveldt
van Atteveldt
van Wassenhove
Vetter
Vicente
Wibral
Wild
Wilson
Winkler
Wright
Yarkoni
Zion Golumbic
Zion Golumbic
Publication venue: eLife Sciences Publications
Publication date: 01/01/2017
Field of study

Seeing a speaker’s face enhances speech intelligibility in adverse environments. We investigated the underlying network mechanisms by quantifying local speech representations and directed connectivity in MEG data obtained while human participants listened to speech of varying acoustic SNR and visual context. During high acoustic SNR speech encoding by temporally entrained brain activity was strong in temporal and inferior frontal cortex, while during low SNR strong entrainment emerged in premotor and superior frontal cortex. These changes in local encoding were accompanied by changes in directed connectivity along the ventral stream and the auditory-premotor axis. Importantly, the behavioral benefit arising from seeing the speaker’s face was not predicted by changes in local encoding but rather by enhanced functional connectivity between temporal and inferior frontal cortex. Our results demonstrate a role of auditory-frontal interactions in visual speech representations and suggest that functional connectivity along the ventral pathway facilitates speech comprehension in multisensory environments

Crossref

HAL AMU

ZENODO

Dryad Digital Repository (Duke University)

Publications at Bielefeld University

Electronic Archiving System

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Enlighten

FigShare

DNN-Based Source Enhancement to Increase Objective Sound Quality Assessment Score

Author: Kazunori Kobayashi
Kenta Niwa
Yoichi Haneda
Yuma Koizumi
Yusuke Hioka
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/10/2018
Field of study

We propose a training method for deep neural network (DNN)-based source enhancement to increase objective sound quality assessment (OSQA) scores such as the perceptual evaluation of speech quality (PESQ). In many conventional studies, DNNs have been used as a mapping function to estimate time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error (MSE). Since OSQA scores have been used widely for soundquality evaluation, constructing DNNs to increase OSQA scores would be better than using the minimum-MSE to create highquality output signals. However, since most OSQA scores are not analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying back-propagation. To calculate the gradient of the OSQA-based objective function, we formulated a DNN optimization scheme on the basis of black-box optimization, which is used for training a computer that plays a game. For a black-box-optimization scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate output signals using the sampling algorithm, DNNs are used to estimate the probability-density function of the output signals that maximize OSQA scores. The OSQA scores are calculated from the simulated output signals, and the DNNs are trained to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several experiments, we found that OSQA scores significantly increased by applying the proposed method, even though the MSE was not minimized

arXiv.org e-Print Archive

Creative Repository of Electro-Communications