Search CORE

266 research outputs found

Binaural Speech Enhancement Using STOI-Optimal Masks

Author: Brookes Mike
Naylor Patrick A.
Tokala Vikas
Publication venue
Publication date: 30/09/2022
Field of study

STOI-optimal masking has been previously proposed and developed for single-channel speech enhancement. In this paper, we consider the extension to the task of binaural speech enhancement in which spatial information is known to be important to speech understanding and therefore should be preserved by the enhancement processing. Masks are estimated for each of the binaural channels individually and a `better-ear listening' mask is computed by choosing the maximum of the two masks. The estimated mask is used to supply probability information about the speech presence in each time-frequency bin to an Optimally-modified Log Spectral Amplitude (OM-LSA) enhancer. We show that using the proposed method for binaural signals with a directional noise not only improves the SNR of the noisy signal but also preserves the binaural cues and intelligibility.Comment: Accepted at IWAENC 202

arXiv.org e-Print Archive

Measuring audio-visual speech intelligibility under dynamic listening conditions using virtual reality

Author: Brookes DM
Green T
Moore AH
Naylor PA
Publication venue: Audio Engineering Society (AES)
Publication date: 22/07/2022
Field of study

The ELOSPHERES project is a collaboration between researchers at Imperial College London and University College London which aims to improve the efficacy of hearing aids. The benefit obtained from hearing aids varies significantly between listeners and listening environments. The noisy, reverberant environments which most people find challenging bear little resemblance to the clinics in which consultations occur. In order to make progress in speech enhancement, algorithms need to be evaluated under realistic listening conditions. A key aim of ELOSPHERES is to create a virtual reality-based test environment in which alternative speech enhancement algorithms can be evaluated using a listener-in-the-loop paradigm. In this paper we present the sap-elospheres-audiovisual-test (SEAT) platform and report the results of an initial experiment in which it was used to measure the benefit of visual cues in a speech intelligibility in spatial noise task

UCL Discovery

Spiral - Imperial College Digital Repository

Speech Separation Using Partially Asynchronous Microphone Arrays Without Resampling

Author: Corey Ryan M.
Singer Andrew C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/07/2018
Field of study

We consider the problem of separating speech sources captured by multiple spatially separated devices, each of which has multiple microphones and samples its signals at a slightly different rate. Most asynchronous array processing methods rely on sample rate offset estimation and resampling, but these offsets can be difficult to estimate if the sources or microphones are moving. We propose a source separation method that does not require offset estimation or signal resampling. Instead, we divide the distributed array into several synchronous subarrays. All arrays are used jointly to estimate the time-varying signal statistics, and those statistics are used to design separate time-varying spatial filters in each array. We demonstrate the method for speech mixtures recorded on both stationary and moving microphone arrays.Comment: To appear at the International Workshop on Acoustic Signal Enhancement (IWAENC 2018

arXiv.org e-Print Archive

Crossref

Engineering data compendium. Human perception and performance. User's guide

Author: Boff Kenneth R.
Lincoln Janet E.
Publication venue
Publication date
Field of study

The concept underlying the Engineering Data Compendium was the product of a research and development program (Integrated Perceptual Information for Designers project) aimed at facilitating the application of basic research findings in human performance to the design and military crew systems. The principal objective was to develop a workable strategy for: (1) identifying and distilling information of potential value to system design from the existing research literature, and (2) presenting this technical information in a way that would aid its accessibility, interpretability, and applicability by systems designers. The present four volumes of the Engineering Data Compendium represent the first implementation of this strategy. This is the first volume, the User's Guide, containing a description of the program and instructions for its use

NASA Technical Reports Server

Informed algorithms for sound source separation in enclosed reverberant environments

Author: Muhammad Salman Khan (7202543)
Publication venue
Publication date: 01/01/2013
Field of study

While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing. Initially, a multi-microphone array based method combined with binary time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise. To tackle the under-determined case and further improve separation performance at higher reverberation times, a two-microphone based method which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference, interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial characteristics of the enclosure and further improves the separation performance in challenging scenarios i.e. when sources are in close proximity and when the level of reverberation is high. Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses

Loughborough University Institutional Repository

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Author: Jensen Jesper
Michelsanti Daniel
Tan Zheng-Hua
Xu Yong
Yu Dong
Yu Meng
Zhang Shi-Xiong
Publication venue
Publication date: 01/01/2021
Field of study

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features; visual features; deep learning methods; fusion techniques; training targets and objective functions. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance

arXiv.org e-Print Archive

VBN

Audio source separation into the wild

Author: Aichner
Anguera Miro
Araki
Araki
Arberet
Arberet
Arberet
Attias
Avargel
Avargel
Badeau
Benaroya
Benesty
Bertrand
Bertrand
Bishop
Bustamante
Cardoso
Cemgil
Chazan
Chazan
Cherkassky
Cook
Cox
Crochiere
Dempster
DiBiase
Dillon
Doclo
Doclo
Drude
Duong
Duong
Dvorkind
Evers
Evers
Fallon
Feng
Févotte
Févotte
Gannot
Gannot
Gannot
Gilloire
Girgis
Girin
Habets
Hadad
Hershey
Higuchi
Higuchi
Higuchi
Hild
Hori
Ikram
Kamkar-Parsi
Kleijn
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Kounades-Bastian
Koutras
Kowalski
Kuttruff
Laufer
Lee
Leglaive
Leglaive
Leglaive
Li
Li
Li
Li
Liutkus
Loesch
Loizou
Luo
Lyon
Löllmann
Ma
Malik
Mandel
Markovich
Markovich-Golan
Markovich-Golan
Markovich-Golan
Markovich-Golan
Marquardt
Mitianoudis
Mukai
Nakadai
Nakadai
Narayanan
Nesta
Nugraha
O'Connor
O'Grady
Ozerov
Ozerov
Ozerov
Parra
Parra
Parsons
Pedersen
Pertilä
Plumbley
Prieto
Roman
Roman
Sawada
Sawada
Schmid
Schmidt
Schwartz
Schwartz
Schwartz
Simon
Smaragdis
Sturmel
Talmon
Talmon
Thiergart
Thiergart
Valin
Van Trees
Vijayasenan
Vincent
Vincent
Wang
Wang
Wang
Wang
Warsitz
Wehr
Weinstein
Widrow
Winter
Yilmaz
Yoshioka
Zeng
Zhang
Publication venue: 'Elsevier BV'
Publication date: 16/11/2018
Field of study

International audienceThis review chapter is dedicated to multichannel audio source separation in real-life environment. We explore some of the major achievements in the field and discuss some of the remaining challenges. We will explore several important practical scenarios, e.g. moving sources and/or microphones, varying number of sources and sensors, high reverberation levels, spatially diffuse sources, and synchronization problems. Several applications such as smart assistants, cellular phones, hearing aids and robots, will be discussed. Our perspectives on the future of the field will be given as concluding remarks of this chapter

Crossref

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

Hal-Diderot

Effects of hearing-aid dynamic range compression on spatial perception in a reverberant environment

Author: Alan Wiinberg
Allen J. B.
Hartmann W. M.
Henrik Gert Hassager
IEC
IEC
Torsten Dau
Publication venue: 'Acoustical Society of America (ASA)'
Publication date: 01/01/2017
Field of study

Crossref

Online Research Database In Technology

A compact noise covariance matrix model for MVDR beamforming

Author: A. Naylor P
Brookes M
H. Moore A
Hafezi S
R. Vos R
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/05/2022
Field of study

Acoustic beamforming is routinely used to improve the SNR of the received signal in applications such as hearing aids, robot audition, augmented reality, teleconferencing, source localisation and source tracking. The beamformer can be made adaptive by using an estimate of the time-varying noise covariance matrix in the spectral domain to determine an optimised beam pattern in each frequency bin that is specific to the acoustic environment and that can respond to temporal changes in it. However, robust estimation of the noise covariance matrix remains a challenging task especially in non-stationary acoustic environments. This paper presents a compact model of the signal covariance matrix that is defined by a small number of parameters whose values can be reliably estimated. The model leads to a robust estimate of the noise covariance matrix which can, in turn, be used to construct a beamformer. The performance of beamformers designed using this approach is evaluated for a spherical microphone array under a range of conditions using both simulated and measured room impulse responses. The proposed approach demonstrates consistent gains in intelligibility and perceptual quality metrics compared to the static and adaptive beamformers used as baselines

Spiral - Imperial College Digital Repository

Physiology, Psychoacoustics and Cognition in Normal and Impaired Hearing

Author
Publication venue: Springer International Publishing
Publication date: 01/01/2016
Field of study

Dissertations of the University of Groningen