182 research outputs found

    Estimation of room acoustic parameters: the ACE challenge

    No full text
    Reverberation Time (T60) and Direct-to-Reverberant Ratio (DRR) are important parameters which together can characterize sound captured by microphones in non-anechoic rooms. These parameters are important in speech processing applications such as speech recognition and dereverberation. The values of T60 and DRR can be estimated directly from the Acoustic Impulse Response (AIR) of the room. In practice, the AIR is not normally available, in which case these parameters must be estimated blindly from the observed speech in the microphone signal. The Acoustic Characterization of Environments (ACE) Challenge aimed to determine the state-of-the-art in blind acoustic parameter estimation and also to stimulate research in this area. A summary of the ACE Challenge, and the corpus used in the challenge is presented together with an analysis of the results. Existing algorithms were submitted alongside novel contributions, the comparative results for which are presented in this paper. The challenge showed that T60 estimation is a mature field where analytical approaches dominate whilst DRR estimation is a less mature field where machine learning approaches are currently more successful

    Estimation of Room Acoustic Parameters: The ACE Challenge

    Full text link

    Gaussian process deconvolution

    Full text link
    Let us consider the deconvolution problem, that is, to recover a latent source x()x(\cdot) from the observations y=[y1,,yN]\mathbf{y} = [y_1,\ldots,y_N] of a convolution process y=xh+ηy = x\star h + \eta, where η\eta is an additive noise, the observations in y\mathbf{y} might have missing parts with respect to yy, and the filter hh could be unknown. We propose a novel strategy to address this task when xx is a continuous-time signal: we adopt a Gaussian process (GP) prior on the source xx, which allows for closed-form Bayesian nonparametric deconvolution. We first analyse the direct model to establish the conditions under which the model is well defined. Then, we turn to the inverse problem, where we study i) some necessary conditions under which Bayesian deconvolution is feasible, and ii) to which extent the filter hh can be learnt from data or approximated for the blind deconvolution case. The proposed approach, termed Gaussian process deconvolution (GPDC) is compared to other deconvolution methods conceptually, via illustrative examples, and using real-world datasets.Comment: Accepted at Proceedings of the Royal Society

    GibbsDDRM: A Partially Collapsed Gibbs Sampler for Solving Blind Inverse Problems with Denoising Diffusion Restoration

    Full text link
    Pre-trained diffusion models have been successfully used as priors in a variety of linear inverse problems, where the goal is to reconstruct a signal from noisy linear measurements. However, existing approaches require knowledge of the linear operator. In this paper, we propose GibbsDDRM, an extension of Denoising Diffusion Restoration Models (DDRM) to a blind setting in which the linear measurement operator is unknown. GibbsDDRM constructs a joint distribution of the data, measurements, and linear operator by using a pre-trained diffusion model for the data prior, and it solves the problem by posterior sampling with an efficient variant of a Gibbs sampler. The proposed method is problem-agnostic, meaning that a pre-trained diffusion model can be applied to various inverse problems without fine-tuning. In experiments, it achieved high performance on both blind image deblurring and vocal dereverberation tasks, despite the use of simple generic priors for the underlying linear operators

    Real-time Estimation of Reverberation Time for Selection of suitable binaural room impulse responses

    Get PDF
    The aim of auditory augmented reality is to create a highly immersive and plausible auditory illusion combining virtual audio objects and scenarios with the real acoustic surrounding. For this use case it is necessary to estimate the acoustics of the current room. A mismatch between real and simulated acoustics will easily be detected by the listener and will probably lead to In-head localization or an unrealistic acoustic envelopment of the virtual sound sources. This publication investigates State-of-the-Art algorithms for blind reverberation time estimation which are commonly used for speech enhancement algorithms or speech dereverberation and applies them to binaural ear signals. The outcome of these algorithms can be used to select the most appropriate room out of a room database for example. A room database could include pre-measured or simulated binaural room impulse responses which could directly be used to realize a binaural reproduction. First results show promising results combined with low computational effort. Further strategies for enhancing the used method are proposed in order to create a more precise reverberation time estimation

    Adaptive time-frequency analysis for cognitive source separation

    Get PDF
    This thesis introduces a framework for separating two speech sources in non-ideal, reverberant environments. The source separation architecture tries to mimic the extraordinary abilities of the human auditory system when performing source separation. A movable human dummy head residing in a normal office room is used to model the conditions humans experience when listening to complex auditory scenes. This thesis first investigates how the orthogonality of speech sources in the time-frequency domain drops with different reverberation times of the environment and shows that separation schemes based on ideal binary time-frequency-masks are suitable to perform source separation also under humanoid reverberant conditions. Prior to separating the sources, the movable human dummy head analyzes the auditory scene and estimates the positions of the sources and the fundamental frequency tracks. The source localization is implemented using an iterative approach based on the interaural time differences between the two ears and achieves a localization blur of less than three degrees in the azimuth plane. The source separation architecture implemented in this thesis extracts the orthogonal timefrequency points of the speech mixtures. It combines the positive features of the STFT with the positive features of the cochleagram representation. The overall goal of the source separation is to find the ideal STFT-mask. The core source separation process however is based on the analysis of the corresponding region in an additionally computed cochleagram, which shows more reliable Interaural Time Difference (ITD) estimations that are used for separation. Several algorithms based on the ITD and the fundamental frequency of the target source are evaluated for their source separation capabilities. To enhance the separation capabilities of the single algorithms, the results of the different algorithms are combined to compute a final estimate. In this way SIR gains of approximately 30 dB for two source scenarios are achieved. For three source scenarios SIR gains of up to 16 dB are attained. Compared to the standard binaural signal processing approaches like DUET and Fixed Beamforming the presented approach achieves up to 29 dB SIR gain.Diese Dissertation beschreibt ein Framework zur Separation zweier Quellen in nicht-idealen, echobehafteten Umgebungen. Die Architektur zur Quellenseparation orientiert sich dabei an den außergewöhnlichen Separationsfähigkeiten des menschlichen Gehörs. Um die Bedingungen eines Menschen in einer komplexen auditiven Szene zu imitieren, wird ein beweglicher, menschlicher Kunstkopf genutzt, der sich in einem üblichen Büroraum befindet. In einem ersten Schritt analysiert diese Dissertation, inwiefern die Orthogonalität von Sprachsignalen im Zeit-Frequenz-Bereich mit unterschiedlichen Nachhallzeiten abnimmt. Trotz der Orthogonalitätsabnahme sind Separationsansätze basierend auf idealen binären Masken geeignet um eine Trennung von Sprachsignalen auch unter menschlichen, echobehafteten Bedingungen zu realisieren. Bevor die Quellen getrennt werden, analysiert der bewegliche Kunstkopf die auditive Szene und schätzt die Positionen der einzelnen Quellen und den Verlauf der Grundfrequenz der Sprecher ab. Die Quellenlokalisation wird durch einen iterativen Ansatz basierend auf den Zeitunterschieden zwischen beiden Ohren verwirklicht und erreicht eine Lokalisierungsgenauigkeit von weniger als drei Grad in der Azimuth-Ebene. Die Quellenseparationsarchitektur die in dieser Arbeit implementiert wird, extrahiert die orthogonalen Zeit-Frequenz-Punkte der Sprachmixturen. Dazu werden die positiven Eigenschaften der STFT mit den positiven Eigenschaften des Cochleagrams kombiniert. Ziel ist es, die ideale STFT-Maske zu finden. Die eigentliche Quellentrennung basiert jedoch auf der Analyse der entsprechenden Region eines zusätzlich berechneten Cochleagrams. Auf diese Weise wird eine weitaus verlässlichere Auswertung der Zeitunterschiede zwischen den beiden Ohren verwirklicht. Mehrere Algorithmen basierend auf den interauralen Zeitunterschieden und der Grundfrequenz der Zielquelle werden bezüglich ihrer Separationsfähigkeiten evaluiert. Um die Trennungsmöglichkeiten der einzelnen Algorithmen zu erhöhen, werden die einzelnen Ergebnisse miteinander verknüpft um eine finale Abschätzung zu gewinnen. Auf diese Weise können SIR Gewinne von ungefähr 30 dB für Szenarien mit zwei Quellen erzielt werden. Für Szenarien mit drei Quellen werden Gewinne von bis zu 16 dB erzielt. Verglichen mit binauralen Standardverfahren zur Quellentrennung wie DUET oder Fixed Beamforming, gewinnt der vorgestellte Ansatz bis zu 29 dB SIR
    corecore