157 research outputs found

    Unifying Robustness and Fidelity: A Comprehensive Study of Pretrained Generative Methods for Speech Enhancement in Adverse Conditions

    Full text link
    Enhancing speech signal quality in adverse acoustic environments is a persistent challenge in speech processing. Existing deep learning based enhancement methods often struggle to effectively remove background noise and reverberation in real-world scenarios, hampering listening experiences. To address these challenges, we propose a novel approach that uses pre-trained generative methods to resynthesize clean, anechoic speech from degraded inputs. This study leverages pre-trained vocoder or codec models to synthesize high-quality speech while enhancing robustness in challenging scenarios. Generative methods effectively handle information loss in speech signals, resulting in regenerated speech that has improved fidelity and reduced artifacts. By harnessing the capabilities of pre-trained models, we achieve faithful reproduction of the original speech in adverse conditions. Experimental evaluations on both simulated datasets and realistic samples demonstrate the effectiveness and robustness of our proposed methods. Especially by leveraging codec, we achieve superior subjective scores for both simulated and realistic recordings. The generated speech exhibits enhanced audio quality, reduced background noise, and reverberation. Our findings highlight the potential of pre-trained generative techniques in speech processing, particularly in scenarios where traditional methods falter. Demos are available at https://whmrtm.github.io/SoundResynthesis.Comment: Paper in submissio

    Flexible binaural resynthesis of room impulse responses for augmented reality research

    Get PDF
    International audienceA basic building block in audio for Augmented Reality (AR) is the use of virtual sound sources layered on top of any real sources present in an environment. In order to perceive these virtual sources as belonging to the natural scene it is important to match their acoustic parameters to those of a real source with the same characteristics, i.e. radiation properties, sound propagation and head-related impulse response (HRIR). However, it is still unclear to what extent these parameters need to be matched in order to generate plausible scenes in which virtual sound sources blend seamlessly with real sound sources. This contribution presents an auralization framework that allows protyping of augmented reality scenarios from measured multichannel room impulse responses to get a better understanding of the relevance of individual acoustic parameters.A well-established approach for binaural measurement and reproduction of sound scenes is based on capturing binaural room impulse responses (BRIR) using a head and torso simulator (HATS) and convolving these BRIRs dynamically with audio content according to the listener head orientation. However, such measurements are laborious and time consuming, requiring measuring the scene with the HATS in multiple orientations. Additionally, the HATS HRIR is inherently encoded in the BRIRs, making them unsuitable for personalization for different listeners. The approach presented here consists of the resynthesis and dynamic binaural reproduction of multichannel room impulse responses (RIR) using an arbitrary HRIR dataset. Using a compact microphone array, we obtained a pressure RIR and a set of auxiliary RIRs, and we applied the Spatial Decomposition Method (SDM) to estimate the direction-of-arrival (DOA) of the different sound events in the RIR. The DOA information was used to map sound pressure to different locations by means of an HRIR dataset, generating a binaural room impulse response (BRIR) for a specific orientation. By either rotating the DOA or the HRIR data set, BRIRs for any direction may be obtained. Auralizations using SDM are known to whiten the spectrum of late reverberation. Available alternatives such as time-frequency equalization were not feasible in this case, as a different time-frequency filter would be necessary for each direction, resulting in a non-homogeneous equalization of the BRIRs. Instead, the resynthesized BRIRs were decomposed into sub-bands and the decay slope of each sub-band was modified independently to match the reverberation time of the original pressure RIR. In this way we could apply the same reverberation correction factor to all BRIRs. In addition, we used a direction independent equalization to correct for timbral effects of equipment, HRIR, and signal processing. Real-time reproduction was achieved by means of a custom Max/MSP patch, in which the direct sound, early reflections and late reverberation were convolved separately to allow real-time changes in the time-energy properties of the BRIRs. The mixing time of the reproduced BRIRs is configurable and a single direction independent reverberation tail is used. To evaluate the quality of the resynthesis method in a real room, we conducted both objective and perceptual comparisons for a variety of source positions. The objective analysis was performed by comparing real measurements of a KEMAR mannequin with the resynthesis at the same receiver location using a simulated KEMAR HRIR. Typical room acoustic parameters of both real and resynthsized acoustics were found to be in good agreement. The perceptual validation consisted of a comparison of a loudspeaker and its resynthesized counterpart. Non-occluding headphones with individual equalization were used to ensure that listeners were able to simultaneously listen to the real and the virtual samples. Subjects were allowed to listen to the sounds for as long as they desired and freely switch between the real and virtual stimuli in real time. The integration of an Optitrack motion tracking system allowed us to present world-locked audio, accounting for head rotations.We present here the results of this listening test (N = 14) with three sections: discrimination, identification, and qualitative ratings. Preliminary analysis revealed that in these conditions listeners were generally able to discriminate between real and virtual sources and were able to consistently identify which of the presented sources was real and which was virtual. The qualitative analysis revealed that timbral differences are the most prominent cues for discrimination and identification, while spatial cues are well preserved. All the listeners reported good externalization of the binaural audio.Future work includes extending the presented validation to more environments, as well as implementing tools to arbitrarily modify BRIRs in the spatial, temporal, and frequency domains in order to study the perceptual requirements of room acoustics reproduction in AR

    Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

    Full text link
    Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments.Comment: To be published in The Proceedings of Forum Acusticum, Sep 2023, Turin, Ital

    Differentiable Modelling of Percussive Audio with Transient and Spectral Synthesis

    Get PDF
    Differentiable digital signal processing (DDSP) techniques, including methods for audio synthesis, have gained attention in recent years and lend themselves to interpretability in the parameter space. However, current differentiable synthesis methods have not explicitly sought to model the transient portion of signals, which is important for percussive sounds. In this work, we present a unified synthesis framework aiming to address transient generation and percussive synthesis within a DDSP framework. To this end, we propose a model for percussive synthesis that builds on sinusoidal modeling synthesis and incorporates a modulated temporal convolutional network for transient generation. We use a modified sinusoidal peak picking algorithm to generate time-varying non-harmonic sinusoids and pair it with differentiable noise and transient encoders that are jointly trained to reconstruct drumset sounds. We compute a set of reconstruction metrics using a large dataset of acoustic and electronic percussion samples that show that our method leads to improved onset signal reconstruction for membranophone percussion instruments
    corecore