7 research outputs found

    PROJET - Spatial Audio Separation Using Projections

    Get PDF
    International audienceWe propose a projection-based method for the unmixing of multi-channel audio signals into their different constituent spatial objects. Here, spatial objects are modelled using a unified framework which handles both point sources and diffuse sources. We then propose a novel methodology to estimate and take advantage of the spatial dependencies of an object. Where previous research has processed the original multichannel mixtures directly and has been principally focused on the use of inter-channel covariance structures, here we instead process projections of the multichannel signal on many different spatial directions. These linear combinations consist of observations where some spatial objects are cancelled or enhanced. We then propose an algorithm which takes these projections as the observations, discarding dependencies between them. Since each one contains global information regarding all channels of the original multichannel mixture, this provides an effective means of learning the parameters of the original audio, while avoiding the need for joint-processing of all the channels. We further show how to recover the separated spatial objects and demonstrate the use of the technique on stereophonic music signals

    A Recurrent Encoder-Decoder Approach with Skip-filtering Connections for Monaural Singing Voice Separation

    Full text link
    The objective of deep learning methods based on encoder-decoder architectures for music source separation is to approximate either ideal time-frequency masks or spectral representations of the target music source(s). The spectral representations are then used to derive time-frequency masks. In this work we introduce a method to directly learn time-frequency masks from an observed mixture magnitude spectrum. We employ recurrent neural networks and train them using prior knowledge only for the magnitude spectrum of the target source. To assess the performance of the proposed method, we focus on the task of singing voice separation. The results from an objective evaluation show that our proposed method provides comparable results to deep learning based methods which operate over complicated signal representations. Compared to previous methods that approximate time-frequency masks, our method has increased performance of signal to distortion ratio by an average of 3.8 dB

    Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation

    Full text link
    The audio-visual sound separation field assumes visible sources in videos, but this excludes invisible sounds beyond the camera's view. Current methods struggle with such sounds lacking visible cues. This paper introduces a novel "Audio-Visual Scene-Aware Separation" (AVSA-Sep) framework. It includes a semantic parser for visible and invisible sounds and a separator for scene-informed separation. AVSA-Sep successfully separates both sound types, with joint training and cross-modal alignment enhancing effectiveness.Comment: Accepted at ICCV 2023 - AV4D, 4 figures, 3 table

    Complex Neural Networks for Audio

    Get PDF
    Audio is represented in two mathematically equivalent ways: the real-valued time domain (i.e., waveform) and the complex-valued frequency domain (i.e., spectrum). There are advantages to the frequency-domain representation, e.g., the human auditory system is known to process sound in the frequency-domain. Furthermore, linear time-invariant systems are convolved with sources in the time-domain, whereas they may be factorized in the frequency-domain. Neural networks have become rather useful when applied to audio tasks such as machine listening and audio synthesis, which are related by their dependencies on high quality acoustic models. They ideally encapsulate fine-scale temporal structure, such as that encoded in the phase of frequency-domain audio, yet there are no authoritative deep learning methods for complex audio. This manuscript is dedicated to addressing the shortcoming. Chapter 2 motivates complex networks by their affinity with complex-domain audio, while Chapter 3 contributes methods for building and optimizing complex networks. We show that the naive implementation of Adam optimization is incorrect for complex random variables and show that selection of input and output representation has a significant impact on the performance of a complex network. Experimental results with novel complex neural architectures are provided in the second half of this manuscript. Chapter 4 introduces a complex model for binaural audio source localization. We show that, like humans, the complex model can generalize to different anatomical filters, which is important in the context of machine listening. The complex model\u27s performance is better than that of the real-valued models, as well as real- and complex-valued baselines. Chapter 5 proposes a two-stage method for speech enhancement. In the first stage, a complex-valued stochastic autoencoder projects complex vectors to a discrete space. In the second stage, long-term temporal dependencies are modeled in the discrete space. The autoencoder raises the performance ceiling for state of the art speech enhancement, but the dynamic enhancement model does not outperform other baselines. We discuss areas for improvement and note that the complex Adam optimizer improves training convergence over the naive implementation
    corecore