17 research outputs found
Oracle estimators for the benchmarking of source separation algorithms
International audienceSource separation is a difficult problem for which many algorithms have been proposed. In this article, we define oracle estimators which compute the best performance achievable by different classes of algorithms on a given mixture, in a theoretical evaluation framework where the reference sources are available. We describe explicit oracle estimators for three particular classes of algorithms: multichannel time-invariant filtering, single-channel time-frequency masking and multichannel time-frequency masking. We evaluate their performance on various audio mixtures and study their robustness. We draw several conclusions for their three typical applications, namely providing performance bounds for existing and future blind algorithms, selecting the best class of algorithms for a given mixture and assessing the separation difficulty. In particular, we show that it is worth developing blind time-frequency masking algorithms relaxing the common assumption of a single active source per time-frequency point
Real Time Blind Source Separation in Reverberant Environments
An online convolutive blind source separation solution has been developed for use in reverberant environments with stationary sources. Results are presented for simulation and real world data. The system achieves a separation SINR of 16.8 dB when operating on a two source mixture, with a total acoustic delay was 270 ms. This is on par with, and in many respects outperforms various published algorithms [1],[2]. A number of instantaneous blind source separation algorithms have been developed, including a block wise and recursive ICA algorithm, and a clustering based algorithm, able to obtain up to 110 dB SIR performance. The system has been realised in both Matlab and C, and is modular, allowing for easy update of the ICA algorithm that is the core of the unmixing process
Performance measurement in blind audio source separation
International audienceIn this article, we discuss the evaluation of Blind Audio Source Separation (BASS) algorithms. Depending on the exact application, different distortions can be allowed between an estimated source and the wanted true source. We consider four different sets of such allowed distortions, from time-invariant gains to time-varying filters. In each case we decompose the estimated source into a true source part plus error terms corresponding to interferences, additive noise and algorithmic artifacts. Then we derive a global performance measure using an energy ratio, plus a separate performance measure for each error term. These measures are computed and discussed on the results of several BASS problems with various difficulty levels
Recommended from our members
End-to-end Speech Separation with Neural Networks
Speech separation has long been an active research topic in the signal processing community with its importance in a wide range of applications such as hearable devices and telecommunication systems. It not only serves as a fundamental problem for all higher-level speech processing tasks such as automatic speech recognition, natural language understanding, and smart personal assistants, but also plays an important role in smart earphones and augmented and virtual reality devices.
With the recent progress in deep neural networks, the separation performance has been significantly advanced by various new problem definitions and model architectures. The most widely-used approach in the past years performs separation in time-frequency domain, where a spectrogram or a time-frequency representation is first calculated from the mixture signal and multiple time-frequency masks are then estimated for the target sources. The masks are applied on the mixture's time-frequency representation to extract the target representations, and then operations such as inverse short-time Fourier transform is utilized to convert them back to waveforms. However, such frequency-domain methods may have difficulties in modeling the phase spectrogram as the conventional time-frequency masks often only consider the magnitude spectrogram. Moreover, the training objectives for the frequency-domain methods are typically also in frequency-domain, which may not be inline with widely-used time-domain evaluation metrics such as signal-to-noise ratio and signal-to-distortion ratio.
The problem formulation of time-domain, end-to-end speech separation naturally arises to tackle the disadvantages in the frequency-domain systems. The end-to-end speech separation networks take the mixture waveform as input and directly estimate the waveforms of the target sources. Following the general pipeline of conventional frequency-domain systems which contains a waveform encoder, a separator, and a waveform decoder, time-domain systems can be design in a similar way while significantly improves the separation performance.
In this dissertation, I focus on multiple aspects in the general problem formulation of end-to-end separation networks including the system designs, model architectures, and training objectives. I start with a single-channel pipeline, which we refer to as the time-domain audio separation network (TasNet), to validate the advantage of end-to-end separation comparing with the conventional time-frequency domain pipelines. I then move to the multi-channel scenario and introduce the filter-and-sum network (FaSNet) for both fixed-geometry and ad-hoc geometry microphone arrays.
Next I introduce methods for lightweight network architecture design that allows the models to maintain the separation performance while using only as small as 2.5% model size and 17.6% model complexity. After that, I look into the training objective functions for end-to-end speech separation and describe two training objectives for separating varying numbers of sources and improving the robustness under reverberant environments, respectively. Finally I take a step back and revisit several problem formulations in end-to-end separation pipeline and raise more questions in this framework to be further analyzed and investigated in future works
Blind Source Separation for the Processing of Contact-Less Biosignals
(Spatio-temporale) Blind Source Separation (BSS) eignet sich für die Verarbeitung von Multikanal-Messungen im Bereich der kontaktlosen Biosignalerfassung. Ziel der BSS ist dabei die Trennung von (z.B. kardialen) Nutzsignalen und Störsignalen typisch für die kontaktlosen Messtechniken. Das Potential der BSS kann praktisch nur ausgeschöpft werden, wenn (1) ein geeignetes BSS-Modell verwendet wird, welches der Komplexität der Multikanal-Messung gerecht wird und (2) die unbestimmte Permutation unter den BSS-Ausgangssignalen gelöst wird, d.h. das Nutzsignal praktisch automatisiert identifiziert werden kann. Die vorliegende Arbeit entwirft ein Framework, mit dessen Hilfe die Effizienz von BSS-Algorithmen im Kontext des kamera-basierten Photoplethysmogramms bewertet werden kann. Empfehlungen zur Auswahl bestimmter Algorithmen im Zusammenhang mit spezifischen Signal-Charakteristiken werden abgeleitet. Außerdem werden im Rahmen der Arbeit Konzepte für die automatisierte Kanalauswahl nach BSS im Bereich der kontaktlosen Messung des Elektrokardiogramms entwickelt und bewertet. Neuartige Algorithmen basierend auf Sparse Coding erwiesen sich dabei als besonders effizient im Vergleich zu Standard-Methoden.(Spatio-temporal) Blind Source Separation (BSS) provides a large potential to process distorted multichannel biosignal measurements in the context of novel contact-less recording techniques for separating distortions from the cardiac signal of interest. This potential can only be practically utilized (1) if a BSS model is applied that matches the complexity of the measurement, i.e. the signal mixture and (2) if permutation indeterminacy is solved among the BSS output components, i.e the component of interest can be practically selected. The present work, first, designs a framework to assess the efficacy of BSS algorithms in the context of the camera-based photoplethysmogram (cbPPG) and characterizes multiple BSS algorithms, accordingly. Algorithm selection recommendations for certain mixture characteristics are derived. Second, the present work develops and evaluates concepts to solve permutation indeterminacy for BSS outputs of contact-less electrocardiogram (ECG) recordings. The novel approach based on sparse coding is shown to outperform the existing concepts of higher order moments and frequency-domain features
Recommended from our members
Automatic Speech Separation for Brain-Controlled Hearing Technologies
Speech perception in crowded acoustic environments is particularly challenging for hearing impaired listeners. While assistive hearing devices can suppress background noises distinct from speech, they struggle to lower interfering speakers without knowing the speaker on which the listener is focusing. The human brain has a remarkable ability to pick out individual voices in a noisy environment like a crowded restaurant or a busy city street. This inspires the brain-controlled hearing technologies. A brain-controlled hearing aid acts as an intelligent filter, reading wearers’ brainwaves and enhancing the voice they want to focus on.
Two essential elements form the core of brain-controlled hearing aids: automatic speech separation (SS), which isolates individual speakers from mixed audio in an acoustic scene, and auditory attention decoding (AAD) in which the brainwaves of listeners are compared with separated speakers to determine the attended one, which can then be amplified to facilitate hearing. This dissertation focuses on speech separation and its integration with AAD, aiming to propel the evolution of brain-controlled hearing technologies. The goal is to help users to engage in conversations with people around them seamlessly and efficiently.
This dissertation is structured into two parts. The first part focuses on automatic speech separation models, beginning with the introduction of a real-time monaural speech separation model, followed by more advanced real-time binaural speech separation models. The binaural models use both spectral and spatial features to separate speakers and are more robust to noise and reverberation. Beyond performing speech separation, the binaural models preserve the interaural cues of separated sound sources, which is a significant step towards immersive augmented hearing. Additionally, the first part explores using speaker identifications to improve the performance and robustness of models in long-form speech separation. This part also delves into unsupervised learning methods for multi-channel speech separation, aiming to improve the models' ability to generalize to real-world audio.
The second part of the dissertation integrates speech separation introduced in the first part with auditory attention decoding (SS-AAD) to develop brain-controlled augmented hearing systems. It is demonstrated that auditory attention decoding with automatically separated speakers is as accurate and fast as using clean speech sounds. Furthermore, to better align the experimental environment of SS-AAD systems with real-life scenarios, the second part introduces a new AAD task that closely simulates real-world complex acoustic settings. The results show that the SS-AAD system is capable of improving speech intelligibility and facilitating tracking of the attended speaker in realistic acoustic environments. Finally, this part presents employing self-supervised learned speech representation in the SS-AAD systems to enhance the neural decoding of attentional selection