436 research outputs found

    Video-aided model-based source separation in real reverberant rooms

    Get PDF
    Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete timefrequency points. The model parameters are refined with the wellknown expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better timefrequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited

    A new cascaded spectral subtraction approach for binaural speech dereverberation and its application in source separation

    Get PDF
    In this work we propose a new binaural spectral subtraction method for the suppression of late reverberation. The pro- posed approach is a cascade of three stages. The first two stages exploit distinct observations to model and suppress the late reverberation by deriving a gain function. The musical noise artifacts generated due to the processing at each stage are compensated by smoothing the spectral magnitudes of the weighting gains. The third stage linearly combines the gains obtained from the first two stages and further enhances the binaural signals. The binaural gains, obtained by indepen- dently processing the left and right channel signals are com- bined using a new method. Experiments on real data are per- formed in two contexts: dereverberation-only and joint dere- verberation and source separation. Objective results verify the suitability of the proposed cascaded approach in both the contexts

    On enhancing model-based expectation maximization source separation in dynamic reverberant conditions using automatic Clifton effect

    Full text link
    [EN] Source separation algorithms based on spatial cues generally face two major problems. The first one is their general performance degradation in reverberant environments and the second is their inability to differentiate closely located sources due to similarity of their spatial cues. The latter problem gets amplified in highly reverberant environments as reverberations have a distorting effect on spatial cues. In this paper, we have proposed a separation algorithm, in which inside an enclosure, the distortions due to reverberations in a spatial cue based source separation algorithm namely model-based expectation-maximization source separation and localization (MESSL) are minimized by using the Precedence effect. The Precedence effect acts as a gatekeeper which restricts the reverberations entering the separation system resulting in its improved separation performance. And this effect is automatically transformed into the Clifton effect to deal with the dynamic acoustic conditions. Our proposed algorithm has shown improved performance over MESSL in all kinds of reverberant conditions including closely located sources. On average, 22.55% improvement in SDR (signal to distortion ratio) and 15% in PESQ (perceptual evaluation of speech quality) is observed by using the Clifton effect to tackle dynamic reverberant conditions.This project is funded by Higher Education Commission (HEC), Pakistan, under project no. 6330/KPK/NRPU/R&D/HEC/2016.Gul, S.; Khan, MS.; Shah, SW.; Lloret, J. (2020). On enhancing model-based expectation maximization source separation in dynamic reverberant conditions using automatic Clifton effect. International Journal of Communication Systems. 33(3):1-18. https://doi.org/10.1002/dac.421011833

    Robust indoor speaker recognition in a network of audio and video sensors

    Get PDF
    AbstractSituational awareness is achieved naturally by the human senses of sight and hearing in combination. Automatic scene understanding aims at replicating this human ability using microphones and cameras in cooperation. In this paper, audio and video signals are fused and integrated at different levels of semantic abstractions. We detect and track a speaker who is relatively unconstrained, i.e., free to move indoors within an area larger than the comparable reported work, which is usually limited to round table meetings. The system is relatively simple: consisting of just 4 microphone pairs and a single camera. Results show that the overall multimodal tracker is more reliable than single modality systems, tolerating large occlusions and cross-talk. System evaluation is performed on both single and multi-modality tracking. The performance improvement given by the audio–video integration and fusion is quantified in terms of tracking precision and accuracy as well as speaker diarisation error rate and precision–recall (recognition). Improvements vs. the closest works are evaluated: 56% sound source localisation computational cost over an audio only system, 8% speaker diarisation error rate over an audio only speaker recognition unit and 36% on the precision–recall metric over an audio–video dominant speaker recognition method

    Informed algorithms for sound source separation in enclosed reverberant environments

    Get PDF
    While humans can separate a sound of interest amidst a cacophony of contending sounds in an echoic environment, machine-based methods lag behind in solving this task. This thesis thus aims at improving performance of audio separation algorithms when they are informed i.e. have access to source location information. These locations are assumed to be known a priori in this work, for example by video processing. Initially, a multi-microphone array based method combined with binary time-frequency masking is proposed. A robust least squares frequency invariant data independent beamformer designed with the location information is utilized to estimate the sources. To further enhance the estimated sources, binary time-frequency masking based post-processing is used but cepstral domain smoothing is required to mitigate musical noise. To tackle the under-determined case and further improve separation performance at higher reverberation times, a two-microphone based method which is inspired by human auditory processing and generates soft time-frequency masks is described. In this approach interaural level difference, interaural phase difference and mixing vectors are probabilistically modeled in the time-frequency domain and the model parameters are learned through the expectation-maximization (EM) algorithm. A direction vector is estimated for each source, using the location information, which is used as the mean parameter of the mixing vector model. Soft time-frequency masks are used to reconstruct the sources. A spatial covariance model is then integrated into the probabilistic model framework that encodes the spatial characteristics of the enclosure and further improves the separation performance in challenging scenarios i.e. when sources are in close proximity and when the level of reverberation is high. Finally, new dereverberation based pre-processing is proposed based on the cascade of three dereverberation stages where each enhances the twomicrophone reverberant mixture. The dereverberation stages are based on amplitude spectral subtraction, where the late reverberation is estimated and suppressed. The combination of such dereverberation based pre-processing and use of soft mask separation yields the best separation performance. All methods are evaluated with real and synthetic mixtures formed for example from speech signals from the TIMIT database and measured room impulse responses

    An unsupervised acoustic fall detection system using source separation for sound interference suppression

    Get PDF
    We present a novel unsupervised fall detection system that employs the collected acoustic signals (footstep sound signals) from an elderly person׳s normal activities to construct a data description model to distinguish falls from non-falls. The measured acoustic signals are initially processed with a source separation (SS) technique to remove the possible interferences from other background sound sources. Mel-frequency cepstral coefficient (MFCC) features are next extracted from the processed signals and used to construct a data description model based on a one class support vector machine (OCSVM) method, which is finally applied to distinguish fall from non-fall sounds. Experiments on a recorded dataset confirm that our proposed fall detection system can achieve better performance, especially with high level of interference from other sound sources, as compared with existing single microphone based methods

    Exploiting CNNs for Improving Acoustic Source Localization in Noisy and Reverberant Conditions

    Get PDF
    This paper discusses the application of convolutional neural networks (CNNs) to minimum variance distortionless response localization schemes. We investigate the direction of arrival estimation problems in noisy and reverberant conditions using a uniform linear array (ULA). CNNs are used to process the multichannel data from the ULA and to improve the data fusion scheme, which is performed in the steered response power computation. CNNs improve the incoherent frequency fusion of the narrowband response power by weighting the components, reducing the deleterious effects of those components affected by artifacts due to noise and reverberation. The use of CNNs avoids the necessity of previously encoding the multichannel data into selected acoustic cues with the advantage to exploit its ability in recognizing geometrical pattern similarity. Experiments with both simulated and real acoustic data demonstrate the superior localization performance of the proposed SRP beamformer with respect to other state-of-the-art techniques
    • …
    corecore