4,723 research outputs found

    Structured Sparsity Models for Multiparty Speech Recovery from Reverberant Recordings

    Get PDF
    We tackle the multi-party speech recovery problem through modeling the acoustic of the reverberant chambers. Our approach exploits structured sparsity models to perform room modeling and speech recovery. We propose a scheme for characterizing the room acoustic from the unknown competing speech sources relying on localization of the early images of the speakers by sparse approximation of the spatial spectra of the virtual sources in a free-space model. The images are then clustered exploiting the low-rank structure of the spectro-temporal components belonging to each source. This enables us to identify the early support of the room impulse response function and its unique map to the room geometry. To further tackle the ambiguity of the reflection ratios, we propose a novel formulation of the reverberation model and estimate the absorption coefficients through a convex optimization exploiting joint sparsity model formulated upon spatio-spectral sparsity of concurrent speech representation. The acoustic parameters are then incorporated for separating individual speech signals through either structured sparse recovery or inverse filtering the acoustic channels. The experiments conducted on real data recordings demonstrate the effectiveness of the proposed approach for multi-party speech recovery and recognition.Comment: 31 page

    Separating Reflection and Transmission Images in the Wild

    Full text link
    The reflections caused by common semi-reflectors, such as glass windows, can impact the performance of computer vision algorithms. State-of-the-art methods can remove reflections on synthetic data and in controlled scenarios. However, they are based on strong assumptions and do not generalize well to real-world images. Contrary to a common misconception, real-world images are challenging even when polarization information is used. We present a deep learning approach to separate the reflected and the transmitted components of the recorded irradiance, which explicitly uses the polarization properties of light. To train it, we introduce an accurate synthetic data generation pipeline, which simulates realistic reflections, including those generated by curved and non-ideal surfaces, non-static scenes, and high-dynamic-range scenes.Comment: accepted at ECCV 201

    Multichannel Speech Separation and Enhancement Using the Convolutive Transfer Function

    Get PDF
    This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domain filters, CTF has much less taps, consequently it has less near-common zeros among channels and less computational complexity. The work proposes three speech-source recovery methods, namely: i) the multichannel inverse filtering method, i.e. the multiple input/output inverse theorem (MINT), is exploited in the CTF domain, and for the multi-source case, ii) a beamforming-like multichannel inverse filtering method applying single source MINT and using power minimization, which is suitable whenever the source CTFs are not all known, and iii) a constrained Lasso method, where the sources are recovered by minimizing the 1\ell_1-norm to impose their spectral sparsity, with the constraint that the 2\ell_2-norm fitting cost, between the microphone signals and the mixing model involving the unknown source signals, is less than a tolerance. The noise can be reduced by setting a tolerance onto the noise power. Experiments under various acoustic conditions are carried out to evaluate the three proposed methods. The comparison between them as well as with the baseline methods is presented.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech and Language Processin

    Acoustic Echo and Noise Cancellation System for Hand-Free Telecommunication using Variable Step Size Algorithms

    Get PDF
    In this paper, acoustic echo cancellation with doubletalk detection system is implemented for a hand-free telecommunication system using Matlab. Here adaptive noise canceller with blind source separation (ANC-BSS) system is proposed to remove both background noise and far-end speaker echo signal in presence of double-talk. During the absence of double-talk, far-end speaker echo signal is cancelled by adaptive echo canceller. Both adaptive noise canceller and adaptive echo canceller are implemented using LMS, NLMS, VSLMS and VSNLMS algorithms. The normalized cross-correlation method is used for double-talk detection. VSNLMS has shown its superiority over all other algorithms both for double-talk and in absence of double-talk. During the absence of double-talk it shows its superiority in terms of increment in ERLE and decrement in misalignment. In presence of double-talk, it shows improvement in SNR of near-end speaker signal

    Video-aided model-based source separation in real reverberant rooms

    Get PDF
    Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete timefrequency points. The model parameters are refined with the wellknown expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better timefrequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited
    corecore