9 research outputs found

    A study of lip movements during spontaneous dialog and its application to voice activity detection

    No full text
    International audienceThis paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences i.e., when no sound is produced by the speaker . The aim is to characterize the relationship between "lip activity" and "speech activity" and then to use visual speech information as a voice activity detector VAD . To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker's lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence i.e., speech+nonspeech audible events . A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals

    Video-aided model-based source separation in real reverberant rooms

    Get PDF
    Source separation algorithms that utilize only audio data can perform poorly if multiple sources or reverberation are present. In this paper we therefore propose a video-aided model-based source separation algorithm for a two-channel reverberant recording in which the sources are assumed static. By exploiting cues from video, we first localize individual speech sources in the enclosure and then estimate their directions. The interaural spatial cues, the interaural phase difference and the interaural level difference, as well as the mixing vectors are probabilistically modeled. The models make use of the source direction information and are evaluated at discrete timefrequency points. The model parameters are refined with the wellknown expectation-maximization (EM) algorithm. The algorithm outputs time-frequency masks that are used to reconstruct the individual sources. Simulation results show that by utilizing the visual modality the proposed algorithm can produce better timefrequency masks thereby giving improved source estimates. We provide experimental results to test the proposed algorithm in different scenarios and provide comparisons with both other audio-only and audio-visual algorithms and achieve improved performance both on synthetic and real data. We also include dereverberation based pre-processing in our algorithm in order to suppress the late reverberant components from the observed stereo mixture and further enhance the overall output of the algorithm. This advantage makes our algorithm a suitable candidate for use in under-determined highly reverberant settings where the performance of other audio-only and audio-visual methods is limited

    Audio-Visual Object Extraction using Graph Cuts

    Get PDF
    We propose a novel method to automatically extract the audio-visual objects that are present in a scene. First, the synchrony between related events in audio and video channels is exploited to identify the possible locations of the sound sources. Video regions presenting a high coherence with the soundtrack are automatically labelled as being part of the audio-visual object. Next, a graph cut segmentation procedure is used to extract the entire object. The proposed segmentation approach includes a novel term that keeps together pixels in regions with high audio- visual synchrony. When longer sequences are analyzed, video signals are divided into groups of frames which are processed sequentially and propagate the information about the source characteristics forward in time. Results show that our method is able to discriminate between audio-visual sources and distracting moving objects and to adapt within a short time delay when sources pass from active to inactive and vice versa

    Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures

    No full text
    International audienceLooking at the speaker's face can be useful to better hear a speech signal in noisy environment and extract it from competing sources before identification. This suggests that the visual signals of speech (movements of visible articulators) could be used in speech enhancement or extraction systems. In this paper, we present a novel algorithm plugging audiovisual coherence of speech signals, estimated by statistical tools, on audio blind source separation (BSS) techniques. This algorithm is applied to the difficult and realistic case of convolutive mixtures. The algorithm mainly works in the frequency (transform) domain, where the convolutive mixture becomes an additive mixture for each frequency channel. Frequency by frequency separation is made by an audio BSS algorithm. The audio and visual informations are modeled by a newly proposed statistical model. This model is then used to solve the standard source permutation and scale factor ambiguities encountered for each frequency after the audio blind separation stage. The proposed method is shown to be efficient in the case of 2 times 2 convolutive mixtures and offers promising perspectives for extracting a particular speech source of interest from complex mixtures

    Enhanced independent vector analysis for speech separation in room environments

    Get PDF
    PhD ThesisThe human brain has the ability to focus on a desired sound source in the presence of several active sound sources. The machine based method lags behind in mimicking this particular skill of human beings. In the domain of digital signal processing this problem is termed as the cocktail party problem. This thesis thus aims to further the eld of acoustic source separation in the frequency domain based on exploiting source independence. The main challenge in such frequency domain algorithms is the permutation problem. Independent vector analysis (IVA) is a frequency domain blind source separation algorithm which can theoretically obviate the permutation problem by preserving the dependency structure within each source vector whilst eliminating the dependency between the frequency bins of di erent source vectors. This thesis in particular focuses on improving the separation performance of IVA algorithms which are used for frequency domain acoustic source separation in real room environments. The source prior is crucial to the separation performance of the IVA algorithm as it is used to model the nonlinear dependency structure within the source vectors. An alternative multivariate Student's t distribution source prior is proposed for the IVA algorithm as it is known to be well suited for modelling certain speech signals due to its heavy tail nature. Therefore the nonlinear score function that is derived from the proposed Student's t source prior can better model the dependency structure within the frequency bins and thereby enhance the separation performance and the convergence speed of the IVA and the Fast version of the IVA (FastIVA) algorithms. 4 5 A novel energy driven mixed Student's t and the original super Gaussian source prior is also proposed for the IVA algorithms. As speech signals can be composed of many high and low amplitude data points, therefore the Student's t distribution in the mixed source prior can account for the high amplitude data points whereas the original su- per Gaussian distribution can cater for the other information in the speech signals. Furthermore, the weight of both distributions in the mixed source prior can be ad- justed according to the energy of the observed mixtures. Therefore the mixed source prior adapts the measured signals and further enhances the performance of the IVA algorithm. A common approach within the IVA algorithm is to model di erent speech sources with an identical source prior, however this does not account for the unique characteristics of each speech signal. Therefore dependency modelling for di erent speech sources can be improved by modelling di erent speech sources with di erent source priors. Hence, the Student's t mixture model (SMM) is introduced as a source prior for the IVA algorithm. This new source prior can adapt according to the nature of di erent speech signals and the parameters for the proposed SMM source prior are estimated by deriving an e cient expectation maximization (EM) algorithm. As a result of this study, a novel EM framework for the IVA algorithm with the SMM as a source prior is proposed which is capable of separating the sources in an e cient manner. The proposed algorithms are tested in various realistic reverberant room environments with real speech signals. All the experiments and evaluation demonstrate the robustness and enhanced separation performance of the proposed algorithms

    Audio-Visual Fusion:New Methods and Applications

    Get PDF
    The perception that we have about the world is influenced by elements of diverse nature. Indeed humans tend to integrate information coming from different sensory modalities to better understand their environment. Following this observation, scientists have been trying to combine different research domains. In particular, in joint audio-visual signal processing the information recorded with one or more video-cameras and one or more microphones is combined in order to extract more knowledge about a given scene than when analyzing each modality separately. In this thesis we attempt the fusion of audio and video modalities when considering one video-camera and one microphone. This is the most common configuration in electronic devices such as laptops and cellphones, and it does not require controlled environments such as previously prepared meeting rooms. Even though numerous approaches have been proposed in the last decade, the fusion of audio and video modalities is still an open problem. All the methods in this domain are based on an assumption of synchrony between related events in audio and video channels, i.e. the appearance of a sound is approximately synchronous with the movement of the image structure that has generated it. However, most approaches do not exploit the spatio-temporal consistency that characterizes video signals and, as a result, they assess the synchrony between single pixels and the soundtrack. The results that they obtain are thus sensitive to noise and the coherence between neighboring pixels is not ensured. This thesis presents two novel audio-visual fusion methods which follow completely different strategies to evaluate the synchrony between moving image structures and sounds. Each fusion method is successfully demonstrated on a different application in this domain. Our first audio-visual fusion approach is focused on the modeling of audio and video signals. We propose to decompose each modality into a small set of functions representing the structures that are inherent in the signals. The audio signal is decomposed into a set of atoms representing concentrations of energy in the spectrogram (sounds) and the video signal is concisely represented by a set of image structures evolving through time, i.e. changing their location, size or orientation. As a result, meaningful features can be easily defined for each modality, as the presence of a sound and the movement of a salient image structure. Finally, the fusion step simply evaluates the co-occurrence of these relevant events. This approach is applied to the blind detection and separation of the audio-visual sources that are present in a scene. In contrast, the second method that we propose uses basic features and it is more focused on the fusion strategy that combines them. This approach is based on a nonlinear diffusion procedure that progressively erodes a video sequence and converts it into an audio-visual video sequence, where only the information that is required in applications in the joint audio-visual domain is kept. For this purpose we define a diffusion coefficient that depends on the synchrony between video motion and audio energy and preserves regions moving coherently with the presence of sounds. Thus, the regions that are least diffused are likely to be part of the video modality of the audio-visual source, and the application of this fusion method to the unsupervised extraction of audio-visual objects is straightforward. Unlike many methods in this domain which are specific to speakers, the fusion methods that we present in this thesis are completely general and they can be applied to all kind of audio-visual sources. Furthermore, our analysis is not limited to one source at a time, i.e. all applications can deal with multiple simultaneous sources. Finally, this thesis tackles the audio-visual fusion problem from a novel perspective, by proposing creative fusion methods and techniques borrowed from other domains such as the blind source separation, nonlinear diffusion based on partial differential equations (PDE) and graph cut segmentation

    Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

    Get PDF
    This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work