461 research outputs found

    Online Localization and Tracking of Multiple Moving Speakers in Reverberant Environments

    Get PDF
    We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter-channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well suited for estimating DP-RTFs associated with moving audio sources. Another crucial ingredient of the proposed method is its ability to properly assign DP-RTFs to audio-source directions. Towards this goal, we adopt a maximum-likelihood formulation and we propose to use an exponentiated gradient (EG) to efficiently update source-direction estimates starting from their currently available values. The problem of multiple speaker tracking is computationally intractable because the number of possible associations between observed source directions and physical speakers grows exponentially with time. We adopt a Bayesian framework and we propose a variational approximation of the posterior filtering distribution associated with multiple speaker tracking, as well as an efficient variational expectation-maximization (VEM) solver. The proposed online localization and tracking method is thoroughly evaluated using two datasets that contain recordings performed in real environments.Comment: IEEE Journal of Selected Topics in Signal Processing, 201

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rĂŽle crucial dans l’interaction homme-robot (HRI). Le systĂšme de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de rĂ©agir en consequence. Dans un scĂ©nario de conversation, un groupe de personnes peut discuter devant le robot et se dĂ©placer librement. Dans de telles situations, les robots sont censĂ©s comprendre oĂč sont les gens, ceux qui parlent et de quoi ils parlent. Cette thĂšse se concentre sur les deux premiĂšres questions, Ă  savoir le suivi et la diarisation des locuteurs. Nous utilisons diffĂ©rentes modalitĂ©s du systĂšme de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scĂ©nario de conversation. Les progrĂšs de la vision par ordinateur et du traitement audio de la derniĂšre dĂ©cennie ont rĂ©volutionnĂ© les capacitĂ©s de perception des robots. Dans cette thĂšse, nous dĂ©veloppons les contributions suivantes : nous dĂ©veloppons d’abord un cadre variationnel bayĂ©sien pour suivre plusieurs objets. Le cadre bayĂ©sien variationnel fournit des solutions explicites, rendant le processus de suivi trĂšs efficace. Cette approche est d’abord appliquĂ© au suivi visuel de plusieurs personnes. Les processus de crĂ©ations et de destructions sont en adĂ©quation avecle modĂšle probabiliste proposĂ© pour traiter un nombre variable de personnes. De plus, nous exploitons la complĂ©mentaritĂ© de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut ĂȘtre intĂ©grĂ© au systĂšme de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent ĂȘtre utilisĂ©es pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinĂ©es dans le modĂšle variationnel, pour lisser les trajectoires et dĂ©duire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario oĂč l’informationvisuelle est absente, nous essayons le modĂšle pour la localisation et le suivi des locuteurs basĂ© sur l’information acoustique uniquement. Les techniques de dĂ©rĂ©verbĂ©ration sont d’abord appliquĂ©es, dont le rĂ©sultat est fourni au systĂšme de suivi. Enfin, une variante du modĂšle de suivi des locuteurs basĂ©e sur la distribution de von-Mises est proposĂ©e, celle-ci Ă©tant plus adaptĂ©e aux donnĂ©es directionnelles. Toutes les mĂ©thodes proposĂ©es sont validĂ©es sur des bases de donnĂ©es specifiques Ă  chaque application

    A multi-modal perception based assistive robotic system for the elderly

    Get PDF
    Edited by Giovanni Maria Farinella, Takeo Kanade, Marco Leo, Gerard G. Medioni, Mohan TrivediInternational audienceIn this paper, we present a multi-modal perception based framework to realize a non-intrusive domestic assistive robotic system. It is non-intrusive in that it only starts interaction with a user when it detects the user's intention to do so. All the robot's actions are based on multi-modal perceptions which include user detection based on RGB-D data, user's intention-for-interaction detection with RGB-D and audio data, and communication via user distance mediated speech recognition. The utilization of multi-modal cues in different parts of the robotic activity paves the way to successful robotic runs (94% success rate). Each presented perceptual component is systematically evaluated using appropriate dataset and evaluation metrics. Finally the complete system is fully integrated on the PR2 robotic platform and validated through system sanity check runs and user studies with the help of 17 volunteer elderly participants

    Structure Inference for Bayesian Multisensory Scene Understanding

    Get PDF
    We investigate a solution to the problem of multi-sensor scene understanding by formulating it in the framework of Bayesian model selection and structure inference. Humans robustly associate multimodal data as appropriate, but previous modelling work has focused largely on optimal fusion, leaving segregation unaccounted for and unexploited by machine perception systems. We illustrate a unifying, Bayesian solution to multi-sensor perception and tracking which accounts for both integration and segregation by explicit probabilistic reasoning about data association in a temporal context. Such explicit inference of multimodal data association is also of intrinsic interest for higher level understanding of multisensory data. We illustrate this using a probabilistic implementation of data association in a multi-party audio-visual scenario, where unsupervised learning and structure inference is used to automatically segment, associate and track individual subjects in audiovisual sequences. Indeed, the structure inference based framework introduced in this work provides the theoretical foundation needed to satisfactorily explain many confounding results in human psychophysics experiments involving multimodal cue integration and association

    Ultra-high-speed imaging of bubbles interacting with cells and tissue

    Get PDF
    Ultrasound contrast microbubbles are exploited in molecular imaging, where bubbles are directed to target cells and where their high-scattering cross section to ultrasound allows for the detection of pathologies at a molecular level. In therapeutic applications vibrating bubbles close to cells may alter the permeability of cell membranes, and these systems are therefore highly interesting for drug and gene delivery applications using ultrasound. In a more extreme regime bubbles are driven through shock waves to sonoporate or kill cells through intense stresses or jets following inertial bubble collapse. Here, we elucidate some of the underlying mechanisms using the 25-Mfps camera Brandaris128, resolving the bubble dynamics and its interactions with cells. We quantify acoustic microstreaming around oscillating bubbles close to rigid walls and evaluate the shear stresses on nonadherent cells. In a study on the fluid dynamical interaction of cavitation bubbles with adherent cells, we find that the nonspherical collapse of bubbles is responsible for cell detachment. We also visualized the dynamics of vibrating microbubbles in contact with endothelial cells followed by fluorescent imaging of the transport of propidium iodide, used as a membrane integrity probe, into these cells showing a direct correlation between cell deformation and cell membrane permeability

    Towards An Intelligent Fuzzy Based Multimodal Two Stage Speech Enhancement System

    Get PDF
    This thesis presents a novel two stage multimodal speech enhancement system, making use of both visual and audio information to filter speech, and explores the extension of this system with the use of fuzzy logic to demonstrate proof of concept for an envisaged autonomous, adaptive, and context aware multimodal system. The design of the proposed cognitively inspired framework is scalable, meaning that it is possible for the techniques used in individual parts of the system to be upgraded and there is scope for the initial framework presented here to be expanded. In the proposed system, the concept of single modality two stage filtering is extended to include the visual modality. Noisy speech information received by a microphone array is first pre-processed by visually derived Wiener filtering employing the novel use of the Gaussian Mixture Regression (GMR) technique, making use of associated visual speech information, extracted using a state of the art Semi Adaptive Appearance Models (SAAM) based lip tracking approach. This pre-processed speech is then enhanced further by audio only beamforming using a state of the art Transfer Function Generalised Sidelobe Canceller (TFGSC) approach. This results in a system which is designed to function in challenging noisy speech environments (using speech sentences with different speakers from the GRID corpus and a range of noise recordings), and both objective and subjective test results (employing the widely used Perceptual Evaluation of Speech Quality (PESQ) measure, a composite objective measure, and subjective listening tests), showing that this initial system is capable of delivering very encouraging results with regard to filtering speech mixtures in difficult reverberant speech environments. Some limitations of this initial framework are identified, and the extension of this multimodal system is explored, with the development of a fuzzy logic based framework and a proof of concept demonstration implemented. Results show that this proposed autonomous,adaptive, and context aware multimodal framework is capable of delivering very positive results in difficult noisy speech environments, with cognitively inspired use of audio and visual information, depending on environmental conditions. Finally some concluding remarks are made along with proposals for future work
    • 

    corecore