7 research outputs found

    Audio-Visual Speech-Turn Detection and Tracking

    Get PDF
    International audienceSpeaker diarization is an important component of multi-party dialog systems in order to assign speech-signal segments among participants. Diariza-tion may well be viewed as the problem of detecting and tracking speech turns. It is proposed to address this problem by modeling the spatial coincidence of visual and auditory observations and by combining this coincidence model with a dynamic Bayesian formulation that tracks the identity of the active speaker. Speech-turn tracking is formulated as a latent-variable temporal graphical model and an exact inference algorithm is proposed. We describe in detail an audiovisual discriminative observation model as well as a state-transition model. We also describe an implementation of a full system composed of multi-person visual tracking, sound-source localization and the proposed online diarization technique. Finally we show that the proposed method yields promising results with two challenging scenarios that were carefully recorded and annotated

    Tracking the Active Speaker Based on a Joint Audio-Visual Observation Model

    Get PDF
    International audienceAny multi-party conversation system benefits from speaker diarization, that is, the assignment of speech signals among the participants. We here cast the diarization problem into a tracking formulation whereby the active speaker is detected and tracked over time. A probabilistic tracker exploits the on-image (spatial) coincidence of visual and auditory observations and infers a single latent variable which represents the identity of the active speaker. Both visual and auditory observations are explained by a recently proposed weighted-data mixture model, while several options for the speaking turns dynamics are fulfilled by a multi-case transition model. The modules that translate raw audio and visual data into on-image observations are also described in detail. The performance of the proposed tracker is tested on challenging data-sets that are available from recent contributions which are used as baselines for comparison

    Audio-Visual Speaker Localization via Weighted Clustering

    Get PDF
    International audienceIn this paper we address the problem of detecting and locating speakers using audiovisual data. We address this problem in the framework of clustering. We propose a novel weighted clustering method based on a finite mixture model which explores the idea of non-uniform weighting of observations. Weighted-data clustering techniques have already been proposed, but not in a generative setting as presented here. We introduce a weighted-data mixture model and we formally devise the associated EM procedure. The clustering algorithm is applied to the problem of detecting and localizing a speaker over time using both visual and auditory observations gathered with a single camera and two microphones. Audiovisual fusion is enforced by introducing a cross-modal weighting scheme. We test the robustness of the method with experiments in two challenging scenarios: disambiguate between an active and a non-active speaker, and associate a speech signal with a person

    A Distributed Architecture for Interacting with NAO

    Get PDF
    International audienceOne of the main applications of the humanoid robot NAO – a small robot companion – is human-robot interaction (HRI). NAO is particularly well suited for HRI applications because of its design, hardware specifications, programming capabilities, and affordable cost. Indeed, NAO can stand up, walk, wander, dance, play soccer, sit down, recognize and grasp simple objects, detect and identify people, localize sounds, understand some spoken words, engage itself in simple and goal-directed dialogs, and synthesize speech. This is made possible due to the robot's 24 degree-of-freedom articulated structure (body, legs, feet, arms, hands, head, etc.), motors, cameras, microphones, etc., as well as to its on-board computing hardware and embedded software, e.g., robot motion control. Nevertheless, the current NAO configuration has two drawbacks that restrict the complexity of interactive behaviors that could potentially be implemented. Firstly, the on-board computing resources are inherently limited, which implies that it is difficult to implement sophisticated computer vision and audio signal analysis algorithms required by advanced interactive tasks. Secondly, programming new robot functionalities currently implies the development of embedded software, which is a difficult task in its own right necessitating specialized knowledge. The vast majority of HRI practitioners may not have this kind of expertise and hence they cannot easily and quickly implement their ideas, carry out thorough experimental validations, and design proof-of-concept demonstrators. We have developed a distributed software architecture that attempts to overcome these two limitations. Broadly speaking, NAO's on-board computing resources are augmented with external computing resources. The latter is a computer platform with its CPUs, GPUs, memory, operating system, libraries, software packages, internet access, etc. This configuration enables easy and fast development in Matlab, C, C++, or Python

    Audio-Visual Analysis In the Framework of Humans Interacting with Robots

    No full text
    Depuis quelques années, un intérêt grandissant pour les interactions homme-robot (HRI), avec pour but de développer des robots pouvant interagir (ou plus généralement communiquer) avec des personnes de manière naturelle. Cela requiert aux robots d'avoir la capacité non seulement de comprendre une conversation et signaux non verbaux associés à la communication (e.g. le regard et les expressions du visage), mais aussi la capacité de comprendre les dynamiques des interactions sociales, e.g. détecter et identifier les personnes présentes, où sont-elles, les suivre au cours de la conversation, savoir qui est le locuteur, à qui parle t-il, mais aussi qui regarde qui, etc. Tout cela nécessite aux robots d’avoir des capacités de perception multimodales pour détecter et intégrer de manière significative les informations provenant de leurs multiples canaux sensoriels. Dans cette thèse, nous nous concentrons sur les entrées sensorielles audio-visuelles du robot composées de microphones (multiples) et de caméras vidéo. Dans cette thèse nous nous concentrons sur trois tâches associés à la perception des robots, à savoir : (P1) localisation de plusieurs locuteurs, (P2) localisation et suivi de plusieurs personnes, et (P3) journalisation de locuteur. La majorité des travaux existants sur le traitement du signal et de la vision par ordinateur abordent ces problèmes en utilisant uniquement soit des signaux audio ou des informations visuelles. Cependant, dans cette thèse, nous prévoyons de les aborder à travers la fusion des informations audio et visuelles recueillies par deux microphones et une caméra vidéo. Notre objectif est d'exploiter la nature complémentaire des modalités auditive et visuelle dans l'espoir d'améliorer de manière significatives la robustesse et la performance par rapport aux systèmes utilisant une seule modalité. De plus, les trois problèmes sont abordés en considérant des scénarios d'interaction Homme-Robot difficiles comme, par exemple, un robot engagé dans une interaction avec un nombre variable de participants, qui peuvent parler en même temps et qui peuvent se déplacer autour de la scène et tourner la tête / faire face aux autres participants plutôt qu’au robot.In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g., find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activity, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot's audio-visual sensory inputs consisting of the (multiple) microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (P1) multiple speakers localization, (P2) multiple-person location tracking, and (P3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing audio signals alone, or visual information only. However, in this thesis, we plan to address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as, eg a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot

    Analyse Audio-Visuelle dans le Contexte de l'Intéraction Humain-Robot

    Get PDF
    In recent years, there has been a growing interest in human-robot interaction (HRI), with the aim to enable robots to naturally interact and communicate with humans. Natural interaction implies that robots not only need to understand speech and non-verbal communication cues such as body gesture, gaze, or facial expressions, but they also need to understand the dynamics of the social interplay, e.g. find people in the environment, distinguish between different people, track them through the physical space, parse their actions and activities, estimate their engagement, identify who is speaking, who speaks to whom, etc. All these task necessitate the robots to have multimodal perception skills to meaningfully detect and integrate information from their multiple sensory channels. In this thesis, we focus on the robot’s audio-visual sensory inputs consisting of microphones and video cameras. Among the different addressable perception tasks, in this thesis we explore three, namely; (1) multiple speakers localization, (2) multiple-person location tracking, and (3) speaker diarization. The majority of existing works in signal processing and computer vision address these problems by utilizing either audio signals or visual information. However, in this thesis, we address them via fusion of the audio and visual information gathered by two microphones and one video camera. Our goal is to exploit the complimentary nature of the audio and visual modalities with a hope of attaining significant improvements on robustness and performance over systems that use a single modality. Moreover, the three problems are addressed considering challenging HRI scenarios such as a robot engaged in a multi-party interaction with varying number of participants, which may speak at the same time as well as may move around the scene and turn their heads/faces towards the other participants rather than facing the robot.Au cours des dernières années, il y a eu un intérêt croissant pour l'interaction homme-robot (HRI), dans le but de permettre aux robots d'interagir naturellement et de communiquer avec les humains. L'interaction naturelle implique que les robots doivent non seulement comprendre les signaux de la parole et de la communication non verbale tels que le geste corporel, le regard ou les expressions faciales, mais ils doivent également comprendre la dynamique de l'interaction sociale, par ex. trouver des personnes dans l'environnement, distinguer les différentes personnes, les suivre dans l'espace physique, analyser leurs actions et leurs activités, estimer leur engagement, identifier qui parle, qui parle à qui, etc. Toutes ces tâches nécessitent une perception multimodale de la part des robots compétences pour détecter et intégrer de manière significative les informations provenant de leurs multiples canaux sensoriels. Dans cette thèse, nous nous concentrons sur les entrées sensorielles audio-visuelles du robot composées de microphones et de caméras vidéo. Parmi les différentes tâches de perception adressables, dans cette thèse, nous explorons trois, à savoir; (1) la localisation de plusieurs locuteurs, (2) le suivi de l'emplacement de plusieurs personnes, et (3) la diarisation du locuteur. La majorité des travaux existants sur le traitement du signal et la vision par ordinateur abordent ces problèmes en utilisant des signaux audio ou des informations visuelles. Cependant, dans cette thèse, nous les abordons via la fusion des informations audio et visuelles recueillies par deux microphones et une caméra vidéo. Notre objectif est d'exploiter la nature complémentaire des modalités audio et visuelles dans l'espoir d'obtenir des améliorations significatives de la robustesse et de la performance par rapport aux systèmes utilisant une seule modalité. De plus, les trois problèmes sont abordés en considérant des scénarios HRI difficiles tels qu'un robot engagé dans une interaction multipartite avec un nombre variable de participants, qui peuvent parler en même temps et qui peuvent se déplacer autour de la scène et tourner leur tête vers les autres participants plutôt que de faire face au robot
    corecore