34 research outputs found

    Smart HMI for an autonomous vehicle

    Get PDF
    El presente trabajo expone la arquitectura diseñada para la implementación de un HMI (Human Machine Interface) en un vehículo autónomo desarrollado en la Universidad de Alcalá. Este sistema hace uso del ecosistema ROS (Robot Operating System) para la comunicación entre los diferentes modulos desarrollados en el vehículo. Además se expone la creación de una herramienta de captación de datos de conductores haciendo uso de la mirada de este, basada en OpenFace, una herramienta de código libre para análisis de caras. Para ello se han desarrollado dos métodos, uno basado en un método lineal y otro usando técnicas del algoritmo NARMAX. Se han desarrollado diferentes test para demostrar la precisión de ambos métodos y han sido evaluados en el dataset de accidentes DADA2000.This works presents the framework that composed the HMI (Human Machine Interface) built in an autonomous vehicle from University of Alcalá. This system has been developed using the framework ROS (Robot Operating System) for the communication between the different sub-modules developed on the vehicle. Also, a system to obtain gaze focalization data from drivers using a camera is presented, based on OpenFace, which is an open source tool for face analysis. Two different methods are proposed, one linear and other based on NARMAX algorithm. Different test has been done in order to prove their accuracy and they have been evaluated on the challenging dataset DADA2000, which is composed by traffic accidents.Máster Universitario en Ingeniería Industrial (M141

    Compréhension de contenus visuels par analyse conjointe du contenu et des usages

    Get PDF
    Dans cette thèse, nous traitons de la compréhension de contenus visuels, qu’il s’agisse d’images, de vidéos ou encore de contenus 3D. On entend par compréhension la capacité à inférer des informations sémantiques sur le contenu visuel. L’objectif de ce travail est d’étudier des méthodes combinant deux approches : 1) l’analyse automatique des contenus et 2) l’analyse des interactions liées à l’utilisation de ces contenus (analyse des usages, en plus bref). Dans un premier temps, nous étudions l’état de l’art issu des communautés de la vision par ordinateur et du multimédia. Il y a 20 ans, l’approche dominante visait une compréhension complètement automatique des images. Cette approche laisse aujourd’hui plus de place à différentes formes d’interventions humaines. Ces dernières peuvent se traduire par la constitution d’une base d’apprentissage annotée, par la résolution interactive de problèmes (par exemple de détection ou de segmentation) ou encore par la collecte d’informations implicites issues des usages du contenu. Il existe des liens riches et complexes entre supervision humaine d’algorithmes automatiques et adaptation des contributions humaines via la mise en œuvre d’algorithmes automatiques. Ces liens sont à l’origine de questions de recherche modernes : comment motiver des intervenants humains ? Comment concevoir des scénarii interactifs pour lesquels les interactions contribuent à comprendre le contenu manipulé ? Comment vérifier la qualité des traces collectées ? Comment agréger les données d’usage ? Comment fusionner les données d’usage avec celles, plus classiques, issues d’une analyse automatique ? Notre revue de la littérature aborde ces questions et permet de positionner les contributions de cette thèse. Celles-ci s’articulent en deux grandes parties. La première partie de nos travaux revisite la détection de régions importantes ou saillantes au travers de retours implicites d’utilisateurs qui visualisent ou acquièrent des con- tenus visuels. En 2D d’abord, plusieurs interfaces de vidéos interactives (en particulier la vidéo zoomable) sont conçues pour coordonner des analyses basées sur le contenu avec celles basées sur l’usage. On généralise ces résultats en 3D avec l’introduction d’un nouveau détecteur de régions saillantes déduit de la capture simultanée de vidéos de la même performance artistique publique (spectacles de danse, de chant etc.) par de nombreux utilisateurs. La seconde contribution de notre travail vise une compréhension sémantique d’images fixes. Nous exploitons les données récoltées à travers un jeu, Ask’nSeek, que nous avons créé. Les interactions élémentaires (comme les clics) et les données textuelles saisies par les joueurs sont, comme précédemment, rapprochées d’analyses automatiques des images. Nous montrons en particulier l’intérêt d’interactions révélatrices des relations spatiales entre différents objets détectables dans une même scène. Après la détection des objets d’intérêt dans une scène, nous abordons aussi le problème, plus ambitieux, de la segmentation. ABSTRACT : This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation

    Face Image and Video Analysis in Biometrics and Health Applications

    Get PDF
    Computer Vision (CV) enables computers and systems to derive meaningful information from acquired visual inputs, such as images and videos, and make decisions based on the extracted information. Its goal is to acquire, process, analyze, and understand the information by developing a theoretical and algorithmic model. Biometrics are distinctive and measurable human characteristics used to label or describe individuals by combining computer vision with knowledge of human physiology (e.g., face, iris, fingerprint) and behavior (e.g., gait, gaze, voice). Face is one of the most informative biometric traits. Many studies have investigated the human face from the perspectives of various different disciplines, ranging from computer vision, deep learning, to neuroscience and biometrics. In this work, we analyze the face characteristics from digital images and videos in the areas of morphing attack and defense, and autism diagnosis. For face morphing attacks generation, we proposed a transformer based generative adversarial network to generate more visually realistic morphing attacks by combining different losses, such as face matching distance, facial landmark based loss, perceptual loss and pixel-wise mean square error. In face morphing attack detection study, we designed a fusion-based few-shot learning (FSL) method to learn discriminative features from face images for few-shot morphing attack detection (FS-MAD), and extend the current binary detection into multiclass classification, namely, few-shot morphing attack fingerprinting (FS-MAF). In the autism diagnosis study, we developed a discriminative few shot learning method to analyze hour-long video data and explored the fusion of facial dynamics for facial trait classification of autism spectrum disorder (ASD) in three severity levels. The results show outstanding performance of the proposed fusion-based few-shot framework on the dataset. Besides, we further explored the possibility of performing face micro- expression spotting and feature analysis on autism video data to classify ASD and control groups. The results indicate the effectiveness of subtle facial expression changes on autism diagnosis

    Combining content analysis with usage analysis to better understand visual contents

    Get PDF
    This thesis focuses on the problem of understanding visual contents, which can be images, videos or 3D contents. Understanding means that we aim at inferring semantic information about the visual content. The goal of our work is to study methods that combine two types of approaches: 1) automatic content analysis and 2) an analysis of how humans interact with the content (in other words, usage analysis). We start by reviewing the state of the art from both Computer Vision and Multimedia communities. Twenty years ago, the main approach was aiming at a fully automatic understanding of images. This approach today gives way to different forms of human intervention, whether it is through the constitution of annotated datasets, or by solving problems interactively (e.g. detection or segmentation), or by the implicit collection of information gathered from content usages. These different types of human intervention are at the heart of modern research questions: how to motivate human contributors? How to design interactive scenarii that will generate interactions that contribute to content understanding? How to check or ensure the quality of human contributions? How to aggregate human contributions? How to fuse inputs obtained from usage analysis with traditional outputs from content analysis? Our literature review addresses these questions and allows us to position the contributions of this thesis. In our first set of contributions we revisit the detection of important (or salient) regions through implicit feedback from users that either consume or produce visual contents. In 2D, we develop several interfaces of interactive video (e.g. zoomable video) in order to coordinate content analysis and usage analysis. We also generalize these results to 3D by introducing a new detector of salient regions that builds upon simultaneous video recordings of the same public artistic performance (dance show, chant, etc.) by multiple users. The second contribution of our work aims at a semantic understanding of fixed images. With this goal in mind, we use data gathered through a game, Ask’nSeek, that we created. Elementary interactions (such as clicks) together with textual input data from players are, as before, mixed with automatic analysis of images. In particular, we show the usefulness of interactions that help revealing spatial relations between different objects in a scene. After studying the problem of detecting objects on a scene, we also adress the more ambitious problem of segmentation

    一人称視点映像からの手操作解析に関する研究

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)国立情報学研究所教授 佐藤 真一, 東京大学教授 佐藤 洋一, 東京大学教授 相澤 清晴, 東京大学准教授 山崎 俊彦, 東京大学准教授 大石 岳史University of Tokyo(東京大学

    Dual Eye-Tracking Methods for the Study of Remote Collaborative Problem Solving

    Get PDF
    Applied eye-tracking has been extensively used for the study of psychological processes. More recently, some researchers have used this technique to study the interaction between people by tracking and analyzing eye-movements of two persons synchronously. However, this is generally accomplished by observing people in simple controlled settings. In this thesis, we use a similar methodology, dual eye-tracking, to study people in more natural, semantically richer, tasks with the aim of identifying dual eye-movement patterns which reflect collaborative processes. Eye-tracking, and more globally eye-movements analyses, is a complex domain involving several methodological issues, which have not yet been satisfactorily solved. This work is also an attempt to offer solutions to several of these issues. The first part of this thesis is dedicated to the improvement of the methodology of eye-tracking data analysis. We present several developments pertaining to the general methodology of eye-tracking. More specifically, we identify problems and offer solutions to the following aspects: fixation identification, systematic position errors correction and hit detection. We also tackle more specific questions concerning the method dual eye-tracking. We present issues that arise in dual eye-tracking data collection and analysis and propose some solutions. On the technical side, we deal with the question of the synchronous recording of two streams of eye-movements and on the analytical side, we extend gaze cross-recurrence, a measure of eye-movements coupling, to complex realistic collaborative tasks. The second part is devoted to experimental studies of collaboration through the use of dual eye-tracking methods. We first present four exploratory studies which allowed us to set up the stage by identifying interesting phenomena and experimental difficulties. The main results of these experiments revolve around the relationship between gaze and speech. In these respects, we extended some results found in the literature to more natural settings and we developed a computational model to make actual predictions about dialogue and visual references. Finally, the main study of this thesis is about computer program understanding. This study is composed of two experiments: a solo programming experiment, from which we identified gaze patterns of a single programmer and a pair-programming task in which we explored how gaze patterns during program comprehension are affected by collaboration

    Scene understanding by robotic interactive perception

    Get PDF
    This thesis presents a novel and generic visual architecture for scene understanding by robotic interactive perception. This proposed visual architecture is fully integrated into autonomous systems performing object perception and manipulation tasks. The proposed visual architecture uses interaction with the scene, in order to improve scene understanding substantially over non-interactive models. Specifically, this thesis presents two experimental validations of an autonomous system interacting with the scene: Firstly, an autonomous gaze control model is investigated, where the vision sensor directs its gaze to satisfy a scene exploration task. Secondly, autonomous interactive perception is investigated, where objects in the scene are repositioned by robotic manipulation. The proposed visual architecture for scene understanding involving perception and manipulation tasks has four components: 1) A reliable vision system, 2) Camera-hand eye calibration to integrate the vision system into an autonomous robot’s kinematic frame chain, 3) A visual model performing perception tasks and providing required knowledge for interaction with scene, and finally, 4) A manipulation model which, using knowledge received from the perception model, chooses an appropriate action (from a set of simple actions) to satisfy a manipulation task. This thesis presents contributions for each of the aforementioned components. Firstly, a portable active binocular robot vision architecture that integrates a number of visual behaviours are presented. This active vision architecture has the ability to verge, localise, recognise and simultaneously identify multiple target object instances. The portability and functional accuracy of the proposed vision architecture is demonstrated by carrying out both qualitative and comparative analyses using different robot hardware configurations, feature extraction techniques and scene perspectives. Secondly, a camera and hand-eye calibration methodology for integrating an active binocular robot head within a dual-arm robot are described. For this purpose, the forward kinematic model of the active robot head is derived and the methodology for calibrating and integrating the robot head is described in detail. A rigid calibration methodology has been implemented to provide a closed-form hand-to-eye calibration chain and this has been extended with a mechanism to allow the camera external parameters to be updated dynamically for optimal 3D reconstruction to meet the requirements for robotic tasks such as grasping and manipulating rigid and deformable objects. It is shown from experimental results that the robot head achieves an overall accuracy of fewer than 0.3 millimetres while recovering the 3D structure of a scene. In addition, a comparative study between current RGB-D cameras and our active stereo head within two dual-arm robotic test-beds is reported that demonstrates the accuracy and portability of our proposed methodology. Thirdly, this thesis proposes a visual perception model for the task of category-wise objects sorting, based on Gaussian Process (GP) classification that is capable of recognising objects categories from point cloud data. In this approach, Fast Point Feature Histogram (FPFH) features are extracted from point clouds to describe the local 3D shape of objects and a Bag-of-Words coding method is used to obtain an object-level vocabulary representation. Multi-class Gaussian Process classification is employed to provide a probability estimate of the identity of the object and serves the key role of modelling perception confidence in the interactive perception cycle. The interaction stage is responsible for invoking the appropriate action skills as required to confirm the identity of an observed object with high confidence as a result of executing multiple perception-action cycles. The recognition accuracy of the proposed perception model has been validated based on simulation input data using both Support Vector Machine (SVM) and GP based multi-class classifiers. Results obtained during this investigation demonstrate that by using a GP-based classifier, it is possible to obtain true positive classification rates of up to 80\%. Experimental validation of the above semi-autonomous object sorting system shows that the proposed GP based interactive sorting approach outperforms random sorting by up to 30\% when applied to scenes comprising configurations of household objects. Finally, a fully autonomous visual architecture is presented that has been developed to accommodate manipulation skills for an autonomous system to interact with the scene by object manipulation. This proposed visual architecture is mainly made of two stages: 1) A perception stage, that is a modified version of the aforementioned visual interaction model, 2) An interaction stage, that performs a set of ad-hoc actions relying on the information received from the perception stage. More specifically, the interaction stage simply reasons over the information (class label and associated probabilistic confidence score) received from perception stage to choose one of the following two actions: 1) An object class has been identified with high confidence, so remove from the scene and place it in the designated basket/bin for that particular class. 2) An object class has been identified with less probabilistic confidence, since from observation and inspired from the human behaviour of inspecting doubtful objects, an action is chosen to further investigate that object in order to confirm the object’s identity by capturing more images from different views in isolation. The perception stage then processes these views, hence multiple perception-action/interaction cycles take place. From an application perspective, the task of autonomous category based objects sorting is performed and the experimental design for the task is described in detail

    Gaze-Based Human-Robot Interaction by the Brunswick Model

    Get PDF
    We present a new paradigm for human-robot interaction based on social signal processing, and in particular on the Brunswick model. Originally, the Brunswick model copes with face-to-face dyadic interaction, assuming that the interactants are communicating through a continuous exchange of non verbal social signals, in addition to the spoken messages. Social signals have to be interpreted, thanks to a proper recognition phase that considers visual and audio information. The Brunswick model allows to quantitatively evaluate the quality of the interaction using statistical tools which measure how effective is the recognition phase. In this paper we cast this theory when one of the interactants is a robot; in this case, the recognition phase performed by the robot and the human have to be revised w.r.t. the original model. The model is applied to Berrick, a recent open-source low-cost robotic head platform, where the gazing is the social signal to be considered
    corecore