252 research outputs found

    Computationally efficient deformable 3D object tracking with a monocular RGB camera

    Get PDF
    182 p.Monocular RGB cameras are present in most scopes and devices, including embedded environments like robots, cars and home automation. Most of these environments have in common a significant presence of human operators with whom the system has to interact. This context provides the motivation to use the captured monocular images to improve the understanding of the operator and the surrounding scene for more accurate results and applications.However, monocular images do not have depth information, which is a crucial element in understanding the 3D scene correctly. Estimating the three-dimensional information of an object in the scene using a single two-dimensional image is already a challenge. The challenge grows if the object is deformable (e.g., a human body or a human face) and there is a need to track its movements and interactions in the scene.Several methods attempt to solve this task, including modern regression methods based on Deep NeuralNetworks. However, despite the great results, most are computationally demanding and therefore unsuitable for several environments. Computational efficiency is a critical feature for computationally constrained setups like embedded or onboard systems present in robotics and automotive applications, among others.This study proposes computationally efficient methodologies to reconstruct and track three-dimensional deformable objects, such as human faces and human bodies, using a single monocular RGB camera. To model the deformability of faces and bodies, it considers two types of deformations: non-rigid deformations for face tracking, and rigid multi-body deformations for body pose tracking. Furthermore, it studies their performance on computationally restricted devices like smartphones and onboard systems used in the automotive industry. The information extracted from such devices gives valuable insight into human behaviour a crucial element in improving human-machine interaction.We tested the proposed approaches in different challenging application fields like onboard driver monitoring systems, human behaviour analysis from monocular videos, and human face tracking on embedded devices

    Objects extraction and recognition for camera-based interaction : heuristic and statistical approaches

    Get PDF
    In this thesis, heuristic and probabilistic methods are applied to a number of problems for camera-based interactions. The goal is to provide solutions for a vision based system that is able to extract and analyze interested objects in camera images and to use that information for various interactions for mobile usage. New methods and new attempts of combination of existing methods are developed for different applications, including text extraction from complex scene images, bar code reading performed by camera phones, and face/facial feature detection and facial expression manipulation. The application-driven problems of camera-based interaction can not be modeled by a uniform and straightforward model that has very strong simplifications of reality. The solutions we learned to be efficient were to apply heuristic but easy of implementation approaches at first to reduce the complexity of the problems and search for possible means, then use developed statistical learning approaches to deal with the remaining difficult but well-defined problems and get much better accuracy. The process can be evolved in some or all of the stages, and the combination of the approaches is problem-dependent. Contribution of this thesis resides in two aspects: firstly, new features and approaches are proposed either as heuristics or statistical means for concrete applications; secondly engineering design combining seveal methods for system optimization is studied. Geometrical characteristics and the alignment of text, texture features of bar codes, and structures of faces can all be extracted as heuristics for object extraction and further recognition. The boosting algorithm is one of the proper choices to perform probabilistic learning and to achieve desired accuracy. New feature selection techniques are proposed for constructing the weak learner and applying the boosting output in concrete applications. Subspace methods such as manifold learning algorithms are introduced and tailored for facial expression analysis and synthesis. A modified generalized learning vector quantization method is proposed to deal with the blurring of bar code images. Efficient implementations that combine the approaches in a rational joint point are presented and the results are illustrated.reviewe

    Modeling small objects under uncertainties : novel algorithms and applications.

    Get PDF
    Active Shape Models (ASM), Active Appearance Models (AAM) and Active Tensor Models (ATM) are common approaches to model elastic (deformable) objects. These models require an ensemble of shapes and textures, annotated by human experts, in order identify the model order and parameters. A candidate object may be represented by a weighted sum of basis generated by an optimization process. These methods have been very effective for modeling deformable objects in biomedical imaging, biometrics, computer vision and graphics. They have been tried mainly on objects with known features that are amenable to manual (expert) annotation. They have not been examined on objects with severe ambiguities to be uniquely characterized by experts. This dissertation presents a unified approach for modeling, detecting, segmenting and categorizing small objects under uncertainty, with focus on lung nodules that may appear in low dose CT (LDCT) scans of the human chest. The AAM, ASM and the ATM approaches are used for the first time on this application. A new formulation to object detection by template matching, as an energy optimization, is introduced. Nine similarity measures of matching have been quantitatively evaluated for detecting nodules less than 1 em in diameter. Statistical methods that combine intensity, shape and spatial interaction are examined for segmentation of small size objects. Extensions of the intensity model using the linear combination of Gaussians (LCG) approach are introduced, in order to estimate the number of modes in the LCG equation. The classical maximum a posteriori (MAP) segmentation approach has been adapted to handle segmentation of small size lung nodules that are randomly located in the lung tissue. A novel empirical approach has been devised to simultaneously detect and segment the lung nodules in LDCT scans. The level sets methods approach was also applied for lung nodule segmentation. A new formulation for the energy function controlling the level set propagation has been introduced taking into account the specific properties of the nodules. Finally, a novel approach for classification of the segmented nodules into categories has been introduced. Geometric object descriptors such as the SIFT, AS 1FT, SURF and LBP have been used for feature extraction and matching of small size lung nodules; the LBP has been found to be the most robust. Categorization implies classification of detected and segmented objects into classes or types. The object descriptors have been deployed in the detection step for false positive reduction, and in the categorization stage to assign a class and type for the nodules. The AAMI ASMI A TM models have been used for the categorization stage. The front-end processes of lung nodule modeling, detection, segmentation and classification/categorization are model-based and data-driven. This dissertation is the first attempt in the literature at creating an entirely model-based approach for lung nodule analysis

    The Discriminative Generalized Hough Transform for Localization of Highly Variable Objects and its Application for Surveillance Recordings

    Get PDF
    This work is about the localization of arbitrary objects in 2D images in general and the localization of persons in video surveillance recordings in particular. More precisely, it is about localizing specific landmarks. Thereby the possibilities and limitations of localization approaches based on the Generalized Hough Transform (GHT), especially of the Discriminative Generalized Hough Transform (DGHT) will be evaluated. GHT-based approaches determine the number of matching model and feature points and the most likely target point position is given by the highest number of matching model and feature points. Additionally, the DGHT comprises a statistical learning approach to generate optimal DGHT-models achieving good results on medical images. This work will show that the DGHT is not restricted to medical tasks but has issues with large target object variabilities, which are frequent in video surveillance tasks. As all GHT-based approaches also the DGHT only considers the number of matching model-feature-point-combinations, which means that all model points are treated independently. This work will show that model points are not independent of each other and considering them independently will result in high error rates. This drawback is analyzed and a universal solution, which is not only applicable for the DGHT but all GHT-based approaches, is presented. This solution is based on an additional classifier that takes the whole set of matching model-feature-point-combinations into account to estimate a confidence score. On all tested databases, this approach could reduce the error rates drastically by up to 94.9%. Furthermore, this work presents a general approach for combining multiple GHT-models into a deeper model. This can be used to combine the localization results of different object landmarks such as mouth, nose, and eyes. Similar to Convolutional Neural Networks (CNNs) this will split the target object variability into multiple and smaller variabilities. A comparison of GHT-based approaches with CNNs and a description of the advantages, disadvantages, and potential application of both approaches will conclude this work.Diese Arbeit beschäftigt sich im Allgemeinen mit der Lokalisierung von Objekten in 2D Bilddaten und im Speziellen mit der Lokalisierung von Personen in Videoüberwachungsaufnahmen. Genauer gesagt handelt es sich hierbei um die Lokalisierung spezieller Landmarken. Dabei werden die Möglichkeiten und Limiterungen von Lokalisierungsverfahren basierend auf der Generalisierten Hough Transformation (GHT) untersucht, insbesondere die der Diskriminativen Generalisierten Hough Transformation (DGHT). Bei GHT-basierten Ansätze wird die Anzahl an übereinstimmenden Modelpunkten und Merkmalspunkten ermittelt und die wahrscheinlicheste Objekt-Position ergibt sich aus der höchsten Anzahl an übereinstimmenden Model- und Merkmalspunkte. Die DGHT umfasst darüber hinaus noch ein statistisches Lernverfahren, um optimale DGHT-Modele zu erzeugen und erzielte damit auf medizinischen Bilder und Anwendungen sehr gute Erfolge. Wie sich in dieser Arbeit zeigen wird, ist die DGHT nicht auf medizinische Anwendungen beschränkt, hat allerdings Schwierigkeiten große Variabilität der Ziel-Objekte abzudecken, wie sie in Überwachungsszenarien zu erwarten sind. Genau wie alle GHT-basierten Ansätze leidet auch die DGHT unter dem Problem, dass lediglich die Anzahl an übereinstimmenden Model- und Merkmalspunkten ermittelt wird, was bedeutet, dass alle Modelpunkte unabhängig voneinander betrachtet werden. Dass Modelpunkte nicht unabhängig voneinander sind, wird im Laufe dieser Arbeit gezeigt werden, und die unabhängige Betrachtung führt gerade bei sehr variablen Zielobjekten zu einer hohen Fehlerrate. Dieses Problem wird in dieser Arbeit grundlegend untersucht und ein allgemeiner Lösungsansatz vorgestellt, welcher nicht nur für die DGHT sondern grundsätzlich für alle GHT-basierten Verfahren Anwendung finden kann. Die Lösung basiert auf der Integration eines zusätzlichen Klassifikators, welcher die gesamte Menge an übereinstimmenden Model- und Merkmalspunkten betrachtet und anhand dessen ein zusätzliches Konfidenzmaß vergibt. Dadurch konnte auf allen getesteten Datenbanken eine deutliche Reduktion der Fehlerrate erzielt werden von bis zu 94.9%. Darüber hinaus umfasst die Arbeit einen generellen Ansatz zur Kombination mehrere GHT-Model in einem tieferen Model. Dies kann dazu verwendet werden, um die Lokalisierungsergebnisse verschiedener Objekt-Landmarken zu kombinieren, z. B. die von Mund, Nase und Augen. Ähnlich wie auch bei Convolutional Neural Networks (CNNs) ist es damit möglich über mehrere Ebenen unterschiedliche Bereiche zu lokalisieren und somit die Variabilität des Zielobjektes in mehrere, leichter zu handhabenden Variabilitäten aufzuspalten. Abgeschlossen wird die Arbeit durch einen Vergleich von GHT-basierten Ansätzen mit CNNs und einer Beschreibung der Vor- und Nachteile und mögliche Einsatzfelder beider Verfahren

    3D Gaze Estimation from Remote RGB-D Sensors

    Get PDF
    The development of systems able to retrieve and characterise the state of humans is important for many applications and fields of study. In particular, as a display of attention and interest, gaze is a fundamental cue in understanding people activities, behaviors, intentions, state of mind and personality. Moreover, gaze plays a major role in the communication process, like for showing attention to the speaker, indicating who is addressed or averting gaze to keep the floor. Therefore, many applications within the fields of human-human, human-robot and human-computer interaction could benefit from gaze sensing. However, despite significant advances during more than three decades of research, current gaze estimation technologies can not address the conditions often required within these fields, such as remote sensing, unconstrained user movements and minimum user calibration. Furthermore, to reduce cost, it is preferable to rely on consumer sensors, but this usually leads to low resolution and low contrast images that current techniques can hardly cope with. In this thesis we investigate the problem of automatic gaze estimation under head pose variations, low resolution sensing and different levels of user calibration, including the uncalibrated case. We propose to build a non-intrusive gaze estimation system based on remote consumer RGB-D sensors. In this context, we propose algorithmic solutions which overcome many of the limitations of previous systems. We thus address the main aspects of this problem: 3D head pose tracking, 3D gaze estimation, and gaze based application modeling. First, we develop an accurate model-based 3D head pose tracking system which adapts to the participant without requiring explicit actions. Second, to achieve a head pose invariant gaze estimation, we propose a method to correct the eye image appearance variations due to head pose. We then investigate on two different methodologies to infer the 3D gaze direction. The first one builds upon machine learning regression techniques. In this context, we propose strategies to improve their generalization, in particular, to handle different people. The second methodology is a new paradigm we propose and call geometric generative gaze estimation. This novel approach combines the benefits of geometric eye modeling (normally restricted to high resolution images due to the difficulty of feature extraction) with a stochastic segmentation process (adapted to low-resolution) within a Bayesian model allowing the decoupling of user specific geometry and session specific appearance parameters, along with the introduction of priors, which are appropriate for adaptation relying on small amounts of data. The aforementioned gaze estimation methods are validated through extensive experiments in a comprehensive database which we collected and made publicly available. Finally, we study the problem of automatic gaze coding in natural dyadic and group human interactions. The system builds upon the thesis contributions to handle unconstrained head movements and the lack of user calibration. It further exploits the 3D tracking of participants and their gaze to conduct a 3D geometric analysis within a multi-camera setup. Experiments on real and natural interactions demonstrate the system is highly accuracy. Overall, the methods developed in this dissertation are suitable for many applications, involving large diversity in terms of setup configuration, user calibration and mobility

    Computationally efficient deformable 3D object tracking with a monocular RGB camera

    Get PDF
    182 p.Monocular RGB cameras are present in most scopes and devices, including embedded environments like robots, cars and home automation. Most of these environments have in common a significant presence of human operators with whom the system has to interact. This context provides the motivation to use the captured monocular images to improve the understanding of the operator and the surrounding scene for more accurate results and applications.However, monocular images do not have depth information, which is a crucial element in understanding the 3D scene correctly. Estimating the three-dimensional information of an object in the scene using a single two-dimensional image is already a challenge. The challenge grows if the object is deformable (e.g., a human body or a human face) and there is a need to track its movements and interactions in the scene.Several methods attempt to solve this task, including modern regression methods based on Deep NeuralNetworks. However, despite the great results, most are computationally demanding and therefore unsuitable for several environments. Computational efficiency is a critical feature for computationally constrained setups like embedded or onboard systems present in robotics and automotive applications, among others.This study proposes computationally efficient methodologies to reconstruct and track three-dimensional deformable objects, such as human faces and human bodies, using a single monocular RGB camera. To model the deformability of faces and bodies, it considers two types of deformations: non-rigid deformations for face tracking, and rigid multi-body deformations for body pose tracking. Furthermore, it studies their performance on computationally restricted devices like smartphones and onboard systems used in the automotive industry. The information extracted from such devices gives valuable insight into human behaviour a crucial element in improving human-machine interaction.We tested the proposed approaches in different challenging application fields like onboard driver monitoring systems, human behaviour analysis from monocular videos, and human face tracking on embedded devices

    An Analysis of Facial Expression Recognition Techniques

    Get PDF
    In present era of technology , we need applications which could be easy to use and are user-friendly , that even people with specific disabilities use them easily. Facial Expression Recognition has vital role and challenges in communities of computer vision, pattern recognition which provide much more attention due to potential application in many areas such as human machine interaction, surveillance , robotics , driver safety, non- verbal communication, entertainment, health- care and psychology study. Facial Expression Recognition has major importance ration in face recognition for significant image applications understanding and analysis. There are many algorithms have been implemented on different static (uniform background, identical poses, similar illuminations ) and dynamic (position variation, partial occlusion orientation, varying lighting )conditions. In general way face expression recognition consist of three main steps first is face detection then feature Extraction and at last classification. In this survey paper we discussed different types of facial expression recognition techniques and various methods which is used by them and their performance measures

    飛行ロボットにおける人間・ロボットインタラクションの実現に向けて : ユーザー同伴モデルとセンシングインターフェース

    Get PDF
    学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 矢入 健久, 東京大学教授 堀 浩一, 東京大学教授 岩崎 晃, 東京大学教授 土屋 武司, 東京理科大学教授 溝口 博University of Tokyo(東京大学
    corecore