419 research outputs found

    Recovering 6D Object Pose: A Review and Multi-modal Analysis

    Full text link
    A large number of studies analyse object detection and pose estimation at visual level in 2D, discussing the effects of challenges such as occlusion, clutter, texture, etc., on the performances of the methods, which work in the context of RGB modality. Interpreting the depth data, the study in this paper presents thorough multi-modal analyses. It discusses the above-mentioned challenges for full 6D object pose estimation in RGB-D images comparing the performances of several 6D detectors in order to answer the following questions: What is the current position of the computer vision community for maintaining "automation" in robotic manipulation? What next steps should the community take for improving "autonomy" in robotics while handling objects? Our findings include: (i) reasonably accurate results are obtained on textured-objects at varying viewpoints with cluttered backgrounds. (ii) Heavy existence of occlusion and clutter severely affects the detectors, and similar-looking distractors is the biggest challenge in recovering instances' 6D. (iii) Template-based methods and random forest-based learning algorithms underlie object detection and 6D pose estimation. Recent paradigm is to learn deep discriminative feature representations and to adopt CNNs taking RGB images as input. (iv) Depending on the availability of large-scale 6D annotated depth datasets, feature representations can be learnt on these datasets, and then the learnt representations can be customized for the 6D problem

    Understanding egocentric human actions with temporal decision forests

    Get PDF
    Understanding human actions is a fundamental task in computer vision with a wide range of applications including pervasive health-care, robotics and game control. This thesis focuses on the problem of egocentric action recognition from RGB-D data, wherein the world is viewed through the eyes of the actor whose hands describe the actions. The main contributions of this work are its findings regarding egocentric actions as described by hands in two application scenarios and a proposal of a new technique that is based on temporal decision forests. The thesis first introduces a novel framework to recognise fingertip writing in mid-air in the context of human-computer interaction. This framework detects whether the user is writing and tracks the fingertip over time to generate spatio-temporal trajectories that are recognised by using a Hough forest variant that encourages temporal consistency in prediction. A problem with using such forest approach for action recognition is that the learning of temporal dynamics is limited to hand-crafted temporal features and temporal regression, which may break the temporal continuity and lead to inconsistent predictions. To overcome this limitation, the thesis proposes transition forests. Besides any temporal information that is encoded in the feature space, the forest automatically learns the temporal dynamics during training, and it is exploited in inference in an online and efficient manner achieving state-of-the-art results. The last contribution of this thesis is its introduction of the first RGB-D benchmark to allow for the study of egocentric hand-object actions with both hand and object pose annotations. This study conducts an extensive evaluation of different baselines, state-of-the art approaches and temporal decision forest models using colour, depth and hand pose features. Furthermore, it extends the transition forest model to incorporate data from different modalities and demonstrates the benefit of using hand pose features to recognise egocentric human actions. The thesis concludes by discussing and analysing the contributions and proposing a few ideas for future work.Open Acces

    Multi-layer random forests with auto-context for object detection and pose estimation

    Get PDF
    Deep neural networks are multi-layer architectures of neural networks used in machine learning. In recent years, training such systems through Deep Learning techniques has shown remarkable improvements over more traditional techniques in the domains of image-based classification and object category recognition. However, a well-known downside of Deep Learning is that generally, for training these very-high-dimensional systems, a very large amount of training samples is required. It is a common understanding that the multi-level structure of data processing that is learned and applied may be a critical factor for the success story of deep neural networks. Highly task-optimized representations may be gradually built through a sequence of transformations that can abstract and hence generalize from the specific data instances. In this Master Thesis, the aspect of deep multi-layered data transformation will be transferred from neural networks onto a very different learning scheme, the Random Forests. The latter is a highly competitive and general-purpose method for classification and regression. Even large Random Forests usually do not have the high amount of parameters to optimize during training, and the number of hyper-parameters and architectural varieties is much lower than for the deep neural networks. Training with a much smaller amount of samples is hence possible. Some variants of a layered architecture of Random Forests is investigated here, and different styles of training each forest is tried. The specific problem domain considered here is object detection and pose estimation from RGB-D images. Hence, the forest output is used by the pose hypothesis scoring function and the poses are optimized using RANSAC-based scheme. Experiments on the pose annotated Hinterstoisser and T-LESS datasets prove the performance of the multi-layer random forests architecture

    Learning to Predict Dense Correspondences for 6D Pose Estimation

    Get PDF
    Object pose estimation is an important problem in computer vision with applications in robotics, augmented reality and many other areas. An established strategy for object pose estimation consists of, firstly, finding correspondences between the image and the objectโ€™s reference frame, and, secondly, estimating the pose from outlier-free correspondences using Random Sample Consensus (RANSAC). The first step, namely finding correspondences, is difficult because object appearance varies depending on perspective, lighting and many other factors. Traditionally, correspondences have been established using handcrafted methods like sparse feature pipelines. In this thesis, we introduce a dense correspondence representation for objects, called object coordinates, which can be learned. By learning object coordinates, our pose estimation pipeline adapts to various aspects of the task at hand. It works well for diverse object types, from small objects to entire rooms, varying object attributes, like textured or texture-less objects, and different input modalities, like RGB-D or RGB images. The concept of object coordinates allows us to easily model and exploit uncertainty as part of the pipeline such that even repeating structures or areas with little texture can contribute to a good solution. Although we can train object coordinate predictors independent of the full pipeline and achieve good results, training the pipeline in an end-to-end fashion is desirable. It enables the object coordinate predictor to adapt its output to the specificities of following steps in the pose estimation pipeline. Unfortunately, the RANSAC component of the pipeline is non-differentiable which prohibits end-to-end training. Adopting techniques from reinforcement learning, we introduce Differentiable Sample Consensus (DSAC), a formulation of RANSAC which allows us to train the pose estimation pipeline in an end-to-end fashion by minimizing the expectation of the final pose error

    Characterizing Objects in Images using Human Context

    Get PDF
    Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context

    3D ์† ํฌ์ฆˆ ์ธ์‹์„ ์œ„ํ•œ ์ธ์กฐ ๋ฐ์ดํ„ฐ์˜ ์ด์šฉ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2021.8. ์–‘ํ•œ์—ด.3D hand pose estimation (HPE) based on RGB images has been studied for a long time. Relevant methods have focused mainly on optimization of neural framework for graphically connected finger joints. Training RGB-based HPE models has not been easy to train because of the scarcity on RGB hand pose datasets; unlike human body pose datasets, the finger joints that span hand postures are structured delicately and exquisitely. Such structure makes accurately annotating each joint with unique 3D world coordinates difficult, which is why many conventional methods rely on synthetic data samples to cover large variations of hand postures. Synthetic dataset consists of very precise annotations of ground truths, and further allows control over the variety of data samples, yielding a learning model to be trained with a large pose space. Most of the studies, however, have performed frame-by-frame estimation based on independent static images. Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human hand pose estimation is a particularly interesting example of this synthetic-to-real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this dissertation, we attempt to not only consider the appearance of a hand but incorporate the temporal movement information of a hand in motion into the learning framework for better 3D hand pose estimation performance, which leads to the necessity of a large scale dataset with sequential RGB hand images. We propose a novel method that generates a synthetic dataset that mimics natural human hand movements by re-engineering annotations of an extant static hand pose dataset into pose-flows. With the generated dataset, we train a newly proposed recurrent framework, exploiting visuo-temporal features from sequential images of synthetic hands in motion and emphasizing temporal smoothness of estimations with a temporal consistency constraint. Our novel training strategy of detaching the recurrent layer of the framework during domain finetuning from synthetic to real allows preservation of the visuo-temporal features learned from sequential synthetic hand images. Hand poses that are sequentially estimated consequently produce natural and smooth hand movements which lead to more robust estimations. We show that utilizing temporal information for 3D hand pose estimation significantly enhances general pose estimations by outperforming state-of-the-art methods in experiments on hand pose estimation benchmarks. Since a fixed set of dataset provides a finite distribution of data samples, the generalization of a learning pose estimation network is limited in terms of pose, RGB and viewpoint spaces. We further propose to augment the data automatically such that the augmented pose sampling is performed in favor of training pose estimators generalization performance. Such auto-augmentation of poses is performed within a learning feature space in order to avoid computational burden of generating synthetic sample for every iteration of updates. The proposed effort can be considered as generating and utilizing synthetic samples for network training in the feature space. This allows training efficiency by requiring less number of real data samples, enhanced generalization power over multiple dataset domains and estimation performance caused by efficient augmentation.2D ์ด๋ฏธ์ง€์—์„œ ์‚ฌ๋žŒ์˜ ์† ๋ชจ์–‘๊ณผ ํฌ์ฆˆ๋ฅผ ์ธ์‹ํ•˜๊ณ  ๊ตฌํ˜„ํ๋Š” ์—ฐ๊ตฌ๋Š” ๊ฐ ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค์˜ 3D ์œ„์น˜๋ฅผ ๊ฒ€์ถœํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœํ•œ๋‹ค. ์† ํฌ์ฆˆ๋Š” ์†๊ฐ€๋ฝ ์กฐ์ธํŠธ๋“ค๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๊ณ  ์†๋ชฉ ๊ด€์ ˆ๋ถ€ํ„ฐ MCP, PIP, DIP ์กฐ์ธํŠธ๋“ค๋กœ ์‚ฌ๋žŒ ์†์„ ๊ตฌ์„ฑํ•˜๋Š” ์‹ ์ฒด์  ์š”์†Œ๋“ค์„ ์˜๋ฏธํ•œ๋‹ค. ์† ํฌ์ฆˆ ์ •๋ณด๋Š” ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ํ™œ์šฉ๋ ์ˆ˜ ์žˆ๊ณ  ์† ์ œ์Šค์ณ ๊ฐ์ง€ ์—ฐ๊ตฌ ๋ถ„์•ผ์—์„œ ์† ํฌ์ฆˆ ์ •๋ณด๊ฐ€ ๋งค์šฐ ํ›Œ๋ฅญํ•œ ์ž…๋ ฅ ํŠน์ง• ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค. ์‚ฌ๋žŒ์˜ ์† ํฌ์ฆˆ ๊ฒ€์ถœ ์—ฐ๊ตฌ๋ฅผ ์‹ค์ œ ์‹œ์Šคํ…œ์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋†’์€ ์ •ํ™•๋„, ์‹ค์‹œ๊ฐ„์„ฑ, ๋‹ค์–‘ํ•œ ๊ธฐ๊ธฐ์— ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ฐ€๋ฒผ์šด ๋ชจ๋ธ์ด ํ•„์š”ํ•˜๊ณ , ์ด๊ฒƒ์„ ๊ฐ€๋Šฅ์ผ€ ํ•˜๊ธฐ ์œ„ํ•ด์„œ ํ•™์Šตํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๋Š”๋ฐ์—๋Š” ๋งŽ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”๋กœ ํ•œ๋‹ค. ํ•˜์ง€๋งŒ ์‚ฌ๋žŒ ์† ํฌ์ฆˆ๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ธฐ๊ณ„๋“ค์ด ๊ฝค ๋ถˆ์•ˆ์ •ํ•˜๊ณ , ์ด ๊ธฐ๊ณ„๋“ค์„ ์žฅ์ฐฉํ•˜๊ณ  ์žˆ๋Š” ์ด๋ฏธ์ง€๋Š” ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€ ์ƒ‰๊ณผ๋Š” ๋งŽ์ด ๋‹ฌ๋ผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ธฐ๊ฐ€ ์ ์ ˆํ•˜์ง€ ์•Š๋‹ค. ๊ทธ๋Ÿฌ๊ธฐ ๋•Œ๋ฌธ์— ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ๊ฐ€๊ณต ๋ฐ ์ฆ๋Ÿ‰ํ•˜์—ฌ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๊ณ , ๊ทธ๊ฒƒ์„ ํ†ตํ•ด ๋” ์ข‹์€ ํ•™์Šต์„ฑ๊ณผ๋ฅผ ์ด๋ฃจ๋ ค๊ณ  ํ•œ๋‹ค. ์ธ๊ณต์ ์œผ๋กœ ๋งŒ๋“ค์–ด๋‚ธ ์‚ฌ๋žŒ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ๋“ค์€ ์‹ค์ œ ์‚ฌ๋žŒ ์† ํ”ผ๋ถ€์ƒ‰๊ณผ๋Š” ๋น„์Šทํ• ์ง€์–ธ์ • ๋””ํ…Œ์ผํ•œ ํ…์Šค์ณ๊ฐ€ ๋งŽ์ด ๋‹ฌ๋ผ, ์‹ค์ œ๋กœ ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ์‹ค์ œ ์† ๋ฐ์ดํ„ฐ์—์„œ ์„ฑ๋Šฅ์ด ํ˜„์ €ํžˆ ๋งŽ์ด ๋–จ์–ด์ง„๋‹ค. ์ด ๋‘ ๋ฐ์ดํƒ€์˜ ๋„๋ฉ”์ธ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด์„œ ์ฒซ๋ฒˆ์งธ๋กœ๋Š” ์‚ฌ๋žŒ์†์˜ ๊ตฌ์กฐ๋ฅผ ๋จผ์ € ํ•™์Šต ์‹œํ‚ค๊ธฐ์œ„ํ•ด, ์† ๋ชจ์…˜์„ ์žฌ๊ฐ€๊ณตํ•˜์—ฌ ๊ทธ ์›€์ง์ž„ ๊ตฌ์กฐ๋ฅผ ํ•™์Šคํ•œ ์‹œ๊ฐ„์  ์ •๋ณด๋ฅผ ๋บ€ ๋‚˜๋จธ์ง€๋งŒ ์‹ค์ œ ์† ์ด๋ฏธ์ง€ ๋ฐ์ดํ„ฐ์— ํ•™์Šตํ•˜์˜€๊ณ  ํฌ๊ฒŒ ํšจ๊ณผ๋ฅผ ๋‚ด์—ˆ๋‹ค. ์ด๋•Œ ์‹ค์ œ ์‚ฌ๋žŒ ์†๋ชจ์…˜์„ ๋ชจ๋ฐฉํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ๋Š” ๋‘ ๋„๋ฉ”์ธ์ด ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ๋ฅผ ๋„คํŠธ์›Œํฌ ํ”ผ์ณ ๊ณต๊ฐ„์—์„œ align์‹œ์ผฐ๋‹ค. ๊ทธ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ์ธ๊ณต ํฌ์ฆˆ๋ฅผ ํŠน์ • ๋ฐ์ดํ„ฐ๋“ค๋กœ augmentํ•˜์ง€ ์•Š๊ณ  ๋„คํŠธ์›Œํฌ๊ฐ€ ๋งŽ์ด ๋ณด์ง€ ๋ชปํ•œ ํฌ์ฆˆ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๋„๋ก ํ•˜๋‚˜์˜ ํ™•๋ฅ  ๋ชจ๋ธ๋กœ์„œ ์„ค์ •ํ•˜์—ฌ ๊ทธ๊ฒƒ์—์„œ ์ƒ˜ํ”Œ๋งํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ธ๊ณต ๋ฐ์ดํ„ฐ๋ฅผ ๋” ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ annotation์ด ์–ด๋ ค์šด ์‹ค์ œ ๋ฐ์ดํ„ฐ๋ฅผ ๋” ๋ชจ์œผ๋Š” ์ˆ˜๊ณ ์Šค๋Ÿฌ์›€ ์—†์ด ์ธ๊ณต ๋ฐ์ดํ„ฐ๋“ค์„ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋งŒ๋“ค์–ด ๋‚ด๋Š” ๊ฒƒ ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๋” ์•ˆ์ „ํ•˜๊ณ  ์ง€์—ญ์  ํŠน์ง•๊ณผ ์‹œ๊ฐ„์  ํŠน์ง•์„ ํ™œ์šฉํ•ด์„œ ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋“ค์„ ์ œ์•ˆํ–ˆ๋‹ค. ๋˜ํ•œ, ๋„คํŠธ์›Œํฌ๊ฐ€ ์Šค์Šค๋กœ ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ฐพ์•„์„œ ํ•™์Šตํ• ์ˆ˜ ์žˆ๋Š” ์ž๋™ ๋ฐ์ดํ„ฐ ์ฆ๋Ÿ‰ ๋ฐฉ๋ฒ•๋ก ๋„ ํ•จ๊ป˜ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด๋ ‡๊ฒŒ ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์„ ๊ฒฐํ•ฉํ•ด์„œ ๋” ๋‚˜์€ ์† ํฌ์ฆˆ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ ํ•  ์ˆ˜ ์žˆ๋‹ค.1. Introduction 1 2. Related Works 14 3. Preliminaries: 3D Hand Mesh Model 27 4. SeqHAND: RGB-sequence-based 3D Hand Pose and Shape Estimation 31 5. Hand Pose Auto-Augment 66 6. Conclusion 85 Abstract (Korea) 101 ๊ฐ์‚ฌ์˜ ๊ธ€ 103๋ฐ•

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator
    • โ€ฆ
    corecore