1,415 research outputs found

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator

    Recognition of human interactions using limb-level feature points

    Get PDF
    Human activity recognition is an emerging area of research in computer vision with applications in video surveillance, human-computer interaction, robotics, and video annotation. Despite a number of recent advances, there are still many opportunities for new developments, especially in the area of person-person and person-object interaction. Many proposed algorithms focus on recognizing solely single person, person-person or person-object activities. An algorithm which can recognize all three types would be a significant step toward the real-world application of this technology. This thesis investigates the design and implementation of such an algorithm. It utilizes background subtraction to extract the subjects in the scene, and pixel clustering to segment their image into body parts. A location-based feature identification algorithm extracts feature points from these segments and feeds them to a classifier which identifies videos as activities. Together these techniques comprise an algorithm that can recognize single person, person-person and person-object interactions. This algorithm\u27s performance was evaluated based on interactions in a new video dataset, demonstrating the effectiveness of using limb-level feature points as a method of identifying human interactions

    Adaptive 3D facial action intensity estimation and emotion recognition

    Get PDF
    Automatic recognition of facial emotion has been widely studied for various computer vision tasks (e.g. health monitoring, driver state surveillance and personalized learning). Most existing facial emotion recognition systems, however, either have not fully considered subject-independent dynamic features or were limited to 2D models, thus are not robust enough for real-life recognition tasks with subject variation, head movement and illumination change. Moreover, there is also lack of systematic research on effective newly arrived novel emotion class detection. To address these challenges, we present a real-time 3D facial Action Unit (AU) intensity estimation and emotion recognition system. It automatically selects 16 motion-based facial feature sets using minimal-redundancy–maximal-relevance criterion based optimization and estimates the intensities of 16 diagnostic AUs using feedforward Neural Networks and Support Vector Regressors. We also propose a set of six novel adaptive ensemble classifiers for robust classification of the six basic emotions and the detection of newly arrived unseen novel emotion classes (emotions that are not included in the training set). A distance-based clustering and uncertainty measures of the base classifiers within each ensemble model are used to inform the novel class detection. Evaluated with the Bosphorus 3D database, the system has achieved the best performance of 0.071 overall Mean Squared Error (MSE) for AU intensity estimation using Support Vector Regressors, and 92.2% average accuracy for the recognition of the six basic emotions using the proposed ensemble classifiers. In comparison with other related work, our research outperforms other state-of-the-art research on 3D facial emotion recognition for the Bosphorus database. Moreover, in on-line real-time evaluation with real human subjects, the proposed system also shows superior real-time performance with 84% recognition accuracy and great flexibility and adaptation for newly arrived novel (e.g. ‘contempt’ which is not included in the six basic emotions) emotion detection

    Texture and Colour in Image Analysis

    Get PDF
    Research in colour and texture has experienced major changes in the last few years. This book presents some recent advances in the field, specifically in the theory and applications of colour texture analysis. This volume also features benchmarks, comparative evaluations and reviews

    Advances in video motion analysis research for mature and emerging application areas

    Get PDF

    Automatic Sign Language Recognition from Image Data

    Get PDF
    Tato práce se zabývá problematikou automatického rozpoznávání znakového jazyka z obrazových dat. Práce představuje pět hlavních přínosů v oblasti tvorby systému pro rozpoznávání, tvorby korpusů, extrakci příznaků z rukou a obličeje s využitím metod pro sledování pozice a pohybu rukou (tracking) a modelování znaků s využitím menších fonetických jednotek (sub-units). Metody využité v rozpoznávacím systému byly využity i k tvorbě vyhledávacího nástroje "search by example", který dokáže vyhledávat ve videozáznamech podle obrázku ruky. Navržený systém pro automatické rozpoznávání znakového jazyka je založen na statistickém přístupu s využitím skrytých Markovových modelů, obsahuje moduly pro analýzu video dat, modelování znaků a dekódování. Systém je schopen rozpoznávat jak izolované, tak spojité promluvy. Veškeré experimenty a vyhodnocení byly provedeny s vlastními korpusy UWB-06-SLR-A a UWB-07-SLR-P, první z nich obsahuje 25 znaků, druhý 378. Základní extrakce příznaků z video dat byla provedena na nízkoúrovňových popisech obrazu. Lepších výsledků bylo dosaženo s příznaky získaných z popisů vyšší úrovně porozumění obsahu v obraze, které využívají sledování pozice rukou a metodu pro segmentaci rukou v době překryvu s obličejem. Navíc, využitá metoda dokáže interpolovat obrazy s obličejem v době překryvu a umožňuje tak využít metody pro extrakci příznaků z obličeje, které by během překryvu nefungovaly, jako např. metoda active appearance models (AAM). Bylo porovnáno několik různých metod pro extrakci příznaků z rukou, jako např. local binary patterns (LBP), histogram of oriented gradients (HOG), vysokoúrovnové lingvistické příznaky a nové navržená metoda hand shape radial distance function (hRDF). Bylo také zkoumáno využití menších fonetických jednotek, než jsou celé znaky, tzv. sub-units. Pro první krok tvorby těchto jednotek byl navržen iterativní algoritmus, který tyto jednotky automaticky vytváří analýzou existujících dat. Bylo ukázáno, že tento koncept je vhodný pro modelování a rozpoznávání znaků. Kromě systému pro rozpoznávání je v práci navržen a představen systém "search by example", který funguje jako vyhledávací systém pro videa se záznamy znakového jazyka a může být využit například v online slovnících znakového jazyka, kde je v současné době složité či nemožné v takovýchto datech vyhledávat. Tento nástroj využívá metody, které byly použity v rozpoznávacím systému. Výstupem tohoto vyhledávacího nástroje je seřazený seznam videí, které obsahují stejný nebo podobný tvar ruky, které zadal uživatel, např. přes webkameru.Katedra kybernetikyObhájenoThis thesis addresses several issues of automatic sign language recognition, namely the creation of vision based sign language recognition framework, sign language corpora creation, feature extraction, making use of novel hand tracking with face occlusion handling, data-driven creation of sub-units and "search by example" tool for searching in sign language corpora using hand images as a search query. The proposed sign language recognition framework, based on statistical approach incorporating hidden Markov models (HMM), consists of video analysis, sign modeling and decoding modules. The framework is able to recognize both isolated signs and continuous utterances from video data. All experiments and evaluations were performed on two own corpora, UWB-06-SLR-A and UWB-07-SLR-P, the first containing 25 signs and second 378. As a baseline feature descriptors, low level image features are used. It is shown that better performance is gained by higher level features that employ hand tracking, which resolve occlusions of hands and face. As a side effect, the occlusion handling method interpolates face area in the frames during the occlusion and allows to use face feature descriptors that fail in such a case, for instance features extracted from active appearance models (AAM) tracker. Several state-of-the-art appearance-based feature descriptors were compared for tracked hands, such as local binary patterns (LBP), histogram of oriented gradients (HOG), high-level linguistic features or newly proposed hand shape radial distance function (denoted as hRDF) that enhances the feature description of hand-shape like concave regions. The concept of sub-units, that uses HMM models based on linguistic units smaller than whole sign and covers inner structures of the signs, was investigated in the proposed iterative method that is a first required step for data-driven construction of sub-units, and shows that such a concept is suitable for sign modeling and recognition tasks. Except of experiments in the sign language recognition, additional tool \textit{search by example} was created and evaluated. This tool is a search engine for sign language videos. Such a system can be incorporated into an online sign language dictionary where it is difficult to search in the sign language data. This proposed tool employs several methods which were examined in the sign language recognition task and allows to search in the video corpora based on an user-given query that consists of one or multiple images of hands. As a result, an ordered list of videos that contain the same or similar hand configurations is returned

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Pattern Recognition

    Get PDF
    Pattern recognition is a very wide research field. It involves factors as diverse as sensors, feature extraction, pattern classification, decision fusion, applications and others. The signals processed are commonly one, two or three dimensional, the processing is done in real- time or takes hours and days, some systems look for one narrow object class, others search huge databases for entries with at least a small amount of similarity. No single person can claim expertise across the whole field, which develops rapidly, updates its paradigms and comprehends several philosophical approaches. This book reflects this diversity by presenting a selection of recent developments within the area of pattern recognition and related fields. It covers theoretical advances in classification and feature extraction as well as application-oriented works. Authors of these 25 works present and advocate recent achievements of their research related to the field of pattern recognition

    Human-Centric Machine Vision

    Get PDF
    Recently, the algorithms for the processing of the visual information have greatly evolved, providing efficient and effective solutions to cope with the variability and the complexity of real-world environments. These achievements yield to the development of Machine Vision systems that overcome the typical industrial applications, where the environments are controlled and the tasks are very specific, towards the use of innovative solutions to face with everyday needs of people. The Human-Centric Machine Vision can help to solve the problems raised by the needs of our society, e.g. security and safety, health care, medical imaging, and human machine interface. In such applications it is necessary to handle changing, unpredictable and complex situations, and to take care of the presence of humans
    corecore