784 research outputs found

    Visual Concept Detection in Images and Videos

    Get PDF
    The rapidly increasing proliferation of digital images and videos leads to a situation where content-based search in multimedia databases becomes more and more important. A prerequisite for effective image and video search is to analyze and index media content automatically. Current approaches in the field of image and video retrieval focus on semantic concepts serving as an intermediate description to bridge the “semantic gap” between the data representation and the human interpretation. Due to the large complexity and variability in the appearance of visual concepts, the detection of arbitrary concepts represents a very challenging task. In this thesis, the following aspects of visual concept detection systems are addressed: First, enhanced local descriptors for mid-level feature coding are presented. Based on the observation that scale-invariant feature transform (SIFT) descriptors with different spatial extents yield large performance differences, a novel concept detection system is proposed that combines feature representations for different spatial extents using multiple kernel learning (MKL). A multi-modal video concept detection system is presented that relies on Bag-of-Words representations for visual and in particular for audio features. Furthermore, a method for the SIFT-based integration of color information, called color moment SIFT, is introduced. Comparative experimental results demonstrate the superior performance of the proposed systems on the Mediamill and on the VOC Challenge. Second, an approach is presented that systematically utilizes results of object detectors. Novel object-based features are generated based on object detection results using different pooling strategies. For videos, detection results are assembled to object sequences and a shot-based confidence score as well as further features, such as position, frame coverage or movement, are computed for each object class. These features are used as additional input for the support vector machine (SVM)-based concept classifiers. Thus, other related concepts can also profit from object-based features. Extensive experiments on the Mediamill, VOC and TRECVid Challenge show significant improvements in terms of retrieval performance not only for the object classes, but also in particular for a large number of indirectly related concepts. Moreover, it has been demonstrated that a few object-based features are beneficial for a large number of concept classes. On the VOC Challenge, the additional use of object-based features led to a superior performance for the image classification task of 63.8% mean average precision (AP). Furthermore, the generalization capabilities of concept models are investigated. It is shown that different source and target domains lead to a severe loss in concept detection performance. In these cross-domain settings, object-based features achieve a significant performance improvement. Since it is inefficient to run a large number of single-class object detectors, it is additionally demonstrated how a concurrent multi-class object detection system can be constructed to speed up the detection of many object classes in images. Third, a novel, purely web-supervised learning approach for modeling heterogeneous concept classes in images is proposed. Tags and annotations of multimedia data in the WWW are rich sources of information that can be employed for learning visual concepts. The presented approach is aimed at continuous long-term learning of appearance models and improving these models periodically. For this purpose, several components have been developed: a crawling component, a multi-modal clustering component for spam detection and subclass identification, a novel learning component, called “random savanna”, a validation component, an updating component, and a scalability manager. Only a single word describing the visual concept is required to initiate the learning process. Experimental results demonstrate the capabilities of the individual components. Finally, a generic concept detection system is applied to support interdisciplinary research efforts in the field of psychology and media science. The psychological research question addressed in the field of behavioral sciences is, whether and how playing violent content in computer games may induce aggression. Therefore, novel semantic concepts most notably “violence” are detected in computer game videos to gain insights into the interrelationship of violent game events and the brain activity of a player. Experimental results demonstrate the excellent performance of the proposed automatic concept detection approach for such interdisciplinary research

    Robust and real-time hand detection and tracking in monocular video

    Get PDF
    In recent years, personal computing devices such as laptops, tablets and smartphones have become ubiquitous. Moreover, intelligent sensors are being integrated into many consumer devices such as eyeglasses, wristwatches and smart televisions. With the advent of touchscreen technology, a new human-computer interaction (HCI) paradigm arose that allows users to interface with their device in an intuitive manner. Using simple gestures, such as swipe or pinch movements, a touchscreen can be used to directly interact with a virtual environment. Nevertheless, touchscreens still form a physical barrier between the virtual interface and the real world. An increasingly popular field of research that tries to overcome this limitation, is video based gesture recognition, hand detection and hand tracking. Gesture based interaction allows the user to directly interact with the computer in a natural manner by exploring a virtual reality using nothing but his own body language. In this dissertation, we investigate how robust hand detection and tracking can be accomplished under real-time constraints. In the context of human-computer interaction, real-time is defined as both low latency and low complexity, such that a complete video frame can be processed before the next one becomes available. Furthermore, for practical applications, the algorithms should be robust to illumination changes, camera motion, and cluttered backgrounds in the scene. Finally, the system should be able to initialize automatically, and to detect and recover from tracking failure. We study a wide variety of existing algorithms, and propose significant improvements and novel methods to build a complete detection and tracking system that meets these requirements. Hand detection, hand tracking and hand segmentation are related yet technically different challenges. Whereas detection deals with finding an object in a static image, tracking considers temporal information and is used to track the position of an object over time, throughout a video sequence. Hand segmentation is the task of estimating the hand contour, thereby separating the object from its background. Detection of hands in individual video frames allows us to automatically initialize our tracking algorithm, and to detect and recover from tracking failure. Human hands are highly articulated objects, consisting of finger parts that are connected with joints. As a result, the appearance of a hand can vary greatly, depending on the assumed hand pose. Traditional detection algorithms often assume that the appearance of the object of interest can be described using a rigid model and therefore can not be used to robustly detect human hands. Therefore, we developed an algorithm that detects hands by exploiting their articulated nature. Instead of resorting to a template based approach, we probabilistically model the spatial relations between different hand parts, and the centroid of the hand. Detecting hand parts, such as fingertips, is much easier than detecting a complete hand. Based on our model of the spatial configuration of hand parts, the detected parts can be used to obtain an estimate of the complete hand's position. To comply with the real-time constraints, we developed techniques to speed-up the process by efficiently discarding unimportant information in the image. Experimental results show that our method is competitive with the state-of-the-art in object detection while providing a reduction in computational complexity with a factor 1 000. Furthermore, we showed that our algorithm can also be used to detect other articulated objects such as persons or animals and is therefore not restricted to the task of hand detection. Once a hand has been detected, a tracking algorithm can be used to continuously track its position in time. We developed a probabilistic tracking method that can cope with uncertainty caused by image noise, incorrect detections, changing illumination, and camera motion. Furthermore, our tracking system automatically determines the number of hands in the scene, and can cope with hands entering or leaving the video canvas. We introduced several novel techniques that greatly increase tracking robustness, and that can also be applied in other domains than hand tracking. To achieve real-time processing, we investigated several techniques to reduce the search space of the problem, and deliberately employ methods that are easily parallelized on modern hardware. Experimental results indicate that our methods outperform the state-of-the-art in hand tracking, while providing a much lower computational complexity. One of the methods used by our probabilistic tracking algorithm, is optical flow estimation. Optical flow is defined as a 2D vector field describing the apparent velocities of objects in a 3D scene, projected onto the image plane. Optical flow is known to be used by many insects and birds to visually track objects and to estimate their ego-motion. However, most optical flow estimation methods described in literature are either too slow to be used in real-time applications, or are not robust to illumination changes and fast motion. We therefore developed an optical flow algorithm that can cope with large displacements, and that is illumination independent. Furthermore, we introduce a regularization technique that ensures a smooth flow-field. This regularization scheme effectively reduces the number of noisy and incorrect flow-vector estimates, while maintaining the ability to handle motion discontinuities caused by object boundaries in the scene. The above methods are combined into a hand tracking framework which can be used for interactive applications in unconstrained environments. To demonstrate the possibilities of gesture based human-computer interaction, we developed a new type of computer display. This display is completely transparent, allowing multiple users to perform collaborative tasks while maintaining eye contact. Furthermore, our display produces an image that seems to float in thin air, such that users can touch the virtual image with their hands. This floating imaging display has been showcased on several national and international events and tradeshows. The research that is described in this dissertation has been evaluated thoroughly by comparing detection and tracking results with those obtained by state-of-the-art algorithms. These comparisons show that the proposed methods outperform most algorithms in terms of accuracy, while achieving a much lower computational complexity, resulting in a real-time implementation. Results are discussed in depth at the end of each chapter. This research further resulted in an international journal publication; a second journal paper that has been submitted and is under review at the time of writing this dissertation; nine international conference publications; a national conference publication; a commercial license agreement concerning the research results; two hardware prototypes of a new type of computer display; and a software demonstrator

    Coupled Action Recognition and Pose Estimation from Multiple Views

    Get PDF
    Action recognition and pose estimation are two closely related topics in understanding human body movements; information from one task can be leveraged to assist the other, yet the two are often treated separately. We present here a framework for coupled action recognition and pose estimation by formulating pose estimation as an optimization over a set of action-specific manifolds. The framework allows for integration of a 2D appearance-based action recognition system as a prior for 3D pose estimation and for refinement of the action labels using relational pose features based on the extracted 3D poses. Our experiments show that our pose estimation system is able to estimate body poses with high degrees of freedom using very few particles and can achieve state-of-the-art results on the HumanEva-II benchmark. We also thoroughly investigate the impact of pose estimation and action recognition accuracy on each other on the challenging TUM kitchen dataset. We demonstrate not only the feasibility of using extracted 3D poses for action recognition, but also improved performance in comparison to action recognition using low-level appearance feature

    Active recognition and pose estimation of rigid and deformable objects in 3D space

    Get PDF
    Object recognition and pose estimation is a fundamental problem in computer vision and of utmost importance in robotic applications. Object recognition refers to the problem of recognizing certain object instances, or categorizing objects into specific classes. Pose estimation deals with estimating the exact position of the object in 3D space, usually expressed in Euler angles. There are generally two types of objects that require special care when designing solutions to the aforementioned problems: rigid and deformable. Dealing with deformable objects has been a much harder problem, and usually solutions that apply to rigid objects, fail when used for deformable objects due to the inherent assumptions made during the design. In this thesis we deal with object categorization, instance recognition and pose estimation of both rigid and deformable objects. In particular, we are interested in a special type of deformable objects, clothes. We tackle the problem of autonomously recognizing and unfolding articles of clothing using a dual manipulator. This problem consists of grasping an article from a random point, recognizing it and then bringing it into an unfolded state by a dual arm robot. We propose a data-driven method for clothes recognition from depth images using Random Decision Forests. We also propose a method for unfolding an article of clothing after estimating and grasping two key-points, using Hough Forests. Both methods are implemented into a POMDP framework allowing the robot to interact optimally with the garments, taking into account uncertainty in the recognition and point estimation process. This active recognition and unfolding makes our system very robust to noisy observations. Our methods were tested on regular-sized clothes using a dual-arm manipulator. Our systems perform better in both accuracy and speed compared to state-of-the-art approaches. In order to take advantage of the robotic manipulator and increase the accuracy of our system, we developed a novel approach to address generic active vision problems, called Active Random Forests. While state of the art focuses on best viewing parameters selection based on single view classifiers, we propose a multi-view classifier where the decision mechanism of optimally changing viewing parameters is inherent to the classification process. This has many advantages: a) the classifier exploits the entire set of captured images and does not simply aggregate probabilistically per view hypotheses; b) actions are based on learnt disambiguating features from all views and are optimally selected using the powerful voting scheme of Random Forests and c) the classifier can take into account the costs of actions. The proposed framework was applied to the same task of autonomously unfolding clothes by a robot, addressing the problem of best viewpoint selection in classification, grasp point and pose estimation of garments. We show great performance improvement compared to state of the art methods and our previous POMDP formulation. Moving from deformable to rigid objects while keeping our interest to domestic robotic applications, we focus on object instance recognition and 3D pose estimation of household objects. We are particularly interested in realistic scenes that are very crowded and objects can be perceived under severe occlusions. Single shot-based 6D pose estimators with manually designed features are still unable to tackle such difficult scenarios for a variety of objects, motivating the research towards unsupervised feature learning and next-best-view estimation. We present a complete framework for both single shot-based 6D object pose estimation and next-best-view prediction based on Hough Forests, the state of the art object pose estimator that performs classification and regression jointly. Rather than using manually designed features we propose an unsupervised feature learnt from depth-invariant patches using a Sparse Autoencoder. Furthermore, taking advantage of the clustering performed in the leaf nodes of Hough Forests, we learn to estimate the reduction of uncertainty in other views, formulating the problem of selecting the next-best-view. To further improve 6D object pose estimation, we propose an improved joint registration and hypotheses verification module as a final refinement step to reject false detections. We provide two additional challenging datasets inspired from realistic scenarios to extensively evaluate the state of the art and our framework. One is related to domestic environments and the other depicts a bin-picking scenario mostly found in industrial settings. We show that our framework significantly outperforms state of the art both on public and on our datasets. Unsupervised feature learning, although efficient, might produce sub-optimal features for our particular tast. Therefore in our last work, we leverage the power of Convolutional Neural Networks to tackled the problem of estimating the pose of rigid objects by an end-to-end deep regression network. To improve the moderate performance of the standard regression objective function, we introduce the Siamese Regression Network. For a given image pair, we enforce a similarity measure between the representation of the sample images in the feature and pose space respectively, that is shown to boost regression performance. Furthermore, we argue that our pose-guided feature learning using our Siamese Regression Network generates more discriminative features that outperform the state of the art. Last, our feature learning formulation provides the ability of learning features that can perform under severe occlusions, demonstrating high performance on our novel hand-object dataset. Concluding, this work is a research on the area of object detection and pose estimation in 3D space, on a variety of object types. Furthermore we investigate how accuracy can be further improved by applying active vision techniques to optimally move the camera view to minimize the detection error.Open Acces

    RGB-D And Thermal Sensor Fusion: A Systematic Literature Review

    Full text link
    In the last decade, the computer vision field has seen significant progress in multimodal data fusion and learning, where multiple sensors, including depth, infrared, and visual, are used to capture the environment across diverse spectral ranges. Despite these advancements, there has been no systematic and comprehensive evaluation of fusing RGB-D and thermal modalities to date. While autonomous driving using LiDAR, radar, RGB, and other sensors has garnered substantial research interest, along with the fusion of RGB and depth modalities, the integration of thermal cameras and, specifically, the fusion of RGB-D and thermal data, has received comparatively less attention. This might be partly due to the limited number of publicly available datasets for such applications. This paper provides a comprehensive review of both, state-of-the-art and traditional methods used in fusing RGB-D and thermal camera data for various applications, such as site inspection, human tracking, fault detection, and others. The reviewed literature has been categorised into technical areas, such as 3D reconstruction, segmentation, object detection, available datasets, and other related topics. Following a brief introduction and an overview of the methodology, the study delves into calibration and registration techniques, then examines thermal visualisation and 3D reconstruction, before discussing the application of classic feature-based techniques as well as modern deep learning approaches. The paper concludes with a discourse on current limitations and potential future research directions. It is hoped that this survey will serve as a valuable reference for researchers looking to familiarise themselves with the latest advancements and contribute to the RGB-DT research field.Comment: 33 pages, 20 figure
    corecore