51 research outputs found

    Efficient Human Activity Recognition in Large Image and Video Databases

    Get PDF
    Vision-based human action recognition has attracted considerable interest in recent research for its applications to video surveillance, content-based search, healthcare, and interactive games. Most existing research deals with building informative feature descriptors, designing efficient and robust algorithms, proposing versatile and challenging datasets, and fusing multiple modalities. Often, these approaches build on certain conventions such as the use of motion cues to determine video descriptors, application of off-the-shelf classifiers, and single-factor classification of videos. In this thesis, we deal with important but overlooked issues such as efficiency, simplicity, and scalability of human activity recognition in different application scenarios: controlled video environment (e.g.~indoor surveillance), unconstrained videos (e.g.~YouTube), depth or skeletal data (e.g.~captured by Kinect), and person images (e.g.~Flicker). In particular, we are interested in answering questions like (a) is it possible to efficiently recognize human actions in controlled videos without temporal cues? (b) given that the large-scale unconstrained video data are often of high dimension low sample size (HDLSS) nature, how to efficiently recognize human actions in such data? (c) considering the rich 3D motion information available from depth or motion capture sensors, is it possible to recognize both the actions and the actors using only the motion dynamics of underlying activities? and (d) can motion information from monocular videos be used for automatically determining saliency regions for recognizing actions in still images

    Deep Multi Temporal Scale Networks for Human Motion Analysis

    Get PDF
    The movement of human beings appears to respond to a complex motor system that contains signals at different hierarchical levels. For example, an action such as ``grasping a glass on a table'' represents a high-level action, but to perform this task, the body needs several motor inputs that include the activation of different joints of the body (shoulder, arm, hand, fingers, etc.). Each of these different joints/muscles have a different size, responsiveness, and precision with a complex non-linearly stratified temporal dimension where every muscle has its temporal scale. Parts such as the fingers responds much faster to brain input than more voluminous body parts such as the shoulder. The cooperation we have when we perform an action produces smooth, effective, and expressive movement in a complex multiple temporal scale cognitive task. Following this layered structure, the human body can be described as a kinematic tree, consisting of joints connected. Although it is nowadays well known that human movement and its perception are characterised by multiple temporal scales, very few works in the literature are focused on studying this particular property. In this thesis, we will focus on the analysis of human movement using data-driven techniques. In particular, we will focus on the non-verbal aspects of human movement, with an emphasis on full-body movements. The data-driven methods can interpret the information in the data by searching for rules, associations or patterns that can represent the relationships between input (e.g. the human action acquired with sensors) and output (e.g. the type of action performed). Furthermore, these models may represent a new research frontier as they can analyse large masses of data and focus on aspects that even an expert user might miss. The literature on data-driven models proposes two families of methods that can process time series and human movement. The first family, called shallow models, extract features from the time series that can help the learning algorithm find associations in the data. These features are identified and designed by domain experts who can identify the best ones for the problem faced. On the other hand, the second family avoids this phase of extraction by the human expert since the models themselves can identify the best set of features to optimise the learning of the model. In this thesis, we will provide a method that can apply the multi-temporal scales property of the human motion domain to deep learning models, the only data-driven models that can be extended to handle this property. We will ask ourselves two questions: what happens if we apply knowledge about how human movements are performed to deep learning models? Can this knowledge improve current automatic recognition standards? In order to prove the validity of our study, we collected data and tested our hypothesis in specially designed experiments. Results support both the proposal and the need for the use of deep multi-scale models as a tool to better understand human movement and its multiple time-scale nature

    DISCRIMINATIVE LEARNING AND RECOGNITION USING DICTIONARIES

    Get PDF
    In recent years, the theory of sparse representation has emerged as a powerful tool for efficient processing of data in non-traditional ways. This is mainly due to the fact that most signals and images of interest tend to be sparse or compressible in some dictionary. In other words, they can be well approximated by a linear combination of a few elements (also known as atoms) of a dictionary. This dictionary can either be an analytic dictionary composed of wavelets or Fourier basis or it can be directly trained from data. It has been observed that dictionaries learned directly from data provide better representation and hence can improve the performance of many practical applications such as restoration and classification. In this dissertation, we study dictionary learning and recognition under supervised, unsupervised, and semi-supervised settings. In the supervised case, we propose an approach to recognize humans in unconstrained videos, where the main challenge is exploiting the identity information in multiple frames and the accompanying dynamic signature. These identity cues include face, body, and motion. Our approach is based on video-dictionaries for face and body. We design video-dictionaries to implicitly encode temporal, pose, and illumination information. Next, we propose a novel multivariate sparse representation method that jointly represents all the video data by a sparse linear combination of training data. To increase the ability of our algorithm to learn nonlinearities, we apply kernel methods to learn the dictionaries. Next, we address the problem of matching faces across changes in pose in unconstrained videos. Our approach consists of two methods based on 3D rotation and sparse representation that compensate for changes in pose. We demonstrate the superior performance of our approach over several state-of-the-art algorithms through extensive experiments on unconstrained video datasets. In the unsupervised case, we present an approach that simultaneously clusters images and learns dictionaries from the clusters. The method learns dictionaries in the Radon transform domain. The main feature of the proposed approach is that it provides in-plane rotation and scale invariant clustering, which is useful in many applications such as Content Based Image Retrieval (CBIR). We demonstrate through experiments that the proposed rotation and scale invariant clustering provides not only good retrieval performances but also substantial improvements and robustness compared to traditional Gabor-based and several state-of-the-art shape-based methods. We then extend the dictionary learning problem to a generalized semi-supervised formulation, where each training sample is provided with a set of possible labels and only one label among them is the true one. Such applications can be found in image and video collections where one often has only partially labeled data. For instance, given an image with multiple faces and a caption specifying the names, we can be sure that each of the faces belong to one of the names specified, while the exact identity of each face is not known. Labeling involves significant amount of human effort and is expensive. This has motivated researchers to develop learning algorithms from partially labeled training data. In this work, we develop dictionary learning algorithms that utilize such partially labeled data. The proposed method aims to solve the problem of ambiguously labeled multiclass-classification using an iterative algorithm. The dictionaries are updated using either soft (EM-based) or hard decision rules. Extensive evaluations on existing datasets demonstrate that the proposed method performs significantly better than state-of-the-art approaches for learning from ambiguously labeled data. As sparsity plays a major role in our research, we further present a sparse representation-based approach to find the salient views of 3D objects. The salient views are categorized into two groups. The first are boundary representative views that have several visible sides and object surfaces that may be attractive to humans. The second are side representative views that best represent side views of the approximating convex shape. The side representative views are class-specific views and possess the most representative power compared to other within-class views. Using the concept of characteristic view class, we first present a sparse representation-based approach for estimating the boundary representative views. With the estimated boundaries, we determine the side representative views based on a minimum reconstruction error criterion. Furthermore, to evaluate our method, we introduce the notion of geometric dictionaries built from salient views for applications in 3D object recognition, retrieval and sparse-to-full reconstruction. By a series of experiments on four publicly available 3D object datasets, we demonstrate the effectiveness of our approach over state-of-the-art algorithms and baseline methods

    Cognitive Robots for Social Interactions

    Get PDF
    One of my goals is to work towards developing Cognitive Robots, especially with regard to improving the functionalities that facilitate the interaction with human beings and their surrounding objects. Any cognitive system designated for serving human beings must be capable of processing the social signals and eventually enable efficient prediction and planning of appropriate responses. My main focus during my PhD study is to bridge the gap between the motoric space and the visual space. The discovery of the mirror neurons ([RC04]) shows that the visual perception of human motion (visual space) is directly associated to the motor control of the human body (motor space). This discovery poses a large number of challenges in different fields such as computer vision, robotics and neuroscience. One of the fundamental challenges is the understanding of the mapping between 2D visual space and 3D motoric control, and further developing building blocks (primitives) of human motion in the visual space as well as in the motor space. First, I present my study on the visual-motoric mapping of human actions. This study aims at mapping human actions in 2D videos to 3D skeletal representation. Second, I present an automatic algorithm to decompose motion capture (MoCap) sequences into synergies along with the times at which they are executed (or "activated") for each joint. Third, I proposed to use the Granger Causality as a tool to study the coordinated actions performed by at least two units. Recent scientific studies suggest that the above "action mirroring circuit" might be tuned to action coordination rather than single action mirroring. Fourth, I present the extraction of key poses in visual space. These key poses facilitate the further study of the "action mirroring circuit". I conclude the dissertation by describing the future of cognitive robotics study

    A Novel Multi-View Table Tennis Umpiring Framework

    Get PDF
    This research investigates the development of a low-cost multi-view umpiring framework, as an alternative to the current expensive systems that are almost exclusively restricted to elite professional sports. Table tennis has been selected as the testbed because, while automating the process is challenging, it has many different complex match elements including the service, return and rallies, which are governed by a strict set of regulations. The focus is mainly on the rally element rather than the whole match. Ball detection and tracking in video frames are undertaken to determine reliably the ball position relative to key reference objects like the table surface and net, and the ball’s flight path is used to determine the rally’s status. While a low-cost option has benefits, it is technically challenging due to the limited number of cameras and generally low video resolution used. This thesis presents a portable multi-view umpiring framework that identifies each state change in a rally. It makes three significant contributions to knowledge: i) a reliable ball detection strategy that accurately detects the location of the ball in low-resolution sequences; ii) a novel framework for ball tracking using a multi-view system, and iii) a new state-machine based evaluation system for analysing table tennis rallies. In a series of ten different test scenarios, the system achieved an average of 94% system detection rate and 100% accurate decisions. A test sequence of duration 1 s can be processed in 8 s, leading to a delay of only 7 s, which is considered acceptable for practical purposes. This solution has the potential to reform the way matches are umpired, providing objectivity in resolving disputed decisions. It affords an economic technology for amateur players, while the multi-view facility is extendible to other relevant ball-based sports. Finally, the ball flight path analysis mechanism can be a valuable training tool for skills development

    Gaze-Based Human-Robot Interaction by the Brunswick Model

    Get PDF
    We present a new paradigm for human-robot interaction based on social signal processing, and in particular on the Brunswick model. Originally, the Brunswick model copes with face-to-face dyadic interaction, assuming that the interactants are communicating through a continuous exchange of non verbal social signals, in addition to the spoken messages. Social signals have to be interpreted, thanks to a proper recognition phase that considers visual and audio information. The Brunswick model allows to quantitatively evaluate the quality of the interaction using statistical tools which measure how effective is the recognition phase. In this paper we cast this theory when one of the interactants is a robot; in this case, the recognition phase performed by the robot and the human have to be revised w.r.t. the original model. The model is applied to Berrick, a recent open-source low-cost robotic head platform, where the gazing is the social signal to be considered

    Two-Handed Gestures for Human-Computer Interaction

    Get PDF

    Surrogate Models Coupled with Machine Learning to Approximate Complex Physical Phenomena Involving Aerodynamic and Aerothermal Simulations

    Get PDF
    Numerical simulations provide a key element in aircraft design process, complementing physical tests and flight tests. They could take advantage of innovative methods, such as artificial intelligence technologies spreading in aviation. Simulating the full flight mission for various disciplines pose important problems due to significant computational cost coupled to varying operating conditions. Moreover, complex physical phenomena can occur. For instance, the aerodynamic field on the wing takes different shapes and can encounter shocks, while aerothermal simulations around nacelle and pylon are sensitive to the interaction between engine flows and external flows. Surrogate models can be used to substitute expensive high-fidelitysimulations by mathematical and statistical approximations in order to reduce overall computation cost and to provide a data-driven approach. In this thesis, we propose two developments: (i) machine learning-based surrogate models capable of approximating aerodynamic experiments and (ii) integrating more classical surrogate models into industrial aerothermal process. The first approach mitigates aerodynamic issues by separating solutions with very different shapes into several subsets using machine learning algorithms. Moreover, a resampling technique takes advantage of the subdomain decomposition by adding extra information in relevant regions. The second development focuses on pylon sizing by building surrogate models substitutingaerothermal simulations. The two approaches are applied to aircraft configurations in order to bridge the gap between academic methods and real-world applications. Significant improvements are highlighted in terms of accuracy and cost gain
    • …
    corecore