107 research outputs found

    Going Deeper into Action Recognition: A Survey

    Full text link
    Understanding human actions in visual data is tied to advances in complementary research areas including object recognition, human dynamics, domain adaptation and semantic segmentation. Over the last decade, human action analysis evolved from earlier schemes that are often limited to controlled environments to nowadays advanced solutions that can learn from millions of videos and apply to almost all daily activities. Given the broad range of applications from video surveillance to human-computer interaction, scientific milestones in action recognition are achieved more rapidly, eventually leading to the demise of what used to be good in a short time. This motivated us to provide a comprehensive review of the notable steps taken towards recognizing human actions. To this end, we start our discussion with the pioneering methods that use handcrafted representations, and then, navigate into the realm of deep learning based approaches. We aim to remain objective throughout this survey, touching upon encouraging improvements as well as inevitable fallbacks, in the hope of raising fresh questions and motivating new research directions for the reader

    Efficient duration and hierarchical modeling for human activity recognition

    Get PDF
    A challenge in building pervasive and smart spaces is to learn and recognize human activities of daily living (ADLs). In this paper, we address this problem and argue that in dealing with ADLs, it is beneficial to exploit both their typical duration patterns and inherent hierarchical structures. We exploit efficient duration modeling using the novel Coxian distribution to form the Coxian hidden semi-Markov model (CxHSMM) and apply it to the problem of learning and recognizing ADLs with complex temporal dependencies.The Coxian duration model has several advantages over existing duration parameterization using multinomial or exponential family distributions, including its denseness in the space of non negative distributions, low number of parameters, computational efficiency and the existence of closed-form estimation solutions. Further we combine both hierarchical and duration extensions of the hidden Markov model (HMM) to form the novel switching hidden semi-Markov model (SHSMM), and empirically compare its performance with existing models. The model can learn what an occupant normally does during the day from unsegmented training data and then perform online activity classification, segmentation and abnormality detection. Experimental results show that Coxian modeling outperforms a range of baseline models for the task of activity segmentation. We also achieve arecognition accuracy competitive to the current state-of-the-art multinomial duration model, while gaining a significant reduction in computation. Furthermore, cross-validation model selection on the number of phases K in the Coxian indicates that only a small Kis required to achieve the optimal performance. Finally, our models are further tested in a more challenging setting in which the tracking is often lost and the activities considerably overlap. With a small amount of labels supplied during training in a partially supervised learning mode, our models are again able to deliver reliable performance, again with a small number of phases, making our proposed framework an attractive choice for activity modeling

    Object Tracking and Mensuration in Surveillance Videos

    Get PDF
    This thesis focuses on tracking and mensuration in surveillance videos. The first part of the thesis discusses several object tracking approaches based on the different properties of tracking targets. For airborne videos, where the targets are usually small and with low resolutions, an approach of building motion models for foreground/background proposed in which the foreground target is simplified as a rigid object. For relatively high resolution targets, the non-rigid models are applied. An active contour-based algorithm has been introduced. The algorithm is based on decomposing the tracking into three parts: estimate the affine transform parameters between successive frames using particle filters; detect the contour deformation using a probabilistic deformation map, and regulate the deformation by projecting the updated model onto a trained shape subspace. The active appearance Markov chain (AAMC). It integrates a statistical model of shape, appearance and motion. In the AAMC model, a Markov chain represents the switching of motion phases (poses), and several pairwise active appearance model (P-AAM) components characterize the shape, appearance and motion information for different motion phases. The second part of the thesis covers video mensuration, in which we have proposed a heightmeasuring algorithm with less human supervision, more flexibility and improved robustness. From videos acquired by an uncalibrated stationary camera, we first recover the vanishing line and the vertical point of the scene. We then apply a single view mensuration algorithm to each of the frames to obtain height measurements. Finally, using the LMedS as the cost function and the Robbins-Monro stochastic approximation (RMSA) technique to obtain the optimal estimate

    Suivi Multi-Locuteurs avec des Informations Audio-Visuelles pour la Perception des Robots

    Get PDF
    Robot perception plays a crucial role in human-robot interaction (HRI). Perception system provides the robot information of the surroundings and enables the robot to give feedbacks. In a conversational scenario, a group of people may chat in front of the robot and move freely. In such situations, robots are expected to understand where are the people, who are speaking, or what are they talking about. This thesis concentrates on answering the first two questions, namely speaker tracking and diarization. We use different modalities of the robot’s perception system to achieve the goal. Like seeing and hearing for a human-being, audio and visual information are the critical cues for a robot in a conversational scenario. The advancement of computer vision and audio processing of the last decade has revolutionized the robot perception abilities. In this thesis, we have the following contributions: we first develop a variational Bayesian framework for tracking multiple objects. The variational Bayesian framework gives closed-form tractable problem solutions, which makes the tracking process efficient. The framework is first applied to visual multiple-person tracking. Birth and death process are built jointly with the framework to deal with the varying number of the people in the scene. Furthermore, we exploit the complementarity of vision and robot motorinformation. On the one hand, the robot’s active motion can be integrated into the visual tracking system to stabilize the tracking. On the other hand, visual information can be used to perform motor servoing. Moreover, audio and visual information are then combined in the variational framework, to estimate the smooth trajectories of speaking people, and to infer the acoustic status of a person- speaking or silent. In addition, we employ the model to acoustic-only speaker localization and tracking. Online dereverberation techniques are first applied then followed by the tracking system. Finally, a variant of the acoustic speaker tracking model based on von-Mises distribution is proposed, which is specifically adapted to directional data. All the proposed methods are validated on datasets according to applications.La perception des robots joue un rôle crucial dans l’interaction homme-robot (HRI). Le système de perception fournit les informations au robot sur l’environnement, ce qui permet au robot de réagir en consequence. Dans un scénario de conversation, un groupe de personnes peut discuter devant le robot et se déplacer librement. Dans de telles situations, les robots sont censés comprendre où sont les gens, ceux qui parlent et de quoi ils parlent. Cette thèse se concentre sur les deux premières questions, à savoir le suivi et la diarisation des locuteurs. Nous utilisons différentes modalités du système de perception du robot pour remplir cet objectif. Comme pour l’humain, l’ouie et la vue sont essentielles pour un robot dans un scénario de conversation. Les progrès de la vision par ordinateur et du traitement audio de la dernière décennie ont révolutionné les capacités de perception des robots. Dans cette thèse, nous développons les contributions suivantes : nous développons d’abord un cadre variationnel bayésien pour suivre plusieurs objets. Le cadre bayésien variationnel fournit des solutions explicites, rendant le processus de suivi très efficace. Cette approche est d’abord appliqué au suivi visuel de plusieurs personnes. Les processus de créations et de destructions sont en adéquation avecle modèle probabiliste proposé pour traiter un nombre variable de personnes. De plus, nous exploitons la complémentarité de la vision et des informations du moteur du robot : d’une part, le mouvement actif du robot peut être intégré au système de suivi visuel pour le stabiliser ; d’autre part, les informations visuelles peuvent être utilisées pour effectuer l’asservissement du moteur. Par la suite, les informations audio et visuelles sont combinées dans le modèle variationnel, pour lisser les trajectoires et déduire le statut acoustique d’une personne : parlant ou silencieux. Pour experimenter un scenario où l’informationvisuelle est absente, nous essayons le modèle pour la localisation et le suivi des locuteurs basé sur l’information acoustique uniquement. Les techniques de déréverbération sont d’abord appliquées, dont le résultat est fourni au système de suivi. Enfin, une variante du modèle de suivi des locuteurs basée sur la distribution de von-Mises est proposée, celle-ci étant plus adaptée aux données directionnelles. Toutes les méthodes proposées sont validées sur des bases de données specifiques à chaque application

    Efficient duration modelling in the hierarchical hidden semi-Markov models and their applications

    Get PDF
    Modeling patterns in temporal data has arisen as an important problem in engineering and science. This has led to the popularity of several dynamic models, in particular the renowned hidden Markov model (HMM) [Rabiner, 1989]. Despite its widespread success in many cases, the standard HMM often fails to model more complex data whose elements are correlated hierarchically or over a long period. Such problems are, however, frequently encountered in practice. Existing efforts to overcome this weakness often address either one of these two aspects separately, mainly due to computational intractability. Motivated by this modeling challenge in many real world problems, in particular, for video surveillance and segmentation, this thesis aims to develop tractable probabilistic models that can jointly model duration and hierarchical information in a unified framework. We believe that jointly exploiting statistical strength from both properties will lead to more accurate and robust models for the needed task. To tackle the modeling aspect, we base our work on an intersection between dynamic graphical models and statistics of lifetime modeling. Realizing that the key bottleneck found in the existing works lies in the choice of the distribution for a state, we have successfully integrated the discrete Coxian distribution [Cox, 1955], a special class of phase-type distributions, into the HMM to form a novel and powerful stochastic model termed as the Coxian Hidden Semi-Markov Model (CxHSMM). We show that this model can still be expressed as a dynamic Bayesian network, and inference and learning can be derived analytically.Most importantly, it has four superior features over existing semi-Markov modelling: the parameter space is compact, computation is fast (almost the same as the HMM), close-formed estimation can be derived, and the Coxian is flexible enough to approximate a large class of distributions. Next, we exploit hierarchical decomposition in the data by borrowing analogy from the hierarchical hidden Markov model in [Fine et al., 1998, Bui et al., 2004] and introduce a new type of shallow structured graphical model that combines both duration and hierarchical modelling into a unified framework, termed the Coxian Switching Hidden Semi-Markov Models (CxSHSMM). The top layer is a Markov sequence of switching variables, while the bottom layer is a sequence of concatenated CxHSMMs whose parameters are determined by the switching variable at the top. Again, we provide a thorough analysis along with inference and learning machinery. We also show that semi-Markov models with arbitrary depth structure can easily be developed. In all cases we further address two practical issues: missing observations to unstable tracking and the use of partially labelled data to improve training accuracy. Motivated by real-world problems, our application contribution is a framework to recognize complex activities of daily livings (ADLs) and detect anomalies to provide better intelligent caring services for the elderly.Coarser activities with self duration distributions are represented using the CxHSMM. Complex activities are made of a sequence of coarser activities and represented at the top level in the CxSHSMM. Intensive experiments are conducted to evaluate our solutions against existing methods. In many cases, the superiority of the joint modeling and the Coxian parameterization over traditional methods is confirmed. The robustness of our proposed models is further demonstrated in a series of more challenging experiments, in which the tracking is often lost and activities considerably overlap. Our final contribution is an application of the switching Coxian model to segment education-oriented videos into coherent topical units. Our results again demonstrate such segmentation processes can benefit greatly from the joint modeling of duration and hierarchy

    Speech Synthesis Based on Hidden Markov Models

    Get PDF

    Proceedings, MSVSCC 2018

    Get PDF
    Proceedings of the 12th Annual Modeling, Simulation & Visualization Student Capstone Conference held on April 19, 2018 at VMASC in Suffolk, Virginia. 155 pp
    corecore