14 research outputs found

    Human action recognition based on estimated weak poses

    Get PDF

    Spatiotemporal visual analysis of human actions

    No full text
    In this dissertation we propose four methods for the recognition of human activities. In all four of them, the representation of the activities is based on spatiotemporal features that are automatically detected at areas where there is a significant amount of independent motion, that is, motion that is due to ongoing activities in the scene. We propose the use of spatiotemporal salient points as features throughout this dissertation. The algorithms presented, however, can be used with any kind of features, as long as the latter are well localized and have a well-defined area of support in space and time. We introduce the utilized spatiotemporal salient points in the first method presented in this dissertation. By extending previous work on spatial saliency, we measure the variations in the information content of pixel neighborhoods both in space and time, and detect the points at the locations and scales for which this information content is locally maximized. In this way, an activity is represented as a collection of spatiotemporal salient points. We propose an iterative linear space-time warping technique in order to align the representations in space and time and propose to use Relevance Vector Machines (RVM) in order to classify each example into an action category. In the second method proposed in this dissertation we propose to enhance the acquired representations of the first method. More specifically, we propose to track each detected point in time, and create representations based on sets of trajectories, where each trajectory expresses how the information engulfed by each salient point evolves over time. In order to deal with imperfect localization of the detected points, we augment the observation model of the tracker with background information, acquired using a fully automatic background estimation algorithm. In this way, the tracker favors solutions that contain a large number of foreground pixels. In addition, we perform experiments where the tracked templates are localized on specific parts of the body, like the hands and the head, and we further augment the tracker’s observation model using a human skin color model. Finally, we use a variant of the Longest Common Subsequence algorithm (LCSS) in order to acquire a similarity measure between the resulting trajectory representations, and RVMs for classification. In the third method that we propose, we assume that neighboring salient points follow a similar motion. This is in contrast to the previous method, where each salient point was tracked independently of its neighbors. More specifically, we propose to extract a novel set of visual descriptors that are based on geometrical properties of three-dimensional piece-wise polynomials. The latter are fitted on the spatiotemporal locations of salient points that fall within local spatiotemporal neighborhoods, and are assumed to follow a similar motion. The extracted descriptors are invariant in translation and scaling in space-time. Coupling the neighborhood dimensions to the scale at which the corresponding spatiotemporal salient points are detected ensures the latter. The descriptors that are extracted across the whole dataset are subsequently clustered in order to create a codebook, which is used in order to represent the overall motion of the subjects within small temporal windows.Finally,we use boosting in order to select the most discriminative of these windows for each class, and RVMs for classification. The fourth and last method addresses the joint problem of localization and recognition of human activities depicted in unsegmented image sequences. Its main contribution is the use of an implicit representation of the spatiotemporal shape of the activity, which relies on the spatiotemporal localization of characteristic ensembles of spatiotemporal features. The latter are localized around automatically detected salient points. Evidence for the spatiotemporal localization of the activity is accumulated in a probabilistic spatiotemporal voting scheme. During training, we use boosting in order to create codebooks of characteristic feature ensembles for each class. Subsequently, we construct class-specific spatiotemporal models, which encode where in space and time each codeword ensemble appears in the training set. During testing, each activated codeword ensemble casts probabilistic votes concerning the spatiotemporal localization of the activity, according to the information stored during training. We use a Mean Shift Mode estimation algorithm in order to extract the most probable hypotheses from each resulting voting space. Each hypothesis corresponds to a spatiotemporal volume which potentially engulfs the activity, and is verified by performing action category classification with an RVM classifier

    Searching for complex human activities with no visual examples

    Get PDF
    We describe a method of representing human activities that allows a collection of motions to be queried without examples, using a simple and effective query language. Our approach is based on units of activity at segments of the body, that can be composed across space and across the body to produce complex queries. The presence of search units is inferred automatically by tracking the body, lifting the tracks to 3D and comparing to models trained using motion capture data. Our models of short time scale limb behaviour are built using labelled motion capture set. We show results for a large range of queries applied to a collection of complex motion and activity. We compare with discriminative methods applied to tracker data; our method offers significantly improved performance. We show experimental evidence that our method is robust to view direction and is unaffected by some important changes of clothing. © 2008 Springer Science+Business Media, LLC

    Chord-Length Shape Features for Human Activity Recognition

    Get PDF

    Human action recognition based on estimated weak poses

    Get PDF
    Altres ajuts: Avanza I+D ViCoMo (TSI-020400-2009-133) and DiCoMa (TSI-020400-2011-55)We present a novel method for human action recognition (HAR) based on estimated poses from image sequences. We use 3D human pose data as additional information and propose a compact human pose representation, called a weak pose, in a low-dimensional space while still keeping the most discriminative information for a given pose. With predicted poses from image features, we map the problem from image feature space to pose space, where a Bag of Poses (BOP) model is learned for the final goal of HAR. The BOP model is a modified version of the classical bag of words pipeline by building the vocabulary based on the most representative weak poses for a given action. Compared with the standard k-means clustering, our vocabulary selection criteria is proven to be more efficient and robust against the inherent challenges of action recognition. Moreover, since for action recognition the ordering of the poses is discriminative, the BOP model incorporates temporal information: in essence, groups of consecutive poses are considered together when computing the vocabulary and assignment. We tested our method on two well-known datasets: HumanEva and IXMAS, to demonstrate that weak poses aid to improve action recognition accuracies. The proposed method is scene-independent and is comparable with the state-of-art method

    Study Of Human Activity In Video Data With An Emphasis On View-invariance

    Get PDF
    The perception and understanding of human motion and action is an important area of research in computer vision that plays a crucial role in various applications such as surveillance, HCI, ergonomics, etc. In this thesis, we focus on the recognition of actions in the case of varying viewpoints and different and unknown camera intrinsic parameters. The challenges to be addressed include perspective distortions, differences in viewpoints, anthropometric variations, and the large degrees of freedom of articulated bodies. In addition, we are interested in methods that require little or no training. The current solutions to action recognition usually assume that there is a huge dataset of actions available so that a classifier can be trained. However, this means that in order to define a new action, the user has to record a number of videos from different viewpoints with varying camera intrinsic parameters and then retrain the classifier, which is not very practical from a development point of view. We propose algorithms that overcome these challenges and require just a few instances of the action from any viewpoint with any intrinsic camera parameters. Our first algorithm is based on the rank constraint on the family of planar homographies associated with triplets of body points. We represent action as a sequence of poses, and decompose the pose into triplets. Therefore, the pose transition is broken down into a set of movement of body point planes. In this way, we transform the non-rigid motion of the body points into a rigid motion of body point iii planes. We use the fact that the family of homographies associated with two identical poses would have rank 4 to gauge similarity of the pose between two subjects, observed by different perspective cameras and from different viewpoints. This method requires only one instance of the action. We then show that it is possible to extend the concept of triplets to line segments. In particular, we establish that if we look at the movement of line segments instead of triplets, we have more redundancy in data thus leading to better results. We demonstrate this concept on “fundamental ratios.” We decompose a human body pose into line segments instead of triplets and look at set of movement of line segments. This method needs only three instances of the action. If a larger dataset is available, we can also apply weighting on line segments for better accuracy. The last method is based on the concept of “Projective Depth”. Given a plane, we can find the relative depth of a point relative to the given plane. We propose three different ways of using “projective depth:” (i) Triplets - the three points of a triplet along with the epipole defines the plane and the movement of points relative to these body planes can be used to recognize actions; (ii) Ground plane - if we are able to extract the ground plane, we can find the “projective depth” of the body points with respect to it. Therefore, the problem of action recognition would translate to curve matching; and (iii) Mirror person - We can use the mirror view of the person to extract mirror symmetric planes. This method also needs only one instance of the action. Extensive experiments are reported on testing view invariance, robustness to noisy localization and occlusions of body points, and action recognition. The experimental results are very promising and demonstrate the efficiency of our proposed invariants. i

    The Understanding of Human Activities by Computer Vision Techniques

    Get PDF
    Esta tesis propone nuevas metodologías para el aprendizaje de actividades humanas y su clasificación en categorías. Aunque este tema ha sido ampliamente estudiado por la comunidad investigadora en visión por computador, aún encontramos importantes dificultades por resolver. En primer lugar hemos encontrado que la literatura sobre técnicas de visión por computador para el aprendizaje de actividades humanas empleando pocas secuencias de entrenamiento es escasa y además presenta resultados pobres [1] [2]. Sin embargo, este aprendizaje es una herramienta crucial en varios escenarios. Por ejemplo, un sistema de reconocimiento recién desplegado necesita mucho tiempo para adquirir nuevas secuencias de entrenamiento así que el entrenamiento con pocos ejemplos puede acelerar la puesta en funcionamiento. También la detección de comportamientos anómalos, ejemplos de los cuales son difíciles de obtener, puede beneficiarse de estas técnicas. Existen soluciones mediante técnicas de cruce dominios o empleando características invariantes, sin embargo estas soluciones omiten información del escenario objetivo la cual reduce el ruido en el sistema mejorando los resultados cuando se tiene en cuenta y ejemplos de actividades anómalas siguen siendo difíciles de obtener. Estos sistemas entrenados con poca información se enfrentan a dos problemas principales: por una parte el sistema de entrenamiento puede sufrir de inestabilidades numéricas en la estimación de los parámetros del modelo, por otra, existe una falta de información representativa proveniente de actividades diversas. Nos hemos enfrentado a estos problemas proponiendo novedosos métodos para el aprendizaje de actividades humanas usando tan solo un ejemplo, lo que se denomina one-shot learning. Nuestras propuestas se basan en sistemas generativos, derivadas de los Modelos Ocultos de Markov[3][4], puesto que cada clase de actividad debe ser aprendida con tan solo un ejemplo. Además, hemos ampliado la diversidad de información en los modelos aplicado una transferencia de información desde fuentes externas al escenario[5]. En esta tesis se explican varias propuestas y se muestra como con ellas hemos conseguidos resultados en el estado del arte en tres bases de datos públicas [6][7][8]. La segunda dificultad a la que nos hemos enfrentado es el reconocimiento de actividades sin restricciones en el escenario. En este caso no tiene por qué coincidir el escenario de entrenamiento y el de evaluación por lo que la reducción de ruido anteriormente expuesta no es aplicable. Esto supone que se pueda emplear cualquier ejemplo etiquetado para entrenamiento independientemente del escenario de origen. Esta libertad nos permite extraer vídeos desde cualquier fuente evitando la restricción en el número de ejemplos de entrenamiento. Teniendo suficientes ejemplos de entrenamiento tanto métodos generativos como discriminativos pueden ser empleados. En el momento de realización de esta tesis encontramos que el estado del arte obtiene los mejores resultados empleando métodos discriminativos, sin embargo, la mayoría de propuestas no suelen considerar la información temporal a largo plazo de las actividades[9]. Esta información puede ser crucial para distinguir entre actividades donde el orden de sub-acciones es determinante, y puede ser una ayuda en otras situaciones[10]. Para ello hemos diseñado un sistema que incluye dicha información en una Máquina de Vectores de Soporte. Además, el sistema permite cierta flexibilidad en la alineación de las secuencias a comparar, característica muy útil si la segmentación de las actividades no es perfecta. Utilizando este sistema hemos obtenido resultados en el estado del arte para cuatro bases de datos complejas sin restricciones en los escenarios[11][12][13][14]. Los trabajos realizados en esta tesis han servido para realizar tres artículos en revistas del primer cuartil [15][16][17], dos ya publicados y otro enviado. Además, se han publicado 8 artículos en congresos internacionales y uno nacional [18][19][20][21][22][23][24][25][26]. [1]Seo, H. J. and Milanfar, P. (2011). Action recognition from one example. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):867–882.(2011) [2]Yang, Y., Saleemi, I., and Shah, M. Discovering motion primitives for unsupervised grouping and one-shot learning of human actions, gestures, and expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(7):1635–1648. (2013) [3]Rabiner, L. R. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286. (1989) [4]Bishop, C. M. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA. (2006) [5]Cook, D., Feuz, K., and Krishnan, N. Transfer learning for activity recognition: a survey. Knowledge and Information Systems, pages 1–20. (2013) [6]Schuldt, C., Laptev, I., and Caputo, B. Recognizing human actions: a local svm approach. In International Conference on Pattern Recognition (ICPR). (2004) [7]Weinland, D., Ronfard, R., and Boyer, E. Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2-3):249–257. (2006) [8]Gorelick, L., Blank, M., Shechtman, E., Irani, M., and Basri, R. Actions as space-time shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(12):2247–2253. (2007) [9]Wang, H. and Schmid, C. Action recognition with improved trajectories. In IEEE International Conference on Computer Vision (ICCV). (2013) [10]Choi, J., Wang, Z., Lee, S.-C., and Jeon, W. J. A spatio-temporal pyramid matching for video retrieval. Computer Vision and Image Understanding, 117(6):660 – 669. (2013) [11]Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee, J. T., Mukherjee, S., Aggarwal, J. K., Lee, H., Davis, L., Swears, E., Wang, X., Ji, Q., Reddy, K., Shah, M., Vondrick, C., Pirsiavash, H., Ramanan, D., Yuen, J., Torralba, A., Song, B., Fong, A., Roy-Chowdhury, A., and Desai, M. A large-scale benchmark dataset for event recognition in surveillance video. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3153–3160. (2011) [12] Niebles, J. C., Chen, C.-W., and Fei-Fei, L. Modeling temporal structure of decomposable motion segments for activity classification. In European Conference on Computer Vision (ECCV), pages 392–405.(2010) [13]Reddy, K. K. and Shah, M. Recognizing 50 human action categories of web videos. Machine Vision and Applications, 24(5):971–981. (2013) [14]Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., and Serre, T. HMDB: a large video database for human motion recognition. In IEEE International Conference on Computer Vision (ICCV). (2011) [15]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. One-shot learning of human activity with an map adapted gmm and simplex-hmm. IEEE Transactions on Cybernetics, PP(99):1–12. (2016) [16]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. A time flexible kernel framework for video-based activity recognition. Image and Vision Computing 48-49:26 – 36. (2016) [17]Rodriguez, M., Orrite, C., Medrano, C., and Makris, D. Extended Study for One-shot Learning of Human Activity by a Simplex-HMM. IEEE Transactions on Cybernetics (Enviado) [18]Orrite, C., Rodriguez, M., Medrano, C. One-shot learning of temporal sequences using a distance dependent Chinese Restaurant Process. In Proceedings of the 23nd International Conference Pattern Recognition ICPR (December 2016) [19]Rodriguez, M., Medrano, C., Herrero, E., and Orrite, C. Spectral Clustering Using Friendship Path Similarity Proceedings of the 7th Iberian Conference, IbPRIA (June 2015) [20]Orrite, C., Soler, J., Rodriguez, M., Herrero, E., and Casas, R. Image-based location recognition and scenario modelling. In Proceedings of the 10th International Conference on Computer Vision Theory and Applications, VISAPP (March 2015) [21]Castán, D., Rodríguez, M., Ortega, A., Orrite, C., and Lleida, E. Vivolab and cvlab - mediaeval 2014: Violent scenes detection affect task. In Working Notes Proceedings of the MediaEval (October 2014) [22]Orrite, C., Rodriguez, M., Herrero, E., Rogez, G., and Velastin, S. A. Automatic segmentation and recognition of human actions in monocular sequences In Proceedings of the 22nd International Conference Pattern Recognition ICPR (August 2014) [23]Rodriguez, M., Medrano, C., Herrero, E., and Orrite, C. Transfer learning of human poses for action recognition. In 4th International Workshop of Human Behavior Unterstanding (HBU). (October 2013) [24]Rodriguez, M., Orrite, C., and Medrano, C. Human action recognition with limited labelled data. In Actas del III Workshop de Reconocimiento de Formas y Analisis de Imagenes, WSRFAI. (September 2013) [25]Orrite, C., Monforte, P., Rodriguez, M., and Herrero, E. Human Action Recognition under Partial Occlusions . Proceedings of the 6th Iberian Conference, IbPRIA (June 2013) [26]Orrite, C., Rodriguez, M., and Montañes, M. One sequence learning of human actions. In 2nd International Workshop of Human Behavior Unterstanding (HBU). (November 2011)This thesis provides some novel frameworks for learning human activities and for further classifying them into categories. This field of research has been largely studied by the computer vision community however there are still many drawbacks to solve. First, we have found few proposals in the literature for learning human activities from limited number of sequences. However, this learning is critical in several scenarios. For instance, in the initial stage after a system installation the capture of activity examples is time expensive and therefore, the learning with limited examples may accelerate the operational launch of the system. Moreover, examples for training abnormal behaviour are hardly obtainable and their learning may benefit from the same techniques. This problem is solved by some approaches, such as cross domain implementations or the use of invariant features, but they do not consider the specific scenario information which is useful for reducing the clutter and improving the results. Systems trained with scarce information face two main problems: on the one hand, the training process may suffer from numerical instabilities while estimating the model parameters; on the other hand, the model lacks of representative information coming from a diverse set of activity classes. We have dealt with these problems providing some novel approaches for learning human activities from one example, what is called a one-shot learning method. To do so, we have proposed generative approaches based on Hidden Markov Models as we need to learn each activity class from only one example. In addition, we have transferred information from external sources in order to introduce diverse information into the model. This thesis explains our proposals and shows how these methods achieve state-of-the-art results in three public datasets. Second, we have studied the recognition of human activities in unconstrained scenarios. In this case, the scenario may or may not be repeated in training and evaluation and therefore the clutter reduction previously mentioned does not happen. On the other hand, we can use any labelled video for training the system independently of the target scenario. This freedom allows the extraction of videos from the Internet dismissing the implicit constrains when training with limited examples. Having plenty of training examples both, generative and discriminative, methods can be used and by the time this thesis has been made the state-of-the-art has been achieved by discriminative ones. However, most of the methods usually fail when taking into consideration long-term information of the activities. This information is critical when comparing activities where the order of sub-actions is important, and may be useful in other comparisons as well. Thus, we have designed a framework that incorporates this information in a discriminative classifier. In addition, this method introduces some flexibility for sequence alignment, useful feature when the activity segmentation is not exact. Using this framework we have obtained state-of-the-art results in four challenging public datasets with unconstrained scenarios

    Understanding human motion : recognition and retrieval of human activities

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2008.Thesis (Ph.D.) -- Bilkent University, 2008.Includes bibliographical references leaves 111-121.Within the ever-growing video archives is a vast amount of interesting information regarding human action/activities. In this thesis, we approach the problem of extracting this information and understanding human motion from a computer vision perspective. We propose solutions for two distinct scenarios, ordered from simple to complex. In the first scenario, we deal with the problem of single action recognition in relatively simple settings. We believe that human pose encapsulates many useful clues for recognizing the ongoing action, and we can represent this shape information for 2D single actions in very compact forms, before going into details of complex modeling. We show that high-accuracy single human action recognition is possible 1) using spatial oriented histograms of rectangular regions when the silhouette is extractable, 2) using the distribution of boundary-fitted lines when the silhouette information is missing. We demonstrate that, inside videos, we can further improve recognition accuracy by means of adding local and global motion information. We also show that within a discriminative framework, shape information is quite useful even in the case of human action recognition in still images. Our second scenario involves recognition and retrieval of complex human activities within more complicated settings, like the presence of changing background and viewpoints. We describe a method of representing human activities in 3D that allows a collection of motions to be queried without examples, using a simple and effective query language. Our approach is based on units of activity at segments of the body, that can be composed across time and across the body to produce complex queries. The presence of search units is inferred automatically by tracking the body, lifting the tracks to 3D and comparing to models trained using motion capture data. Our models of short time scale limb behaviour are built using labelled motion capture set. Our query language makes use of finite state automata and requires simple text encoding and no visual examples. We show results for a large range of queries applied to a collection of complex motion and activity. We compare with discriminative methods applied to tracker data; our method offers significantly improved performance. We show experimental evidence that our method is robust to view direction and is unaffected by some important changes of clothing.İkizler, NazlıPh.D

    A Study on Human Motion Acquisition and Recognition Employing Structured Motion Database

    Get PDF
    九州工業大学博士学位論文 学位記番号:工博甲第332号 学位授与年月日:平成24年3月23日1 Introduction||2 Human Motion Representation||3 Human Motion Recognition||4 Automatic Human Motion Acquisition||5 Human Motion Recognition Employing Structured Motion Database||6 Analysis on the Constraints in Human Motion Recognition||7 Multiple Persons’ Action Recognition||8 Discussion and ConclusionsHuman motion analysis is an emerging research field for the video-based applications capable of acquiring and recognizing human motions or actions. The automaticity of such a system with these capabilities has vital importance in real-life scenarios. With the increasing number of applications, the demand for a human motion acquisition system is gaining importance day-by-day. We develop such kind of acquisition system based on body-parts modeling strategy. The system is able to acquire the motion by positioning body joints and interpreting those joints by the inter-parts inclination. Besides the development of the acquisition system, there is increasing need for a reliable human motion recognition system in recent years. There are a number of researches on motion recognition is performed in last two decades. At the same time, an enormous amount of bulk motion datasets are becoming available. Therefore, it becomes an indispensable task to develop a motion database that can deal with large variability of motions efficiently. We have developed such a system based on the structured motion database concept. In order to gain a perspective on this issue, we have analyzed various aspects of the motion database with a view to establishing a standard recognition scheme. The conventional structured database is subjected to improvement by considering three aspects: directional organization, nearest neighbor searching problem resolution, and prior direction estimation. In order to investigate and analyze comprehensively the effect of those aspects on motion recognition, we have adopted two forms of motion representation, eigenspace-based motion compression, and B-Tree structured database. Moreover, we have also analyzed the two important constraints in motion recognition: missing information and clutter outdoor motions. Two separate systems based on these constraints are also developed that shows the suitable adoption of the constraints. However, several people occupy a scene in practical cases. We have proposed a detection-tracking-recognition integrated action recognition system to deal with multiple people case. The system shows decent performance in outdoor scenarios. The experimental results empirically illustrate the suitability and compatibility of various factors of the motion recognition
    corecore