13 research outputs found

    Combining image regions and human activity for indirect object recognition in indoor wide-angle views

    Get PDF
    Traditional methods of object recognition are reliant on shape and so are very difficult to apply in cluttered, wideangle and low-detail views such as surveillance scenes. To address this, a method of indirect object recognition is proposed, where human activity is used to infer both the location and identity of objects. No shape analysis is necessary. The concept is dubbed \u27interaction signatures\u27, since the premise is that a human will interact with objects in ways characteristic of the function of that object - for example, a person sits in a chair and drinks from a cup. The human-centred approach means that recognition is possible in low-detail views and is largely invariant to the shape of objects within the same functional class. This paper implements a Bayesian network for classifying region patches with object labels, building upon our previous work in automatically segmenting and recognising a human\u27s interactions with the objects. Experiments show that interaction signatures can successfully find and label objects in low-detail views and are equally effective at recognising test objects that differ markedly in appearance from the training objects.<br /

    Learning Action Maps of Large Environments via First-Person Vision

    Full text link
    When people observe and interact with physical spaces, they are able to associate functionality to regions in the environment. Our goal is to automate dense functional understanding of large spaces by leveraging sparse activity demonstrations recorded from an ego-centric viewpoint. The method we describe enables functionality estimation in large scenes where people have behaved, as well as novel scenes where no behaviors are observed. Our method learns and predicts "Action Maps", which encode the ability for a user to perform activities at various locations. With the usage of an egocentric camera to observe human activities, our method scales with the size of the scene without the need for mounting multiple static surveillance cameras and is well-suited to the task of observing activities up-close. We demonstrate that by capturing appearance-based attributes of the environment and associating these attributes with activity demonstrations, our proposed mathematical framework allows for the prediction of Action Maps in new environments. Additionally, we offer a preliminary glance of the applicability of Action Maps by demonstrating a proof-of-concept application in which they are used in concert with activity detections to perform localization.Comment: To appear at CVPR 201

    Learning Dynamical Representations of Tools for Tool-Use Recognition

    No full text

    Sistema de reconhecimento de ações : uso para monitoramento de pacientes hemiplégicos vítimas de acidente vascular cerebral

    Get PDF
    Monografia (graduação)—Universidade de Brasília, Faculdade UnB Gama, Engenharia de Software, 2013.O Sistema de Reconhecimento de Ações desenvolvido nesse trabalho passou por três fases importantes antes da sua construção. A primeira fase foi uma investigação bibliográfica sobre os sistemas de reconhecimento de ações já existentes procurando saber quais problemas eles solucionam, quais suas limitações e quais técnicas o mesmos utilizam. A segunda fase constitui-se da definição de como o sistema seria construído, com base nas técnicas estudadas anteriormente, assim como a escolha dos usuários alvo do sistema, que no caso são os hemiplégicos. A terceira fase foi a definição de qual é o problema a ser resolvido pelo sistema, que no caso foram as dificuldades na marcha apresentadas pelos usuários hemiplégicos. O desenvolvimento aplicou técnicas de Engenharia de Software, através do levantamento de requisitos, definição de uma arquitetura e execução de testes, para validação do algoritmo e do sistema. A metodologia de desenvolvimento, a técnica usada para a codificação, assim como as ferramentas usadas no desenvolvimentos estão descritas neste trabalho. Por fim, são listadas as limitações do sistemas, as oportunidades de melhoria, os resultados obtidos com o desenvolvimento e sugestões de trabalhos futuros. ____________________________________________________________________________ ABSTRACTThe Action Recognition System developed in this work passed through three steps before being built. The first step was the bibliographic investigation about existing action recognition systems looking for the problems that they solve, their limitations and which techniques they used. The second step was the definition of how the system was going to be built, based on techniques studied before, as well as the choice of whom would be the users of the system, in this case hemiplegics. The third step was the definition of the problem to be solved by the system, that in this case were the difficulties presented by hemiplegic users. The development applied Software Engineer techniques, such as software requirements, software architecture and tests, the tests were used to validate the algorithm and the system itself. The development methodology used, the code technique, as well as the tools used during the whole development are described in this work. Finally, were listed the limitations of the system, the improvement opportunities, the results obtained and suggestions of future workers

    일반화된 4차원 동작 특징을 이용한 시선각에 무관한 행동 인식

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 8. 최진영.본 논문은 일반화된 4차원 [x,y,z,t] 동작 특징을 이용하여 시선각에 무관한 행동 및 행동 방향 인식 문제를 해결하는 것을 목적으로 수행되었다. 기존의 행동 인식 연구는 주로 카메라의 위치는 고정되어있고 사람들은 카메라를 바라보고 서있는 상황을 가정하였다. 그러나 실제 비디오나 감시카메라에 등장하는 사람들은 카메라를 의식하지 않고 자연스럽게 행동하기 때문에 제한된 조건, 환경에서 행동을 인식하는 것과 달리, 카메라의 위치와 사람의 방향에 따라서 다양한 시선각에서 영상이 촬영될 수 있다. 따라서 실제 어플리케이션에 적용하기 위해서는 무작위의 시선각에서 영상이 들어왔을 때 행동 인식을 하는 것이 필수적이며, 어떤 방향으로 행동하는 지 알 수 있다면 누구와 상호작용을 하는 지 아는데 도움을 줄 수 있다. 본 논문에서는 몇 개의 다른 시선각에서 찍힌 영상을 이용하여 3차원 [x,y,z] 입체를 복원하고, 연속된 3차원 입체에서 4차원 시공간 특징점을 구하는 방법을 제안하여 시선각에 무관한 행동 및 행동 방향 인식을 수행하였다. 3차원 입체 및 연속된 3차원 입체에서 구한 4차원 시공간 특징점은 모든 시선각에서의 정보를 갖고 있으므로, 원하는 시선각으로 사영을 하여 각 시선각에서의 특징을 얻을 수 있다. 사영된 실루엣과 4차원 시공간 특징점의 위치를 바탕으로 각각 움직이는 부분과 움직이지 않는 부분에 대한 정보를 포함하는 motion history images (MHIs)와 non motion history images (NMHIs) 를 만들어 행동 인식을 위한 특징으로 사용을 하였다. MHIs만으로는 행동 시 움직이는 부분이 비슷한 패턴을 보일 때 좋은 성능을 보장할 수 없고 따라서 행동 시 움직이지 않는 부분에 대한 정보를 줄 수 있는 NMHIs를 제안하였다. 행동 인식을 위한 학습 단계에서 MHIs와 NMHIs는 클래스를 고려한 차원 축소 알고리즘인 class-augmented principal component analysis (CA-PCA)를 통해서 차원이 축소되며, 이 때 행동 라벨을 이용하여 차원을 축소하므로 각 행동이 잘 분리가 되도록하는 principal axis를 찾을 수 있다. 차원이 축소된 MHIs와 NMHIs는 support vector data description (SVDD) 방법으로 학습되고, support vector domain density description (SVDDD)를 이용하여 인식된다. 행동 방향을 학습할때에는 각 행동에 대해 방향 라벨을 사용하여 principal axis를 구하며, 마찬가지로 SVDD로 학습을 하고 SVDDD를 이용하여 인식된다. 제안된 4차원 시공간 특징점은 시선각에 무관한 행동 및 행동 방향 인식에 사용될 수 있으며 실험을 통해 4차원 시공간 특징점이 각 행동의 특징을 압축적으로 잘 보여주고 있음을 보였다. 또한 실제 어플리케이션에서처럼 무작위의 시선각에서 영상이 들어왔을 경우를 가정하기 위하여 학습 데이터셋과 전혀 다른 새로운 인식 데이터셋을 구축하였다. 기존의 여러 시선각에서 촬영 된 IXMAS 행동 인식 데이터셋을 이용하여 학습을 하고, 학습 데이터셋과 다른 시선각에서 촬영한 SNU 데이터셋에서 인식 실험을 하여 제안한 알고리즘을 검증하였다. 실험 결과 제안한 방법은 학습을 위해 촬영한 영상에 포함되지 않는 시선각에서 테스트 영상이 들어왔을 경우에도 좋은 성능을 보이는 것을 확인하였다. 또한 5개의 방향으로 촬영된 SNU 데이터셋을 이용하여 행동 방향 인식 실험을 하였으며, 좋은 방향 인식률을 보이는 것을 확인하였다. 행동 방향 인식을 통해서 영상 내에서 여러 사람이 등장할 때 다른사람들과 어떻게 상호 작용을 하는지 정보를 알 수 있고, 이는 영상을 해석하는데 도움을 줄 수 있을 것으로 생각된다.In this thesis, we propose a method to recognize human action and their orientation independently of viewpoints using generalized 4D [x,y,z,t] motion features. The conventional action recognition methods assume that the camera view is fixed and people are standing towards the cameras. However, in real life scenarios, the cameras are installed at various positions for their purposes and the orientation of people are chosen arbitrarily. Therefore, the images can be taken with various views according to the position of camera and the orientation of people. To recognize human action and their orientation under this difficult scenario, we focus on the view invariant action recognition method which can recognize the test videos from any arbitrary view. For this purpose, we propose a method to recognize human action and their orientation independently of viewpoints by developing 4D space-time interest points (4D-STIPs, [x,y,z,t]) using 3D space (3D-S, [x,y,z]) volumes reconstructed from images of a finite number of different views. Since the 3D-S volumes and the 4D-STIPs are constructed using volumetric information, the features for arbitrary 2D space (2D-S, [x,y]) viewpoint can be generated by projecting the 3D-S volumes and 4D-STIPs on corresponding test image planes. With these projected features, we construct motion history images (MHIs) and non-motion history images (NMHIs) which encode the moving and non-moving parts of an action respectively. Since MHIs cannot guarantee a good performance when moving parts of an object show similar patterns, we propose NMHIs and combine it with MHIs to add the information from stationary parts of an object in the description of the particular action class. To reduce the dimension of MHIs and NMHIs, we apply class-augmented principal component analysis (CA-PCA) which uses class information for dimension reduction. Since we use the action label for reducing the dimension of features, we obtain the principal axis which can separate each action well. After reducing the feature dimension, the final features are trained by support vector data description method (SVDD) and tested by support vector domain density description (SVDDD). As for the recognition of action orientation, the features are reduced the dimension using orientation label. Similarly, the reduced features are trained by SVDD and tested by SVDDD. The proposed 4D-STIPs can be applied to view invariant recognition of action and their orientation, and we verify that they represent the properties of each action compactly in experiments. To assume arbitrary test view as in real applications, we develop a new testing dataset which is totally different from the training dataset. We verify our algorithm by training action models using the multi-view IXMAS dataset and testing using SNU dataset. Experimental results show that the proposed method is more generalized and outperforms the state-of-the-art methods, especially when training the classifier with the information insufficient about the test views. As for the recognition of action orientation, we experiment with SNU dataset taken from 5 different orientations to verify recognition performance. The recognition of action orientation can be helpful in analyzing the video by providing the information about interactions of people.1 Introduction 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Contents of the research . . . . . . . . . . . . . . . . . . . . . . . . 7 1.2.1 Generalized 4D motion features . . . . . . . . . . . . . . . . 10 1.2.2 View invariant action recognition . . . . . . . . . . . . . . . 11 1.2.3 Recognition of action orientation . . . . . . . . . . . . . . . 12 2 Generalized 4D Motion Features 14 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1 Harris corner detector . . . . . . . . . . . . . . . . . . . . . 18 2.2.2 3D space-time interest points . . . . . . . . . . . . . . . . . 21 2.2.3 3D reconstruction . . . . . . . . . . . . . . . . . . . . . . . 23 2.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3.1 Modified 3D space-time interest points . . . . . . . . . . . . 27 2.3.2 4D space-time interest points . . . . . . . . . . . . . . . . . 30 2.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3 View Invariant Action Recognition 40 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.2.1 Motion history images . . . . . . . . . . . . . . . . . . . . . 45 3.2.2 Class-augmented principal component analysis . . . . . . . 47 3.2.3 Support vector data description . . . . . . . . . . . . . . . . 53 3.2.4 Support vector domain density description . . . . . . . . . . 56 3.3 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.3.1 Silhouettes . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 3.3.2 Space-time interest points . . . . . . . . . . . . . . . . . . . 67 3.3.3 Motion history images and Non-motion history images . . . 69 3.3.4 Training and Testing . . . . . . . . . . . . . . . . . . . . . . 72 3.4 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 73 3.5 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4 Recognition of Action Orientation 87 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.2 Proposed method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.2.1 Training and Testing . . . . . . . . . . . . . . . . . . . . . . 93 4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Conclusions 100 Bibliography 103 Abstract in Korean 113Docto

    Characterizing Objects in Images using Human Context

    Get PDF
    Humans have an unmatched capability of interpreting detailed information about existent objects by just looking at an image. Particularly, they can effortlessly perform the following tasks: 1) Localizing various objects in the image and 2) Assigning functionalities to the parts of localized objects. This dissertation addresses the problem of aiding vision systems accomplish these two goals. The first part of the dissertation concerns object detection in a Hough-based framework. To this end, the independence assumption between features is addressed by grouping them in a local neighborhood. We study the complementary nature of individual and grouped features and combine them to achieve improved performance. Further, we consider the challenging case of detecting small and medium sized household objects under human-object interactions. We first evaluate appearance based star and tree models. While the tree model is slightly better, appearance based methods continue to suffer due to deficiencies caused by human interactions. To this end, we successfully incorporate automatically extracted human pose as a form of context for object detection. The second part of the dissertation addresses the tedious process of manually annotating objects to train fully supervised detectors. We observe that videos of human-object interactions with activity labels can serve as weakly annotated examples of household objects. Since such objects cannot be localized only through appearance or motion, we propose a framework that includes human centric functionality to retrieve the common object. Designed to maximize data utility by detecting multiple instances of an object per video, the framework achieves performance comparable to its fully supervised counterpart. The final part of the dissertation concerns localizing functional regions or affordances within objects by casting the problem as that of semantic image segmentation. To this end, we introduce a dataset involving human-object interactions with strong i.e. pixel level and weak i.e. clickpoint and image level affordance annotations. We propose a framework that utilizes both forms of weak labels and demonstrate that efforts for weak annotation can be further optimized using human context

    Human-robot interaction and computer-vision-based services for autonomous robots

    Get PDF
    L'Aprenentatge per Imitació (IL), o Programació de robots per Demostració (PbD), abasta mètodes pels quals un robot aprèn noves habilitats a través de l'orientació humana i la imitació. La PbD s'inspira en la forma en què els éssers humans aprenen noves habilitats per imitació amb la finalitat de desenvolupar mètodes pels quals les noves tasques es poden transferir als robots. Aquesta tesi està motivada per la pregunta genèrica de "què imitar?", Que es refereix al problema de com extreure les característiques essencials d'una tasca. Amb aquesta finalitat, aquí adoptem la perspectiva del Reconeixement d'Accions (AR) per tal de permetre que el robot decideixi el què cal imitar o inferir en interactuar amb un ésser humà. L'enfoc proposat es basa en un mètode ben conegut que prové del processament del llenguatge natural: és a dir, la bossa de paraules (BoW). Aquest mètode s'aplica a grans bases de dades per tal d'obtenir un model entrenat. Encara que BoW és una tècnica d'aprenentatge de màquines que s'utilitza en diversos camps de la investigació, en la classificació d'accions per a l'aprenentatge en robots està lluny de ser acurada. D'altra banda, se centra en la classificació d'objectes i gestos en lloc d'accions. Per tant, en aquesta tesi es demostra que el mètode és adequat, en escenaris de classificació d'accions, per a la fusió d'informació de diferents fonts o de diferents assajos. Aquesta tesi fa tres contribucions: (1) es proposa un mètode general per fer front al reconeixement d'accions i per tant contribuir a l'aprenentatge per imitació; (2) la metodologia pot aplicar-se a grans bases de dades, que inclouen diferents modes de captura de les accions; i (3) el mètode s'aplica específicament en un projecte internacional d'innovació real anomenat Vinbot.El Aprendizaje por Imitación (IL), o Programación de robots por Demostración (PbD), abarca métodos por los cuales un robot aprende nuevas habilidades a través de la orientación humana y la imitación. La PbD se inspira en la forma en que los seres humanos aprenden nuevas habilidades por imitación con el fin de desarrollar métodos por los cuales las nuevas tareas se pueden transferir a los robots. Esta tesis está motivada por la pregunta genérica de "qué imitar?", que se refiere al problema de cómo extraer las características esenciales de una tarea. Con este fin, aquí adoptamos la perspectiva del Reconocimiento de Acciones (AR) con el fin de permitir que el robot decida lo que hay que imitar o inferir al interactuar con un ser humano. El enfoque propuesto se basa en un método bien conocido que proviene del procesamiento del lenguaje natural: es decir, la bolsa de palabras (BoW). Este método se aplica a grandes bases de datos con el fin de obtener un modelo entrenado. Aunque BoW es una técnica de aprendizaje de máquinas que se utiliza en diversos campos de la investigación, en la clasificación de acciones para el aprendizaje en robots está lejos de ser acurada. Además, se centra en la clasificación de objetos y gestos en lugar de acciones. Por lo tanto, en esta tesis se demuestra que el método es adecuado, en escenarios de clasificación de acciones, para la fusión de información de diferentes fuentes o de diferentes ensayos. Esta tesis hace tres contribuciones: (1) se propone un método general para hacer frente al reconocimiento de acciones y por lo tanto contribuir al aprendizaje por imitación; (2) la metodología puede aplicarse a grandes bases de datos, que incluyen diferentes modos de captura de las acciones; y (3) el método se aplica específicamente en un proyecto internacional de innovación real llamado Vinbot.Imitation Learning (IL), or robot Programming by Demonstration (PbD), covers methods by which a robot learns new skills through human guidance and imitation. PbD takes its inspiration from the way humans learn new skills by imitation in order to develop methods by which new tasks can be transmitted to robots. This thesis is motivated by the generic question of “what to imitate?” which concerns the problem of how to extract the essential features of a task. To this end, here we adopt Action Recognition (AR) perspective in order to allow the robot to decide what has to be imitated or inferred when interacting with a human kind. The proposed approach is based on a well-known method from natural language processing: namely, Bag of Words (BoW). This method is applied to large databases in order to obtain a trained model. Although BoW is a machine learning technique that is used in various fields of research, in action classification for robot learning it is far from accurate. Moreover, it focuses on the classification of objects and gestures rather than actions. Thus, in this thesis we show that the method is suitable in action classification scenarios for merging information from different sources or different trials. This thesis makes three contributions: (1) it proposes a general method for dealing with action recognition and thus to contribute to imitation learning; (2) the methodology can be applied to large databases which include different modes of action captures; and (3) the method is applied specifically in a real international innovation project called Vinbot
    corecore