2,019 research outputs found

    Determining Interacting Objects in Human-Centric Activities via Qualitative Spatio-Temporal Reasoning

    Full text link
    Abstract. Understanding the activities taking place in a video is a chal-lenging problem in Artificial Intelligence. Complex video sequences con-tain many activities and involve a multitude of interacting objects. De-termining which objects are relevant to a particular activity is the first step in understanding the activity. Indeed many objects in the scene are irrelevant to the main activity taking place. In this work, we consider human-centric activities and look to identify which objects in the scene are involved in the activity. We take an activity-agnostic approach and rank every moving object in the scene with how likely it is to be involved in the activity. We use a comprehensive spatio-temporal representation that captures the joint movement between humans and each object. We then use supervised machine learning techniques to recognize relevant objects based on these features. Our approach is tested on the challeng-ing Mind’s Eye dataset.

    Interaction Visual Transformer for Egocentric Action Anticipation

    Full text link
    Human-object interaction is one of the most important visual cues that has not been explored for egocentric action anticipation. We propose a novel Transformer variant to model interactions by computing the change in the appearance of objects and human hands due to the execution of the actions and use those changes to refine the video representation. Specifically, we model interactions between hands and objects using Spatial Cross-Attention (SCA) and further infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens. Using these tokens, we construct an interaction-centric video representation for action anticipation. We term our model InAViT which achieves state-of-the-art action anticipation performance on large-scale egocentric datasets EPICKTICHENS100 (EK100) and EGTEA Gaze+. InAViT outperforms other visual transformer-based methods including object-centric video representation. On the EK100 evaluation server, InAViT is the top-performing method on the public leaderboard (at the time of submission) where it outperforms the second-best model by 3.3% on mean-top5 recall

    LOOKING INTO ACTORS, OBJECTS AND THEIR INTERACTIONS FOR VIDEO UNDERSTANDING

    Get PDF
    Automatic video understanding is critical for enabling new applications in video surveillance, augmented reality, and beyond. Powered by deep networks that learn holistic representations of video clips, and large-scale annotated datasets, modern systems are capable of accurately recognizing hundreds of human activity classes. However, their performance significantly degrades as the number of actors in the scene or the complexity of the activities increases. Therefore, most of the research thus far has focused on videos that are short and/or contain a few activities performed only by adults. Furthermore, most current systems require expensive, spatio-temporal annotations for training. These limitations prevent the deployment of such systems in real-life applications, such as detecting activities of people and vehicles in an extended surveillance videos. To address these limitations, this thesis focuses on developing data-driven, compositional, region-based video understanding models motivated by the observation that actors, objects and their spatio-temporal interactions are the building blocks of activities and the main content of video descriptions provided by humans. This thesis makes three main contributions. First, we propose a novel Graph Neural Network for representation learning on heterogeneous graphs that encode spatio-temporal interactions between actor and object regions in videos. This model can learn context-aware representations for detected actors and objects, which we leverage for detecting complex activities. Second, we propose an attention-based deep conditional generative model of sentences, whose latent variables correspond to alignments between words in textual descriptions of videos and object regions. Building upon the framework of Conditional Variational Autoencoders, we train this model using only textual descriptions without bounding box annotations, and leverage its latent variables for localizing the actors and objects that are mentioned in generated or ground-truth descriptions of videos. Finally, we propose an actor-centric framework for real-time activity detection in videos that are extended both in space and time. Our framework leverages object detections and tracking to generate actor-centric tubelets, capturing all relevant spatio-temporal context for a single actor, and detects activities per tubelet based on contextual region embeddings. The models described have demonstrably improved the ability to temporally detect activities, as well as ground words in visual inputs

    MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain

    Full text link
    Wearable cameras allow to acquire images and videos from the user's perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human-object interactions. With the MECCANO dataset, we explored five different tasks including 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and 5) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms. To support research in this field, we publicy release the dataset at https://iplab.dmi.unict.it/MECCANO/.Comment: arXiv admin note: text overlap with arXiv:2010.0565

    Object-agnostic Affordance Categorization via Unsupervised Learning of Graph Embeddings

    Get PDF
    Acquiring knowledge about object interactions and affordances can facilitate scene understanding and human-robot collaboration tasks. As humans tend to use objects in many different ways depending on the scene and the objects’ availability, learning object affordances in everyday-life scenarios is a challenging task, particularly in the presence of an open set of interactions and objects. We address the problem of affordance categorization for class-agnostic objects with an open set of interactions; we achieve this by learning similarities between object interactions in an unsupervised way and thus inducing clusters of object affordances. A novel depth-informed qualitative spatial representation is proposed for the construction of Activity Graphs (AGs), which abstract from the continuous representation of spatio-temporal interactions in RGB-D videos. These AGs are clustered to obtain groups of objects with similar affordances. Our experiments in a real-world scenario demonstrate that our method learns to create object affordance clusters with a high V-measure even in cluttered scenes. The proposed approach handles object occlusions by capturing effectively possible interactions and without imposing any object or scene constraints

    Time-slice analysis of dyadic human activity

    Get PDF
    La reconnaissance d’activitĂ©s humaines Ă  partir de donnĂ©es vidĂ©o est utilisĂ©e pour la surveillance ainsi que pour des applications d’interaction homme-machine. Le principal objectif est de classer les vidĂ©os dans l’une des k classes d’actions Ă  partir de vidĂ©os entiĂšrement observĂ©es. Cependant, de tout temps, les systĂšmes intelligents sont amĂ©liorĂ©s afin de prendre des dĂ©cisions basĂ©es sur des incertitudes et ou des informations incomplĂštes. Ce besoin nous motive Ă  introduire le problĂšme de l’analyse de l’incertitude associĂ©e aux activitĂ©s humaines et de pouvoir passer Ă  un nouveau niveau de gĂ©nĂ©ralitĂ© liĂ© aux problĂšmes d’analyse d’actions. Nous allons Ă©galement prĂ©senter le problĂšme de reconnaissance d’activitĂ©s par intervalle de temps, qui vise Ă  explorer l’activitĂ© humaine dans un intervalle de temps court. Il a Ă©tĂ© dĂ©montrĂ© que l’analyse par intervalle de temps est utile pour la caractĂ©risation des mouvements et en gĂ©nĂ©ral pour l’analyse de contenus vidĂ©o. Ces Ă©tudes nous encouragent Ă  utiliser ces intervalles de temps afin d’analyser l’incertitude associĂ©e aux activitĂ©s humaines. Nous allons dĂ©tailler Ă  quel degrĂ© de certitude chaque activitĂ© se produit au cours de la vidĂ©o. Dans cette thĂšse, l’analyse par intervalle de temps d’activitĂ©s humaines avec incertitudes sera structurĂ©e en 3 parties. i) Nous prĂ©sentons une nouvelle famille de descripteurs spatiotemporels optimisĂ©s pour la prĂ©diction prĂ©coce avec annotations d’intervalle de temps. Notre reprĂ©sentation prĂ©dictive du point d’intĂ©rĂȘt spatiotemporel (Predict-STIP) est basĂ©e sur l’idĂ©e de la contingence entre intervalles de temps. ii) Nous exploitons des techniques de pointe pour extraire des points d’intĂ©rĂȘts afin de reprĂ©senter ces intervalles de temps. iii) Nous utilisons des relations (uniformes et par paires) basĂ©es sur les rĂ©seaux neuronaux convolutionnels entre les diffĂ©rentes parties du corps de l’individu dans chaque intervalle de temps. Les relations uniformes enregistrent l’apparence locale de la partie du corps tandis que les relations par paires captent les relations contextuelles locales entre les parties du corps. Nous extrayons les spĂ©cificitĂ©s de chaque image dans l’intervalle de temps et examinons diffĂ©rentes façons de les agrĂ©ger temporellement afin de gĂ©nĂ©rer un descripteur pour tout l’intervalle de temps. En outre, nous crĂ©ons une nouvelle base de donnĂ©es qui est annotĂ©e Ă  de multiples intervalles de temps courts, permettant la modĂ©lisation de l’incertitude inhĂ©rente Ă  la reconnaissance d’activitĂ©s par intervalle de temps. Les rĂ©sultats expĂ©rimentaux montrent l’efficience de notre stratĂ©gie dans l’analyse des mouvements humains avec incertitude.Recognizing human activities from video data is routinely leveraged for surveillance and human-computer interaction applications. The main focus has been classifying videos into one of k action classes from fully observed videos. However, intelligent systems must to make decisions under uncertainty, and based on incomplete information. This need motivates us to introduce the problem of analysing the uncertainty associated with human activities and move to a new level of generality in the action analysis problem. We also present the problem of time-slice activity recognition which aims to explore human activity at a small temporal granularity. Time-slice recognition is able to infer human behaviours from a short temporal window. It has been shown that temporal slice analysis is helpful for motion characterization and for video content representation in general. These studies motivate us to consider timeslices for analysing the uncertainty associated with human activities. We report to what degree of certainty each activity is occurring throughout the video from definitely not occurring to definitely occurring. In this research, we propose three frameworks for time-slice analysis of dyadic human activity under uncertainty. i) We present a new family of spatio-temporal descriptors which are optimized for early prediction with time-slice action annotations. Our predictive spatiotemporal interest point (Predict-STIP) representation is based on the intuition of temporal contingency between time-slices. ii) we exploit state-of-the art techniques to extract interest points in order to represent time-slices. We also present an accumulative uncertainty to depict the uncertainty associated with partially observed videos for the task of early activity recognition. iii) we use Convolutional Neural Networks-based unary and pairwise relations between human body joints in each time-slice. The unary term captures the local appearance of the joints while the pairwise term captures the local contextual relations between the parts. We extract these features from each frame in a time-slice and examine different temporal aggregations to generate a descriptor for the whole time-slice. Furthermore, we create a novel dataset which is annotated at multiple short temporal windows, allowing the modelling of the inherent uncertainty in time-slice activity recognition. All the three methods have been evaluated on TAP dataset. Experimental results demonstrate the effectiveness of our framework in the analysis of dyadic activities under uncertaint

    Audio-Visual Egocentric Action Recognition

    Get PDF

    Context-based scene recognition from visual data in smart homes: an Information Fusion approach

    Get PDF
    Ambient Intelligence (AmI) aims at the development of computational systems that process data acquired by sensors embedded in the environment to support users in everyday tasks. Visual sensors, however, have been scarcely used in this kind of applications, even though they provide very valuable information about scene objects: position, speed, color, texture, etc. In this paper, we propose a cognitive framework for the implementation of AmI applications based on visual sensor networks. The framework, inspired by the Information Fusion paradigm, combines a priori context knowledge represented with ontologies with real time single camera data to support logic-based high-level local interpretation of the current situation. In addition, the system is able to automatically generate feedback recommendations to adjust data acquisition procedures. Information about recognized situations is eventually collected by a central node to obtain an overall description of the scene and consequently trigger AmI services. We show the extensible and adaptable nature of the approach with a prototype system in a smart home scenario.This research activity is supported in part by Projects CICYT TIN2008-06742-C02-02/TSI, CICYT TEC2008- 06732-C02-02/TEC, CAM CONTEXTS (S2009/TIC-1485) and DPS2008-07029-C02-02.Publicad
    • 

    corecore