11 research outputs found

    Egocentric Vision-based Action Recognition: A survey

    Get PDF
    [EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)

    Multi-Task Learning of Object State Changes from Uncurated Videos

    Full text link
    We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring water and pouring coffee. Second, we design a multi-task self-supervised learning procedure that exploits different types of constraints between objects and state-modifying actions enabling end-to-end training of a model for temporal localization of object states and actions in videos from only noisy video-level supervision. Third, we report results on the large-scale ChangeIt and COIN datasets containing tens of thousands of long (un)curated web videos depicting various interactions such as hole drilling, cream whisking, or paper plane folding. We show that our multi-task model achieves a relative improvement of 40% over the prior single-task methods and significantly outperforms both image-based and video-based zero-shot models for this problem. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the robustness of our learned model

    EgoTaskQA: Understanding Human Tasks in Egocentric Videos

    Full text link
    Understanding human tasks through video observations is an essential capability of intelligent agents. The challenges of such capability lie in the difficulty of generating a detailed understanding of situated actions, their effects on object states (i.e., state changes), and their causal dependencies. These challenges are further aggravated by the natural parallelism from multi-tasking and partial observations in multi-agent collaboration. Most prior works leverage action localization or future prediction as an indirect metric for evaluating such task understanding from videos. To make a direct evaluation, we introduce the EgoTaskQA benchmark that provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos. We meticulously design questions that target the understanding of (1) action dependencies and effects, (2) intents and goals, and (3) agents' beliefs about others. These questions are divided into four types, including descriptive (what status?), predictive (what will?), explanatory (what caused?), and counterfactual (what if?) to provide diagnostic analyses on spatial, temporal, and causal understandings of goal-oriented tasks. We evaluate state-of-the-art video reasoning models on our benchmark and show their significant gaps between humans in understanding complex goal-oriented egocentric videos. We hope this effort will drive the vision community to move onward with goal-oriented video understanding and reasoning.Comment: Published at NeurIPS Track on Datasets and Benchmarks 202

    Qualitative Spatial and Temporal Reasoning based on And/Or Linear Programming An approach to partially grounded qualitative spatial reasoning

    Get PDF
    Acting intelligently in dynamic environments involves anticipating surrounding processes, for example to foresee a dangerous situation or acceptable social behavior. Knowledge about spatial configurations and how they develop over time enables intelligent robots to safely navigate by reasoning about possible actions. The seamless connection of high-level deliberative processes to perception and action selection remains a challenge though. Moreover, an integration should allow the robot to build awareness of these processes as in reality there will be misunderstandings a robot should be able to respond to. My aim is to verify that actions selected by the robot do not violate navigation or safety regulations and thereby endanger the robot or others. Navigation rules specified qualitatively allow an autonomous agent to consistently combine all rules applicable in a context. Within this thesis, I develop a formal, symbolic representation of right-of-way-rules based on a qualitative spatial representation. This cumulative dissertation consists of 5 peer-reviewed papers and 1 manuscript under review. The contribution of this thesis is an approach to represent navigation patterns based on qualitative spatio-temporal representation and the development of corresponding effective sound reasoning techniques. The approach is based on a spatial logic in the sense of Aiello, Pratt-Hartmann, and van Benthem. This logic has clear spatial and temporal semantics and I demonstrate how it allows various navigation rules and social conventions to be represented. I demonstrate the applicability of the developed method in three different areas, an autonomous robotic system in an industrial setting, an autonomous sailing boat, and a robot that should act politely by adhering to social conventions. In all three settings, the navigation behavior is specified by logic formulas. Temporal reasoning is performed via model checking. An important aspect is that a logic symbol, such as \emph{turn left}, comprises a family of movement behaviors rather than a single pre-specified movement command. This enables to incorporate the current spatial context, the possible changing kinematics of the robotic system, and so on without changing a single formula. Additionally, I show that the developed approach can be integrated into various robotic software architectures. Further, an answer to three long standing questions in the field of qualitative spatial reasoning is presented. Using generalized linear programming as a unifying basis for reasoning, one can jointly reason about relations from different qualitative calculi. Also, concrete entities (fixed points, regions fixed in shape and/or position, etc.) can be mixed with free variables. In addition, a realization of qualitative spatial description can be calculated, i.e., a specific instance/example. All three features are important for applications but cannot be handled by other techniques. I advocate the use of And/Or trees to facilitate efficient reasoning and I show the feasibility of my approach. Last but not least, I investigate a fourth question, how to integrate And/Or trees with linear temporal logic, to enable spatio-temporal reasoning

    TME Volume 7, Numbers 2 and 3

    Get PDF
    corecore