436 research outputs found

    Goal-oriented top-down probabilistic visual attention model for recognition of manipulated objects in egocentric videos

    Get PDF
    We propose a new top down probabilistic saliency model for egocentric video content. It aims to predict top-down visual attention maps focused on manipulated objects, that are then used for psycho-visual weighting of features in the problem of manipulated object recognition. The model is probabilistically defined using both global and local appearance features extracted from automatically segmented arm areas and objects. A psycho-visual experiment has been conducted in a guided framework that compares our proposal and other popular state-of-the-art models with respect to human gaze fixations. The obtained results show that our approach outperforms several popular bottom-up saliency approaches in a well-known egocentric dataset Furthermore, an additional task-driven assessment for object recognition in egocentric video reveals that the proposed method improves the performance of several state-of-the-art techniques for object detection

    Recognition of Activities of Daily Living with Egocentric Vision: A Review.

    Get PDF
    Video-based recognition of activities of daily living (ADLs) is being used in ambient assisted living systems in order to support the independent living of older people. However, current systems based on cameras located in the environment present a number of problems, such as occlusions and a limited field of view. Recently, wearable cameras have begun to be exploited. This paper presents a review of the state of the art of egocentric vision systems for the recognition of ADLs following a hierarchical structure: motion, action and activity levels, where each level provides higher semantic information and involves a longer time frame. The current egocentric vision literature suggests that ADLs recognition is mainly driven by the objects present in the scene, especially those associated with specific tasks. However, although object-based approaches have proven popular, object recognition remains a challenge due to the intra-class variations found in unconstrained scenarios. As a consequence, the performance of current systems is far from satisfactory

    Reconnaissance perceptuelle des objets d’Intérêt : application à l’interprétation des activités instrumentales de la vie quotidienne pour les études de démence

    Get PDF
    The rationale and motivation of this PhD thesis is in the diagnosis, assessment,maintenance and promotion of self-independence of people with dementia in their InstrumentalActivities of Daily Living (IADLs). In this context a strong focus is held towardsthe task of automatically recognizing IADLs. Egocentric video analysis (cameras worn by aperson) has recently gained much interest regarding this goal. Indeed recent studies havedemonstrated how crucial is the recognition of active objects (manipulated or observedby the person wearing the camera) for the activity recognition task and egocentric videospresent the advantage of holding a strong differentiation between active and passive objects(associated to background). One recent approach towards finding active elements in a sceneis the incorporation of visual saliency in the object recognition paradigms. Modeling theselective process of human perception of visual scenes represents an efficient way to drivethe scene analysis towards particular areas considered of interest or salient, which, in egocentricvideos, strongly corresponds to the locus of objects of interest. The objective of thisthesis is to design an object recognition system that relies on visual saliency-maps to providemore precise object representations, that are robust against background clutter and, therefore,improve the recognition of active object for the IADLs recognition task. This PhD thesisis conducted in the framework of the Dem@care European project.Regarding the vast field of visual saliency modeling, we investigate and propose a contributionin both Bottom-up (gaze driven by stimuli) and Top-down (gaze driven by semantics)areas that aim at enhancing the particular task of active object recognition in egocentricvideo content. Our first contribution on Bottom-up models originates from the fact thatobservers are attracted by a central stimulus (the center of an image). This biological phenomenonis known as central bias. In egocentric videos however this hypothesis does not alwayshold. We study saliency models with non-central bias geometrical cues. The proposedvisual saliency models are trained based on eye fixations of observers and incorporated intospatio-temporal saliency models. When compared to state of the art visual saliency models,the ones we present show promising results as they highlight the necessity of a non-centeredgeometric saliency cue. For our top-down model contribution we present a probabilisticvisual attention model for manipulated object recognition in egocentric video content. Althougharms often occlude objects and are usually considered as a burden for many visionsystems, they become an asset in our approach, as we extract both global and local featuresdescribing their geometric layout and pose, as well as the objects being manipulated. We integratethis information in a probabilistic generative model, provide update equations thatautomatically compute the model parameters optimizing the likelihood of the data, and designa method to generate maps of visual attention that are later used in an object-recognitionframework. This task-driven assessment reveals that the proposed method outperforms thestate-of-the-art in object recognition for egocentric video content. [...]Cette thèse est motivée par le diagnostic, l’évaluation, la maintenance et la promotion de l’indépendance des personnes souffrant de maladies démentielles pour leurs activités de la vie quotidienne. Dans ce contexte nous nous intéressons à la reconnaissance automatique des activités de la vie quotidienne.L’analyse des vidéos de type égocentriques (où la caméra est posée sur une personne) a récemment gagné beaucoup d’intérêt en faveur de cette tâche. En effet de récentes études démontrent l’importance cruciale de la reconnaissance des objets actifs (manipulés ou observés par le patient) pour la reconnaissance d’activités et les vidéos égocentriques présentent l’avantage d’avoir une forte différenciation entre les objets actifs et passifs (associés à l’arrière plan). Une des approches récentes envers la reconnaissance des éléments actifs dans une scène est l’incorporation de la saillance visuelle dans les algorithmes de reconnaissance d’objets. Modéliser le processus sélectif du système visuel humain représente un moyen efficace de focaliser l’analyse d’une scène vers les endroits considérés d’intérêts ou saillants,qui, dans les vidéos égocentriques, correspondent fortement aux emplacements des objets d’intérêt. L’objectif de cette thèse est de permettre au systèmes de reconnaissance d’objets de fournir une détection plus précise des objets d’intérêts grâce à la saillance visuelle afin d’améliorer les performances de reconnaissances d’activités de la vie de tous les jours. Cette thèse est menée dans le cadre du projet Européen [email protected] le vaste domaine de la modélisation de la saillance visuelle, nous étudions et proposons une contribution à la fois dans le domaine "Bottom-up" (regard attiré par des stimuli) que dans le domaine "Top-down" (regard attiré par la sémantique) qui ont pour but d’améliorer la reconnaissance d’objets actifs dans les vidéos égocentriques. Notre première contribution pour les modèles Bottom-up prend racine du fait que les observateurs d’une vidéo sont normalement attirés par le centre de celle-ci. Ce phénomène biologique s’appelle le biais central. Dans les vidéos égocentriques cependant, cette hypothèse n’est plus valable.Nous proposons et étudions des modèles de saillance basés sur ce phénomène de biais non central.Les modèles proposés sont entrainés à partir de fixations d’oeil enregistrées et incorporées dans des modèles spatio-temporels. Lorsque comparés à l’état-de-l’art des modèles Bottom-up, ceux que nous présentons montrent des résultats prometteurs qui illustrent la nécessité d’un modèle géométrique biaisé non-centré dans ce type de vidéos. Pour notre contribution dans le domaine Top-down, nous présentons un modèle probabiliste d’attention visuelle pour la reconnaissance d’objets manipulés dans les vidéos égocentriques. Bien que les bras soient souvent source d’occlusion des objets et considérés comme un fardeau, ils deviennent un atout dans notre approche. En effet nous extrayons à la fois des caractéristiques globales et locales permettant d’estimer leur disposition géométrique. Nous intégrons cette information dans un modèle probabiliste, avec équations de mise a jour pour optimiser la vraisemblance du modèle en fonction de ses paramètres et enfin générons les cartes d’attention visuelle pour la reconnaissance d’objets manipulés. [...

    Left/Right Hand Segmentation in Egocentric Videos

    Full text link
    Wearable cameras allow people to record their daily activities from a user-centered (First Person Vision) perspective. Due to their favorable location, wearable cameras frequently capture the hands of the user, and may thus represent a promising user-machine interaction tool for different applications. Existent First Person Vision methods handle hand segmentation as a background-foreground problem, ignoring two important facts: i) hands are not a single "skin-like" moving element, but a pair of interacting cooperative entities, ii) close hand interactions may lead to hand-to-hand occlusions and, as a consequence, create a single hand-like segment. These facts complicate a proper understanding of hand movements and interactions. Our approach extends traditional background-foreground strategies, by including a hand-identification step (left-right) based on a Maxwell distribution of angle and position. Hand-to-hand occlusions are addressed by exploiting temporal superpixels. The experimental results show that, in addition to a reliable left/right hand-segmentation, our approach considerably improves the traditional background-foreground hand-segmentation

    Learning to Predict Gaze in Egocentric Video

    Full text link
    We present a model for gaze prediction in egocentric video by leveraging the implicit cues that exist in camera wearer’s behaviors. Specifically, we compute the camera wearer’s head motion and hand location from the video and combine them to estimate where the eyes look. We further model the dynamic behavior of the gaze, in particular fix-ations, as latent variables to improve the gaze prediction. Our gaze prediction results outperform the state-of-the-art algorithms by a large margin on publicly available egocen-tric vision datasets. In addition, we demonstrate that we get a significant performance boost in recognizing daily actions and segmenting foreground objects by plugging in our gaze predictions into state-of-the-art methods. 1

    An Outlook into the Future of Egocentric Vision

    Full text link
    What will the future be? We wonder! In this survey, we explore the gap between current research in egocentric vision and the ever-anticipated future, where wearable computing, with outward facing cameras and digital overlays, is expected to be integrated in our every day lives. To understand this gap, the article starts by envisaging the future through character-based stories, showcasing through examples the limitations of current technology. We then provide a mapping between this future and previously defined research tasks. For each task, we survey its seminal works, current state-of-the-art methodologies and available datasets, then reflect on shortcomings that limit its applicability to future research. Note that this survey focuses on software models for egocentric vision, independent of any specific hardware. The paper concludes with recommendations for areas of immediate explorations so as to unlock our path to the future always-on, personalised and life-enhancing egocentric vision.Comment: We invite comments, suggestions and corrections here: https://openreview.net/forum?id=V3974SUk1

    Affordance Diffusion: Synthesizing Hand-Object Interactions

    Full text link
    Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (ie, an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two-step generative approach: a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes of portable-sized objects. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. Project page: https://judyye.github.io/affordiffusion-ww
    corecore