3,346 research outputs found

    Gaze-Informed egocentric action recognition for memory aid systems

    Get PDF
    Egocentric action recognition has been intensively studied in the fields of computer vision and clinical science with applications in pervasive health-care. The majority of the existing egocentric action recognition techniques utilize the features extracted from either the entire contents or the regions of interest in video frames as the inputs of action classifiers. The former might suffer from moving backgrounds or irrelevant foregrounds usually associated with egocentric action videos, while the latter may be impaired by the mismatch between the calculated and the ground truth regions of interest. This paper proposes a new gaze-informed feature extraction approach, by which the features are extracted from the regions around the gaze points and thus representing the genuine regions of interest from a first person of view. The activity of daily life can then be classified based only on the identified regions using the extracted gaze-informed features. The proposed approach has been further applied to a memory support system for people with poor memory, such as those with Amnesia or dementia, and their carers. The experimental results demonstrate the efficacy of the proposed approach in egocentric action recognition and thus the potential of the memory support tool in health care

    Enhanced Gradient-Based Local Feature Descriptors by Saliency Map for Egocentric Action Recognition

    Get PDF
    Egocentric video analysis is an important tool in healthcare that serves a variety of purposes, such as memory aid systems and physical rehabilitation, and feature extraction is an indispensable process for such analysis. Local feature descriptors have been widely applied due to their simple implementation and reasonable efficiency and performance in applications. This paper proposes an enhanced spatial and temporal local feature descriptor extraction method to boost the performance of action classification. The approach allows local feature descriptors to take advantage of saliency maps, which provide insights into visual attention. The effectiveness of the proposed method was validated and evaluated by a comparative study, whose results demonstrated an improved accuracy of around 2%

    Egocentric Vision-based Action Recognition: A survey

    Get PDF
    [EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)

    Eyewear Computing \u2013 Augmenting the Human with Head-Mounted Wearable Assistants

    Get PDF
    The seminar was composed of workshops and tutorials on head-mounted eye tracking, egocentric vision, optics, and head-mounted displays. The seminar welcomed 30 academic and industry researchers from Europe, the US, and Asia with a diverse background, including wearable and ubiquitous computing, computer vision, developmental psychology, optics, and human-computer interaction. In contrast to several previous Dagstuhl seminars, we used an ignite talk format to reduce the time of talks to one half-day and to leave the rest of the week for hands-on sessions, group work, general discussions, and socialising. The key results of this seminar are 1) the identification of key research challenges and summaries of breakout groups on multimodal eyewear computing, egocentric vision, security and privacy issues, skill augmentation and task guidance, eyewear computing for gaming, as well as prototyping of VR applications, 2) a list of datasets and research tools for eyewear computing, 3) three small-scale datasets recorded during the seminar, 4) an article in ACM Interactions entitled \u201cEyewear Computers for Human-Computer Interaction\u201d, as well as 5) two follow-up workshops on \u201cEgocentric Perception, Interaction, and Computing\u201d at the European Conference on Computer Vision (ECCV) as well as \u201cEyewear Computing\u201d at the ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)

    Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

    Full text link
    In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context. Our assumption is that in the procedure of performing a manipulation task, what a person is doing determines where the person is looking at, and the gaze point reveals gaze and non-gaze regions which contain important and complementary information about the undergoing action. We propose a novel mutual context network (MCN) that jointly learns action-dependent gaze prediction and gaze-guided action recognition in an end-to-end manner. Experiments on public egocentric video datasets demonstrate that our MCN achieves state-of-the-art performance of both gaze prediction and action recognition

    Saliency-Informed Spatio-Temporal Vector of Locally Aggregated Descriptors and Fisher Vector for Visual Action Recognition

    Get PDF
    Feature encoding has been extensively studied for the task of visual action recognition (VAR). The recently proposed super vector-based encoding methods, such as the Vector of Locally Aggregated Descriptors (VLAD) and the Fisher Vector (FV), have significantly improved the recognition performance. Despite of the success, they still struggle with the superfluous information that presents during the training stage, which makes the methods computationally expensive when applied to a large number of extracted features. In order to address such challenge, this paper proposes a Saliency-Informed Spatio-Temporal VLAD (SST-VLAD) approach which selects the extracted features corresponding to small amount of videos in the data set by considering both the spatial and temporal video-wise saliency scores; and the same extension principle has also been applied to the FV approach. The experimental results indicate that the proposed feature encoding scheme consistently outperforms the existing ones with significantly lower computational cost

    Age-related effects on spatial memory across viewpoint changes relative to different reference frames

    Get PDF
    Remembering object positions across different views is a fundamental competence for acting and moving appropriately in a large-scale space. Behavioural and neurological changes in elderly subjects suggest that the spatial representations of the environment might decline compared to young participants. However, no data are available on the use of different reference frames within topographical space in aging. Here we investigated the use of allocentric and egocentric frames in aging, by asking young and older participants to encode the location of a target in a virtual room relative either to stable features of the room (allocentric environment-based frame), or to an unstable objects set (allocentric objects-based frame), or to the viewer's viewpoint (egocentric frame). After a viewpoint change of 0,circ,^{circ} (absent), 45,circ,^{circ} (small) or 135,circ,^{circ} (large), participants judged whether the target was in the same spatial position as before relative to one of the three frames. Results revealed a different susceptibility to viewpoint changes in older than young participants. Importantly, we detected a worst performance, in terms of reaction times, for older than young participants in the allocentric frames. The deficit was more marked for the environment-based frame, for which a lower sensitivity was revealed as well as a worst performance even when no viewpoint change occurred. Our data provide new evidence of a greater vulnerability of the allocentric, in particular environment-based, spatial coding with aging, in line with the retrogenesis theory according to which cognitive changes in aging reverse the sequence of acquisition in mental development

    Sensing, interpreting, and anticipating human social behaviour in the real world

    Get PDF
    Low-level nonverbal social signals like glances, utterances, facial expressions and body language are central to human communicative situations and have been shown to be connected to important high-level constructs, such as emotions, turn-taking, rapport, or leadership. A prerequisite for the creation of social machines that are able to support humans in e.g. education, psychotherapy, or human resources is the ability to automatically sense, interpret, and anticipate human nonverbal behaviour. While promising results have been shown in controlled settings, automatically analysing unconstrained situations, e.g. in daily-life settings, remains challenging. Furthermore, anticipation of nonverbal behaviour in social situations is still largely unexplored. The goal of this thesis is to move closer to the vision of social machines in the real world. It makes fundamental contributions along the three dimensions of sensing, interpreting and anticipating nonverbal behaviour in social interactions. First, robust recognition of low-level nonverbal behaviour lays the groundwork for all further analysis steps. Advancing human visual behaviour sensing is especially relevant as the current state of the art is still not satisfactory in many daily-life situations. While many social interactions take place in groups, current methods for unsupervised eye contact detection can only handle dyadic interactions. We propose a novel unsupervised method for multi-person eye contact detection by exploiting the connection between gaze and speaking turns. Furthermore, we make use of mobile device engagement to address the problem of calibration drift that occurs in daily-life usage of mobile eye trackers. Second, we improve the interpretation of social signals in terms of higher level social behaviours. In particular, we propose the first dataset and method for emotion recognition from bodily expressions of freely moving, unaugmented dyads. Furthermore, we are the first to study low rapport detection in group interactions, as well as investigating a cross-dataset evaluation setting for the emergent leadership detection task. Third, human visual behaviour is special because it functions as a social signal and also determines what a person is seeing at a given moment in time. Being able to anticipate human gaze opens up the possibility for machines to more seamlessly share attention with humans, or to intervene in a timely manner if humans are about to overlook important aspects of the environment. We are the first to propose methods for the anticipation of eye contact in dyadic conversations, as well as in the context of mobile device interactions during daily life, thereby paving the way for interfaces that are able to proactively intervene and support interacting humans.Blick, Gesichtsausdrücke, Körpersprache, oder Prosodie spielen als nonverbale Signale eine zentrale Rolle in menschlicher Kommunikation. Sie wurden durch vielzählige Studien mit wichtigen Konzepten wie Emotionen, Sprecherwechsel, Führung, oder der Qualität des Verhältnisses zwischen zwei Personen in Verbindung gebracht. Damit Menschen effektiv während ihres täglichen sozialen Lebens von Maschinen unterstützt werden können, sind automatische Methoden zur Erkennung, Interpretation, und Antizipation von nonverbalem Verhalten notwendig. Obwohl die bisherige Forschung in kontrollierten Studien zu ermutigenden Ergebnissen gekommen ist, bleibt die automatische Analyse nonverbalen Verhaltens in weniger kontrollierten Situationen eine Herausforderung. Darüber hinaus existieren kaum Untersuchungen zur Antizipation von nonverbalem Verhalten in sozialen Situationen. Das Ziel dieser Arbeit ist, die Vision vom automatischen Verstehen sozialer Situationen ein Stück weit mehr Realität werden zu lassen. Diese Arbeit liefert wichtige Beiträge zur autmatischen Erkennung menschlichen Blickverhaltens in alltäglichen Situationen. Obwohl viele soziale Interaktionen in Gruppen stattfinden, existieren unüberwachte Methoden zur Augenkontakterkennung bisher lediglich für dyadische Interaktionen. Wir stellen einen neuen Ansatz zur Augenkontakterkennung in Gruppen vor, welcher ohne manuelle Annotationen auskommt, indem er sich den statistischen Zusammenhang zwischen Blick- und Sprechverhalten zu Nutze macht. Tägliche Aktivitäten sind eine Herausforderung für Geräte zur mobile Augenbewegungsmessung, da Verschiebungen dieser Geräte zur Verschlechterung ihrer Kalibrierung führen können. In dieser Arbeit verwenden wir Nutzerverhalten an mobilen Endgeräten, um den Effekt solcher Verschiebungen zu korrigieren. Neben der Erkennung verbessert diese Arbeit auch die Interpretation sozialer Signale. Wir veröffentlichen den ersten Datensatz sowie die erste Methode zur Emotionserkennung in dyadischen Interaktionen ohne den Einsatz spezialisierter Ausrüstung. Außerdem stellen wir die erste Studie zur automatischen Erkennung mangelnder Verbundenheit in Gruppeninteraktionen vor, und führen die erste datensatzübergreifende Evaluierung zur Detektion von sich entwickelndem Führungsverhalten durch. Zum Abschluss der Arbeit präsentieren wir die ersten Ansätze zur Antizipation von Blickverhalten in sozialen Interaktionen. Blickverhalten hat die besondere Eigenschaft, dass es sowohl als soziales Signal als auch der Ausrichtung der visuellen Wahrnehmung dient. Somit eröffnet die Fähigkeit zur Antizipation von Blickverhalten Maschinen die Möglichkeit, sich sowohl nahtloser in soziale Interaktionen einzufügen, als auch Menschen zu warnen, wenn diese Gefahr laufen wichtige Aspekte der Umgebung zu übersehen. Wir präsentieren Methoden zur Antizipation von Blickverhalten im Kontext der Interaktion mit mobilen Endgeräten während täglicher Aktivitäten, als auch während dyadischer Interaktionen mittels Videotelefonie
    corecore