Search CORE

320 research outputs found

Digging Deeper into Egocentric Gaze Prediction

Author: Borji Ali
Kannala Juho
Rahtu Esa
Tavakoli Hamed R.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/04/2019
Field of study

This paper digs deeper into factors that influence egocentric gaze. Instead of training deep models for this purpose in a blind manner, we propose to inspect factors that contribute to gaze guidance during daily tasks. Bottom-up saliency and optical flow are assessed versus strong spatial prior baselines. Task-specific cues such as vanishing point, manipulation point, and hand regions are analyzed as representatives of top-down information. We also look into the contribution of these factors by investigating a simple recurrent neural model for ego-centric gaze prediction. First, deep features are extracted for all input video frames. Then, a gated recurrent unit is employed to integrate information over time and to predict the next fixation. We also propose an integrated model that combines the recurrent model with several top-down and bottom-up cues. Extensive experiments over multiple datasets reveal that (1) spatial biases are strong in egocentric videos, (2) bottom-up saliency models perform poorly in predicting gaze and underperform spatial biases, (3) deep features perform better compared to traditional features, (4) as opposed to hand regions, the manipulation point is a strong influential cue for gaze prediction, (5) combining the proposed recurrent model with bottom-up cues, vanishing points and, in particular, manipulation point results in the best gaze prediction accuracy over egocentric videos, (6) the knowledge transfer works best for cases where the tasks or sequences are similar, and (7) task and activity recognition can benefit from gaze prediction. Our findings suggest that (1) there should be more emphasis on hand-object interaction and (2) the egocentric vision community should consider larger datasets including diverse stimuli and more subjects.Comment: presented at WACV 201

arXiv.org e-Print Archive

Crossref

Mutual Context Network for Jointly Estimating Egocentric Gaze and Actions

Author: Cai Minjie
Huang Yifei
Li Zhenqiang
Sato Yoichi
Publication venue
Publication date: 29/06/2020
Field of study

In this work, we address two coupled tasks of gaze prediction and action recognition in egocentric videos by exploring their mutual context. Our assumption is that in the procedure of performing a manipulation task, what a person is doing determines where the person is looking at, and the gaze point reveals gaze and non-gaze regions which contain important and complementary information about the undergoing action. We propose a novel mutual context network (MCN) that jointly learns action-dependent gaze prediction and gaze-guided action recognition in an end-to-end manner. Experiments on public egocentric video datasets demonstrate that our MCN achieves state-of-the-art performance of both gaze prediction and action recognition

arXiv.org e-Print Archive

Egocentric Auditory Attention Localization in Conversations

Author: Ithapu Vamsi Krishna
Jiang Hao
Rehg James M.
Ryan Fiona
Shukla Abhinav
Publication venue
Publication date: 28/03/2023
Field of study

In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saa

arXiv.org e-Print Archive

Multi-Dataset, Multitask Learning of Egocentric Vision Tasks

Author: Kapidis Georgios
Poppe Ronald
Veltkamp Remco
Publication venue
Publication date: 01/06/2023
Field of study

For egocentric vision tasks such as action recognition, there is a relative scarcity of labeled data. This increases the risk of overfitting during training. In this paper, we address this issue by introducing a multitask learning scheme that employs related tasks as well as related datasets in the training process. Related tasks are indicative of the performed action, such as the presence of objects and the position of the hands. By including related tasks as additional outputs to be optimized, action recognition performance typically increases because the network focuses on relevant aspects in the video. Still, the training data is limited to a single dataset because the set of action labels usually differs across datasets. To mitigate this issue, we extend the multitask paradigm to include datasets with different label sets. During training, we effectively mix batches with samples from multiple datasets. Our experiments on egocentric action recognition in the EPIC-Kitchens, EGTEA Gaze+, ADL and Charades-EGO datasets demonstrate the improvements of our approach over single-dataset baselines. On EGTEA we surpass the current state-of-the-art by 2.47 percent. We further illustrate the cross-dataset task correlations that emerge automatically with our novel training scheme

Utrecht University Repository

Non-acted multi-view audio-visual dyadic interactions. Project master thesis: multitask learning for facial attributes analysis

Author: Masdeu Ninot Andreu
Publication venue
Publication date: 02/09/2019
Field of study

Treballs finals del Màster de Fonaments de Ciència de Dades, Facultat de matemàtiques, Universitat de Barcelona, Any: 2019, Tutor: Sergio Escalera Guerrero, Cristina Palmero i Julio C. S. Jacques Junior[en] In this thesis we explore the use of Multitask Learning for improving performance in facial attributes tasks such as gender, age and ethnicity prediction. These tasks, along with emotion recognition will be part of a new dyadic interaction dataset which was recorded during the development of this thesis. This work includes the implementation of two state of the art multitask deep learning models and the discussion of the results obtained from these methods in a preliminary dataset, as well as a first evaluation in a sample of the dyadic interaction dataset. This will serve as a baseline for a future implementation of Multitask Learning methods in the fully annotated dyadic interaction dataset

Diposit Digital de la Universitat de Barcelona

Cone of Vision as a Behavioural Cue for VR Collaboration

Author: Alebri Muna
Bovo Riccardo
Costanza Enrico
Giunchi Daniele
Heinis Thomas
Steed Anthony
Publication venue: ACM Press
Publication date: 01/11/2022
Field of study

UCL Discovery

Driver Attention based on Deep Learning for a Smart Vehicle to Driver (V2D) Interaction

Author: Araluce Ruiz Javier
Publication venue
Publication date: 01/01/2023
Field of study

La atención del conductor es un tópico interesante dentro del mundo de los vehículos inteligentes para la consecución de tareas que van desde la monitorización del conductor hasta la conducción autónoma. Esta tesis aborda este tópico basándose en algoritmos de aprendizaje profundo para conseguir una interacción inteligente entre el vehículo y el conductor. La monitorización del conductor requiere una estimación precisa de su mirada en un entorno 3D para conocer el estado de su atención. En esta tesis se aborda este problema usando una única cámara, para que pueda ser utilizada en aplicaciones reales, sin un alto coste y sin molestar al conductor. La herramienta desarrollada ha sido evaluada en una base de datos pública (DADA2000), obteniendo unos resultados similares a los obtenidos mediante un seguidor de ojos caro que no puede ser usado en un vehículo real. Además, ha sido usada en una aplicación que evalúa la atención del conductor en la transición de modo autónomo a manual de forma simulada, proponiendo el uso de una métrica novedosa para conocer el estado de la situación del conductor en base a su atención sobre los diferentes objetos de la escena. Por otro lado, se ha propuesto un algoritmo de estimación de atención del conductor, utilizando las últimas técnicas de aprendizaje profundo como son las conditional Generative Adversarial Networks (cGANs) y el Multi-Head Self-Attention. Esto permite enfatizar ciertas zonas de la escena al igual que lo haría un humano. El modelo ha sido entrenado y validado en dos bases de datos públicas (BDD-A y DADA2000) superando a otras propuestas del estado del arte y consiguiendo unos tiempos de inferencia que permiten su uso en aplicaciones reales. Por último, se ha desarrollado un modelo que aprovecha nuestro algoritmo de atención del conductor para comprender una escena de tráfico obteniendo la decisión tomada por el vehículo y su explicación, en base a las imágenes tomadas por una cámara situada en la parte frontal del vehículo. Ha sido entrenado en una base de datos pública (BDD-OIA) proponiendo un modelo que entiende la secuencia temporal de los eventos usando un Transformer Encoder, consiguiendo superar a otras propuestas del estado del arte. Además de su validación en la base de datos, ha sido implementado en una aplicación que interacciona con el conductor aconsejando sobre las decisiones a tomar y sus explicaciones ante diferentes casos de uso en un entorno simulado. Esta tesis explora y demuestra los beneficios de la atención del conductor para el mundo de los vehículos inteligentes, logrando una interacción vehículo conductor a través de las últimas técnicas de aprendizaje profundo

e_Buah - Biblioteca Digital de la Universidad de Alcalá

The role of time in video understanding

Author: Price Will
Publication venue
Publication date: 25/01/2022
Field of study

Explore Bristol Research