245 research outputs found
DVGaze: Dual-View Gaze Estimation
Gaze estimation methods estimate gaze from facial appearance with a single
camera. However, due to the limited view of a single camera, the captured
facial appearance cannot provide complete facial information and thus
complicate the gaze estimation problem. Recently, camera devices are rapidly
updated. Dual cameras are affordable for users and have been integrated in many
devices. This development suggests that we can further improve gaze estimation
performance with dual-view gaze estimation. In this paper, we propose a
dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze
directions from a pair of images. We first propose a dual-view interactive
convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information
during convolution in multiple feature scales. It fuses dual-view features
along epipolar lines and compensates for the original feature with the fused
feature. We further propose a dual-view transformer to estimate gaze from
dual-view features. Camera poses are encoded to indicate the position
information in the transformer. We also consider the geometric relation between
dual-view gaze directions and propose a dual-view gaze consistency loss for
DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE
datasets. Our experiments also prove the potential of dual-view gaze
estimation. We release codes in https://github.com/yihuacheng/DVGaze.Comment: ICCV 202
High-Fidelity Eye Animatable Neural Radiance Fields for Human Face
Face rendering using neural radiance fields (NeRF) is a rapidly developing
research area in computer vision. While recent methods primarily focus on
controlling facial attributes such as identity and expression, they often
overlook the crucial aspect of modeling eyeball rotation, which holds
importance for various downstream tasks. In this paper, we aim to learn a face
NeRF model that is sensitive to eye movements from multi-view images. We
address two key challenges in eye-aware face NeRF learning: how to effectively
capture eyeball rotation for training and how to construct a manifold for
representing eyeball rotation. To accomplish this, we first fit FLAME, a
well-established parametric face model, to the multi-view images considering
multi-view consistency. Subsequently, we introduce a new Dynamic Eye-aware NeRF
(DeNeRF). DeNeRF transforms 3D points from different views into a canonical
space to learn a unified face NeRF model. We design an eye deformation field
for the transformation, including rigid transformation, e.g., eyeball rotation,
and non-rigid transformation. Through experiments conducted on the ETH-XGaze
dataset, we demonstrate that our model is capable of generating high-fidelity
images with accurate eyeball rotation and non-rigid periocular deformation,
even under novel viewing angles. Furthermore, we show that utilizing the
rendered images can effectively enhance gaze estimation performance.Comment: Under revie
A Coarse-to-Fine Adaptive Network for Appearance-Based Gaze Estimation
Human gaze is essential for various appealing applications. Aiming at more
accurate gaze estimation, a series of recent works propose to utilize face and
eye images simultaneously. Nevertheless, face and eye images only serve as
independent or parallel feature sources in those works, the intrinsic
correlation between their features is overlooked. In this paper we make the
following contributions: 1) We propose a coarse-to-fine strategy which
estimates a basic gaze direction from face image and refines it with
corresponding residual predicted from eye images. 2) Guided by the proposed
strategy, we design a framework which introduces a bi-gram model to bridge gaze
residual and basic gaze direction, and an attention component to adaptively
acquire suitable fine-grained feature. 3) Integrating the above innovations, we
construct a coarse-to-fine adaptive network named CA-Net and achieve
state-of-the-art performances on MPIIGaze and EyeDiap.Comment: 9 pages, 7figures, AAAI-2
Recommended from our members
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
To render each generated token in real-time for users, the Large Language Model (LLM) server generates tokens one by one and streams each token (or group of a few tokens) through the network to the user right after generation, which we refer to as LLM token streaming. However, under unstable network conditions, the LLM token streaming experience could suffer greatly from stalls since one packet loss could block the rendering of later tokens even if the packets containing them arrive on time. With a measurement study, we show that current applications suffer from increased stalls under unstable networks. For this emerging token streaming problem in LLM Chatbots that differs from previous multimedia and text applications, we propose a novel transmission scheme, called Eloquent, which puts newly generated tokens as well as currently unacknowledged tokens in the next outgoing packet. This ensures that each packet contains some new tokens and, in the meantime, is independently rendered when received, avoiding the aforementioned stalls caused by missing packets. Through simulation under various networks, we show Eloquent reduces stall ratio (proportion of token rendering wait time) by 71.0% compared to the retransmission method commonly used by real chatbot applications and by 31.6% compared to the baseline packet duplication scheme. By tailoring Eloquent to fit the token-by-token generation of LLM, we enable the Chatbots to respond like an eloquent speaker for users to better enjoy pervasive AI.</p
NL2Contact:Natural Language Guided 3D Hand-Object Contact Modeling with Diffusion Model
Modeling the physical contacts between the hand and object is standard for refining inaccurate hand poses and generating novel human grasp in 3D hand-object reconstruction. However, existing methods rely on geometric constraints that cannot be specified or controlled. This paper introduces a novel task of controllable 3D hand-object contact modeling with natural language descriptions. Challenges include i) the complexity of cross-modal modeling from language to contact, and ii) a lack of descriptive text for contact patterns. To address these issues, we propose NL2Contact, a model that generates controllable contacts by leveraging staged diffusion models. Provided with a language description of the hand and contact, NL2Contact generates realistic and faithful 3D hand-object contacts. To train the model, we build ContactDescribe, the first dataset with hand-centered contact descriptions. It contains multi-level and diverse descriptions generated by large language models, based on carefully designed prompts (e.g. grasp action, grasp type, contact location, free finger status). We show applications of our model to grasp pose optimization and novel human grasp generation, both based on a textual contact description
- …