29 research outputs found
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
Visual Speech Recognition (VSR) is the task of predicting spoken words from
silent lip movements. VSR is regarded as a challenging task because of the
insufficient information on lip movements. In this paper, we propose an Audio
Knowledge empowered Visual Speech Recognition framework (AKVSR) to complement
the insufficient speech information of visual modality by using audio modality.
Different from the previous methods, the proposed AKVSR 1) utilizes rich audio
knowledge encoded by a large-scale pretrained audio model, 2) saves the
linguistic information of audio knowledge in compact audio memory by discarding
the non-linguistic information from the audio through quantization, and 3)
includes Audio Bridging Module which can find the best-matched audio features
from the compact audio memory, which makes our training possible without audio
inputs, once after the compact audio memory is composed. We validate the
effectiveness of the proposed method through extensive experiments, and achieve
new state-of-the-art performances on the widely-used datasets, LRS2 and LRS3
TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation
In this paper, we introduce neural texture learning for 6D object pose
estimation from synthetic data and a few unlabelled real images. Our major
contribution is a novel learning scheme which removes the drawbacks of previous
works, namely the strong dependency on co-modalities or additional refinement.
These have been previously necessary to provide training signals for
convergence. We formulate such a scheme as two sub-optimisation problems on
texture learning and pose learning. We separately learn to predict realistic
texture of objects from real image collections and learn pose estimation from
pixel-perfect synthetic data. Combining these two capabilities allows then to
synthesise photorealistic novel views to supervise the pose estimator with
accurate geometry. To alleviate pose noise and segmentation imperfection
present during the texture learning phase, we propose a surfel-based
adversarial training loss together with texture regularisation from synthetic
data. We demonstrate that the proposed approach significantly outperforms the
recent state-of-the-art methods without ground-truth pose annotations and
demonstrates substantial generalisation improvements towards unseen scenes.
Remarkably, our scheme improves the adopted pose estimators substantially even
when initialised with much inferior performance
Radar and RGB-depth sensors for fall detection: a review
This paper reviews recent works in the literature on the use of systems based on radar and RGB-Depth (RGB-D) sensors for fall detection, and discusses outstanding research challenges and trends related to this research field. Systems to detect reliably fall events and promptly alert carers and first responders have gained significant interest in the past few years in order to address the societal issue of an increasing number of elderly people living alone, with the associated risk of them falling and the consequences in terms of health treatments, reduced well-being, and costs. The interest in radar and RGB-D sensors is related to their capability to enable contactless and non-intrusive monitoring, which is an advantage for practical deployment and users’ acceptance and compliance, compared with other sensor technologies, such as video-cameras, or wearables. Furthermore, the possibility of combining and fusing information from The heterogeneous types of sensors is expected to improve the overall performance of practical fall detection systems. Researchers from different fields can benefit from multidisciplinary knowledge and awareness of the latest developments in radar and RGB-D sensors that this paper is discussing
Relative Pose Estimation Algorithm with Gyroscope Sensor
This paper proposes a novel vision and inertial fusion algorithm S2fM (Simplified Structure from Motion) for camera relative pose estimation. Different from current existing algorithms, our algorithm estimates rotation parameter and translation parameter separately. S2fM employs gyroscopes to estimate camera rotation parameter, which is later fused with the image data to estimate camera translation parameter. Our contributions are in two aspects. (1) Under the circumstance that no inertial sensor can estimate accurately enough translation parameter, we propose a translation estimation algorithm by fusing gyroscope sensor and image data. (2) Our S2fM algorithm is efficient and suitable for smart devices. Experimental results validate efficiency of the proposed S2fM algorithm
Emotional Theory of Mind: Bridging Fast Visual Processing with Slow Linguistic Reasoning
The emotional theory of mind problem in images is an emotion recognition
task, specifically asking "How does the person in the bounding box feel?"
Facial expressions, body pose, contextual information and implicit commonsense
knowledge all contribute to the difficulty of the task, making this task
currently one of the hardest problems in affective computing. The goal of this
work is to evaluate the emotional commonsense knowledge embedded in recent
large vision language models (CLIP, LLaVA) and large language models (GPT-3.5)
on the Emotions in Context (EMOTIC) dataset. In order to evaluate a purely
text-based language model on images, we construct "narrative captions" relevant
to emotion perception, using a set of 872 physical social signal descriptions
related to 26 emotional categories, along with 224 labels for emotionally
salient environmental contexts, sourced from writer's guides for character
expressions and settings. We evaluate the use of the resulting captions in an
image-to-language-to-emotion task. Experiments using zero-shot vision-language
models on EMOTIC show that combining "fast" and "slow" reasoning is a promising
way forward to improve emotion recognition systems. Nevertheless, a gap remains
in the zero-shot emotional theory of mind task compared to prior work trained
on the EMOTIC dataset.Comment: 16 pages(including references and appendix), 8 Tables, 3 figure
Recommended from our members
SmileNet: Registration-Free Smiling Face Detection in the Wild
We present a novel smiling face detection framework called SmileNet for detecting faces and recognising smiles in the wild. SmileNet uses a Fully Convolutional Neural Network (FCNN) to detect multiple smiling faces in a given image of varying resolution. Our contributions are three-fold: 1) SmileNet is the first smiling face detection network that does not require pre-processing such as face detection and registration in advance to generate a normalised (cropped and aligned) input image; 2) the proposed SmileNet is a simple and single FCNN architecture simultaneously performing face detection and smile recognition, which are conventionally treated as separate consecutive pipelines; and 3) SmileNet ensures real-time processing speed (21.15 FPS) even when detecting multiple smiling faces in a given image (300x300). Experimental results show that SmileNet can deliver state-of-the-art performance (95.76%), even under occlusions, and variances of pose, scale, and illumination.This work is supported by the Technology Strategy Board / Innovate UK project Sensing Feeling (project no. 102547)
FULL 3D RECONSTRUCTION OF DYNAMIC NON-RIGID SCENES: ACQUISITION AND ENHANCEMENT
Recent advances in commodity depth or 3D sensing technologies have enabled us to move
closer to the goal of accurately sensing and modeling the 3D representations of complex
dynamic scenes. Indeed, in domains such as virtual reality, security, surveillance and
e-health, there is now a greater demand for aff ordable and flexible vision systems which
are capable of acquiring high quality 3D reconstructions. Available commodity RGB-D
cameras, though easily accessible, have limited fi eld-of-view, and acquire noisy and low-resolution measurements which restricts their direct usage in building such vision systems.
This thesis targets these limitations and builds approaches around commodity 3D
sensing technologies to acquire noise-free and feature preserving full 3D reconstructions
of dynamic scenes containing, static or moving, rigid or non-rigid objects. A mono-view
system based on a single RGB-D camera is incapable of acquiring full 360 degrees 3D reconstruction of a dynamic scene instantaneously. For this purpose, a multi-view system
composed of several RGB-D cameras covering the whole scene is used. In the first part of
this thesis, the domain of correctly aligning the information acquired from RGB-D cameras
in a multi-view system to provide full and textured 3D reconstructions of dynamic
scenes, instantaneously, is explored. This is achieved by solving the extrinsic calibration
problem. This thesis proposes an extrinsic calibration framework which uses the 2D
photometric and 3D geometric information, acquired with RGB-D cameras, according
to their relative (in)accuracies, a ffected by the presence of noise, in a single weighted
bi-objective optimization. An iterative scheme is also proposed, which estimates the parameters
of noise model aff ecting both 2D and 3D measurements, and solves the extrinsic
calibration problem simultaneously. Results show improvement in calibration accuracy
as compared to state-of-art methods. In the second part of this thesis, the domain
of enhancement of noisy and low-resolution 3D data acquired with commodity RGB-D
cameras in both mono-view and multi-view systems is explored. This thesis extends
the state-of-art in mono-view template-free recursive 3D data enhancement which targets
dynamic scenes containing rigid-objects, and thus requires tracking only the global
motions of those objects for view-dependent surface representation and fi ltering. This
thesis proposes to target dynamic scenes containing non-rigid objects which introduces
the complex requirements of tracking relatively large local motions and maintaining data
organization for view-dependent surface representation. The proposed method is shown
to be e ffective in handling non-rigid objects of changing topologies. Building upon the
previous work, this thesis overcomes the requirement of data organization by proposing
an approach based on view-independent surface representation. View-independence
decreases the complexity of the proposed algorithm and allows it the flexibility to process
and enhance noisy data, acquired with multiple cameras in a multi-view system,
simultaneously. Moreover, qualitative and quantitative experimental analysis shows this
method to be more accurate in removing noise to produce enhanced 3D reconstructions
of non-rigid objects. Although, extending this method to a multi-view system would
allow for obtaining instantaneous enhanced full 360 degrees 3D reconstructions of non-rigid
objects, it still lacks the ability to explicitly handle low-resolution data. Therefore, this
thesis proposes a novel recursive dynamic multi-frame 3D super-resolution algorithm
together with a novel 3D bilateral total variation regularization to filter out the noise,
recover details and enhance the resolution of data acquired from commodity cameras in
a multi-view system. Results show that this method is able to build accurate, smooth
and feature preserving full 360 degrees 3D reconstructions of the dynamic scenes containing
non-rigid objects