941 research outputs found
Deep Thermal Imaging: Proximate Material Type Recognition in the Wild through Deep Learning of Spatial Surface Temperature Patterns
We introduce Deep Thermal Imaging, a new approach for close-range automatic
recognition of materials to enhance the understanding of people and ubiquitous
technologies of their proximal environment. Our approach uses a low-cost mobile
thermal camera integrated into a smartphone to capture thermal textures. A deep
neural network classifies these textures into material types. This approach
works effectively without the need for ambient light sources or direct contact
with materials. Furthermore, the use of a deep learning network removes the
need to handcraft the set of features for different materials. We evaluated the
performance of the system by training it to recognise 32 material types in both
indoor and outdoor environments. Our approach produced recognition accuracies
above 98% in 14,860 images of 15 indoor materials and above 89% in 26,584
images of 17 outdoor materials. We conclude by discussing its potentials for
real-time use in HCI applications and future directions.Comment: Proceedings of the 2018 CHI Conference on Human Factors in Computing
System
Personalizing Human-Robot Dialogue Interactions using Face and Name Recognition
Task-oriented dialogue systems are computer systems that aim to provide an interaction
indistinguishable from ordinary human conversation with the goal of completing user-
defined tasks. They are achieving this by analyzing the intents of users and choosing
respective responses. Recent studies show that by personalizing the conversations with
this systems one can positevely affect their perception and long-term acceptance.
Personalised social robots have been widely applied in different fields to provide assistance.
In this thesis we are working on development of a scientific conference assistant. The goal
of this assistant is to provide the conference participants with conference information and
inform about the activities for their spare time during conference. Moreover, to increase
the engagement with the robot our team has worked on personalizing the human-robot
interaction by means of face and name recognition.
To achieve this personalisation, first the name recognition ability of available physical
robot was improved, next by the concent of the participants their pictures were taken
and used for memorization of returning users. As acquiring the consent for personal data
storage is not an optimal solution, an alternative method for participants recognition
using QR Codes on their badges was developed and compared to pre-trained model in
terms of speed. Lastly, the personal details of each participant, as unviversity, country of
origin, was acquired prior to conference or during the conversation and used in dialogues.
The developed robot, called DAGFINN was displayed at two conferences happened this
year in Stavanger, where the first time installment did not involve personalization feature.
Hence, we conclude this thesis by discussing the influence of personalisation on dialogues
with the robot and participants satisfaction with developed social robot
STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition
Existing methods of privacy-preserving action recognition (PPAR) mainly focus
on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they
have two major drawbacks. First, they may compromise temporal dynamics in input
videos, which are critical for accurate action recognition. Second, they are
vulnerable to practical attacking scenarios where attackers probe for privacy
from an entire video rather than individual frames. To address these issues, we
propose a novel framework STPrivacy to perform video-level PPAR. For the first
time, we introduce vision Transformers into PPAR by treating a video as a
tubelet sequence, and accordingly design two complementary mechanisms, i.e.,
sparsification and anonymization, to remove privacy from a spatio-temporal
perspective. In specific, our privacy sparsification mechanism applies adaptive
token selection to abandon action-irrelevant tubelets. Then, our anonymization
mechanism implicitly manipulates the remaining action-tubelets to erase privacy
in the embedding space through adversarial learning. These mechanisms provide
significant advantages in terms of privacy preservation for human eyes and
action-privacy trade-off adjustment during deployment. We additionally
contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101,
to the community. Extensive evaluations on them, as well as two other tasks,
validate the effectiveness and generalization capability of our framework
SKATE: A Natural Language Interface for Encoding Structured Knowledge
In Natural Language (NL) applications, there is often a mismatch between what
the NL interface is capable of interpreting and what a lay user knows how to
express. This work describes a novel natural language interface that reduces
this mismatch by refining natural language input through successive,
automatically generated semi-structured templates. In this paper we describe
how our approach, called SKATE, uses a neural semantic parser to parse NL input
and suggest semi-structured templates, which are recursively filled to produce
fully structured interpretations. We also show how SKATE integrates with a
neural rule-generation model to interactively suggest and acquire commonsense
knowledge. We provide a preliminary coverage analysis of SKATE for the task of
story understanding, and then describe a current business use-case of the tool
in a specific domain: COVID-19 policy design.Comment: Accepted at IAAI-2
Multimodal Emotion Recognition via Convolutional Neural Networks: Comparison of different strategies on two multimodal datasets
The aim of this paper is to investigate emotion recognition using a multimodal approach that exploits convolutional neural networks (CNNs) with multiple input. Multimodal approaches allow different modalities to cooperate in order to achieve generally better performances because different features are extracted from different pieces of information. In this work, the facial frames, the optical flow computed from consecutive facial frames, and the Mel Spectrograms (from the word melody) are extracted from videos and combined together in different ways to understand which modality combination works better. Several experiments are run on the models by first considering one modality at a time so that good accuracy results are found on each modality. Afterward, the models are concatenated to create a final model that allows multiple inputs. For the experiments the datasets used are BAUM-1 ((Bahçeşehir University Multimodal Affective Database - 1) and RAVDESS (Ryerson Audio–Visual Database of Emotional Speech and Song), which both collect two distinguished sets of videos based on the different intensity of the expression, that is acted/strong or spontaneous/normal, providing the representations of the following emotional states that will be taken into consideration: angry, disgust, fearful, happy and sad. The performances of the proposed models are shown through accuracy results and some confusion matrices, demonstrating better accuracy than the compared proposals in the literature. The best accuracy achieved on BAUM-1 dataset is about 95%, while on RAVDESS it is about 95.5%
- …