14,397 research outputs found
First impressions: A survey on vision-based apparent personality trait analysis
© 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Personality analysis has been widely studied in psychology, neuropsychology, and signal processing fields, among others. From the past few years, it also became an attractive research area in visual computing. From the computational point of view, by far speech and text have been the most considered cues of information for analyzing personality. However, recently there has been an increasing interest from the computer vision community in analyzing personality from visual data. Recent computer vision approaches are able to accurately analyze human faces, body postures and behaviors, and use these information to infer apparent personality traits. Because of the overwhelming research interest in this topic, and of the potential impact that this sort of methods could have in society, we present in this paper an up-to-date review of existing vision-based approaches for apparent personality trait recognition. We describe seminal and cutting edge works on the subject, discussing and comparing their distinctive features and limitations. Future venues of research in the field are identified and discussed. Furthermore, aspects on the subjectivity in data labeling/evaluation, as well as current datasets and challenges organized to push the research on the field are reviewed.Peer ReviewedPostprint (author's final draft
Audio-Visual Target Speaker Enhancement on Multi-Talker Environment using Event-Driven Cameras
We propose a method to address audio-visual target speaker enhancement in
multi-talker environments using event-driven cameras. State of the art
audio-visual speech separation methods shows that crucial information is the
movement of the facial landmarks related to speech production. However, all
approaches proposed so far work offline, using frame-based video input, making
it difficult to process an audio-visual signal with low latency, for online
applications. In order to overcome this limitation, we propose the use of
event-driven cameras and exploit compression, high temporal resolution and low
latency, for low cost and low latency motion feature extraction, going towards
online embedded audio-visual speech processing. We use the event-driven optical
flow estimation of the facial landmarks as input to a stacked Bidirectional
LSTM trained to predict an Ideal Amplitude Mask that is then used to filter the
noisy audio, to obtain the audio signal of the target speaker. The presented
approach performs almost on par with the frame-based approach, with very low
latency and computational cost.Comment: Accepted at ISCAS 202
The analysis of breathing and rhythm in speech
Speech rhythm can be described as the temporal patterning by which speech events, such as vocalic onsets, occur. Despite efforts to quantify and model speech rhythm across languages, it remains a scientifically enigmatic aspect of prosody. For instance, one challenge lies in determining how to best quantify and analyse speech rhythm. Techniques range from manual phonetic annotation to the automatic extraction of acoustic features. It is currently unclear how closely these differing approaches correspond to one another. Moreover, the primary means of speech rhythm research has been the analysis of the acoustic signal only. Investigations of speech rhythm may instead benefit from a range of complementary measures, including physiological recordings, such as of respiratory effort. This thesis therefore combines acoustic recording with inductive plethysmography (breath belts) to capture temporal characteristics of speech and speech breathing rhythms. The first part examines the performance of existing phonetic and algorithmic techniques for acoustic prosodic analysis in a new corpus of rhythmically diverse English and Mandarin speech. The second part addresses the need for an automatic speech breathing annotation technique by developing a novel function that is robust to the noisy plethysmography typical of spontaneous, naturalistic speech production. These methods are then applied in the following section to the analysis of English speech and speech breathing in a second, larger corpus. Finally, behavioural experiments were conducted to investigate listeners' perception of speech breathing using a novel gap detection task. The thesis establishes the feasibility, as well as limits, of automatic methods in comparison to manual annotation. In the speech breathing corpus analysis, they help show that speakers maintain a normative, yet contextually adaptive breathing style during speech. The perception experiments in turn demonstrate that listeners are sensitive to the violation of these speech breathing norms, even if unconsciously so. The thesis concludes by underscoring breathing as a necessary, yet often overlooked, component in speech rhythm planning and production
Audio-Visual Target Speaker Extraction on Multi-Talker Environment using Event-Driven Cameras
In this work, we propose a new method to address audio-visual target speaker extraction in multi-talker environments using event-driven cameras. All audio-visual speech separation approaches use a frame-based video to extract visual features. However, these frame-based cameras usually work at 30 frames per second. This limitation makes it difficult to process an audio-visual signal with low latency. In order to overcome this limitation, we propose using event-driven cameras due to their high temporal resolution and low latency. Recent work showed that the use of landmark motion features is very important in order to get good results on audio-visual speech separation. Thus, we use event-driven vision sensors from which the extraction of motion is available at lower latency computational cost. A stacked Bidirectional LSTM is trained to predict an Ideal Amplitude Mask before post-processing to get a clean audio signal. The performance of our model is close to those yielded in frame-based fashion
Automatic Detection of Online Jihadist Hate Speech
We have developed a system that automatically detects online jihadist hate
speech with over 80% accuracy, by using techniques from Natural Language
Processing and Machine Learning. The system is trained on a corpus of 45,000
subversive Twitter messages collected from October 2014 to December 2016. We
present a qualitative and quantitative analysis of the jihadist rhetoric in the
corpus, examine the network of Twitter users, outline the technical procedure
used to train the system, and discuss examples of use.Comment: 31 page
VISION-BASED URBAN NAVIGATION PROCEDURES FOR VERBALLY INSTRUCTED ROBOTS
The work presented in this thesis is part of a project in instruction based learning (IBL) for mobile
robots were a robot is designed that can be instructed by its users through unconstrained natural
language. The robot uses vision guidance to follow route instructions in a miniature town model.
The aim of the work presented here was to determine the functional vocabulary of the robot in the
form of "primitive procedures". In contrast to previous work in the field of instructable robots this
was done following a "user-centred" approach were the main concern was to create primitive
procedures that can be directly associated with natural language instructions. To achieve this, a corpus
of human-to-human natural language instructions was collected and analysed. A set of primitive
actions was found with which the collected corpus could be represented. These primitive actions were
then implemented as robot-executable procedures.
Natural language instructions are under-specified when destined to be executed by a robot. This is
because instructors omit information that they consider as "commonsense" and rely on the listener's
sensory-motor capabilities to determine the details of the task execution. In this thesis the under-specification
problem is solved by determining the missing information, either during the learning of
new routes or during their execution by the robot. During learning, the missing information is
determined by imitating the commonsense approach human listeners take to achieve the same
purpose. During execution, missing information, such as the location of road layout features
mentioned in route instructions, is determined from the robot's view by using image template
matching. The original contribution of this thesis, in both these methods, lies in the fact that they are
driven by the natural language examples found in the corpus collected for the IDL project.
During the testing phase a high success rate of primitive calls, when these were considered individually,
showed that the under-specification problem has overall been solved. A novel method for testing the
primitive procedures, as part of complete route descriptions, is also proposed in this thesis. This was
done by comparing the performance of human subjects when driving the robot, following route
descriptions, with the performance of the robot when executing the same route descriptions. The
results obtained from this comparison clearly indicated where errors occur from the time when a
human speaker gives a route description to the time when the task is executed by a human listener or
by the robot.
Finally, a software speed controller is proposed in this thesis in order to control the wheel speeds of
the robot used in this project. The controller employs PI (Proportional and Integral) and PID
(Proportional, Integral and Differential) control and provides a good alternative to expensive hardware
- …