1,907 research outputs found
Recognition of nonmanual markers in American Sign Language (ASL) using non-parametric adaptive 2D-3D face tracking
This paper addresses the problem of automatically recognizing linguistically significant nonmanual expressions in American Sign Language from video. We develop a fully automatic system that is able to track facial expressions and head movements, and detect and recognize facial events continuously from video. The main contributions of the proposed framework are the following: (1) We have built a stochastic and adaptive ensemble of face trackers to address factors resulting in lost face track; (2) We combine 2D and 3D deformable face models to warp input frames, thus correcting for any variation in facial appearance resulting from changes in 3D head pose; (3) We use a combination of geometric features and texture features extracted from a canonical frontal representation. The proposed new framework makes it possible to detect grammatically significant nonmanual expressions from continuous signing and to differentiate successfully among linguistically significant expressions that involve subtle differences in appearance. We present results that are based on the use of a dataset containing 330 sentences from videos that were collected and linguistically annotated at Boston University
SALSA: A Novel Dataset for Multimodal Group Behavior Analysis
Studying free-standing conversational groups (FCGs) in unstructured social
settings (e.g., cocktail party ) is gratifying due to the wealth of information
available at the group (mining social networks) and individual (recognizing
native behavioral and personality traits) levels. However, analyzing social
scenes involving FCGs is also highly challenging due to the difficulty in
extracting behavioral cues such as target locations, their speaking activity
and head/body pose due to crowdedness and presence of extreme occlusions. To
this end, we propose SALSA, a novel dataset facilitating multimodal and
Synergetic sociAL Scene Analysis, and make two main contributions to research
on automated social interaction analysis: (1) SALSA records social interactions
among 18 participants in a natural, indoor environment for over 60 minutes,
under the poster presentation and cocktail party contexts presenting
difficulties in the form of low-resolution images, lighting variations,
numerous occlusions, reverberations and interfering sound sources; (2) To
alleviate these problems we facilitate multimodal analysis by recording the
social interplay using four static surveillance cameras and sociometric badges
worn by each participant, comprising the microphone, accelerometer, bluetooth
and infrared sensors. In addition to raw data, we also provide annotations
concerning individuals' personality as well as their position, head, body
orientation and F-formation information over the entire event duration. Through
extensive experiments with state-of-the-art approaches, we show (a) the
limitations of current methods and (b) how the recorded multiple cues
synergetically aid automatic analysis of social interactions. SALSA is
available at http://tev.fbk.eu/salsa.Comment: 14 pages, 11 figure
Facial Landmark Detection Evaluation on MOBIO Database
MOBIO is a bi-modal database that was captured almost exclusively on mobile
phones. It aims to improve research into deploying biometric techniques to
mobile devices. Research has been shown that face and speaker recognition can
be performed in a mobile environment. Facial landmark localization aims at
finding the coordinates of a set of pre-defined key points for 2D face images.
A facial landmark usually has specific semantic meaning, e.g. nose tip or eye
centre, which provides rich geometric information for other face analysis tasks
such as face recognition, emotion estimation and 3D face reconstruction. Pretty
much facial landmark detection methods adopt still face databases, such as
300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our
work is first to perform facial landmark detection evaluation on the mobile
still data, i.e., face images from MOBIO database. About 20,600 face images
have been extracted from this audio-visual database and manually labeled with
22 landmarks as the groundtruth. Several state-of-the-art facial landmark
detection methods are adopted to evaluate their performance on these data. The
result shows that the data from MOBIO database is pretty challenging. This
database can be a new challenging one for facial landmark detection evaluation.Comment: 13 pages, 10 figure
Multimedia information technology and the annotation of video
The state of the art in multimedia information technology has not progressed to the point where a single solution is available to meet all reasonable needs of documentalists and users of video archives. In general, we do not have an optimistic view of the usability of new technology in this domain, but digitization and digital power can be expected to cause a small revolution in the area of video archiving. The volume of data leads to two views of the future: on the pessimistic side, overload of data will cause lack of annotation capacity, and on the optimistic side, there will be enough data from which to learn selected concepts that can be deployed to support automatic annotation. At the threshold of this interesting era, we make an attempt to describe the state of the art in technology. We sample the progress in text, sound, and image processing, as well as in machine learning
Recognizing complex faces and gaits via novel probabilistic models
In the field of computer vision, developing automated systems to recognize people
under unconstrained scenarios is a partially solved problem. In unconstrained sce-
narios a number of common variations and complexities such as occlusion, illumi-
nation, cluttered background and so on impose vast uncertainty to the recognition
process. Among the various biometrics that have been emerging recently, this
dissertation focus on two of them namely face and gait recognition.
Firstly we address the problem of recognizing faces with major occlusions amidst
other variations such as pose, scale, expression and illumination using a novel
PRObabilistic Component based Interpretation Model (PROCIM) inspired by key
psychophysical principles that are closely related to reasoning under uncertainty.
The model basically employs Bayesian Networks to establish, learn, interpret and
exploit intrinsic similarity mappings from the face domain. Then, by incorporating
e cient inference strategies, robust decisions are made for successfully recognizing
faces under uncertainty. PROCIM reports improved recognition rates over recent
approaches.
Secondly we address the newly upcoming gait recognition problem and show that
PROCIM can be easily adapted to the gait domain as well. We scienti cally
de ne and formulate sub-gaits and propose a novel modular training scheme to
e ciently learn subtle sub-gait characteristics from the gait domain. Our results
show that the proposed model is robust to several uncertainties and yields sig-
ni cant recognition performance. Apart from PROCIM, nally we show how a
simple component based gait reasoning can be coherently modeled using the re-
cently prominent Markov Logic Networks (MLNs) by intuitively fusing imaging,
logic and graphs.
We have discovered that face and gait domains exhibit interesting similarity map-
pings between object entities and their components. We have proposed intuitive
probabilistic methods to model these mappings to perform recognition under vari-
ous uncertainty elements. Extensive experimental validations justi es the robust-
ness of the proposed methods over the state-of-the-art techniques.
Challenges of Deep Learning-based Text Detection in the Wild
The reported accuracy of recent state-of-the-art text detection methods, mostly deep learning approaches, is in the order of 80% to 90% on standard benchmark datasets. These methods have relaxed some of the restrictions of structured text and environment (i.e., "in the wild") which are usually required for classical OCR to properly function. Even with this relaxation, there are still circumstances where these state-of-the-art methods fail. Several remaining challenges in wild images, like in-plane-rotation, illumination reflection, partial occlusion, complex font styles, and perspective distortion, cause exciting methods to perform poorly. In order to evaluate current approaches in a formal way, we standardize the datasets and metrics for comparison which had made comparison between these methods difficult in the past. We use three benchmark datasets for our evaluations: ICDAR13, ICDAR15, and COCO-Text V2.0. The objective of the paper is to quantify the current shortcomings and to identify the challenges for future text detection research
- …