2,110 research outputs found
Single camera pose estimation using Bayesian filtering and Kinect motion priors
Traditional approaches to upper body pose estimation using monocular vision
rely on complex body models and a large variety of geometric constraints. We
argue that this is not ideal and somewhat inelegant as it results in large
processing burdens, and instead attempt to incorporate these constraints
through priors obtained directly from training data. A prior distribution
covering the probability of a human pose occurring is used to incorporate
likely human poses. This distribution is obtained offline, by fitting a
Gaussian mixture model to a large dataset of recorded human body poses, tracked
using a Kinect sensor. We combine this prior information with a random walk
transition model to obtain an upper body model, suitable for use within a
recursive Bayesian filtering framework. Our model can be viewed as a mixture of
discrete Ornstein-Uhlenbeck processes, in that states behave as random walks,
but drift towards a set of typically observed poses. This model is combined
with measurements of the human head and hand positions, using recursive
Bayesian estimation to incorporate temporal information. Measurements are
obtained using face detection and a simple skin colour hand detector, trained
using the detected face. The suggested model is designed with analytical
tractability in mind and we show that the pose tracking can be
Rao-Blackwellised using the mixture Kalman filter, allowing for computational
efficiency while still incorporating bio-mechanical properties of the upper
body. In addition, the use of the proposed upper body model allows reliable
three-dimensional pose estimates to be obtained indirectly for a number of
joints that are often difficult to detect using traditional object recognition
strategies. Comparisons with Kinect sensor results and the state of the art in
2D pose estimation highlight the efficacy of the proposed approach.Comment: 25 pages, Technical report, related to Burke and Lasenby, AMDO 2014
conference paper. Code sample: https://github.com/mgb45/SignerBodyPose Video:
https://www.youtube.com/watch?v=dJMTSo7-uF
Facial Landmark Detection Evaluation on MOBIO Database
MOBIO is a bi-modal database that was captured almost exclusively on mobile
phones. It aims to improve research into deploying biometric techniques to
mobile devices. Research has been shown that face and speaker recognition can
be performed in a mobile environment. Facial landmark localization aims at
finding the coordinates of a set of pre-defined key points for 2D face images.
A facial landmark usually has specific semantic meaning, e.g. nose tip or eye
centre, which provides rich geometric information for other face analysis tasks
such as face recognition, emotion estimation and 3D face reconstruction. Pretty
much facial landmark detection methods adopt still face databases, such as
300W, AFW, AFLW, or COFW, for evaluation, but seldomly use mobile data. Our
work is first to perform facial landmark detection evaluation on the mobile
still data, i.e., face images from MOBIO database. About 20,600 face images
have been extracted from this audio-visual database and manually labeled with
22 landmarks as the groundtruth. Several state-of-the-art facial landmark
detection methods are adopted to evaluate their performance on these data. The
result shows that the data from MOBIO database is pretty challenging. This
database can be a new challenging one for facial landmark detection evaluation.Comment: 13 pages, 10 figure
ROBUST REPRESENTATIONS FOR UNCONSTRAINED FACE RECOGNITION AND ITS APPLICATIONS
Face identification and verification are important problems in computer vision and
have been actively researched for over two decades. There are several applications including mobile authentication, visual surveillance, social network analysis, and video content analysis. Many algorithms have shown to work well on images collected in controlled settings. However, the performance of these algorithms often degrades significantly on images that have large variations in pose, illumination and expression as well as due to aging, cosmetics, and occlusion. How to extract robust and discriminative feature representations from face images/videos is an important problem to achieve good performance in uncontrolled settings.
In this dissertation, we present several approaches to extract robust feature representation from a set of images/video frames for face identification and verification problems. We first present a dictionary approach with dense facial landmark features. Each face video is segmented into K partitions first, and the multi-scale features are extracted from patches centered at detected facial landmarks. Then, compact and representative dictionaries are learned from dense features for each partition of a video and then concatenated together into a video dictionary representation for the video. Experiments show that the representation is effective for the unconstrained video-based face identification task. Secondly, we present a landmark-based Fisher vector approach for video-based face verification problems. This approach encodes over-complete local features into a high-dimensional feature representation followed by a learned joint Bayesian metric to project the feature vector into a low-dimensional space and to compute the similarity score. We then present an automated system for face verification which exploits features from deep convolutional neural networks (DCNN) trained using the CASIA-WebFace dataset. Our experimental results show that the DCNN model is able to characterize the face variations from the large-scale source face dataset and generalizes well to another smaller one. Finally, we also demonstrate that the model pre-trained for face identification and verification tasks encodes rich face information which benefit other face-related tasks with scarce annotated training data. We use apparent age estimation as an example and develop a cascade convolutional neural network framework which consists of age group classification and age regression, and a deep networks is fine-tuned using the target data
AFFECT-PRESERVING VISUAL PRIVACY PROTECTION
The prevalence of wireless networks and the convenience of mobile cameras enable many new video applications other than security and entertainment. From behavioral diagnosis to wellness monitoring, cameras are increasing used for observations in various educational and medical settings. Videos collected for such applications are considered protected health information under privacy laws in many countries. Visual privacy protection techniques, such as blurring or object removal, can be used to mitigate privacy concern, but they also obliterate important visual cues of affect and social behaviors that are crucial for the target applications. In this dissertation, we propose to balance the privacy protection and the utility of the data by preserving the privacy-insensitive information, such as pose and expression, which is useful in many applications involving visual understanding.
The Intellectual Merits of the dissertation include a novel framework for visual privacy protection by manipulating facial image and body shape of individuals, which: (1) is able to conceal the identity of individuals; (2) provide a way to preserve the utility of the data, such as expression and pose information; (3) balance the utility of the data and capacity of the privacy protection.
The Broader Impacts of the dissertation focus on the significance of privacy protection on visual data, and the inadequacy of current privacy enhancing technologies in preserving affect and behavioral attributes of the visual content, which are highly useful for behavior observation in educational and medical settings. This work in this dissertation represents one of the first attempts in achieving both goals simultaneously
Video content analysis for intelligent forensics
The networks of surveillance cameras installed in public places and private territories continuously record video data with the aim of detecting and preventing unlawful activities. This enhances the importance of video content analysis applications, either for real time (i.e. analytic) or post-event (i.e. forensic) analysis. In this thesis, the primary focus is on four key aspects of video content analysis, namely; 1. Moving object detection and recognition, 2. Correction of colours in the video frames and recognition of colours of moving objects, 3. Make and model recognition of vehicles and identification of their type, 4. Detection and recognition of text information in outdoor scenes.
To address the first issue, a framework is presented in the first part of the thesis that efficiently detects and recognizes moving objects in videos. The framework targets the problem of object detection in the presence of complex background. The object detection part of the framework relies on background modelling technique and a novel post processing step where the contours of the foreground regions (i.e. moving object) are refined by the classification of edge segments as belonging either to the background or to the foreground region. Further, a novel feature descriptor is devised for the classification of moving objects into humans, vehicles and background. The proposed feature descriptor captures the texture information present in the silhouette of foreground objects.
To address the second issue, a framework for the correction and recognition of true colours of objects in videos is presented with novel noise reduction, colour enhancement and colour recognition stages. The colour recognition stage makes use of temporal information to reliably recognize the true colours of moving objects in multiple frames. The proposed framework is specifically designed to perform robustly on videos that have poor quality because of surrounding illumination, camera sensor imperfection and artefacts due to high compression.
In the third part of the thesis, a framework for vehicle make and model recognition and type identification is presented. As a part of this work, a novel feature representation technique for distinctive representation of vehicle images has emerged. The feature representation technique uses dense feature description and mid-level feature encoding scheme to capture the texture in the frontal view of the vehicles. The proposed method is insensitive to minor in-plane rotation and skew within the image. The capability of the proposed framework can be enhanced to any number of vehicle classes without re-training. Another important contribution of this work is the publication of a comprehensive up to date dataset of vehicle images to support future research in this domain.
The problem of text detection and recognition in images is addressed in the last part of the thesis. A novel technique is proposed that exploits the colour information in the image for the identification of text regions. Apart from detection, the colour information is also used to segment characters from the words. The recognition of identified characters is performed using shape features and supervised learning. Finally, a lexicon based alignment procedure is adopted to finalize the recognition of strings present in word images.
Extensive experiments have been conducted on benchmark datasets to analyse the performance of proposed algorithms. The results show that the proposed moving object detection and recognition technique superseded well-know baseline techniques. The proposed framework for the correction and recognition of object colours in video frames achieved all the aforementioned goals. The performance analysis of the vehicle make and model recognition framework on multiple datasets has shown the strength and reliability of the technique when used within various scenarios. Finally, the experimental results for the text detection and recognition framework on benchmark datasets have revealed the potential of the proposed scheme for accurate detection and recognition of text in the wild
A query language for exploratory analysis of video-based tracking data in padel matches
Recent advances in sensor technologies, in particular video-based human detection, object tracking and pose estimation, have opened new possibilities for the automatic or semi-automatic per-frame annotation of sport videos. In the case of racket sports such as tennis and padel, state-of- the-art deep learning methods allow the robust detection and tracking of the players from a single video, which can be combined with ball tracking and shot recognition techniques to obtain a precise description of the play state at every frame. These data, which might include the court-space position of the players, their speeds, accelerations, shots and ball trajectories, can be exported in tabular format for further analysis. Unfortunately, the limitations of traditional table-based methods for analyzing such sport data are twofold. On the one hand, these methods cannot represent complex spatio-temporal queries in a compact, readable way, usable by sport analysts. On the other hand, traditional data visualization tools often fail to convey all the information available in the video (such as the precise body motion before, during and after the execution of a shot) and resulting plots only show a small portion of the available data. In this paper we address these two limitations by focusing on the analysis of video-based tracking data of padel matches. In particular, we propose a domain-specific query language to facilitate coaches and sport analysts to write queries in a very compact form. Additionally, we enrich the data visualization plots by linking each data item to a specific segment of the video so that analysts have full access to all the details related to the query. We demonstrate the flexibility of our system by collecting and converting into readable queries multiple tips and hypotheses on padel strategies extracted from the literature.Postprint (published version
- …