395,391 research outputs found
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
Synergy between face alignment and tracking via Discriminative Global Consensus Optimization
An open question in facial landmark localization in video is whether one should perform tracking or tracking-by-detection (i.e. face alignment). Tracking produces fittings of high accuracy but is prone to drifting. Tracking-by-detection is drift-free but results in low accuracy fittings. To provide a solution to this problem, we describe the very first, to the best of our knowledge, synergistic approach between detection (face alignment) and tracking which completely eliminates drifting from face tracking, and does not merely perform tracking-by-detection. Our first main contribution is to show that one can achieve this synergy between detection and tracking using a principled optimization framework based on the theory of Global Variable Consensus Optimization using ADMM; Our second contribution is to show how the proposed analytic framework can be integrated within state-of-the-art discriminative methods for face alignment and tracking based on cascaded regression and deeply learned features. Overall, we call our method Discriminative Global Consensus Model (DGCM). Our third contribution is to show that DGCM achieves large performance improvement over the currently best performing face tracking methods on the most challenging category of the 300-VW dataset
Adaptive User Perspective Rendering for Handheld Augmented Reality
Handheld Augmented Reality commonly implements some variant of magic lens
rendering, which turns only a fraction of the user's real environment into AR
while the rest of the environment remains unaffected. Since handheld AR devices
are commonly equipped with video see-through capabilities, AR magic lens
applications often suffer from spatial distortions, because the AR environment
is presented from the perspective of the camera of the mobile device. Recent
approaches counteract this distortion based on estimations of the user's head
position, rendering the scene from the user's perspective. To this end,
approaches usually apply face-tracking algorithms on the front camera of the
mobile device. However, this demands high computational resources and therefore
commonly affects the performance of the application beyond the already high
computational load of AR applications. In this paper, we present a method to
reduce the computational demands for user perspective rendering by applying
lightweight optical flow tracking and an estimation of the user's motion before
head tracking is started. We demonstrate the suitability of our approach for
computationally limited mobile devices and we compare it to device perspective
rendering, to head tracked user perspective rendering, as well as to fixed
point of view user perspective rendering
Audio-Video Event Recognition System For Public Transport Security
International audienceThis paper presents an audio-video surveillance system for the automatic surveillance in public transport vehicle. The system comprises six modules including in particular three novel ones: (i) Face Detection and Tracking, (ii) Audio Event Detection and (iii) Audio-Video Scenario Recognition. The Face Detection and Tracking module is responsible for detecting and tracking faces of people in front of cameras. The Audio Event Detection module detects abnormal audio events which are precursor for detecting scenarios which have been predefined by end-users. The Audio-Video Scenario Recognition module performs high level interpretation of the observed objects by combining audio and video events based on spatio-temporal reasoning. The performance of the system is evaluated for a series of pre-defined audio, video and audio-video events specified using an audio-video event ontology
The first Facial Landmark Tracking in-the-Wild Challenge: benchmark and results
Detection and tracking of faces in image sequences is among the most well studied problems in the intersection of statistical machine learning and computer vision. Often, tracking and detection methodologies use a rigid representation to describe the facial region 1, hence they can neither capture nor exploit the non-rigid facial deformations, which are crucial for countless of applications (e.g., facial expression analysis, facial motion capture, high-performance face recognition etc.). Usually, the non-rigid deformations are captured by locating and tracking the position of a set of fiducial facial landmarks (e.g., eyes, nose, mouth etc.). Recently, we witnessed a burst of research in automatic facial landmark localisation in static imagery. This is partly attributed to the availability of large amount of annotated data, many of which have been provided by the first facial landmark localisation challenge (also known as 300-W challenge). Even though now well established benchmarks exist for facial landmark localisation in static imagery, to the best of our knowledge, there is no established benchmark for assessing the performance of facial landmark tracking methodologies, containing an adequate number of annotated face videos. In conjunction with ICCVâ2015 we run the first competition/challenge on facial landmark tracking in long-term videos. In this paper, we present the first benchmark for long-term facial landmark tracking, containing currently over 110 annotated videos, and we summarise the results of the competition
Recommended from our members
Eye Movements, Perceptions, and Performance
Due to the ever growing amount of information, individuals often face high cognitive loads when making decisions. Thus, understanding user reactions to high cognitive loads can help to improve a userâs ability to make good quality decisions under high cognitive load. Prior decision making and user experience research suggest that eye tracking may provide a more complete picture of user reactions under high cognitive loads. Thus, through an exploratory study we investigate the relationship between fixation, perceptions of cognitive load and performance. Our analysis shows that fixation can predict both perception of the load as well as performance of a cognitively demanding online game
Long Range Automated Persistent Surveillance
This dissertation addresses long range automated persistent surveillance with focus on three topics: sensor planning, size preserving tracking, and high magnification imaging.
field of view should be reserved so that camera handoff can be executed successfully before the object of interest becomes unidentifiable or untraceable. We design a sensor planning algorithm that not only maximizes coverage but also ensures uniform and sufficient overlapped cameraâs field of view for an optimal handoff success rate. This algorithm works for environments with multiple dynamic targets using different types of cameras. Significantly improved handoff success rates are illustrated via experiments using floor plans of various scales.
Size preserving tracking automatically adjusts the cameraâs zoom for a consistent view of the object of interest. Target scale estimation is carried out based on the paraperspective projection model which compensates for the center offset and considers system latency and tracking errors. A computationally efficient foreground segmentation strategy, 3D affine shapes, is proposed. The 3D affine shapes feature direct and real-time implementation and improved flexibility in accommodating the targetâs 3D motion, including off-plane rotations. The effectiveness of the scale estimation and foreground segmentation algorithms is validated via both offline and real-time tracking of pedestrians at various resolution levels.
Face image quality assessment and enhancement compensate for the performance degradations in face recognition rates caused by high system magnifications and long observation distances. A class of adaptive sharpness measures is proposed to evaluate and predict this degradation. A wavelet based enhancement algorithm with automated frame selection is developed and proves efficient by a considerably elevated face recognition rate for severely blurred long range face images
- âŠ