Search CORE

395,391 research outputs found

Deep Multimodal Speaker Naming

Author: Dai Jingwen
Hu Yongtao
Ren Jimmy
Wang Wenping
Xu Li
Yuan Chang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/07/2015
Field of study

Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online

arXiv.org e-Print Archive

Crossref

Synergy between face alignment and tracking via Discriminative Global Consensus Optimization

Author: Khan Muhammad Haris
McDonagh John
Tzimiropoulos Georgios
Publication venue
Publication date: 26/10/2017
Field of study

An open question in facial landmark localization in video is whether one should perform tracking or tracking-by-detection (i.e. face alignment). Tracking produces fittings of high accuracy but is prone to drifting. Tracking-by-detection is drift-free but results in low accuracy fittings. To provide a solution to this problem, we describe the very first, to the best of our knowledge, synergistic approach between detection (face alignment) and tracking which completely eliminates drifting from face tracking, and does not merely perform tracking-by-detection. Our first main contribution is to show that one can achieve this synergy between detection and tracking using a principled optimization framework based on the theory of Global Variable Consensus Optimization using ADMM; Our second contribution is to show how the proposed analytic framework can be integrated within state-of-the-art discriminative methods for face alignment and tracking based on cascaded regression and deeply learned features. Overall, we call our method Discriminative Global Consensus Model (DGCM). Our third contribution is to show that DGCM achieves large performance improvement over the currently best performing face tracking methods on the most challenging category of the 300-VW dataset

Nottingham ePrints

Nottingham eTheses

Crossref

Adaptive User Perspective Rendering for Handheld Augmented Reality

Author: Grubert Jens
Kalkofen Denis
Mohr Peter
Schmalstieg Dieter
Tatzgern Markus
Publication venue
Publication date: 01/01/2017
Field of study

Handheld Augmented Reality commonly implements some variant of magic lens rendering, which turns only a fraction of the user's real environment into AR while the rest of the environment remains unaffected. Since handheld AR devices are commonly equipped with video see-through capabilities, AR magic lens applications often suffer from spatial distortions, because the AR environment is presented from the perspective of the camera of the mobile device. Recent approaches counteract this distortion based on estimations of the user's head position, rendering the scene from the user's perspective. To this end, approaches usually apply face-tracking algorithms on the front camera of the mobile device. However, this demands high computational resources and therefore commonly affects the performance of the application beyond the already high computational load of AR applications. In this paper, we present a method to reduce the computational demands for user perspective rendering by applying lightweight optical flow tracking and an estimation of the user's motion before head tracking is started. We demonstrate the suitability of our approach for computationally limited mobile devices and we compare it to device perspective rendering, to head tracked user perspective rendering, as well as to fixed point of view user perspective rendering

arXiv.org e-Print Archive

Crossref

Audio-Video Event Recognition System For Public Transport Security

Author: Allezard Nicolas
Ambellouis Sébastien
Brémond François
Davini Gabriele
Flancquart Amaury
Pham Quoc-Cuong
Rouas Jean-Luc
Sayd Patrick
Thonnat Monique
Vu Van-Thinh
Publication venue: HAL CCSD
Publication date: 01/01/2006
Field of study

International audienceThis paper presents an audio-video surveillance system for the automatic surveillance in public transport vehicle. The system comprises six modules including in particular three novel ones: (i) Face Detection and Tracking, (ii) Audio Event Detection and (iii) Audio-Video Scenario Recognition. The Face Detection and Tracking module is responsible for detecting and tracking faces of people in front of cameras. The Audio Event Detection module detects abnormal audio events which are precursor for detecting scenarios which have been predefined by end-users. The Audio-Video Scenario Recognition module performs high level interpretation of the observed objects by combining audio and video events based on spatio-temporal reasoning. The performance of the system is evaluated for a series of pre-defined audio, video and audio-video events specified using an audio-video event ontology

CiteSeerX

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL-CEA

Hal-Diderot

The first Facial Landmark Tracking in-the-Wild Challenge: benchmark and results

Author: Chrysos Grigorios G.
Kossaifi Jean
Pantic Maja
Shen Jie
Tzimiropoulos Georgios
Zafeiriou Stefanos
Publication venue
Publication date: 01/12/2015
Field of study

Detection and tracking of faces in image sequences is among the most well studied problems in the intersection of statistical machine learning and computer vision. Often, tracking and detection methodologies use a rigid representation to describe the facial region 1, hence they can neither capture nor exploit the non-rigid facial deformations, which are crucial for countless of applications (e.g., facial expression analysis, facial motion capture, high-performance face recognition etc.). Usually, the non-rigid deformations are captured by locating and tracking the position of a set of fiducial facial landmarks (e.g., eyes, nose, mouth etc.). Recently, we witnessed a burst of research in automatic facial landmark localisation in static imagery. This is partly attributed to the availability of large amount of annotated data, many of which have been provided by the first facial landmark localisation challenge (also known as 300-W challenge). Even though now well established benchmarks exist for facial landmark localisation in static imagery, to the best of our knowledge, there is no established benchmark for assessing the performance of facial landmark tracking methodologies, containing an adequate number of annotated face videos. In conjunction with ICCV’2015 we run the first competition/challenge on facial landmark tracking in long-term videos. In this paper, we present the first benchmark for long-term facial landmark tracking, containing currently over 110 annotated videos, and we summarise the results of the competition

Nottingham eTheses

Recommended from our members

Eye Movements, Perceptions, and Performance

Author: Djamasbi Soussan
Mehta Dhiren
Samani Ami
Publication venue: AIS Electronic Library (AISeL)
Publication date: 30/07/2012
Field of study

Due to the ever growing amount of information, individuals often face high cognitive loads when making decisions. Thus, understanding user reactions to high cognitive loads can help to improve a user’s ability to make good quality decisions under high cognitive load. Prior decision making and user experience research suggest that eye tracking may provide a more complete picture of user reactions under high cognitive loads. Thus, through an exploratory study we investigate the relationship between fixation, perceptions of cognitive load and performance. Our analysis shows that fixation can predict both perception of the load as well as performance of a cognitively demanding online game

Digital WPI

AIS Electronic Library (AISeL)

Long Range Automated Persistent Surveillance

Author: Yao Yi
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/01/2008
Field of study

This dissertation addresses long range automated persistent surveillance with focus on three topics: sensor planning, size preserving tracking, and high magnification imaging. field of view should be reserved so that camera handoff can be executed successfully before the object of interest becomes unidentifiable or untraceable. We design a sensor planning algorithm that not only maximizes coverage but also ensures uniform and sufficient overlapped camera’s field of view for an optimal handoff success rate. This algorithm works for environments with multiple dynamic targets using different types of cameras. Significantly improved handoff success rates are illustrated via experiments using floor plans of various scales. Size preserving tracking automatically adjusts the camera’s zoom for a consistent view of the object of interest. Target scale estimation is carried out based on the paraperspective projection model which compensates for the center offset and considers system latency and tracking errors. A computationally efficient foreground segmentation strategy, 3D affine shapes, is proposed. The 3D affine shapes feature direct and real-time implementation and improved flexibility in accommodating the target’s 3D motion, including off-plane rotations. The effectiveness of the scale estimation and foreground segmentation algorithms is validated via both offline and real-time tracking of pedestrians at various resolution levels. Face image quality assessment and enhancement compensate for the performance degradations in face recognition rates caused by high system magnifications and long observation distances. A class of adaptive sharpness measures is proposed to evaluate and predict this degradation. A wavelet based enhancement algorithm with automated frame selection is developed and proves efficient by a considerably elevated face recognition rate for severely blurred long range face images

University of Tennessee, Knoxville: Trace

CiteSeerX