Search CORE

77,076 research outputs found

Taking the bite out of automated naming of characters in TV video

Author: Everingham M.
Sivic J.
Zisserman A.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2009
Field of study

We investigate the problem of automatically labelling appearances of characters in TV or film material with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”

Oxford University Research Archive

White Rose Research Online

Video-based driver identification using local appearance face recognition

Author: Ekenel Hazım Kemal
Ekenel Hazim Kemal
Ercil Aytul
Erdogan Hakan
Erdoğan Hakan
Erçil Aytül
Stallkamp Johannes
Stiefelhagen Rainer
Publication venue: Sabancı University
Publication date: 01/06/2007
Field of study

In this paper, we present a person identification system for vehicular environments. The proposed system uses face images of the driver and utilizes local appearance-based face recognition over the video sequence. To perform local appearance-based face recognition, the input face image is decomposed into non-overlapping blocks and on each local block discrete cosine transform is applied to extract the local features. The extracted local features are then combined to construct the overall feature vector. This process is repeated for each video frame. The distribution of the feature vectors over the video are modelled using a Gaussian distribution function at the training stage. During testing, the feature vector extracted from each frame is compared to each person’s distribution, and individual likelihood scores are generated. Finally, the person is identified as the one who has maximum joint-likelihood score. To assess the performance of the developed system, extensive experiments are conducted on different identification scenarios, such as closed set identification, open set identification and verification. For the experiments a subset of the CIAIR-HCC database, an in-vehicle data corpus that is collected at the Nagoya University, Japan is used. We show that, despite varying environment and illumination conditions, that commonly exist in vehicular environments, it is possible to identify individuals robustly from their face images. Index Terms — Local appearance face recognition, vehicle environment, discrete cosine transform, fusion. 1

CiteSeerX

Sabanci University Research Database

Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

Author: Hu Yongtao
Ren Jimmy
Sun Wenxiu
Tai Yu-Wing
Wang Chuan
Xu Li
Yan Qiong
Publication venue
Publication date: 13/02/2016
Field of study

Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.Comment: The 30th AAAI Conference on Artificial Intelligence (AAAI-16

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Deep Multimodal Speaker Naming

Author: Dai Jingwen
Hu Yongtao
Ren Jimmy
Wang Wenping
Xu Li
Yuan Chang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 17/07/2015
Field of study

Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online

arXiv.org e-Print Archive

Crossref

Emergent Leadership Detection Across Datasets

Author: Bulling Andreas
Müller Philipp
Publication venue
Publication date: 01/01/2019
Field of study

Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets -- an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.Comment: 5 pages, 3 figure

arXiv.org e-Print Archive

Crossref

MPG.PuRe