77,076 research outputs found

    Taking the bite out of automated naming of characters in TV video

    No full text
    We investigate the problem of automatically labelling appearances of characters in TV or film material with their names. This is tremendously challenging due to the huge variation in imaged appearance of each character and the weakness and ambiguity of available annotation. However, we demonstrate that high precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties that we introduce are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking. In addition, we incorporate complementary cues of face matching and clothing matching to propose common annotations for face tracks, and consider choices of classifier which can potentially correct errors made in the automatic extraction of training data from the weak textual annotation. Results are presented on episodes of the TV series ‘‘Buffy the Vampire Slayer”

    Video-based driver identification using local appearance face recognition

    Get PDF
    In this paper, we present a person identification system for vehicular environments. The proposed system uses face images of the driver and utilizes local appearance-based face recognition over the video sequence. To perform local appearance-based face recognition, the input face image is decomposed into non-overlapping blocks and on each local block discrete cosine transform is applied to extract the local features. The extracted local features are then combined to construct the overall feature vector. This process is repeated for each video frame. The distribution of the feature vectors over the video are modelled using a Gaussian distribution function at the training stage. During testing, the feature vector extracted from each frame is compared to each person’s distribution, and individual likelihood scores are generated. Finally, the person is identified as the one who has maximum joint-likelihood score. To assess the performance of the developed system, extensive experiments are conducted on different identification scenarios, such as closed set identification, open set identification and verification. For the experiments a subset of the CIAIR-HCC database, an in-vehicle data corpus that is collected at the Nagoya University, Japan is used. We show that, despite varying environment and illumination conditions, that commonly exist in vehicular environments, it is possible to identify individuals robustly from their face images. Index Terms — Local appearance face recognition, vehicle environment, discrete cosine transform, fusion. 1

    Look, Listen and Learn - A Multimodal LSTM for Speaker Identification

    Full text link
    Speaker identification refers to the task of localizing the face of a person who has the same identity as the ongoing voice in a video. This task not only requires collective perception over both visual and auditory signals, the robustness to handle severe quality degradations and unconstrained content variations are also indispensable. In this paper, we describe a novel multimodal Long Short-Term Memory (LSTM) architecture which seamlessly unifies both visual and auditory modalities from the beginning of each sequence input. The key idea is to extend the conventional LSTM by not only sharing weights across time steps, but also sharing weights across modalities. We show that modeling the temporal dependency across face and voice can significantly improve the robustness to content quality degradations and variations. We also found that our multimodal LSTM is robustness to distractors, namely the non-speaking identities. We applied our multimodal LSTM to The Big Bang Theory dataset and showed that our system outperforms the state-of-the-art systems in speaker identification with lower false alarm rate and higher recognition accuracy.Comment: The 30th AAAI Conference on Artificial Intelligence (AAAI-16

    Deep Multimodal Speaker Naming

    Full text link
    Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online

    Emergent Leadership Detection Across Datasets

    Get PDF
    Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets -- an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.Comment: 5 pages, 3 figure
    corecore