Important Person Detection from Multiple Videos

Abstract

Given a crowd-sourced set of videos of a crowded public event, this thesis addresses the problem of detecting and grouping appearances of every person in the scenes. The persons are ranked according to the amount of their occurrence. The rank of a person is considered as the measure of his/her importance. Grouping appearances of every individual from such videos is a very challenging task. This is due to unavailability of prior information or training data, large changes in illumination, huge variations in camera viewpoints, severe occlusions and videos from different photographers. These problems are made tractable by exploiting a variety of visual and contextual cues – appearance, sensor data and co-occurrence of people. This thesis provides a unified framework that integrates these cues to establish an efficient person matching process across videos of the same event. The presence of a person is detected based on a multi-view face detector followed by an efficient person tracking that tracks the detected persons in remaining video frames. The performance of person tracker is optimized by utilizing two independent trackers; one for the face and the other for clothes, and the clothes are detected by taking a bounding box below the face region. The person matching is performed using the facial appearance (biometric) and colors of clothes (non-biometric). Unlike traditional matching algorithms that use only low-level facial features for face identification, high-level attribute classifiers (i.e., Gender, ethnicity, hair color, etc.) are also utilized to enhance the identification performance. Hierarchical Agglomerative Clustering (HAC) is used to group the individuals within a video and also across videos. The performance of HAC is improved by using contextual constraints, such as a person cannot appear twice in the same frame. These constraints are directly enforced by altering the HAC algorithm. Finally the detected individuals are ranked according to the number of videos in which they appear and ‘N’ top ranked individuals are taken as important persons. The performance of the proposed algorithm is validated on two novel challenging datasets. The contribution of this thesis is twofold. First, a unified framework is proposed that does not require any prior information or training data about the individuals. The framework is completely automatic and does not require any human interaction. Second, we demonstrate how usage of multiple visual modalities and contextual cues can be exploited to enhance the performance of persons matching under real life problems. Experimental results show the effectiveness of the framework and ensure that the proposed system provides competitive results with the state-of-art algorithms

    Similar works