In this paper, we present algorithms for unsupervised mining of structures in video using multiscale statistical models. Video structure are repetitive segments in a video stream with consistent statistical characteristics. Such structures can often be interpreted in relation to distinctive semantics, particularly in structured domains like sports. While much work in the literature explores the link between the observations and the semantics using supervised learning, we propose unsupervised structure mining algorithms that aim at alleviating the burden of labelling and training, as well as providing a scalable solution for generalizing video indexing techniques to heterogeneous content collections such as surveillance and consumer videos. Existing unsupervised video structuring works primarily use clustering techniques, while the rich statistical characteristics in the temporal dimension at different granularity remain unexplored. Authomatically identifying structures from an unknown domain poses significant challenges when domain knowledge is not explicitly present to assist algorithm design, model selection, and feature selection. In this work, we model multi-level statistical structures with hierarchical hidden Markov models based on a multi-level Markov dependency assumption. The parameters of the model are efficiently estimated using the EM algorithm, we have also developed a model structure learning algorithm that uses stochastic sampling techniques to find the optimal model structure, and a feature selection algorithm that automatically finds compact relevant feature sets using hybrid wrappeer-filter methods. When tested on sports videos, the unsupervised learning scheme achieves very promising results: (1) The automatically selected feature set for soccer and baseball vides matches the ones that are manually selected with domain knowledge, (2) The system automatically discovers high-level structures that matches the semantic events in the video, (3) The system achieves even slightly better accuracy in detecting semantic events in unlabelled soccer videos than a competing supervised approach designed and trained with domain knowledge
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.