Multimodal Video Analysis and Modeling

Roininen, Mikko

Multimodal Video Analysis and Modeling

Authors: Mikko Roininen
Publication date: 1 January 2016
Publisher: Tampere University of Technology

Abstract

From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip reading aided conversation in noisy environments or travel sickness caused by mismatch of the signals from vision and the vestibular system, the human perception manifests countless examples of subtle and effortless joint adoption of the multiple senses provided to us by evolution. Emulating such multisensory (or multimodal, i.e., comprising multiple types of input modes or modalities) processing computationally offers tools for more effective, efficient, or robust accomplishment of many multimedia tasks using evidence from the multiple input modalities. Information from the modalities can also be analyzed for patterns and connections across them, opening up interesting applications not feasible with a single modality, such as prediction of some aspects of one modality based on another. In this dissertation, multimodal analysis techniques are applied to selected video tasks with accompanying modalities. More speciﬁcally, all the tasks involve some type of analysis of videos recorded by non-professional videographers using mobile devices.Fusion of information from multiple modalities is applied to recording environment classiﬁcation from video and audio as well as to sport type classiﬁcation from a set of multi-device videos, corresponding audio, and recording device motion sensor data. The environment classiﬁcation combines support vector machine (SVM) classiﬁers trained on various global visual low-level features with audio event histogram based environment classiﬁcation using k nearest neighbors (k-NN). Rule-based fusion schemes with genetic algorithm (GA)-optimized modality weights are compared to training a SVM classiﬁer to perform the multimodal fusion. A comprehensive selection of fusion strategies is compared for the task of classifying the sport type of a set of recordings from a common event. These include fusion prior to, simultaneously with, and after classiﬁcation; various approaches for using modality quality estimates; and fusing soft conﬁdence scores as well as crisp single-class predictions. Additionally, different strategies are examined for aggregating the decisions of single videos to a collective prediction from the set of videos recorded concurrently with multiple devices. In both tasks multimodal analysis shows clear advantage over separate classiﬁcation of the modalities.Another part of the work investigates cross-modal pattern analysis and audio-based video editing. This study examines the feasibility of automatically timing shot cuts of multi-camera concert recordings according to music-related cutting patterns learnt from professional concert videos. Cut timing is a crucial part of automated creation of multicamera mashups, where shots from multiple recording devices from a common event are alternated with the aim at mimicing a professionally produced video. In the framework, separate statistical models are formed for typical patterns of beat-quantized cuts in short segments, differences in beats between consecutive cuts, and relative deviation of cuts from exact beat times. Based on music meter and audio change point analysis of a new recording, the models can be used for synthesizing cut times. In a user study the proposed framework clearly outperforms a baseline automatic method with comparably advanced audio analysis and wins 48.2 % of comparisons against hand-edited videos

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Trepo - Institutional Repository of Tampere University

oai:trepo.tuni.fi:10024/114443

Last time updated on 05/12/2019