701 research outputs found

    A COMPUTATION METHOD/FRAMEWORK FOR HIGH LEVEL VIDEO CONTENT ANALYSIS AND SEGMENTATION USING AFFECTIVE LEVEL INFORMATION

    No full text
    VIDEO segmentation facilitates e±cient video indexing and navigation in large digital video archives. It is an important process in a content-based video indexing and retrieval (CBVIR) system. Many automated solutions performed seg- mentation by utilizing information about the \facts" of the video. These \facts" come in the form of labels that describe the objects which are captured by the cam- era. This type of solutions was able to achieve good and consistent results for some video genres such as news programs and informational presentations. The content format of this type of videos is generally quite standard, and automated solutions were designed to follow these format rules. For example in [1], the presence of news anchor persons was used as a cue to determine the start and end of a meaningful news segment. The same cannot be said for video genres such as movies and feature films. This is because makers of this type of videos utilized different filming techniques to design their videos in order to elicit certain affective response from their targeted audience. Humans usually perform manual video segmentation by trying to relate changes in time and locale to discontinuities in meaning [2]. As a result, viewers usually have doubts about the boundary locations of a meaningful video segment due to their different affective responses. This thesis presents an entirely new view to the problem of high level video segmentation. We developed a novel probabilistic method for affective level video content analysis and segmentation. Our method had two stages. In the first stage, a®ective content labels were assigned to video shots by means of a dynamic bayesian 0. Abstract 3 network (DBN). A novel hierarchical-coupled dynamic bayesian network (HCDBN) topology was proposed for this stage. The topology was based on the pleasure- arousal-dominance (P-A-D) model of a®ect representation [3]. In principle, this model can represent a large number of emotions. In the second stage, the visual, audio and a®ective information of the video was used to compute a statistical feature vector to represent the content of each shot. Affective level video segmentation was achieved by applying spectral clustering to the feature vectors. We evaluated the first stage of our proposal by comparing its emotion detec- tion ability with all the existing works which are related to the field of a®ective video content analysis. To evaluate the second stage, we used the time adaptive clustering (TAC) algorithm as our performance benchmark. The TAC algorithm was the best high level video segmentation method [2]. However, it is a very computationally intensive algorithm. To accelerate its computation speed, we developed a modified TAC (modTAC) algorithm which was designed to be mapped easily onto a field programmable gate array (FPGA) device. Both the TAC and modTAC algorithms were used as performance benchmarks for our proposed method. Since affective video content is a perceptual concept, the segmentation per- formance and human agreement rates were used as our evaluation criteria. To obtain our ground truth data and viewer agreement rates, a pilot panel study which was based on the work of Gross et al. [4] was conducted. Experiment results will show the feasibility of our proposed method. For the first stage of our proposal, our experiment results will show that an average improvement of as high as 38% was achieved over previous works. As for the second stage, an improvement of as high as 37% was achieved over the TAC algorithm

    Multimodal Video Analysis and Modeling

    Get PDF
    From recalling long forgotten experiences based on a familiar scent or on a piece of music, to lip reading aided conversation in noisy environments or travel sickness caused by mismatch of the signals from vision and the vestibular system, the human perception manifests countless examples of subtle and effortless joint adoption of the multiple senses provided to us by evolution. Emulating such multisensory (or multimodal, i.e., comprising multiple types of input modes or modalities) processing computationally offers tools for more effective, efficient, or robust accomplishment of many multimedia tasks using evidence from the multiple input modalities. Information from the modalities can also be analyzed for patterns and connections across them, opening up interesting applications not feasible with a single modality, such as prediction of some aspects of one modality based on another. In this dissertation, multimodal analysis techniques are applied to selected video tasks with accompanying modalities. More specifically, all the tasks involve some type of analysis of videos recorded by non-professional videographers using mobile devices.Fusion of information from multiple modalities is applied to recording environment classification from video and audio as well as to sport type classification from a set of multi-device videos, corresponding audio, and recording device motion sensor data. The environment classification combines support vector machine (SVM) classifiers trained on various global visual low-level features with audio event histogram based environment classification using k nearest neighbors (k-NN). Rule-based fusion schemes with genetic algorithm (GA)-optimized modality weights are compared to training a SVM classifier to perform the multimodal fusion. A comprehensive selection of fusion strategies is compared for the task of classifying the sport type of a set of recordings from a common event. These include fusion prior to, simultaneously with, and after classification; various approaches for using modality quality estimates; and fusing soft confidence scores as well as crisp single-class predictions. Additionally, different strategies are examined for aggregating the decisions of single videos to a collective prediction from the set of videos recorded concurrently with multiple devices. In both tasks multimodal analysis shows clear advantage over separate classification of the modalities.Another part of the work investigates cross-modal pattern analysis and audio-based video editing. This study examines the feasibility of automatically timing shot cuts of multi-camera concert recordings according to music-related cutting patterns learnt from professional concert videos. Cut timing is a crucial part of automated creation of multicamera mashups, where shots from multiple recording devices from a common event are alternated with the aim at mimicing a professionally produced video. In the framework, separate statistical models are formed for typical patterns of beat-quantized cuts in short segments, differences in beats between consecutive cuts, and relative deviation of cuts from exact beat times. Based on music meter and audio change point analysis of a new recording, the models can be used for synthesizing cut times. In a user study the proposed framework clearly outperforms a baseline automatic method with comparably advanced audio analysis and wins 48.2 % of comparisons against hand-edited videos

    Balancing automation and user control in a home video editing system

    Get PDF
    The context of this PhD project is the area of multimedia content management, in particular interaction with home videos. Nowadays, more and more home videos are produced, shared and edited. Home videos are captured by amateur users, mainly to document their lives. People frequently edit home videos, to select and keep the best parts of their visual memories and to add to them a touch of personal creativity. However, most users find the current products for video editing timeconsuming and sometimes too technical and difficult. One reason of the large amount of time required for editing is the slow accessibility caused by the temporal dimension of videos: a video needs to be played back in order to be watched or edited. Another reason of the limitation of current video editing tools is that they are modelled too much on professional video editing systems, including technical details like frame-by-frame browsing. This thesis aims at making home video editing more efficient and easier for the non-technical, amateur user. To accomplish this goal, an approach was taken characterized by two main guidelines. We designed a semi-automatic tool, and we adopted a user-centered approach. To gain insights on user behaviours and needs related to home video editing, we designed an Internet-based survey, which was answered by 180 home video users. The results of the survey revealed the facts that video editing is done frequently and is seen as a very time-consuming activity. We also found that users with low experience with PCs often consider video editing programs too complex. Although nearly all commercial editing tools are designed for a PC, many of our respondents said to be interested in doing video editing on a TV. We created a novel concept, Edit While Watching, designed to be user-friendly. It requires only a TV set and a remote control, instead of a PC. The video that the user inputs to the system is automatically analyzed and structured in small video segments. The editing operations happen on the basis of these video segments: the user is not aware anymore of the single video frames. After the input video has been analyzed and structured, a first edited version is automatically prepared. Successively, Edit While Watching allows the user to modify and enrich the automatically edited video while watching it. When the user is satisfied, the video can be saved to a DVD or to another storage medium. We performed two iterations of system implementation and use testing to refine our concept. After the first iteration, we discovered that two requirements were insufficiently addressed: to have an overview of the video and to precisely control which video content to keep or to discard. The second version of EditWhileWatching was designed to address these points. It allows the user to visualize the video at three levels of detail: the different chapters (or scenes) of the video, the shots inside one chapter, and the timeline representation of a single shot. Also, the second version allows the users to edit the video at different levels of automation. For example, the user can choose an event in the video (e.g. a child playing with a toy) and just ask the system to automatically include more content related to it. Alternatively, if the user wants more control, he or she can precisely select which content to add to the video. We evaluated the second version of our tool by inviting nine users to edit their own home videos with it. The users judged Edit While Watching as an easy to use and fast application. However, some of them missed the possibility of enriching the video with transitions, music, text and pictures. Our test showed that the requirements of overview on the video and control in the selection of the edited material are better addressed than in the first version. Moreover, the participants were able to select which video portions to keep or to discard in a time close to the playback time of the video. The second version of Edit While Watching exploits different levels of automation. In some editing functions the user only gives an indication about editing a clip, and the system automatically decides the start and end points of the part of the video to be cut. However, there are also editing functions in which the user has complete control on the start and end points of a cut. We wanted to investigate how to balance automation and user control to optimize the perceived ease of use, the perceived control, the objective editing efficiency and the mental effort. To this aim, we implemented three types of editing functions, each type representing a different balance between automation and user control. To compare these three levels, we invited 25 users to perform pre-defined tasks with the three function types. The results showed that the type of functions with the highest level of automation performed worse than the two other types, according to both subjective and objective measurements. The other two types of functions were equally liked. However, some users clearly preferred the functions that allowed faster editing while others preferred the functions that gave full control and a more complete overview. In conclusion, on the basis of this research some design guidelines can be offered for building an easy and efficient video editing application. Such application should automatically structure the video, eliminate the detail about single frames, support a scalable video overview, implement a rich set of editing functionalities, and should be preferably TV-based

    A Probabilistic Multimedia Retrieval Model and its Evaluation

    Get PDF
    We present a probabilistic model for the retrieval of multimodal documents. The model is based on Bayesian decision theory and combines models for text-based search with models for visual search. The textual model is based on the language modelling approach to text retrieval, and the visual information is modelled as a mixture of Gaussian densities. Both models have proved successful on various standard retrieval tasks. We evaluate the multimodal model on the search task of TREC′s video track. We found that the disclosure of video material based on visual information only is still too difficult. Even with purely visual information needs, text-based retrieval still outperforms visual approaches. The probabilistic model is useful for text, visual, and multimedia retrieval. Unfortunately, simplifying assumptions that reduce its computational complexity degrade retrieval effectiveness. Regarding the question whether the model can effectively combine information from different modalities, we conclude that whenever both modalities yield reasonable scores, a combined run outperforms the individual runs

    Audio-coupled video content understanding of unconstrained video sequences

    Get PDF
    Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results

    Highly efficient low-level feature extraction for video representation and retrieval.

    Get PDF
    PhDWitnessing the omnipresence of digital video media, the research community has raised the question of its meaningful use and management. Stored in immense multimedia databases, digital videos need to be retrieved and structured in an intelligent way, relying on the content and the rich semantics involved. Current Content Based Video Indexing and Retrieval systems face the problem of the semantic gap between the simplicity of the available visual features and the richness of user semantics. This work focuses on the issues of efficiency and scalability in video indexing and retrieval to facilitate a video representation model capable of semantic annotation. A highly efficient algorithm for temporal analysis and key-frame extraction is developed. It is based on the prediction information extracted directly from the compressed domain features and the robust scalable analysis in the temporal domain. Furthermore, a hierarchical quantisation of the colour features in the descriptor space is presented. Derived from the extracted set of low-level features, a video representation model that enables semantic annotation and contextual genre classification is designed. Results demonstrate the efficiency and robustness of the temporal analysis algorithm that runs in real time maintaining the high precision and recall of the detection task. Adaptive key-frame extraction and summarisation achieve a good overview of the visual content, while the colour quantisation algorithm efficiently creates hierarchical set of descriptors. Finally, the video representation model, supported by the genre classification algorithm, achieves excellent results in an automatic annotation system by linking the video clips with a limited lexicon of related keywords

    The Murray Ledger and Times, December 6-7, 2014

    Get PDF
    corecore