595 research outputs found

    Video indexing and summarization using motion activity

    Get PDF
    In this dissertation, video-indexing techniques using low-level motion activity characteristics and their application to video summarization are presented. The MPEG-7 motion activity feature is defined as the subjective level of activity or motion in a video segment. First, a novel psychophysical and analytical framework for automatic measurement of motion activity in compliance with its subjective perception is developed. A psychophysically sound subjective ground truth for motion activity and a test-set of video clips is constructed for this purpose. A number of low-level, compressed domain motion vector based, known and novel descriptors are then described. It is shown that these descriptors successfully estimate the subjective level of motion activity of video clips. Furthermore, the individual strengths and limitations of the proposed descriptors are determined using a novel pair wise comparison framework. It is verified that the intensity of motion activity descriptor of the MPEG-7 standard is one of the best performers, while a novel descriptor proposed in this dissertation performs comparably or better. A new descriptor for the spatial distribution of motion activity in a scene is proposed. This descriptor is supplementary to the intensity of motion activity descriptor. The new descriptor is shown to have comparable query retrieval performance to the current spatial distribution of motion activity descriptor of the MPEG-7 standard. The insights obtained from the motion activity investigation are applied to video summarization. A novel approach to summarizing and skimming through video using motion activity is presented. The approach is based on allocation of playback time to video segments proportional to the motion activity of the segments. Low activity segments are played faster than high activity segments in such a way that a constant level of activity is maintained throughout the video. Since motion activity is a low-complexity descriptor, the proposed summarization techniques are extremely fast. The summarization techniques are successfully used on surveillance video, The proposed techniques can also be used as a preprocessing stage for more complex summarization and content analysis techniques, thus providing significant cost gains

    A COMPUTATION METHOD/FRAMEWORK FOR HIGH LEVEL VIDEO CONTENT ANALYSIS AND SEGMENTATION USING AFFECTIVE LEVEL INFORMATION

    No full text
    VIDEO segmentation facilitates e±cient video indexing and navigation in large digital video archives. It is an important process in a content-based video indexing and retrieval (CBVIR) system. Many automated solutions performed seg- mentation by utilizing information about the \facts" of the video. These \facts" come in the form of labels that describe the objects which are captured by the cam- era. This type of solutions was able to achieve good and consistent results for some video genres such as news programs and informational presentations. The content format of this type of videos is generally quite standard, and automated solutions were designed to follow these format rules. For example in [1], the presence of news anchor persons was used as a cue to determine the start and end of a meaningful news segment. The same cannot be said for video genres such as movies and feature films. This is because makers of this type of videos utilized different filming techniques to design their videos in order to elicit certain affective response from their targeted audience. Humans usually perform manual video segmentation by trying to relate changes in time and locale to discontinuities in meaning [2]. As a result, viewers usually have doubts about the boundary locations of a meaningful video segment due to their different affective responses. This thesis presents an entirely new view to the problem of high level video segmentation. We developed a novel probabilistic method for affective level video content analysis and segmentation. Our method had two stages. In the first stage, a®ective content labels were assigned to video shots by means of a dynamic bayesian 0. Abstract 3 network (DBN). A novel hierarchical-coupled dynamic bayesian network (HCDBN) topology was proposed for this stage. The topology was based on the pleasure- arousal-dominance (P-A-D) model of a®ect representation [3]. In principle, this model can represent a large number of emotions. In the second stage, the visual, audio and a®ective information of the video was used to compute a statistical feature vector to represent the content of each shot. Affective level video segmentation was achieved by applying spectral clustering to the feature vectors. We evaluated the first stage of our proposal by comparing its emotion detec- tion ability with all the existing works which are related to the field of a®ective video content analysis. To evaluate the second stage, we used the time adaptive clustering (TAC) algorithm as our performance benchmark. The TAC algorithm was the best high level video segmentation method [2]. However, it is a very computationally intensive algorithm. To accelerate its computation speed, we developed a modified TAC (modTAC) algorithm which was designed to be mapped easily onto a field programmable gate array (FPGA) device. Both the TAC and modTAC algorithms were used as performance benchmarks for our proposed method. Since affective video content is a perceptual concept, the segmentation per- formance and human agreement rates were used as our evaluation criteria. To obtain our ground truth data and viewer agreement rates, a pilot panel study which was based on the work of Gross et al. [4] was conducted. Experiment results will show the feasibility of our proposed method. For the first stage of our proposal, our experiment results will show that an average improvement of as high as 38% was achieved over previous works. As for the second stage, an improvement of as high as 37% was achieved over the TAC algorithm

    Resource Allocation for Personalized Video Summarization

    Full text link

    Audio-visual football video analysis, from structure detection to attention analysis

    Get PDF
    Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics. For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification. Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection. An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains. Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA

    Video Abstracting at a Semantical Level

    Get PDF
    One the most common form of a video abstract is the movie trailer. Contemporary movie trailers share a common structure across genres which allows for an automatic generation and also reflects the corresponding moviea s composition. In this thesis a system for the automatic generation of trailers is presented. In addition to action trailers, the system is able to deal with further genres such as Horror and comedy trailers, which were first manually analyzed in order to identify their basic structures. To simplify the modeling of trailers and the abstract generation itself a new video abstracting application was developed. This application is capable of performing all steps of the abstract generation automatically and allows for previews and manual optimizations. Based on this system, new abstracting models for horror and comedy trailers were created and the corresponding trailers have been automatically generated using the new abstracting models. In an evaluation the automatic trailers were compared to the original Trailers and showed a similar structure. However, the automatically generated trailers still do not exhibit the full perfection of the Hollywood originals as they lack intentional storylines across shots

    Automatic indexing of video content via the detection of semantic events

    Get PDF
    The number, and size, of digital video databases is continuously growing. Unfortunately, most, if not all, of the video content in these databases is stored without any sort of indexing or analysis and without any associated metadata. If any of the videos do have metadata, then it is usually the result of some manual annotation process rather than any automatic indexing. Thus, locating clips and browsing content is difficult, time consuming and generally inefficient. The task of automatically indexing movies is particularly difficult given their innovative creation process and the individual style of many film makers. However, there are a number of underlying film grammar conventions that are universally followed, from a Hollywood blockbuster to an underground movie with a limited budget. These conventions dictate many elements of film making such as camera placement and editing. By examining the use of these conventions it is possible to extract information about the events in a movie. This research aims to provide an approach that creates an indexed version of a movie to facilitate ease of browsing and efficient retrieval. In order to achieve this aim, all of the relevant events contained within a movie are detected and classified into a predefined index. The event detection process involves examining the underlying structure of a movie and utilising audiovisual analysis techniques, supported by machine learning algorithms, to extract information based on this structure. The result is an indexed movie that can be presented to users for browsing/retrieval of relevant events, as well as supporting user specified searching. Extensive evaluation of the indexing approach is carried out. This evaluation indicates efficient performance of the event detection and retrieval system, and also highlights the subjective nature of video content

    Interactively skimming recorded speech

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Program in Media Arts & Sciences, 1994.Includes bibliographical references (p. 143-156).Barry Michael Arons.Ph.D

    Multi-modal surrogates for retrieving and making sense of videos: is synchronization between the multiple modalities optimal?

    Get PDF
    Video surrogates can help people quickly make sense of the content of a video before downloading or seeking more detailed information. Visual and audio features of a video are primary information carriers and might become important components of video retrieval and video sense-making. In the past decades, most research and development efforts on video surrogates have focused on visual features of the video, and comparatively little work has been done on audio surrogates and examining their pros and cons in aiding users' retrieval and sense-making of digital videos. Even less work has been done on multi-modal surrogates, where more than one modality are employed for consuming the surrogates, for example, the audio and visual modalities. This research examined the effectiveness of a number of multi-modal surrogates, and investigated whether synchronization between the audio and visual channels is optimal. A user study was conducted to evaluate six different surrogates on a set of six recognition and inference tasks to answer two main research questions: (1) How do automatically-generated multi-modal surrogates compare to manually-generated ones in video retrieval and video sense-making? and (2) Does synchronization between multiple surrogate channels enhance or inhibit video retrieval and video sense-making? Forty-eight participants participated in the study, in which the surrogates were measured on the the time participants spent on experiencing the surrogates, the time participants spent on doing the tasks, participants' performance accuracy on the tasks, participants' confidence in their task responses, and participants' subjective ratings on the surrogates. On average, the uncoordinated surrogates were more helpful than the coordinated ones, but the manually-generated surrogates were only more helpful than the automatically-generated ones in terms of task completion time. Participants' subjective ratings were more favorable for the coordinated surrogate C2 (Magic A + V) and the uncoordinated surrogate U1 (Magic A + Storyboard V) with respect to usefulness, usability, enjoyment, and engagement. The post-session questionnaire comments demonstrated participants' preference for the coordinated surrogates, but the comments also revealed the value of having uncoordinated sensory channels
    corecore