301 research outputs found
A COMPUTATION METHOD/FRAMEWORK FOR HIGH LEVEL VIDEO CONTENT ANALYSIS AND SEGMENTATION USING AFFECTIVE LEVEL INFORMATION
VIDEO segmentation facilitates e±cient video indexing and navigation in large
digital video archives. It is an important process in a content-based video
indexing and retrieval (CBVIR) system. Many automated solutions performed seg-
mentation by utilizing information about the \facts" of the video. These \facts"
come in the form of labels that describe the objects which are captured by the cam-
era. This type of solutions was able to achieve good and consistent results for some
video genres such as news programs and informational presentations. The content
format of this type of videos is generally quite standard, and automated solutions
were designed to follow these format rules. For example in [1], the presence of news
anchor persons was used as a cue to determine the start and end of a meaningful
news segment.
The same cannot be said for video genres such as movies and feature films.
This is because makers of this type of videos utilized different filming techniques to
design their videos in order to elicit certain affective response from their targeted
audience. Humans usually perform manual video segmentation by trying to relate
changes in time and locale to discontinuities in meaning [2]. As a result, viewers
usually have doubts about the boundary locations of a meaningful video segment
due to their different affective responses.
This thesis presents an entirely new view to the problem of high level video
segmentation. We developed a novel probabilistic method for affective level video
content analysis and segmentation. Our method had two stages. In the first stage,
a®ective content labels were assigned to video shots by means of a dynamic bayesian
0. Abstract 3
network (DBN). A novel hierarchical-coupled dynamic bayesian network (HCDBN)
topology was proposed for this stage. The topology was based on the pleasure-
arousal-dominance (P-A-D) model of a®ect representation [3]. In principle, this
model can represent a large number of emotions. In the second stage, the visual,
audio and a®ective information of the video was used to compute a statistical feature
vector to represent the content of each shot. Affective level video segmentation was
achieved by applying spectral clustering to the feature vectors.
We evaluated the first stage of our proposal by comparing its emotion detec-
tion ability with all the existing works which are related to the field of a®ective video
content analysis. To evaluate the second stage, we used the time adaptive clustering
(TAC) algorithm as our performance benchmark. The TAC algorithm was the best
high level video segmentation method [2]. However, it is a very computationally
intensive algorithm. To accelerate its computation speed, we developed a modified
TAC (modTAC) algorithm which was designed to be mapped easily onto a field
programmable gate array (FPGA) device. Both the TAC and modTAC algorithms
were used as performance benchmarks for our proposed method.
Since affective video content is a perceptual concept, the segmentation per-
formance and human agreement rates were used as our evaluation criteria. To obtain
our ground truth data and viewer agreement rates, a pilot panel study which was
based on the work of Gross et al. [4] was conducted. Experiment results will show
the feasibility of our proposed method. For the first stage of our proposal, our
experiment results will show that an average improvement of as high as 38% was
achieved over previous works. As for the second stage, an improvement of as high
as 37% was achieved over the TAC algorithm
Content-based music classification, summarization and retrieval
Ph.DDOCTOR OF PHILOSOPH
Automatic summarization of narrative video
The amount of digital video content available to users is rapidly increasing. Developments in computer, digital network, and storage technologies all contribute to broaden the offer of digital video. Only users’ attention and time remain scarce resources. Users face the problem of choosing the right content to watch among hundreds of potentially interesting offers. Video and audio have a dynamic nature: they cannot be properly perceived without considering their temporal dimension. This property makes it difficult to get a good idea of what a video item is about without watching it. Video previews aim at solving this issue by providing compact representations of video items that can help users making choices in massive content collections. This thesis is concerned with solving the problem of automatic creation of video previews. To allow fast and convenient content selection, a video preview should take into consideration more than thirty requirements that we have collected by analyzing related literature on video summarization and film production. The list has been completed with additional requirements elicited by interviewing end-users, experts and practitioners in the field of video editing and multimedia. This list represents our collection of user needs with respect to video previews. The requirements, presented from the point of view of the end-users, can be divided into seven categories: duration, continuity, priority, uniqueness, exclusion, structural, and temporal order. Duration requirements deal with the durations of the preview and its subparts. Continuity requirements request video previews to be as continuous as possible. Priority requirements indicate which content should be included in the preview to convey as much information as possible in the shortest time. Uniqueness requirements aim at maximizing the efficiency of the preview by minimizing redundancy. Exclusion requirements indicate which content should not be included in the preview. Structural requirements are concerned with the structural properties of video, while temporal order requirements set the order of the sequences included in the preview. Based on these requirements, we have introduced a formal model of video summarization specialized for the generation of video previews. The basic idea is to translate the requirements into score functions. Each score function is defined to have a non-positive value if a requirement is not met, and to increase depending on the degree of fulfillment of the requirement. A global objective function is then defined that combines all the score functions and the problem of generating a preview is translated into the problem of finding the parts of the initial content that maximize the objective function. Our solution approach is based on two main steps: preparation and selection. In the preparation step, the raw audiovisual data is analyzed and segmented into basic elements that are suitable for being included in a preview. The segmentation of the raw data is based on a shot-cut detection algorithm. In the selection step various content analysis algorithms are used to perform scene segmentation, advertisements detection and to extract numerical descriptors of the content that, introduced in the objective function, allow to estimate the quality of a video preview. The core part of the selection step is the optimization step that consists in searching the set of segments that maximizes the objective function in the space of all possible previews. Instead of solving the optimization problem exactly, an approximate solution is found by means of a local search algorithm using simulated annealing. We have performed a numerical evaluation of the quality of the solutions generated by our algorithm with respect to previews generated randomly or by selecting segments uniformly in time. The results on thirty content items have shown that the local search approach outperforms the other methods. However, based on this evaluation, we cannot conclude that the degree of fulfillment of the requirements achieved by our method satisfies the end-user needs completely. To validate our approach and assess end-user satisfaction, we conducted a user evaluation study in which we compared six aspects of previews generated using our algorithm to human-made previews and to previews generated by subsampling. The results have shown that previews generated using our optimization-based approach are not as good as manually made previews, but have higher quality than previews created using subsample. The differences between the previews are statistically significant
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single user’s subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators ‘on the fly’. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewers’ preferences. Accordingly, the end-user’s priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenes’ semantic categories. Fusing this retrieved data with pre-built users’ profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
Semantic analysis of field sports video using a petri-net of audio-visual concepts
The most common approach to automatic summarisation and highlight detection in sports video is to train an automatic classifier to detect semantic highlights based on occurrences of low-level features such as action replays, excited commentators or changes in a scoreboard. We propose an alternative approach based on the detection of perception concepts (PCs) and the construction of Petri-Nets which can be used for both semantic description and event detection within sports videos. Low-level algorithms for the detection of perception concepts using visual, aural and motion characteristics are proposed, and a series of Petri-Nets composed of perception concepts is formally defined to describe video content. We call this a Perception Concept Network-Petri Net (PCN-PN) model. Using PCN-PNs, personalized high-level semantic descriptions of video highlights can be facilitated and queries on high-level semantics can be achieved. A particular strength of this framework is that we can easily build semantic detectors based on PCN-PNs to search within sports videos and locate interesting events. Experimental results based on recorded sports
video data across three types of sports games (soccer, basketball and rugby), and each from multiple broadcasters, are used to illustrate the potential of this framework
Mining Social Media for Newsgathering: A Review
Social media is becoming an increasingly important data source for learning
about breaking news and for following the latest developments of ongoing news.
This is in part possible thanks to the existence of mobile devices, which
allows anyone with access to the Internet to post updates from anywhere,
leading in turn to a growing presence of citizen journalism. Consequently,
social media has become a go-to resource for journalists during the process of
newsgathering. Use of social media for newsgathering is however challenging,
and suitable tools are needed in order to facilitate access to useful
information for reporting. In this paper, we provide an overview of research in
data mining and natural language processing for mining social media for
newsgathering. We discuss five different areas that researchers have worked on
to mitigate the challenges inherent to social media newsgathering: news
discovery, curation of news, validation and verification of content,
newsgathering dashboards, and other tasks. We outline the progress made so far
in the field, summarise the current challenges as well as discuss future
directions in the use of computational journalism to assist with social media
newsgathering. This review is relevant to computer scientists researching news
in social media as well as for interdisciplinary researchers interested in the
intersection of computer science and journalism.Comment: Accepted for publication in Online Social Networks and Medi
Unsupervised video indexing on audiovisual characterization of persons
Cette thèse consiste à proposer une méthode de caractérisation non-supervisée des intervenants dans les documents audiovisuels, en exploitant des données liées à leur apparence physique et à leur voix. De manière générale, les méthodes d'identification automatique, que ce soit en vidéo ou en audio, nécessitent une quantité importante de connaissances a priori sur le contenu. Dans ce travail, le but est d'étudier les deux modes de façon corrélée et d'exploiter leur propriété respective de manière collaborative et robuste, afin de produire un résultat fiable aussi indépendant que possible de toute connaissance a priori. Plus particulièrement, nous avons étudié les caractéristiques du flux audio et nous avons proposé plusieurs méthodes pour la segmentation et le regroupement en locuteurs que nous avons évaluées dans le cadre d'une campagne d'évaluation. Ensuite, nous avons mené une étude approfondie sur les descripteurs visuels (visage, costume) qui nous ont servis à proposer de nouvelles approches pour la détection, le suivi et le regroupement des personnes. Enfin, le travail s'est focalisé sur la fusion des données audio et vidéo en proposant une approche basée sur le calcul d'une matrice de cooccurrence qui nous a permis d'établir une association entre l'index audio et l'index vidéo et d'effectuer leur correction. Nous pouvons ainsi produire un modèle audiovisuel dynamique des intervenants.This thesis consists to propose a method for an unsupervised characterization of persons within audiovisual documents, by exploring the data related for their physical appearance and their voice. From a general manner, the automatic recognition methods, either in video or audio, need a huge amount of a priori knowledge about their content. In this work, the goal is to study the two modes in a correlated way and to explore their properties in a collaborative and robust way, in order to produce a reliable result as independent as possible from any a priori knowledge. More particularly, we have studied the characteristics of the audio stream and we have proposed many methods for speaker segmentation and clustering and that we have evaluated in a french competition. Then, we have carried a deep study on visual descriptors (face, clothing) that helped us to propose novel approches for detecting, tracking, and clustering of people within the document. Finally, the work was focused on the audiovisual fusion by proposing a method based on computing the cooccurrence matrix that allowed us to establish an association between audio and video indexes, and to correct them. That will enable us to produce a dynamic audiovisual model for each speaker
- …