766 research outputs found
Goal Detection in Soccer Video: Role-Based Events Detection Approach
Soccer video processing and analysis to find critical events such as occurrences of goal event have been one of the important issues and topics of active researches in recent years. In this paper, a new role-based framework is proposed for goal event detection in which the semantic structure of soccer game is used. Usually after a goal scene, the audiencesā and reportersā sound intensity is increased, ball is sent back to the center and the camera may: zoom on Player, show audiencesā delighting, repeat the goal scene or display a combination of them. Thus, the occurrence of goal event will be detectable by analysis of sequences of above roles. The proposed framework in this paper consists of four main procedures: 1- detection of gameās critical events by using audio channel, 2- detection of shot boundary and shots classification, 3- selection of candidate events according to the type of shot and existence of goalmouth in the shot, 4- detection of restarting the game from the center of the field. A new method for shot classification is also presented in this framework. Finally, by applying the proposed method it was shown that the goal events detection has a good accuracy and the percentage of detection failure is also very low.DOI:http://dx.doi.org/10.11591/ijece.v4i6.637
Content-based video indexing for sports applications using integrated multi-modal approach
This thesis presents a research work based on an integrated multi-modal approach for sports video indexing and retrieval. By combining specific features extractable from multiple (audio-visual) modalities, generic structure and specific events can be detected and classified. During browsing and retrieval, users will benefit from the integration of high-level semantic and some descriptive mid-level features such as whistle and close-up view of player(s). The main objective is to contribute to the three major components of sports video indexing systems. The first component is a set of powerful techniques to extract audio-visual features and semantic contents automatically. The main purposes are to reduce manual annotations and to summarize the lengthy contents into a compact, meaningful and more enjoyable presentation. The second component is an expressive and flexible indexing technique that supports gradual index construction. Indexing scheme is essential to determine the methods by which users can access a video database. The third and last component is a query language that can generate dynamic video summaries for smart browsing and support user-oriented retrievals
Verbs in Action: Improving verb understanding in video-language models
Understanding verbs is crucial to modelling how people and objects interact
with each other and the environment through space and time. Recently,
state-of-the-art video-language models based on CLIP have been shown to have
limited verb understanding and to rely extensively on nouns, restricting their
performance in real-world video applications that require action and temporal
understanding. In this work, we improve verb understanding for CLIP-based
video-language models by proposing a new Verb-Focused Contrastive (VFC)
framework. This consists of two main components: (1) leveraging pretrained
large language models (LLMs) to create hard negatives for cross-modal
contrastive learning, together with a calibration strategy to balance the
occurrence of concepts in positive and negative pairs; and (2) enforcing a
fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art
results for zero-shot performance on three downstream tasks that focus on verb
understanding: video-text matching, video question-answering and video
classification. To the best of our knowledge, this is the first work which
proposes a method to alleviate the verb understanding problem, and does not
simply highlight it
Feature based dynamic intra-video indexing
A thesis submitted in partial fulfillment for the degree of Doctor of PhilosophyWith the advent of digital imagery and its wide spread application in all vistas of life, it has become an important component in the world of communication. Video content ranging from broadcast news, sports, personal videos, surveillance, movies and entertainment and similar domains is increasing exponentially in quantity and it is becoming a challenge to retrieve content of interest from the corpora. This has led to an increased interest amongst the researchers to investigate concepts of video structure analysis, feature extraction, content annotation, tagging, video indexing, querying and retrieval to fulfil the requirements. However, most of the previous work is confined within specific domain and constrained by the quality, processing and storage capabilities. This thesis presents a novel framework agglomerating the established approaches from feature extraction to browsing in one system of content based video retrieval. The proposed framework significantly fills the gap identified while satisfying the imposed constraints of processing, storage, quality and retrieval times. The output entails a framework, methodology and prototype application to allow the user to efficiently and effectively retrieved content of interest such as age, gender and activity by specifying the relevant query. Experiments have shown plausible results with an average precision and recall of 0.91 and 0.92 respectively for face detection using Haar wavelets based approach. Precision of age ranges from 0.82 to 0.91 and recall from 0.78 to 0.84. The recognition of gender gives better precision with males (0.89) compared to females while recall gives a higher value with females (0.92). Activity of the subject has been detected using Hough transform and classified using Hiddell Markov Model. A comprehensive dataset to support similar studies has also been developed as part of the research process. A Graphical User Interface (GUI) providing a friendly and intuitive interface has been integrated into the developed system to facilitate the retrieval process. The comparison results of the intraclass correlation coefficient (ICC) shows that the performance of the system closely resembles with that of the human annotator. The performance has been optimised for time and error rate
Knowledge assisted data management and retrieval in multimedia database sistems
With the proliferation of multimedia data and ever-growing requests for multimedia applications, there is an increasing need for efficient and effective indexing, storage and retrieval of multimedia data, such as graphics, images, animation, video, audio and text. Due to the special characteristics of the multimedia data, the Multimedia Database management Systems (MMDBMSs) have emerged and attracted great research attention in recent years. Though much research effort has been devoted to this area, it is still far from maturity and there exist many open issues. In this dissertation, with the focus of addressing three of the essential challenges in developing the MMDBMS, namely, semantic gap, perception subjectivity and data organization, a systematic and integrated framework is proposed with video database and image database serving as the testbed. In particular, the framework addresses these challenges separately yet coherently from three main aspects of a MMDBMS: multimedia data representation, indexing and retrieval. In terms of multimedia data representation, the key to address the semantic gap issue is to intelligently and automatically model the mid-level representation and/or semi-semantic descriptors besides the extraction of the low-level media features. The data organization challenge is mainly addressed by the aspect of media indexing where various levels of indexing are required to support the diverse query requirements. In particular, the focus of this study is to facilitate the high-level video indexing by proposing a multimodal event mining framework associated with temporal knowledge discovery approaches. With respect to the perception subjectivity issue, advanced techniques are proposed to support usersā interaction and to effectively model usersā perception from the feedback at both the image-level and object-level
Integrated analysis of audiovisual signals and external information sources for event detection in team sports video
Ph.DDOCTOR OF PHILOSOPH
Recommended from our members
User-centred video abstraction
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University LondonThe rapid growth of digital video content in recent years has imposed the need for the development of technologies with the capability to produce condensed but semantically rich versions of the input video stream in an effective manner. Consequently, the topic of Video Summarisation is becoming increasingly popular in multimedia community and numerous video abstraction approaches have been proposed accordingly. These recommended techniques can be divided into two major categories of automatic and semi-automatic in accordance with the required level of human intervention in summarisation process. The fully-automated methods mainly adopt the low-level visual, aural and textual features alongside the mathematical and statistical algorithms in furtherance to extract the most significant segments of original video. However, the effectiveness of this type of techniques is restricted by a number of factors such as domain-dependency, computational expenses and the inability to understand the semantics of videos from low-level features. The second category of techniques however, attempts to alleviate the quality of summaries by involving humans in the abstraction process to bridge the semantic gap. Nonetheless, a single userās subjectivity and other external contributing factors such as distraction will potentially deteriorate the performance of this group of approaches. Accordingly, in this thesis we have focused on the development of three user-centred effective video summarisation techniques that could be applied to different video categories and generate satisfactory results. According to our first proposed approach, a novel mechanism for a user-centred video summarisation has been presented for the scenarios in which multiple actors are employed in the video summarisation process in order to minimise the negative effects of sole user adoption. Based on our recommended algorithm, the video frames were initially scored by a group of video annotators āon the flyā. This was followed by averaging these assigned scores in order to generate a singular saliency score for each video frame and, finally, the highest scored video frames alongside the corresponding audio and textual contents were extracted to be included into the final summary. The effectiveness of our approach has been assessed by comparing the video summaries generated based on our approach against the results obtained from three existing automatic summarisation tools that adopt different modalities for abstraction purposes. The experimental results indicated that our proposed method is capable of delivering remarkable outcomes in terms of Overall Satisfaction and Precision with an acceptable Recall rate, indicating the usefulness of involving user input in the video summarisation process. In an attempt to provide a better user experience, we have proposed our personalised video summarisation method with an ability to customise the generated summaries in accordance with the viewersā preferences. Accordingly, the end-userās priority levels towards different video scenes were captured and utilised for updating the average scores previously assigned by the video annotators. Finally, our earlier proposed summarisation method was adopted to extract the most significant audio-visual content of the video. Experimental results indicated the capability of this approach to deliver superior outcomes compared with our previously proposed method and the three other automatic summarisation tools. Finally, we have attempted to reduce the required level of audience involvement for personalisation purposes by proposing a new method for producing personalised video summaries. Accordingly, SIFT visual features were adopted to identify the video scenesā semantic categories. Fusing this retrieved data with pre-built usersā profiles, personalised video abstracts can be created. Experimental results showed the effectiveness of this method in delivering superior outcomes comparing to our previously recommended algorithm and the three other automatic summarisation techniques
Video Abstracting at a Semantical Level
One the most common form of a video abstract is the movie trailer. Contemporary movie trailers share a common structure across genres which allows for an automatic generation and also reflects the corresponding moviea s composition. In this thesis a system for the automatic generation of trailers is presented. In addition to action trailers, the system is able to deal with further genres such as Horror and comedy trailers, which were first manually analyzed in order to identify their basic structures. To simplify the modeling of trailers and the abstract generation itself a new video abstracting application was developed. This application is capable of performing all steps of the abstract generation automatically and allows for previews and manual optimizations. Based on this system, new abstracting models for horror and comedy trailers were created and the corresponding trailers have been automatically generated using the new abstracting models. In an evaluation the automatic trailers were compared to the original Trailers and showed a similar structure. However, the automatically generated trailers still do not exhibit the full perfection of the Hollywood originals as they lack intentional storylines across shots
Analysis and Visualization of Index Words from Audio Transcripts of Instructional Videos
We introduce new techniques for extracting, analyzing, and visualizing
textual contents from instructional videos of low production quality. Using
Automatic Speech Recognition, approximate transcripts (H75% Word Error Rate)
are obtained from the originally highly compressed videos of university
courses, each comprising between 10 to 30 lectures. Text material in the form
of books or papers that accompany the course are then used to filter meaningful
phrases from the seemingly incoherent transcripts. The resulting index into the
transcripts is tied together and visualized in 3 experimental graphs that help
in understanding the overall course structure and provide a tool for localizing
certain topics for indexing. We specifically discuss a Transcript Index Map,
which graphically lays out key phrases for a course, a Textbook Chapter to
Transcript Match, and finally a Lecture Transcript Similarity graph, which
clusters semantically similar lectures. We test our methods and tools on 7 full
courses with 230 hours of video and 273 transcripts. We are able to extract up
to 98 unique key terms for a given transcript and up to 347 unique key terms
for an entire course. The accuracy of the Textbook Chapter to Transcript Match
exceeds 70% on average. The methods used can be applied to genres of video in
which there are recurrent thematic words (news, sports, meetings,...)Comment: 2004 IEEE International Workshop on Multimedia Content-based Analysis
and Retrieval; 20 pages, 8 figures, 7 table
- ā¦