2,204 research outputs found
Recommended from our members
Automatic parsing of sports videos with grammars
Motivated by the analogies between languages and sports videos, we introduce a novel
approach for video parsing with grammars. It utilizes compiler techniques for integrating both semantic
annotation and syntactic analysis to generate a semantic index of events and a table of content for a given
sports video. The video sequence is first segmented and annotated by event detection with domain
knowledge. A grammar-based parser is then used to identify the structure of the video content.
Meanwhile, facilities for error handling are introduced which are particularly useful when the results of
automatic parsing need to be adjusted. As a case study, we have developed a system for video parsing in
the particular domain of TV diving programs. Experimental results indicate the proposed approach is
effectiv
Video browsing interfaces and applications: a review
We present a comprehensive review of the state of the art in video browsing and retrieval systems, with special emphasis on interfaces and applications. There has been a significant increase in activity (e.g., storage, retrieval, and sharing) employing video data in the past decade, both for personal and professional use. The ever-growing amount of video content available for human consumption and the inherent characteristics of video data—which, if presented in its raw format, is rather unwieldy and costly—have become driving forces for the development of more effective solutions to present video contents and allow rich user interaction. As a result, there are many contemporary research efforts toward developing better video browsing solutions, which we summarize. We review more than 40 different video browsing and retrieval interfaces and classify them into three groups: applications that use video-player-like interaction, video retrieval applications, and browsing solutions based on video surrogates. For each category, we present a summary of existing work, highlight the technical aspects of each solution, and compare them against each other
Automatic Summarization of Soccer Highlights Using Audio-visual Descriptors
Automatic summarization generation of sports video content has been object of
great interest for many years. Although semantic descriptions techniques have
been proposed, many of the approaches still rely on low-level video descriptors
that render quite limited results due to the complexity of the problem and to
the low capability of the descriptors to represent semantic content. In this
paper, a new approach for automatic highlights summarization generation of
soccer videos using audio-visual descriptors is presented. The approach is
based on the segmentation of the video sequence into shots that will be further
analyzed to determine its relevance and interest. Of special interest in the
approach is the use of the audio information that provides additional
robustness to the overall performance of the summarization system. For every
video shot a set of low and mid level audio-visual descriptors are computed and
lately adequately combined in order to obtain different relevance measures
based on empirical knowledge rules. The final summary is generated by selecting
those shots with highest interest according to the specifications of the user
and the results of relevance measures. A variety of results are presented with
real soccer video sequences that prove the validity of the approach
Automatic analysis of sport events in video sequences
These last years have been an increasing need regarding the automatic sport analysis. Being soccer one of the most watched sports around the world, action spotting for soccer videos has become one of the main field studies in computer vision. In this project, we have done a deep research and analysis of the State of the Art regarding human recognition and action recognition in soccer videos. Apart from this more theoretical part, we have also executed and compared two of the most pioneering models in sports action spotting today: CALF[3] and NetVLAD++[4] models. In addition, we have been able to make a modification in one hyperparameter, the window size, to evaluate its effect on performance. Finally, we have concluded that we get the highest mAP performance with the NetVLAD++[4] model and that modifying the window size worsens the overall performance in both models, but the individual class performance can benefit from it as the performance is improved in some classes for different window sizes
Identification, indexing, and retrieval of cardio-pulmonary resuscitation (CPR) video scenes of simulated medical crisis.
Medical simulations, where uncommon clinical situations can be replicated, have proved to provide a more comprehensive training. Simulations involve the use of patient simulators, which are lifelike mannequins. After each session, the physician must manually review and annotate the recordings and then debrief the trainees. This process can be tedious and retrieval of specific video segments should be automated. In this dissertation, we propose a machine learning based approach to detect and classify scenes that involve rhythmic activities such as Cardio-Pulmonary Resuscitation (CPR) from training video sessions simulating medical crises. This applications requires different preprocessing techniques from other video applications. In particular, most processing steps require the integration of multiple features such as motion, color and spatial and temporal constrains. The first step of our approach consists of segmenting the video into shots. This is achieved by extracting color and motion information from each frame and identifying locations where consecutive frames have different features. We propose two different methods to identify shot boundaries. The first one is based on simple thresholding while the second one uses unsupervised learning techniques. The second step of our approach consists of selecting one key frame from each shot and segmenting it into homogeneous regions. Then few regions of interest are identified for further processing. These regions are selected based on the type of motion of their pixels and their likelihood to be skin-like regions. The regions of interest are tracked and a sequence of observations that encode their motion throughout the shot is extracted. The next step of our approach uses an HMM classiffier to discriminate between regions that involve CPR actions and other regions. We experiment with both continuous and discrete HMM. Finally, to improve the accuracy of our system, we also detect faces in each key frame, track them throughout the shot, and fuse their HMM confidence with the region\u27s confidence. To allow the user to view and analyze the video training session much more efficiently, we have also developed a graphical user interface (GUI) for CPR video scene retrieval and analysis with several desirable features. To validate our proposed approach to detect CPR scenes, we use one video simulation session recorded by the SPARC group to train the HMM classifiers and learn the system\u27s parameters. Then, we analyze the proposed system on other video recordings. We show that our approach can identify most CPR scenes with few false alarms
A Graph-Based Method for Soccer Action Spotting Using Unsupervised Player Classification
Action spotting in soccer videos is the task of identifying the specific time
when a certain key action of the game occurs. Lately, it has received a large
amount of attention and powerful methods have been introduced. Action spotting
involves understanding the dynamics of the game, the complexity of events, and
the variation of video sequences. Most approaches have focused on the latter,
given that their models exploit the global visual features of the sequences. In
this work, we focus on the former by (a) identifying and representing the
players, referees, and goalkeepers as nodes in a graph, and by (b) modeling
their temporal interactions as sequences of graphs. For the player
identification, or player classification task, we obtain an accuracy of 97.72%
in our annotated benchmark. For the action spotting task, our method obtains an
overall performance of 57.83% average-mAP by combining it with other
audiovisual modalities. This performance surpasses similar graph-based methods
and has competitive results with heavy computing methods. Code and data are
available at https://github.com/IPCV/soccer_action_spotting.Comment: Accepted at the 5th International ACM Workshop on Multimedia Content
Analysis in Sports (MMSports 2022
Deep Learning for Semantic Video Understanding
The field of computer vision has long strived to extract understanding from images and videos sequences. The recent flood of video data along with massive increments in computing power have provided the perfect environment to generate advanced research to extract intelligence from video data. Video data is ubiquitous, occurring in numerous everyday activities such as surveillance, traffic, movies, sports, etc. This massive amount of video needs to be analyzed and processed efficiently to extract semantic features towards video understanding. Such capabilities could benefit surveillance, video analytics and visually challenged people. While watching a long video, humans have the uncanny ability to bypass unnecessary information and concentrate on the important events. These key events can be used as a higher-level description or summary of a long video. Inspired by the human visual cortex, this research affords such abilities in computers using neural networks. Useful or interesting events are first extracted from a video and then deep learning methodologies are used to extract natural language summaries for each video sequence. Previous approaches of video description either have been domain specific or use a template based approach to fill detected objects such as verbs or actions to constitute a grammatically correct sentence. This work involves exploiting temporal contextual information for sentence generation while working on wide domain datasets. Current state-of- the-art video description methodologies are well suited for small video clips whereas this research can also be applied to long sequences of video.
This work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. End to end video summarization immensely depends on abstractive summarization of video descriptions. State-of- the-art neural language & attention joint models have been used to generate textual summaries. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. This novel approach will be a stepping stone for a variety of innovative applications such as video retrieval, automatic summarization for visually impaired persons, automatic movie review generation, video question and answering systems
- …