5,268 research outputs found
Semantic analysis of field sports video using a petri-net of audio-visual concepts
The most common approach to automatic summarisation and highlight detection in sports video is to train an automatic classifier to detect semantic highlights based on occurrences of low-level features such as action replays, excited commentators or changes in a scoreboard. We propose an alternative approach based on the detection of perception concepts (PCs) and the construction of Petri-Nets which can be used for both semantic description and event detection within sports videos. Low-level algorithms for the detection of perception concepts using visual, aural and motion characteristics are proposed, and a series of Petri-Nets composed of perception concepts is formally defined to describe video content. We call this a Perception Concept Network-Petri Net (PCN-PN) model. Using PCN-PNs, personalized high-level semantic descriptions of video highlights can be facilitated and queries on high-level semantics can be achieved. A particular strength of this framework is that we can easily build semantic detectors based on PCN-PNs to search within sports videos and locate interesting events. Experimental results based on recorded sports
video data across three types of sports games (soccer, basketball and rugby), and each from multiple broadcasters, are used to illustrate the potential of this framework
Localization and recognition of the scoreboard in sports video based on SIFT point matching
In broadcast sports video, the scoreboard is attached at a fixed location in the video and generally the scoreboard always exists in all video frames in order to help viewers to understand the match’s progression quickly. Based on these observations, we present a new localization and recognition method for scoreboard text in sport videos in this paper. The method first matches the Scale Invariant Feature Transform (SIFT) points using a modified matching technique between two frames extracted from a video clip and then localizes the scoreboard by computing a robust estimate of the matched point cloud in a two-stage non-scoreboard filter process based on some domain rules. Next some enhancement operations are performed on the localized scoreboard, and a Multi-frame Voting Decision is used. Both aim to increasing the OCR rate. Experimental results demonstrate the effectiveness and efficiency of our proposed method
Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions
Given a video and a description sentence with one missing word (we call it
the "source sentence"), Video-Fill-In-the-Blank (VFIB) problem is to find the
missing word automatically. The contextual information of the sentence, as well
as visual cues from the video, are important to infer the missing word
accurately. Since the source sentence is broken into two fragments: the
sentence's left fragment (before the blank) and the sentence's right fragment
(after the blank), traditional Recurrent Neural Networks cannot encode this
structure accurately because of many possible variations of the missing word in
terms of the location and type of the word in the source sentence. For example,
a missing word can be the first word or be in the middle of the sentence and it
can be a verb or an adjective. In this paper, we propose a framework to tackle
the textual encoding: Two separate LSTMs (the LR and RL LSTMs) are employed to
encode the left and right sentence fragments and a novel structure is
introduced to combine each fragment with an "external memory" corresponding the
opposite fragments. For the visual encoding, end-to-end spatial and temporal
attention models are employed to select discriminative visual representations
to find the missing word. In the experiments, we demonstrate the superior
performance of the proposed method on challenging VFIB problem. Furthermore, we
introduce an extended and more generalized version of VFIB, which is not
limited to a single blank. Our experiments indicate the generalization
capability of our method in dealing with such more realistic scenarios
Learning to Detect Violent Videos using Convolutional Long Short-Term Memory
Developing a technique for the automatic analysis of surveillance videos in
order to identify the presence of violence is of broad interest. In this work,
we propose a deep neural network for the purpose of recognizing violent videos.
A convolutional neural network is used to extract frame level features from a
video. The frame level features are then aggregated using a variant of the long
short term memory that uses convolutional gates. The convolutional neural
network along with the convolutional long short term memory is capable of
capturing localized spatio-temporal features which enables the analysis of
local motion taking place in the video. We also propose to use adjacent frame
differences as the input to the model thereby forcing it to encode the changes
occurring in the video. The performance of the proposed feature extraction
pipeline is evaluated on three standard benchmark datasets in terms of
recognition accuracy. Comparison of the results obtained with the state of the
art techniques revealed the promising capability of the proposed method in
recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal
based Surveillance(AVSS 2017
- …