34 research outputs found
Audio-visual football video analysis, from structure detection to attention analysis
Sport video is an important video genre. Content-based sports video analysis attracts great interest from both industry and academic fields. A sports video is characterised by repetitive temporal structures, relatively plain contents, and strong spatio-temporal variations, such as quick camera switches and swift local motions. It is necessary to develop specific techniques for content-based sports video analysis to utilise these characteristics.
For an efficient and effective sports video analysis system, there are three fundamental questions: (1) what are key stories for sports videos; (2) what incurs viewer’s interest; and (3) how to identify game highlights. This thesis is developed around these questions. We approached these questions from two different perspectives and in turn three research contributions are presented, namely, replay detection, attack temporal structure decomposition, and attention-based highlight identification.
Replay segments convey the most important contents in sports videos. It is an efficient approach to collect game highlights by detecting replay segments. However, replay is an artefact of editing, which improves with advances in video editing tools. The composition of replay is complex, which includes logo transitions, slow motions, viewpoint switches and normal speed video clips. Since logo transition clips are pervasive in game collections of FIFA World Cup 2002, FIFA World Cup 2006 and UEFA Championship 2006, we take logo transition detection as an effective replacement of replay detection. A two-pass system was developed, including a five-layer adaboost classifier and a logo template matching throughout an entire video. The five-layer adaboost utilises shot duration, average game pitch ratio, average motion, sequential colour histogram and shot frequency between two neighbouring logo transitions, to filter out logo transition candidates. Subsequently, a logo template is constructed and employed to find all transition logo sequences. The precision and recall of this system in replay detection is 100% in a five-game evaluation collection.
An attack structure is a team competition for a score. Hence, this structure is a conceptually fundamental unit of a football video as well as other sports videos. We review the literature of content-based temporal structures, such as play-break structure, and develop a three-step system for automatic attack structure decomposition. Four content-based shot classes, namely, play, focus, replay and break were identified by low level visual features. A four-state hidden Markov model was trained to simulate transition processes among these shot classes. Since attack structures are the longest repetitive temporal unit in a sports video, a suffix tree is proposed to find the longest repetitive substring in the label sequence of shot class transitions. These occurrences of this substring are regarded as a kernel of an attack hidden Markov process. Therefore, the decomposition of attack structure becomes a boundary likelihood comparison between two Markov chains.
Highlights are what attract notice. Attention is a psychological measurement of “notice ”. A brief survey of attention psychological background, attention estimation from vision and auditory, and multiple modality attention fusion is presented. We propose two attention models for sports video analysis, namely, the role-based attention model and the multiresolution autoregressive framework. The role-based attention model is based on the perception structure during watching video. This model removes reflection bias among modality salient signals and combines these signals by reflectors. The multiresolution autoregressive framework (MAR) treats salient signals as a group of smooth random processes, which follow a similar trend but are filled with noise. This framework tries to estimate a noise-less signal from these coarse noisy observations by a multiple resolution analysis. Related algorithms are developed, such as event segmentation on a MAR tree and real time event detection. The experiment shows that these attention-based approach can find goal events at a high precision. Moreover, results of MAR-based highlight detection on the final game of FIFA 2002 and 2006 are highly similar to professionally labelled highlights by BBC and FIFA
Recommended from our members
View-invariant gait person re-identification with spatial and temporal attention
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonPerson re-identification at a distance across multiple none overlapping cameras has
been an active research area for years. In the past ten years, Short term Person Re-Id
techniques have made great strides in terms of accuracy using only appearance features
in limited environments. However, massive intraclass variations and inter-class
confusion limit their ability to be used in practical applications. Moreover, appearance
consistency can only be assumed in a short time span from one camera to the other.
Since the holistic appearance will change drastically over days and weeks, the technique,
as mentioned above, will be ineffective. Practical applications usually require a
long-term solution in which the subject appearance and clothing might have changed
after a significant period has elapsed. Facing these problems, soft biometric features
such as Gait have been proposed in the past. Nevertheless, even Gait can vary with
illness, ageing and changes in the emotional state, changes in walking surfaces, shoe
type, clothes type, objects carried by the subject and even clutter in the scene. Therefore,
Gait is considered a temporal cue that could provide biometric motion information.
On the other hand, the shape of the human body could be viewed as a spatial signal
which can produce valuable information. So, extracting discriminative features from
both spatial and temporal domains would be very beneficial to this research. Therefore,
this thesis focuses on finding the best and most robust method to tackle the gait human Re-identification problem and solve it for practical applications. In real-world
surveillance scenarios, the human gait cycle is primarily abnormal. These abnormalities
include but not limited to temporal and spatial characteristics changes such as
walking speed, broken gait phase and most importantly, varied camera angles. Our
work performed an extensive literature study on spatial and temporal gait feature extraction
methods with a focus on deep learning. Next, we conducted a comparative
study and proposed a spatial-temporal approach for gait feature extraction using the
fusion of multiple modalities, including optical-flow, raw silhouettes and RGB images.
This approach was tested on two of the most challenging publicly available datasets for
gait recognition TUM-GAID and CASIA-B, with excellent results presented in chapter
3.
Furthermore, a modern spatial-temporal attention mechanism was proposed and
tested on CASIA-B and OULP datasets which learns salient features independent of
the gait cycle and view variations. The spatial attention layer in the proposed method
extracts the spatial feature maps using a two-layered architecture that are fused using
late fusion. It can pay attention to the identity-related salient regions in silhouette sequences
discriminatively using the spatial feature maps. The temporal attention layer
consists of an LSTM that encodes the temporal motion for silhouette sequences. It
uses the encoded output vectors in the temporal attention architecture to focus on the
most critical timesteps in the gait cycle and discard the rest. Furthermore, we improved
the performance of our method by mapping our extracted spatial-temporal gait
features to a discriminative null space for use in our Siamese architecture for crossmatching.
We also conducted an element removal experiment on each segment of our
spatial-temporal attentional network to gain insight into each component’s contribution to the performance. Our method showed outstanding robustness against abnormal
gait cycles as well as viewpoint variations on both benchmark datasets