10,087 research outputs found
Audio-based event detection for sports video
In this paper, we present an audio-based event detection approach shown to be effective when applied to the Sports broadcast data. The main benefit of this approach is the ability to recognise patterns that indicate high levels of crowd response which can be correlated to key events. By applying Hidden Markov Model-based classifiers, where the predefined content classes are parameterised using Mel-Frequency Cepstral Coefficients, we were able to eliminate the need for defining a heuristic set of rules to determine event detection, thus avoiding a two-class approach shown not to be suitable for this problem. Experimentation indicated that this is an effective method for classifying crowd response in Soccer matches, thus providing a basis for automatic indexing and summarisation
Recommended from our members
Temporal hybridity: Mixing live video footage with instant replay in real time
Copyright @ 2010 ACMIn this paper we explore the production of streaming media that involves live and recorded content. To examine this, we report on how the production practices and process are conducted through an empirical study of the production of live television, involving the use of live and non-live media under highly time critical conditions. In explaining how this process is managed both as an individual and collective activity, we develop the concept of temporal hybridity to
explain the properties of these kinds of production system and show how temporally separated media are used, understood and coordinated. Our analysis is examined in
the light of recent developments in computing technology and we present some design implications to support amateur video production.The research was partly made possible by a grant from the Swedish Governmental Agency for Innovation Systems to the Mobile Life VinnExcellence Center, in partnership with
SonyEricsson, Ericsson, Microsoft Research, Nokia Research, TeliaSonera and the City of Stockholm
Combining Residual Networks with LSTMs for Lipreading
We propose an end-to-end deep learning architecture for word-level visual
speech recognition. The system is a combination of spatiotemporal
convolutional, residual and bidirectional Long Short-Term Memory networks. We
train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging
database of 500-size target-words consisting of 1.28sec video excerpts from BBC
TV broadcasts. The proposed network attains word accuracy equal to 83.0,
yielding 6.8 absolute improvement over the current state-of-the-art, without
using information about word boundaries during training or testing.Comment: Submitted to Interspeech 201
An audio-based sports video segmentation and event detection algorithm
In this paper, we present an audio-based event detection algorithm shown to be effective when applied to Soccer video. The main benefit of this approach is the ability to recognise patterns that display high levels of crowd response correlated to key events. The soundtrack from a Soccer sequence is first parameterised using Mel-frequency Cepstral coefficients. It is then segmented into homogenous components using a windowing algorithm with a decision process based on Bayesian model selection. This decision process eliminated the need for defining a heuristic set of rules for segmentation. Each audio segment is then labelled using a series of Hidden Markov model (HMM) classifiers, each a representation of one of 6 predefined semantic content classes found in Soccer video. Exciting events are identified as those segments belonging to a crowd cheering class. Experimentation indicated that the algorithm was more effective for classifying crowd response when compared to traditional model-based segmentation and classification techniques
Face detection and clustering for video indexing applications
This paper describes a method for automatically detecting human faces in generic video sequences. We employ an iterative algorithm in order to give a confidence measure for the presence or absence of faces within video shots. Skin colour filtering is carried out on a selected number of frames per video shot, followed by the application of shape and size heuristics. Finally, the remaining candidate regions are normalized and projected into an eigenspace, the reconstruction error being the measure of confidence for presence/absence of face. Following this, the confidence score for the entire video shot is calculated. In order to cluster extracted faces into a set of face classes, we employ an incremental procedure using a PCA-based dissimilarity measure in con-junction with spatio-temporal correlation. Experiments were carried out on a representative broadcast news test corpus
Indexing, browsing and searching of digital video
Video is a communications medium that normally brings together moving pictures with a synchronised audio track into a discrete piece or pieces of information. The size of a âpiece â of video can variously be referred to as a frame, a shot, a scene, a clip, a programme or an episode, and these are distinguished by their lengths and by their composition. We shall return to the definition of each of these in section 4 this chapter. In modern society, video is ver
A new framework for sign language recognition based on 3D handshape identification and linguistic modeling
Current approaches to sign recognition by computer generally have at least some of the following limitations: they rely on laboratory
conditions for sign production, are limited to a small vocabulary, rely on 2D modeling (and therefore cannot deal with occlusions
and off-plane rotations), and/or achieve limited success. Here we propose a new framework that (1) provides a new tracking method
less dependent than others on laboratory conditions and able to deal with variations in background and skin regions (such as the
face, forearms, or other hands); (2) allows for identification of 3D hand configurations that are linguistically important in American
Sign Language (ASL); and (3) incorporates statistical information reflecting linguistic constraints in sign production. For purposes of
large-scale computer-based sign language recognition from video, the ability to distinguish hand configurations accurately is critical.
Our current method estimates the 3D hand configuration to distinguish among 77 hand configurations linguistically relevant for
ASL. Constraining the problem in this way makes recognition of 3D hand configuration more tractable and provides the information
specifically needed for sign recognition. Further improvements are obtained by incorporation of statistical information about linguistic
dependencies among handshapes within a sign derived from an annotated corpus of almost 10,000 sign tokens
Content-based Video Retrieval by Integrating Spatio-Temporal and Stochastic Recognition of Events
As amounts of publicly available video data grow the need to query this data efficiently becomes significant. Consequently content-based retrieval of video data turns out to be a challenging and important problem. We address the specific aspect of inferring semantics automatically from raw video data. In particular, we introduce a new video data model that supports the integrated use of two different approaches for mapping low-level features to high-level concepts. Firstly, the model is extended with a rule-based approach that supports spatio-temporal formalization of high-level concepts, and then with a stochastic approach. Furthermore, results on real tennis video data are presented, demonstrating the validity of both approaches, as well us advantages of their integrated us
- âŠ