Search CORE

2,119 research outputs found

Temporal Extension of Scale Pyramid and Spatial Pyramid Matching for Action Recognition

Author: Hauptmann Alexandar G.
Lan Zhenzhong
Li Xuanchong
Publication venue
Publication date: 29/08/2014
Field of study

Historically, researchers in the field have spent a great deal of effort to create image representations that have scale invariance and retain spatial location information. This paper proposes to encode equivalent temporal characteristics in video representations for action recognition. To achieve temporal scale invariance, we develop a method called temporal scale pyramid (TSP). To encode temporal information, we present and compare two methods called temporal extension descriptor (TED) and temporal division pyramid (TDP) . Our purpose is to suggest solutions for matching complex actions that have large variation in velocity and appearance, which is missing from most current action representations. The experimental results on four benchmark datasets, UCF50, HMDB51, Hollywood2 and Olympic Sports, support our approach and significantly outperform state-of-the-art methods. Most noticeably, we achieve 65.0% mean accuracy and 68.2% mean average precision on the challenging HMDB51 and Hollywood2 datasets which constitutes an absolute improvement over the state-of-the-art by 7.8% and 3.9%, respectively

arXiv.org e-Print Archive

CiteSeerX

Action Recognition in Videos: from Motion Capture Labs to the Web

Author: Ana Paula Br
Arnaldo Albuquerque De Araújo
De Almeida
Eduardo Alves
Jussara Marques
Publication venue
Publication date: 17/06/2010
Field of study

This paper presents a survey of human action recognition approaches based on visual data recorded from a single video camera. We propose an organizing framework which puts in evidence the evolution of the area, with techniques moving from heavily constrained motion capture scenarios towards more challenging, realistic, "in the wild" videos. The proposed organization is based on the representation used as input for the recognition task, emphasizing the hypothesis assumed and thus, the constraints imposed on the type of video that each technique is able to address. Expliciting the hypothesis and constraints makes the framework particularly useful to select a method, given an application. Another advantage of the proposed organization is that it allows categorizing newest approaches seamlessly with traditional ones, while providing an insightful perspective of the evolution of the action recognition task up to now. That perspective is the basis for the discussion in the end of the paper, where we also present the main open issues in the area.Comment: Preprint submitted to CVIU, survey paper, 46 pages, 2 figures, 4 table

arXiv.org e-Print Archive

CiteSeerX

Semantic Model Vectors for Complex Video Event Recognition

Author: Apostol Natsev
Bert Huang
Gang Hua
Lexing Xie
Michele Merler
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

TNO at TRECVID 2013 : multimedia event detection and instance search

Author: Antwerpen Gert van
Azzopardi George
Baan Jan
Boer Maaike de
Bouma Henri
Brandt Paul
Broekhuijsen Jeroen
Daniele Laura
Eekeren Adam van
Eendebak Pieter T.
Haar Frank ter
Hollander Richard den
Hove Johan-Martijn ten
Huis Jasper van
Kraaij Wessel
Schavemaker John
Schutte Klamer
Spitters Martijn
TRECVID 2013
Versloot Corne
Wit Joost de
Zon Remco van der
Publication venue: TRECVID
Publication date: 01/01/2013
Field of study

We describe the TNO system and the evaluation results for TRECVID 2013 Multimedia Event Detection (MED) and instance search (INS) tasks. The MED system consists of a bag-of-word (BOW) approach with spatial tiling that uses low-level static and dynamic visual features, an audio feature and high-level concepts. Automatic speech recognition (ASR) and optical character recognition (OCR) are not used in the system. In the MED case with 100 example training videos, support-vector machines (SVM) are trained and fused to detect an event in the test set. In the case with 0 example videos, positive and negative concepts are extracted as keywords from the textual event description and events are detected with the high-level concepts. The MED results show that the SIFT keypoint descriptor is the one which contributes best to the results, fusion of multiple low-level features helps to improve the performance, and the textual event-description chain currently performs poorly. The TNO INS system presents a baseline open-source approach using standard SIFT keypoint detection and exhaustive matching. In order to speed up search times for queries a basic map-reduce scheme is presented to be used on a multi-node cluster. Our INS results show above-median results with acceptable search times.This research for the MED submission was performed in the GOOSE project, which is jointly funded by the enabling technology program Adaptive Multi Sensor Networks (AMSN) and the MIST research program of the Dutch Ministry of Defense. The INS submission was partly supported by the MIME project of the creative industries knowledge and innovation network CLICKNL.peer-reviewe

OAR@UM

Recommended from our members

Towards Segment-level Video Understanding: Detecting Activities from Untrimmed Videos

Author: Zhang Da
Publication venue: eScholarship, University of California
Publication date: 01/01/2020
Field of study

We generate massive amounts of video data every day. While most real-world videos are long and untrimmed with sparsely localized segments of interest, existing AI systems that can interpret videos today often rely on static image analysis or can only process temporal information in a short video snippet. To automatically understand the content of long video streams, this thesis mainly describes the efforts to design accurate, efficient, and intelligent deep learning algorithms for temporal activity detection in untrimmed videos. Detecting segments of interest from untrimmed videos is a key step towards segment-level video understanding. Depending on the purposes of tasks being performed, we address three different activity detection tasks: detecting activities of interest from videos without specific purposes (i.e., temporal activity detection); detecting temporal segment that best corresponds to a language query (i.e., natural language moment retrieval); and detecting activities given less supervision (i.e., weakly-supervised or few-shot activity detection).In temporal activity detection, We first propose a highly unified single-shot temporal activity detector based on fully 3D convolutional networks, by eliminating explicit temporal proposal and classification stages. Evaluations show that it achieves state-of-the-art on temporal activity detection while being super efficient to operate at 1271 FPS. We then investigate how to effectively apply a multi-scale architecture to model activities with various temporal length and frequency. We propose three novel architecture designs: (1) dynamic temporal sampling; (2) two-branch feature hierarchy; (3) multi-scale contextual feature fusion, and we combine all these components into a uniform network and achieve the state-of-the-art on a much larger temporal activity detection benchmark.In natural language moment retrieval, we aim to localize the segment that best corresponds to a given language query. We present a language-guided temporal attention module and an iterative graph adjustment network to handle the semantic and structural misalignment between video and language. The proposed model demonstrates superior capability to handle temporal relations, thus, significantly improves the state-of-the-art by a large margin.Finally, we study the problem of weakly-supervised and few-shot temporal activity detection to mitigate the drawbacks of huge amounts of supervision needed to train a temporal detection model. Namely, we answer the question if we can learn a temporal activity detector under weak supervision that is able to localize unseen activity classes. A novel meta-learning based detection method is accordingly proposed by adopting the few-shot learning technique of Relation Network. Results show that our method achieves performance superior or competitive to state-of-the-art approaches with stronger supervision.In summary, we propose a suite of algorithms and solutions to automatically detect segments of interest in long untrimmed videos. We hope our studies could provide insights for researchers to explore new deep learning paradigms for future computer vision research, especially on video-related topics

eScholarship - University of California

Discriminatively Trained Latent Ordinal Model for Video Classification

Author: Sharma Gaurav
Sikka Karan
Publication venue
Publication date: 14/08/2017
Field of study

We study the problem of video classification for facial analysis and human action recognition. We propose a novel weakly supervised learning method that models the video as a sequence of automatically mined, discriminative sub-events (eg. onset and offset phase for "smile", running and jumping for "highjump"). The proposed model is inspired by the recent works on Multiple Instance Learning and latent SVM/HCRF -- it extends such frameworks to model the ordinal aspect in the videos, approximately. We obtain consistent improvements over relevant competitive baselines on four challenging and publicly available video based facial analysis datasets for prediction of expression, clinical pain and intent in dyadic conversations and on three challenging human action datasets. We also validate the method with qualitative results and show that they largely support the intuitions behind the method.Comment: Paper accepted in IEEE TPAMI. arXiv admin note: substantial text overlap with arXiv:1604.0150

arXiv.org e-Print Archive

MPG.PuRe

Detecting events and key actors in multi-person videos

Author: Abu-El-Haija Sami
Fei-Fei Li
Gorban Alexander
Huang Jonathan
Murphy Kevin
Ramanathan Vignesh
Publication venue
Publication date: 16/03/2016
Field of study

Multi-person event recognition is a challenging task, often with many people active in the scene but only a small subset contributing to an actual event. In this paper, we propose a model which learns to detect events in such videos while automatically "attending" to the people responsible for the event. Our model does not use explicit annotations regarding who or where those people are during training and testing. In particular, we track people in videos and use a recurrent neural network (RNN) to represent the track features. We learn time-varying attention weights to combine these features at each time-instant. The attended features are then processed using another RNN for event detection/classification. Since most video datasets with multiple people are restricted to a small number of videos, we also collected a new basketball dataset comprising 257 basketball games with 14K event annotations corresponding to 11 event classes. Our model outperforms state-of-the-art methods for both event classification and detection on this new dataset. Additionally, we show that the attention mechanism is able to consistently localize the relevant players.Comment: Accepted for publication in CVPR'1

arXiv.org e-Print Archive

Crossref