End-to-End Temporal Action Detection using Bag of Discriminant Snippets (BoDS)

Abstract

Detecting human actions in long untrimmed videosis a challenging problem. Existing temporal action detectionmethods have difficulties in finding the precise starting andending time of the actions in untrimmed videos. In this letter, wepropose a temporal action detection framework based on a Bagof Discriminant Snippets (BoDS) that can detect multiple actionsin an end-to-end manner. BoDS is based on the observationthat multiple actions and the background classes have similarsnippets, which cause incorrect classification of action regionsand imprecise boundaries. We solve this issue by finding the keysnippetsfrom the training data of each class and compute theirdiscriminative power which is used in BoDS encoding. Duringtesting of an untrimmed video, we find the BoDS representationfor multiple candidate proposals and find their class label basedon a majority voting scheme. We test BoDS on the Thumos14 andActivityNet datasets and obtain state-of-the-art results. For thesports subset of ActivityNet dataset, we obtain a mean AveragePrecision (mAP) value of 29% at 0.7 temporal intersection overunion (tIoU) threshold. For the Thumos14 dataset, we obtain asignificant gain in terms of mAP i.e., improving from 20.8% to31.6% at tIoU=0.7.This work was supported by the ASR&TD, University of Engineering and Technology (UET) Taxila, Pakistan. The work of S. A. Velastin was supported by the Universidad Carlos III de Madrid, the European Unions Seventh Framework Program for research, technological development, and demonstration under Grant 600371, el Ministerio de Economia y Competitividad (COFUND2013-51509), and Banco Santander

    Similar works