4 research outputs found
Learning to Discriminate Information for Online Action Detection
From a streaming video, online action detection aims to identify actions in
the present. For this task, previous methods use recurrent networks to model
the temporal sequence of current action frames. However, these methods overlook
the fact that an input image sequence includes background and irrelevant
actions as well as the action of interest. For online action detection, in this
paper, we propose a novel recurrent unit to explicitly discriminate the
information relevant to an ongoing action from others. Our unit, named
Information Discrimination Unit (IDU), decides whether to accumulate input
information based on its relevance to the current action. This enables our
recurrent network with IDU to learn a more discriminative representation for
identifying ongoing actions. In experiments on two benchmark datasets, TVSeries
and THUMOS-14, the proposed method outperforms state-of-the-art methods by a
significant margin. Moreover, we demonstrate the effectiveness of our recurrent
unit by conducting comprehensive ablation studies.Comment: To appear in CVPR 202
Continual Transformers: Redundancy-Free Attention for Online Inference
Transformers in their common form are inherently limited to operate on whole
token sequences rather than on one token at a time. Consequently, their use
during online inference on time-series data entails considerable redundancy due
to the overlap in successive token sequences. In this work, we propose novel
formulations of the Scaled Dot-Product Attention, which enable Transformers to
perform efficient online token-by-token inference on a continual input stream.
Importantly, our modifications are purely to the order of computations, while
the outputs and learned weights are identical to those of the original
Transformer Encoder. We validate our Continual Transformer Encoder with
experiments on the THUMOS14, TVSeries and GTZAN datasets with remarkable
results: Our Continual one- and two-block architectures reduce the floating
point operations per prediction by up to 63x and 2.6x, respectively, while
retaining predictive performance.Comment: 15 pages, 6 figures, 7 table