To balance the annotation labor and the granularity of supervision,
single-frame annotation has been introduced in temporal action localization. It
provides a rough temporal location for an action but implicitly overstates the
supervision from the annotated-frame during training, leading to the confusion
between actions and backgrounds, i.e., action incompleteness and background
false positives. To tackle the two challenges, in this work, we present the
Snippet Classification model and the Dilation-Erosion module. In the
Dilation-Erosion module, we expand the potential action segments with a loose
criterion to alleviate the problem of action incompleteness and then remove the
background from the potential action segments to alleviate the problem of
action incompleteness. Relying on the single-frame annotation and the output of
the snippet classification, the Dilation-Erosion module mines pseudo
snippet-level ground-truth, hard backgrounds and evident backgrounds, which in
turn further trains the Snippet Classification model. It forms a cyclic
dependency. Furthermore, we propose a new embedding loss to aggregate the
features of action instances with the same label and separate the features of
actions from backgrounds. Experiments on THUMOS14 and ActivityNet 1.2 validate
the effectiveness of the proposed method. Code has been made publicly available
(https://github.com/LingJun123/single-frame-TAL).Comment: 28 pages, 8 figure