16,563 research outputs found
Deep attentive video summarization with distribution consistency learning
This article studies supervised video summarization by formulating it into a sequence-to-sequence learning framework, in which the input and output are sequences of original video frames and their predicted importance scores, respectively. Two critical issues are addressed in this article: short-term contextual attention insufficiency and distribution inconsistency. The former lies in the insufficiency of capturing the short-term contextual attention information within the video sequence itself since the existing approaches focus a lot on the long-term encoder-decoder attention. The latter refers to the distributions of predicted importance score sequence and the ground-truth sequence is inconsistent, which may lead to a suboptimal solution. To better mitigate the first issue, we incorporate a self-attention mechanism in the encoder to highlight the important keyframes in a short-term context. The proposed approach alongside the encoder-decoder attention constitutes our deep attentive models for video summarization. For the second one, we propose a distribution consistency learning method by employing a simple yet effective regularization loss term, which seeks a consistent distribution for the two sequences. Our final approach is dubbed as Attentive and Distribution consistent video Summarization (ADSum). Extensive experiments on benchmark data sets demonstrate the superiority of the proposed ADSum approach against state-of-the-art approaches
Detecting the Unexpected via Image Resynthesis
Classical semantic segmentation methods, including the recent deep learning
ones, assume that all classes observed at test time have been seen during
training. In this paper, we tackle the more realistic scenario where unexpected
objects of unknown classes can appear at test time. The main trends in this
area either leverage the notion of prediction uncertainty to flag the regions
with low confidence as unknown, or rely on autoencoders and highlight
poorly-decoded regions. Having observed that, in both cases, the detected
regions typically do not correspond to unexpected objects, in this paper, we
introduce a drastically different strategy: It relies on the intuition that the
network will produce spurious labels in regions depicting unexpected objects.
Therefore, resynthesizing the image from the resulting semantic map will yield
significant appearance differences with respect to the input image. In other
words, we translate the problem of detecting unknown classes to one of
identifying poorly-resynthesized image regions. We show that this outperforms
both uncertainty- and autoencoder-based methods
- …