10 research outputs found
Spatio-Temporal Learnable Proposals for End-to-End Video Object Detection
This paper presents the novel idea of generating object proposals by
leveraging temporal information for video object detection. The feature
aggregation in modern region-based video object detectors heavily relies on
learned proposals generated from a single-frame RPN. This imminently introduces
additional components like NMS and produces unreliable proposals on low-quality
frames. To tackle these restrictions, we present SparseVOD, a novel video
object detection pipeline that employs Sparse R-CNN to exploit temporal
information. In particular, we introduce two modules in the dynamic head of
Sparse R-CNN. First, the Temporal Feature Extraction module based on the
Temporal RoI Align operation is added to extract the RoI proposal features.
Second, motivated by sequence-level semantic aggregation, we incorporate the
attention-guided Semantic Proposal Feature Aggregation module to enhance object
feature representation before detection. The proposed SparseVOD effectively
alleviates the overhead of complicated post-processing methods and makes the
overall pipeline end-to-end trainable. Extensive experiments show that our
method significantly improves the single-frame Sparse RCNN by 8%-9% in mAP.
Furthermore, besides achieving state-of-the-art 80.3% mAP on the ImageNet VID
dataset with ResNet-50 backbone, our SparseVOD outperforms existing
proposal-based methods by a significant margin on increasing IoU thresholds
(IoU > 0.5).Comment: BMVC 202
Learning to Incorporate Texture Saliency Adaptive Attention to Image Cartoonization
Image cartoonization is recently dominated by generative adversarial networks
(GANs) from the perspective of unsupervised image-to-image translation, in
which an inherent challenge is to precisely capture and sufficiently transfer
characteristic cartoon styles (e.g., clear edges, smooth color shading,
abstract fine structures, etc.). Existing advanced models try to enhance
cartoonization effect by learning to promote edges adversarially, introducing
style transfer loss, or learning to align style from multiple representation
space. This paper demonstrates that more distinct and vivid cartoonization
effect could be easily achieved with only basic adversarial loss. Observing
that cartoon style is more evident in cartoon-texture-salient local image
regions, we build a region-level adversarial learning branch in parallel with
the normal image-level one, which constrains adversarial learning on
cartoon-texture-salient local patches for better perceiving and transferring
cartoon texture features. To this end, a novel cartoon-texture-saliency-sampler
(CTSS) module is proposed to dynamically sample cartoon-texture-salient patches
from training data. With extensive experiments, we demonstrate that texture
saliency adaptive attention in adversarial learning, as a missing ingredient of
related methods in image cartoonization, is of significant importance in
facilitating and enhancing image cartoon stylization, especially for
high-resolution input pictures.Comment: Proceedings of the 39th International Conference on Machine Learning,
PMLR 162:7183-7207, 202
TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers
Detection Transformer (DETR) and Deformable DETR have been proposed to
eliminate the need for many hand-designed components in object detection while
demonstrating good performance as previous complex hand-crafted detectors.
However, their performance on Video Object Detection (VOD) has not been well
explored. In this paper, we present TransVOD, the first end-to-end video object
detection system based on spatial-temporal Transformer architectures. The first
goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g.,
optical flow model, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS. In particular, we present a temporal Transformer to aggregate
both the spatial object queries and the feature memories of each frame. Our
temporal transformer consists of two components: Temporal Query Encoder (TQE)
to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to
obtain current frame detection results. These designs boost the strong baseline
deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID
dataset. Then, we present two improved versions of TransVOD including
TransVOD++ and TransVOD Lite. The former fuses object-level information into
object query via dynamic convolution while the latter models the entire video
clips as the output to speed up the inference time. We give detailed analysis
of all three models in the experiment part. In particular, our proposed
TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet
VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and
accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single
V100 GPU device.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI), extended version of arXiv:2105.1092
Exploiting Spatio-Temporal Coherence for Video Object Detection in Robotics
This paper proposes a method to enhance video object detection for indoor environments in robotics. Concretely, it exploits knowledge about the camera motion between frames to propagate previously detected objects to successive frames. The proposal is rooted in the concepts of planar homography to propose regions of interest where to find objects, and recursive Bayesian filtering to integrate observations over time. The proposal is evaluated on six virtual, indoor environments, accounting for the detection of nine object classes over a total of ∼ 7k frames. Results show that our proposal improves the recall and the F1-score by a factor of 1.41 and 1.27, respectively, as well as it achieves a significant reduction of the object categorization entropy (58.8%) when compared to a two-stage video object detection method used as baseline, at the cost of small time overheads (120 ms) and precision loss (0.92).</p