4,538 research outputs found
Relation Networks for Object Detection
Although it is well believed for years that modeling relations between
objects would help object recognition, there has not been evidence that the
idea is working in the deep learning era. All state-of-the-art object detection
systems still rely on recognizing object instances individually, without
exploiting their relations during learning.
This work proposes an object relation module. It processes a set of objects
simultaneously through interaction between their appearance feature and
geometry, thus allowing modeling of their relations. It is lightweight and
in-place. It does not require additional supervision and is easy to embed in
existing networks. It is shown effective on improving object recognition and
duplicate removal steps in the modern object detection pipeline. It verifies
the efficacy of modeling object relations in CNN based detection. It gives rise
to the first fully end-to-end object detector
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation
This paper addresses the issue of modifying the visual appearance of videos
while preserving their motion. A novel framework, named MagicProp, is proposed,
which disentangles the video editing process into two stages: appearance
editing and motion-aware appearance propagation. In the first stage, MagicProp
selects a single frame from the input video and applies image-editing
techniques to modify the content and/or style of the frame. The flexibility of
these techniques enables the editing of arbitrary regions within the frame. In
the second stage, MagicProp employs the edited frame as an appearance reference
and generates the remaining frames using an autoregressive rendering approach.
To achieve this, a diffusion-based conditional generation model, called
PropDPM, is developed, which synthesizes the target frame by conditioning on
the reference appearance, the target motion, and its previous appearance. The
autoregressive editing approach ensures temporal consistency in the resulting
videos. Overall, MagicProp combines the flexibility of image-editing techniques
with the superior temporal consistency of autoregressive modeling, enabling
flexible editing of object types and aesthetic styles in arbitrary regions of
input videos while maintaining good temporal consistency across frames.
Extensive experiments in various video editing scenarios demonstrate the
effectiveness of MagicProp
Separable Self and Mixed Attention Transformers for Efficient Object Tracking
The deployment of transformers for visual object tracking has shown
state-of-the-art results on several benchmarks. However, the transformer-based
models are under-utilized for Siamese lightweight tracking due to the
computational complexity of their attention blocks. This paper proposes an
efficient self and mixed attention transformer-based architecture for
lightweight tracking. The proposed backbone utilizes the separable mixed
attention transformers to fuse the template and search regions during feature
extraction to generate superior feature encoding. Our prediction head performs
global contextual modeling of the encoded features by leveraging efficient
self-attention blocks for robust target state estimation. With these
contributions, the proposed lightweight tracker deploys a transformer-based
backbone and head module concurrently for the first time. Our ablation study
testifies to the effectiveness of the proposed combination of backbone and head
modules. Simulations show that our Separable Self and Mixed Attention-based
Tracker, SMAT, surpasses the performance of related lightweight trackers on
GOT10k, TrackingNet, LaSOT, NfS30, UAV123, and AVisT datasets, while running at
37 fps on CPU, 158 fps on GPU, and having 3.8M parameters. For example, it
significantly surpasses the closely related trackers E.T.Track and
MixFormerV2-S on GOT10k-test by a margin of 7.9% and 5.8%, respectively, in the
AO metric. The tracker code and model is available at
https://github.com/goutamyg/SMATComment: Accepted by WACV2024. Code available at
https://github.com/goutamyg/SMA
- …