20,837 research outputs found
TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers
Detection Transformer (DETR) and Deformable DETR have been proposed to
eliminate the need for many hand-designed components in object detection while
demonstrating good performance as previous complex hand-crafted detectors.
However, their performance on Video Object Detection (VOD) has not been well
explored. In this paper, we present TransVOD, the first end-to-end video object
detection system based on spatial-temporal Transformer architectures. The first
goal of this paper is to streamline the pipeline of VOD, effectively removing
the need for many hand-crafted components for feature aggregation, e.g.,
optical flow model, relation networks. Besides, benefited from the object query
design in DETR, our method does not need complicated post-processing methods
such as Seq-NMS. In particular, we present a temporal Transformer to aggregate
both the spatial object queries and the feature memories of each frame. Our
temporal transformer consists of two components: Temporal Query Encoder (TQE)
to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to
obtain current frame detection results. These designs boost the strong baseline
deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID
dataset. Then, we present two improved versions of TransVOD including
TransVOD++ and TransVOD Lite. The former fuses object-level information into
object query via dynamic convolution while the latter models the entire video
clips as the output to speed up the inference time. We give detailed analysis
of all three models in the experiment part. In particular, our proposed
TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet
VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and
accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single
V100 GPU device.Comment: Accepted to IEEE Transactions on Pattern Analysis and Machine
Intelligence (IEEE TPAMI), extended version of arXiv:2105.1092
MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving
3D object detection is a significant task for autonomous driving. Recently
with the progress of vision transformers, the 2D object detection problem is
being treated with the set-to-set loss. Inspired by these approaches on 2D
object detection and an approach for multi-view 3D object detection DETR3D, we
propose MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer architecture to
fuse image and LiDAR features to improve the detection accuracy. Our end-to-end
single-stage, anchor-free and NMS-free network takes in multi-view images and
LiDAR point clouds and predicts 3D bounding boxes. Firstly, we link the object
queries learnt from data to the image and LiDAR features using a novel
MSF3DDETR cross-attention block. Secondly, the object queries interacts with
each other in multi-head self-attention block. Finally, MSF3DDETR block is
repeated for number of times to refine the object queries. The MSF3DDETR
network is trained end-to-end on the nuScenes dataset using Hungarian algorithm
based bipartite matching and set-to-set loss inspired by DETR. We present both
quantitative and qualitative results which are competitive to the
state-of-the-art approaches.Comment: Accepted at the ICPR 2022 Workshop DLVDR202
Focused Decoding Enables 3D Anatomical Detection by Transformers
Detection Transformers represent end-to-end object detection approaches based on a Transformer encoder-decoder architecture, exploiting the attention mechanism for global relation modeling. Although Detection Transformers deliver results on par with or even superior to their highly optimized CNN-based counterparts operating on 2D natural images, their success is closely coupled to access to a vast amount of training data. This, however, restricts the feasibility of employing Detection Transformers in the medical domain, as access to annotated data is typically limited. To tackle this issue and facilitate the advent of medical Detection Transformers, we propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder. Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the crossattention’s field of view to regions of interest, which allows for a precise focus on relevant anatomical structures. We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights. Our code is available at https://github.com/bwittmann/transoar
Focused Decoding Enables 3D Anatomical Detection by Transformers
Detection Transformers represent end-to-end object detection approaches based on a Transformer encoder-decoder architecture, exploiting the attention mechanism for global relation modeling. Although Detection Transformers deliver results on par with or even superior to their highly optimized CNN-based counterparts operating on 2D natural images, their success is closely coupled to access to a vast amount of training data. This, however, restricts the feasibility of employing Detection Transformers in the medical domain, as access to annotated data is typically limited. To tackle this issue and facilitate the advent of medical Detection Transformers, we propose a novel Detection Transformer for 3D anatomical structure detection, dubbed Focused Decoder. Focused Decoder leverages information from an anatomical region atlas to simultaneously deploy query anchors and restrict the cross-attention's field of view to regions of interest, which allows for a precise focus on relevant anatomical structures. We evaluate our proposed approach on two publicly available CT datasets and demonstrate that Focused Decoder not only provides strong detection results and thus alleviates the need for a vast amount of annotated data but also exhibits exceptional and highly intuitive explainability of results via attention weights. Code for Focused Decoder is available in our medical Vision Transformer library this http URL
Focused Decoding Enables 3D Anatomical Detection by Transformers
Detection Transformers represent end-to-end object detection approaches based
on a Transformer encoder-decoder architecture, exploiting the attention
mechanism for global relation modeling. Although Detection Transformers deliver
results on par with or even superior to their highly optimized CNN-based
counterparts operating on 2D natural images, their success is closely coupled
to access to a vast amount of training data. This, however, restricts the
feasibility of employing Detection Transformers in the medical domain, as
access to annotated data is typically limited. To tackle this issue and
facilitate the advent of medical Detection Transformers, we propose a novel
Detection Transformer for 3D anatomical structure detection, dubbed Focused
Decoder. Focused Decoder leverages information from an anatomical region atlas
to simultaneously deploy query anchors and restrict the cross-attention's field
of view to regions of interest, which allows for a precise focus on relevant
anatomical structures. We evaluate our proposed approach on two publicly
available CT datasets and demonstrate that Focused Decoder not only provides
strong detection results and thus alleviates the need for a vast amount of
annotated data but also exhibits exceptional and highly intuitive
explainability of results via attention weights. Our code is available at
https://github.com/bwittmann/transoar.Comment: Accepted for publication at the Journal of Machine Learning for
Biomedical Imaging (MELBA) https://melba-journal.org/2023:00
YOLOPose V2: Understanding and Improving Transformer-based 6D Pose Estimation
6D object pose estimation is a crucial prerequisite for autonomous robot
manipulation applications. The state-of-the-art models for pose estimation are
convolutional neural network (CNN)-based. Lately, Transformers, an architecture
originally proposed for natural language processing, is achieving
state-of-the-art results in many computer vision tasks as well. Equipped with
the multi-head self-attention mechanism, Transformers enable simple
single-stage end-to-end architectures for learning object detection and 6D
object pose estimation jointly. In this work, we propose YOLOPose (short form
for You Only Look Once Pose estimation), a Transformer-based multi-object 6D
pose estimation method based on keypoint regression and an improved variant of
the YOLOPose model. In contrast to the standard heatmaps for predicting
keypoints in an image, we directly regress the keypoints. Additionally, we
employ a learnable orientation estimation module to predict the orientation
from the keypoints. Along with a separate translation estimation module, our
model is end-to-end differentiable. Our method is suitable for real-time
applications and achieves results comparable to state-of-the-art methods. We
analyze the role of object queries in our architecture and reveal that the
object queries specialize in detecting objects in specific image regions.
Furthermore, we quantify the accuracy trade-off of using datasets of smaller
sizes to train our model.Comment: Robotics and Autonomous Systems Journal, Elsevier, to appear 2023.
arXiv admin note: substantial text overlap with arXiv:2205.0253
NMS Strikes Back
Detection Transformer (DETR) directly transforms queries to unique objects by
using one-to-one bipartite matching during training and enables end-to-end
object detection. Recently, these models have surpassed traditional detectors
on COCO with undeniable elegance. However, they differ from traditional
detectors in multiple designs, including model architecture and training
schedules, and thus the effectiveness of one-to-one matching is not fully
understood. In this work, we conduct a strict comparison between the one-to-one
Hungarian matching in DETRs and the one-to-many label assignments in
traditional detectors with non-maximum supervision (NMS). Surprisingly, we
observe one-to-many assignments with NMS consistently outperform standard
one-to-one matching under the same setting, with a significant gain of up to
2.5 mAP. Our detector that trains Deformable-DETR with traditional IoU-based
label assignment achieved 50.2 COCO mAP within 12 epochs (1x schedule) with
ResNet50 backbone, outperforming all existing traditional or transformer-based
detectors in this setting. On multiple datasets, schedules, and architectures,
we consistently show bipartite matching is unnecessary for performant detection
transformers. Furthermore, we attribute the success of detection transformers
to their expressive transformer architecture. Code is available at
https://github.com/jozhang97/DETA.Comment: Code is available at https://github.com/jozhang97/DET
Vision Transformer with Quadrangle Attention
Window-based attention has become a popular choice in vision transformers due
to its superior performance, lower computational complexity, and less memory
footprint. However, the design of hand-crafted windows, which is data-agnostic,
constrains the flexibility of transformers to adapt to objects of varying
sizes, shapes, and orientations. To address this issue, we propose a novel
quadrangle attention (QA) method that extends the window-based attention to a
general quadrangle formulation. Our method employs an end-to-end learnable
quadrangle regression module that predicts a transformation matrix to transform
default windows into target quadrangles for token sampling and attention
calculation, enabling the network to model various targets with different
shapes and orientations and capture rich context information. We integrate QA
into plain and hierarchical vision transformers to create a new architecture
named QFormer, which offers minor code modifications and negligible extra
computational cost. Extensive experiments on public benchmarks demonstrate that
QFormer outperforms existing representative vision transformers on various
vision tasks, including classification, object detection, semantic
segmentation, and pose estimation. The code will be made publicly available at
\href{https://github.com/ViTAE-Transformer/QFormer}{QFormer}.Comment: 15 pages, the extension of the ECCV 2022 paper (VSA: Learning
Varied-Size Window Attention in Vision Transformers
- …