4,736 research outputs found
The Right (Angled) Perspective: Improving the Understanding of Road Scenes Using Boosted Inverse Perspective Mapping
Many tasks performed by autonomous vehicles such as road marking detection,
object tracking, and path planning are simpler in bird's-eye view. Hence,
Inverse Perspective Mapping (IPM) is often applied to remove the perspective
effect from a vehicle's front-facing camera and to remap its images into a 2D
domain, resulting in a top-down view. Unfortunately, however, this leads to
unnatural blurring and stretching of objects at further distance, due to the
resolution of the camera, limiting applicability. In this paper, we present an
adversarial learning approach for generating a significantly improved IPM from
a single camera image in real time. The generated bird's-eye-view images
contain sharper features (e.g. road markings) and a more homogeneous
illumination, while (dynamic) objects are automatically removed from the scene,
thus revealing the underlying road layout in an improved fashion. We
demonstrate our framework using real-world data from the Oxford RobotCar
Dataset and show that scene understanding tasks directly benefit from our
boosted IPM approach.Comment: equal contribution of first two authors, 8 full pages, 6 figures,
accepted at IV 201
D3Former: Debiased Dual Distilled Transformer for Incremental Learning
In class incremental learning (CIL) setting, groups of classes are introduced
to a model in each learning phase. The goal is to learn a unified model
performant on all the classes observed so far. Given the recent popularity of
Vision Transformers (ViTs) in conventional classification settings, an
interesting question is to study their continual learning behaviour. In this
work, we develop a Debiased Dual Distilled Transformer for CIL dubbed
. The proposed model leverages a hybrid nested ViT
design to ensure data efficiency and scalability to small as well as large
datasets. In contrast to a recent ViT based CIL approach, our
does not dynamically expand its architecture when
new tasks are learned and remains suitable for a large number of incremental
tasks. The improved CIL behaviour of owes to two
fundamental changes to the ViT design. First, we treat the incremental learning
as a long-tail classification problem where the majority samples from new
classes vastly outnumber the limited exemplars available for old classes. To
avoid the bias against the minority old classes, we propose to dynamically
adjust logits to emphasize on retaining the representations relevant to old
tasks. Second, we propose to preserve the configuration of spatial attention
maps as the learning progresses across tasks. This helps in reducing
catastrophic forgetting by constraining the model to retain the attention on
the most discriminative regions. obtains
favorable results on incremental versions of CIFAR-100, MNIST, SVHN, and
ImageNet datasets. Code is available at https://tinyurl.com/d3formerComment: Accepted to CLVision at CVPR 202
FedET: A Communication-Efficient Federated Class-Incremental Learning Framework Based on Enhanced Transformer
Federated Learning (FL) has been widely concerned for it enables
decentralized learning while ensuring data privacy. However, most existing
methods unrealistically assume that the classes encountered by local clients
are fixed over time. After learning new classes, this assumption will make the
model's catastrophic forgetting of old classes significantly severe. Moreover,
due to the limitation of communication cost, it is challenging to use
large-scale models in FL, which will affect the prediction accuracy. To address
these challenges, we propose a novel framework, Federated Enhanced Transformer
(FedET), which simultaneously achieves high accuracy and low communication
cost. Specifically, FedET uses Enhancer, a tiny module, to absorb and
communicate new knowledge, and applies pre-trained Transformers combined with
different Enhancers to ensure high precision on various tasks. To address local
forgetting caused by new classes of new tasks and global forgetting brought by
non-i.i.d (non-independent and identically distributed) class imbalance across
different local clients, we proposed an Enhancer distillation method to modify
the imbalance between old and new knowledge and repair the non-i.i.d. problem.
Experimental results demonstrate that FedET's average accuracy on
representative benchmark datasets is 14.1% higher than the state-of-the-art
method, while FedET saves 90% of the communication cost compared to the
previous method.Comment: Accepted by 2023 International Joint Conference on Artificial
Intelligence (IJCAI2023
SegViTv2: Exploring Efficient and Continual Semantic Segmentation with Plain Vision Transformers
This paper investigates the capability of plain Vision Transformers (ViTs)
for semantic segmentation using the encoder-decoder framework and introduces
\textbf{SegViTv2}. In this study, we introduce a novel Attention-to-Mask (\atm)
module to design a lightweight decoder effective for plain ViT. The proposed
ATM converts the global attention map into semantic masks for high-quality
segmentation results. Our decoder outperforms the popular decoder UPerNet using
various ViT backbones while consuming only about of the computational
cost. For the encoder, we address the concern of the relatively high
computational cost in the ViT-based encoders and propose a \emph{Shrunk++}
structure that incorporates edge-aware query-based down-sampling (EQD) and
query-based upsampling (QU) modules. The Shrunk++ structure reduces the
computational cost of the encoder by up to while maintaining competitive
performance. Furthermore, we propose to adapt SegViT for continual semantic
segmentation, demonstrating nearly zero forgetting of previously learned
knowledge. Experiments show that our proposed SegViTv2 surpasses recent
segmentation methods on three popular benchmarks including ADE20k,
COCO-Stuff-10k and PASCAL-Context datasets. The code is available through the
following link: \url{https://github.com/zbwxp/SegVit}.Comment: IJCV 2023 accepted, 21 pages, 8 figures, 12 table
Towards Fully Decoupled End-to-End Person Search
End-to-end person search aims to jointly detect and re-identify a target
person in raw scene images with a unified model. The detection task unifies all
persons while the re-id task discriminates different identities, resulting in
conflict optimal objectives. Existing works proposed to decouple end-to-end
person search to alleviate such conflict. Yet these methods are still
sub-optimal on one or two of the sub-tasks due to their partially decoupled
models, which limits the overall person search performance. In this paper, we
propose to fully decouple person search towards optimal person search. A
task-incremental person search network is proposed to incrementally construct
an end-to-end model for the detection and re-id sub-task, which decouples the
model architecture for the two sub-tasks. The proposed task-incremental network
allows task-incremental training for the two conflicting tasks. This enables
independent learning for different objectives thus fully decoupled the model
for persons earch. Comprehensive experimental evaluations demonstrate the
effectiveness of the proposed fully decoupled models for end-to-end person
search.Comment: DICTA 202
Revisiting a kNN-based Image Classification System with High-capacity Storage
In existing image classification systems that use deep neural networks, the
knowledge needed for image classification is implicitly stored in model
parameters. If users want to update this knowledge, then they need to fine-tune
the model parameters. Moreover, users cannot verify the validity of inference
results or evaluate the contribution of knowledge to the results. In this
paper, we investigate a system that stores knowledge for image classification,
such as image feature maps, labels, and original images, not in model
parameters but in external high-capacity storage. Our system refers to the
storage like a database when classifying input images. To increase knowledge,
our system updates the database instead of fine-tuning model parameters, which
avoids catastrophic forgetting in incremental learning scenarios. We revisit a
kNN (k-Nearest Neighbor) classifier and employ it in our system. By analyzing
the neighborhood samples referred by the kNN algorithm, we can interpret how
knowledge learned in the past is used for inference results. Our system
achieves 79.8% top-1 accuracy on the ImageNet dataset without fine-tuning model
parameters after pretraining, and 90.8% accuracy on the Split CIFAR-100 dataset
in the task incremental learning setting.Comment: 16 pages, 7 figures, 6 table
CoMFormer: Continual Learning in Semantic and Panoptic Segmentation
Continual learning for segmentation has recently seen increasing interest.
However, all previous works focus on narrow semantic segmentation and disregard
panoptic segmentation, an important task with real-world impacts. %a In this
paper, we present the first continual learning model capable of operating on
both semantic and panoptic segmentation. Inspired by recent transformer
approaches that consider segmentation as a mask-classification problem, we
design CoMFormer. Our method carefully exploits the properties of transformer
architectures to learn new classes over time. Specifically, we propose a novel
adaptive distillation loss along with a mask-based pseudo-labeling technique to
effectively prevent forgetting. To evaluate our approach, we introduce a novel
continual panoptic segmentation benchmark on the challenging ADE20K dataset.
Our CoMFormer outperforms all the existing baselines by forgetting less old
classes but also learning more effectively new classes. In addition, we also
report an extensive evaluation in the large-scale continual semantic
segmentation scenario showing that CoMFormer also significantly outperforms
state-of-the-art methods.Comment: Under submissio
- …