27 research outputs found
AFPN: Asymptotic Feature Pyramid Network for Object Detection
Multi-scale features are of great importance in encoding objects with scale
variance in object detection tasks. A common strategy for multi-scale feature
extraction is adopting the classic top-down and bottom-up feature pyramid
networks. However, these approaches suffer from the loss or degradation of
feature information, impairing the fusion effect of non-adjacent levels. This
paper proposes an asymptotic feature pyramid network (AFPN) to support direct
interaction at non-adjacent levels. AFPN is initiated by fusing two adjacent
low-level features and asymptotically incorporates higher-level features into
the fusion process. In this way, the larger semantic gap between non-adjacent
levels can be avoided. Given the potential for multi-object information
conflicts to arise during feature fusion at each spatial location, adaptive
spatial fusion operation is further utilized to mitigate these inconsistencies.
We incorporate the proposed AFPN into both two-stage and one-stage object
detection frameworks and evaluate with the MS-COCO 2017 validation and test
datasets. Experimental evaluation shows that our method achieves more
competitive results than other state-of-the-art feature pyramid networks. The
code is available at
\href{https://github.com/gyyang23/AFPN}{https://github.com/gyyang23/AFPN}
Progressive Feature Self-reinforcement for Weakly Supervised Semantic Segmentation
Compared to conventional semantic segmentation with pixel-level supervision,
Weakly Supervised Semantic Segmentation (WSSS) with image-level labels poses
the challenge that it always focuses on the most discriminative regions,
resulting in a disparity between fully supervised conditions. A typical
manifestation is the diminished precision on the object boundaries, leading to
a deteriorated accuracy of WSSS. To alleviate this issue, we propose to
adaptively partition the image content into deterministic regions (e.g.,
confident foreground and background) and uncertain regions (e.g., object
boundaries and misclassified categories) for separate processing. For uncertain
cues, we employ an activation-based masking strategy and seek to recover the
local information with self-distilled knowledge. We further assume that the
unmasked confident regions should be robust enough to preserve the global
semantics. Building upon this, we introduce a complementary self-enhancement
method that constrains the semantic consistency between these confident regions
and an augmented image with the same class labels. Extensive experiments
conducted on PASCAL VOC 2012 and MS COCO 2014 demonstrate that our proposed
single-stage approach for WSSS not only outperforms state-of-the-art benchmarks
remarkably but also surpasses multi-stage methodologies that trade complexity
for accuracy. The code can be found at
\url{https://github.com/Jessie459/feature-self-reinforcement}.Comment: Accepted by AAAI 202
ViT-Calibrator: Decision Stream Calibration for Vision Transformer
A surge of interest has emerged in utilizing Transformers in diverse vision
tasks owing to its formidable performance. However, existing approaches
primarily focus on optimizing internal model architecture designs that often
entail significant trial and error with high burdens. In this work, we propose
a new paradigm dubbed Decision Stream Calibration that boosts the performance
of general Vision Transformers. To achieve this, we shed light on the
information propagation mechanism in the learning procedure by exploring the
correlation between different tokens and the relevance coefficient of multiple
dimensions. Upon further analysis, it was discovered that 1) the final decision
is associated with tokens of foreground targets, while token features of
foreground target will be transmitted into the next layer as much as possible,
and the useless token features of background area will be eliminated gradually
in the forward propagation. 2) Each category is solely associated with specific
sparse dimensions in the tokens. Based on the discoveries mentioned above, we
designed a two-stage calibration scheme, namely ViT-Calibrator, including token
propagation calibration stage and dimension propagation calibration stage.
Extensive experiments on commonly used datasets show that the proposed approach
can achieve promising results. The source codes are given in the supplements.Comment: 14pages, 12 figure
Propheter: Prophetic Teacher Guided Long-Tailed Distribution Learning
The problem of deep long-tailed learning, a prevalent challenge in the realm
of generic visual recognition, persists in a multitude of real-world
applications. To tackle the heavily-skewed dataset issue in long-tailed
classification, prior efforts have sought to augment existing deep models with
the elaborate class-balancing strategies, such as class rebalancing, data
augmentation, and module improvement. Despite the encouraging performance, the
limited class knowledge of the tailed classes in the training dataset still
bottlenecks the performance of the existing deep models. In this paper, we
propose an innovative long-tailed learning paradigm that breaks the bottleneck
by guiding the learning of deep networks with external prior knowledge. This is
specifically achieved by devising an elaborated ``prophetic'' teacher, termed
as ``Propheter'', that aims to learn the potential class distributions. The
target long-tailed prediction model is then optimized under the instruction of
the well-trained ``Propheter'', such that the distributions of different
classes are as distinguishable as possible from each other. Experiments on
eight long-tailed benchmarks across three architectures demonstrate that the
proposed prophetic paradigm acts as a promising solution to the challenge of
limited class knowledge in long-tailed datasets. Our code and model can be
found in the supplementary material
Interaction Pattern Disentangling for Multi-Agent Reinforcement Learning
Deep cooperative multi-agent reinforcement learning has demonstrated its
remarkable success over a wide spectrum of complex control tasks. However,
recent advances in multi-agent learning mainly focus on value decomposition
while leaving entity interactions still intertwined, which easily leads to
over-fitting on noisy interactions between entities. In this work, we introduce
a novel interactiOn Pattern disenTangling (OPT) method, to disentangle not only
the joint value function into agent-wise value functions for decentralized
execution, but also the entity interactions into interaction prototypes, each
of which represents an underlying interaction pattern within a subgroup of the
entities. OPT facilitates filtering the noisy interactions between irrelevant
entities and thus significantly improves generalizability as well as
interpretability. Specifically, OPT introduces a sparse disagreement mechanism
to encourage sparsity and diversity among discovered interaction prototypes.
Then the model selectively restructures these prototypes into a compact
interaction pattern by an aggregator with learnable weights. To alleviate the
training instability issue caused by partial observability, we propose to
maximize the mutual information between the aggregation weights and the history
behaviors of each agent. Experiments on both single-task and multi-task
benchmarks demonstrate that the proposed method yields results superior to the
state-of-the-art counterparts. Our code is available at
https://github.com/liushunyu/OPT