83 research outputs found
Scene Graph Lossless Compression with Adaptive Prediction for Objects and Relations
The scene graph is a new data structure describing objects and their pairwise
relationship within image scenes. As the size of scene graph in vision
applications grows, how to losslessly and efficiently store such data on disks
or transmit over the network becomes an inevitable problem. However, the
compression of scene graph is seldom studied before because of the complicated
data structures and distributions. Existing solutions usually involve
general-purpose compressors or graph structure compression methods, which is
weak at reducing redundancy for scene graph data. This paper introduces a new
lossless compression framework with adaptive predictors for joint compression
of objects and relations in scene graph data. The proposed framework consists
of a unified prior extractor and specialized element predictors to adapt for
different data elements. Furthermore, to exploit the context information within
and between graph elements, Graph Context Convolution is proposed to support
different graph context modeling schemes for different graph elements. Finally,
a learned distribution model is devised to predict numerical data under
complicated conditional constraints. Experiments conducted on labeled or
generated scene graphs proves the effectiveness of the proposed framework in
scene graph lossless compression task
NCGNN: Node-level Capsule Graph Neural Network
Message passing has evolved as an effective tool for designing Graph Neural
Networks (GNNs). However, most existing works naively sum or average all the
neighboring features to update node representations, which suffers from the
following limitations: (1) lack of interpretability to identify crucial node
features for GNN's prediction; (2) over-smoothing issue where repeated
averaging aggregates excessive noise, making features of nodes in different
classes over-mixed and thus indistinguishable. In this paper, we propose the
Node-level Capsule Graph Neural Network (NCGNN) to address these issues with an
improved message passing scheme. Specifically, NCGNN represents nodes as groups
of capsules, in which each capsule extracts distinctive features of its
corresponding node. For each node-level capsule, a novel dynamic routing
procedure is developed to adaptively select appropriate capsules for
aggregation from a subgraph identified by the designed graph filter.
Consequently, as only the advantageous capsules are aggregated and harmful
noise is restrained, over-mixing features of interacting nodes in different
classes tends to be avoided to relieve the over-smoothing issue. Furthermore,
since the graph filter and the dynamic routing identify a subgraph and a subset
of node features that are most influential for the prediction of the model,
NCGNN is inherently interpretable and exempt from complex post-hoc
explanations. Extensive experiments on six node classification benchmarks
demonstrate that NCGNN can well address the over-smoothing issue and
outperforms the state of the arts by producing better node embeddings for
classification
Frequency-Aware Transformer for Learned Image Compression
Learned image compression (LIC) has gained traction as an effective solution
for image storage and transmission in recent years. However, existing LIC
methods are redundant in latent representation due to limitations in capturing
anisotropic frequency components and preserving directional details. To
overcome these challenges, we propose a novel frequency-aware transformer (FAT)
block that for the first time achieves multiscale directional ananlysis for
LIC. The FAT block comprises frequency-decomposition window attention (FDWA)
modules to capture multiscale and directional frequency components of natural
images. Additionally, we introduce frequency-modulation feed-forward network
(FMFFN) to adaptively modulate different frequency components, improving
rate-distortion performance. Furthermore, we present a transformer-based
channel-wise autoregressive (T-CA) model that effectively exploits channel
dependencies. Experiments show that our method achieves state-of-the-art
rate-distortion performance compared to existing LIC methods, and evidently
outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in
BD-rate on the Kodak, Tecnick, and CLIC datasets
Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners
Representation learning has been evolving from traditional supervised
training to Contrastive Learning (CL) and Masked Image Modeling (MIM). Previous
works have demonstrated their pros and cons in specific scenarios, i.e., CL and
supervised pre-training excel at capturing longer-range global patterns and
enabling better feature discrimination, while MIM can introduce more local and
diverse attention across all transformer layers. In this paper, we explore how
to obtain a model that combines their strengths. We start by examining previous
feature distillation and mask feature reconstruction methods and identify their
limitations. We find that their increasing diversity mainly derives from the
asymmetric designs, but these designs may in turn compromise the discrimination
ability. In order to better obtain both discrimination and diversity, we
propose a simple but effective Hybrid Distillation strategy, which utilizes
both the supervised/CL teacher and the MIM teacher to jointly guide the student
model. Hybrid Distill imitates the token relations of the MIM teacher to
alleviate attention collapse, as well as distills the feature maps of the
supervised/CL teacher to enable discrimination. Furthermore, a progressive
redundant token masking strategy is also utilized to reduce the distilling
costs and avoid falling into local optima. Experiment results prove that Hybrid
Distill can achieve superior performance on different benchmarks
ActionPrompt: Action-Guided 3D Human Pose Estimation With Text and Pose Prompting
Recent 2D-to-3D human pose estimation (HPE) utilizes temporal consistency
across sequences to alleviate the depth ambiguity problem but ignore the action
related prior knowledge hidden in the pose sequence. In this paper, we propose
a plug-and-play module named Action Prompt Module (APM) that effectively mines
different kinds of action clues for 3D HPE. The highlight is that, the mining
scheme of APM can be widely adapted to different frameworks and bring
consistent benefits. Specifically, we first present a novel Action-related Text
Prompt module (ATP) that directly embeds action labels and transfers the rich
language information in the label to the pose sequence. Besides, we further
introduce Action-specific Pose Prompt module (APP) to mine the position-aware
pose pattern of each action, and exploit the correlation between the mined
patterns and input pose sequence for further pose refinement. Experiments show
that APM can improve the performance of most video-based 2D-to-3D HPE
frameworks by a large margin.Comment: 6 pages, 4 figures, 2023ICM
AiluRus: A Scalable ViT Framework for Dense Prediction
Vision transformers (ViTs) have emerged as a prevalent architecture for
vision tasks owing to their impressive performance. However, when it comes to
handling long token sequences, especially in dense prediction tasks that
require high-resolution input, the complexity of ViTs increases significantly.
Notably, dense prediction tasks, such as semantic segmentation or object
detection, emphasize more on the contours or shapes of objects, while the
texture inside objects is less informative. Motivated by this observation, we
propose to apply adaptive resolution for different regions in the image
according to their importance. Specifically, at the intermediate layer of the
ViT, we utilize a spatial-aware density-based clustering algorithm to select
representative tokens from the token sequence. Once the representative tokens
are determined, we proceed to merge other tokens into their closest
representative token. Consequently, semantic similar tokens are merged together
to form low-resolution regions, while semantic irrelevant tokens are preserved
independently as high-resolution regions. This strategy effectively reduces the
number of tokens, allowing subsequent layers to handle a reduced token sequence
and achieve acceleration. We evaluate our proposed method on three different
datasets and observe promising performance. For example, the "Segmenter ViT-L"
model can be accelerated by 48% FPS without fine-tuning, while maintaining the
performance. Additionally, our method can be applied to accelerate fine-tuning
as well. Experimental results demonstrate that we can save 52% training time
while accelerating 2.46 times FPS with only a 0.09% performance drop. The code
is available at https://github.com/caddyless/ailurus/tree/main.Comment: Accepted by NeurIPS 202
- …