22 research outputs found
FMViT: A multiple-frequency mixing Vision Transformer
The transformer model has gained widespread adoption in computer vision tasks
in recent times. However, due to the quadratic time and memory complexity of
self-attention, which is proportional to the number of input tokens, most
existing Vision Transformers (ViTs) encounter challenges in achieving efficient
performance in practical industrial deployment scenarios, such as TensorRT and
CoreML, where traditional CNNs excel. Although some recent attempts have been
made to design CNN-Transformer hybrid architectures to tackle this problem,
their overall performance has not met expectations. To tackle these challenges,
we propose an efficient hybrid ViT architecture named FMViT. This approach
enhances the model's expressive power by blending high-frequency features and
low-frequency features with varying frequencies, enabling it to capture both
local and global information effectively. Additionally, we introduce
deploy-friendly mechanisms such as Convolutional Multigroup Reparameterization
(gMLP), Lightweight Multi-head Self-Attention (RLMHSA), and Convolutional
Fusion Block (CFB) to further improve the model's performance and reduce
computational overhead. Our experiments demonstrate that FMViT surpasses
existing CNNs, ViTs, and CNNTransformer hybrid architectures in terms of
latency/accuracy trade-offs for various vision tasks. On the TensorRT platform,
FMViT outperforms Resnet101 by 2.5% (83.3% vs. 80.8%) in top-1 accuracy on the
ImageNet dataset while maintaining similar inference latency. Moreover, FMViT
achieves comparable performance with EfficientNet-B5, but with a 43%
improvement in inference speed. On CoreML, FMViT outperforms MobileOne by 2.6%
in top-1 accuracy on the ImageNet dataset, with inference latency comparable to
MobileOne (78.5% vs. 75.9%). Our code can be found at
https://github.com/tany0699/FMViT
Hypergraph Transformer for Skeleton-based Action Recognition
Skeleton-based action recognition aims to predict human actions given human
joint coordinates with skeletal interconnections. To model such off-grid data
points and their co-occurrences, Transformer-based formulations would be a
natural choice. However, Transformers still lag behind state-of-the-art methods
using graph convolutional networks (GCNs). Transformers assume that the input
is permutation-invariant and homogeneous (partially alleviated by positional
encoding), which ignores an important characteristic of skeleton data, i.e.,
bone connectivity. Furthermore, each type of body joint has a clear physical
meaning in human motion, i.e., motion retains an intrinsic relationship
regardless of the joint coordinates, which is not explored in Transformers. In
fact, certain re-occurring groups of body joints are often involved in specific
actions, such as the subconscious hand movement for keeping balance. Vanilla
attention is incapable of describing such underlying relations that are
persistent and beyond pair-wise. In this work, we aim to exploit these unique
aspects of skeleton data to close the performance gap between Transformers and
GCNs. Specifically, we propose a new self-attention (SA) extension, named
Hypergraph Self-Attention (HyperSA), to incorporate inherently higher-order
relations into the model. The K-hop relative positional embeddings are also
employed to take bone connectivity into account. We name the resulting model
Hyperformer, and it achieves comparable or better performance w.r.t. accuracy
and efficiency than state-of-the-art GCN architectures on NTU RGB+D, NTU RGB+D
120, and Northwestern-UCLA datasets. On the largest NTU RGB+D 120 dataset, the
significantly improved performance reached by our Hyperformer demonstrates the
underestimated potential of Transformer models in this field
Refined Temporal Pyramidal Compression-and-Amplification Transformer for 3D Human Pose Estimation
Accurately estimating the 3D pose of humans in video sequences requires both
accuracy and a well-structured architecture. With the success of transformers,
we introduce the Refined Temporal Pyramidal Compression-and-Amplification
(RTPCA) transformer. Exploiting the temporal dimension, RTPCA extends
intra-block temporal modeling via its Temporal Pyramidal
Compression-and-Amplification (TPCA) structure and refines inter-block feature
interaction with a Cross-Layer Refinement (XLR) module. In particular, TPCA
block exploits a temporal pyramid paradigm, reinforcing key and value
representation capabilities and seamlessly extracting spatial semantics from
motion sequences. We stitch these TPCA blocks with XLR that promotes rich
semantic representation through continuous interaction of queries, keys, and
values. This strategy embodies early-stage information with current flows,
addressing typical deficits in detail and stability seen in other
transformer-based methods. We demonstrate the effectiveness of RTPCA by
achieving state-of-the-art results on Human3.6M, HumanEva-I, and MPI-INF-3DHP
benchmarks with minimal computational overhead. The source code is available at
https://github.com/hbing-l/RTPCA.Comment: 11 pages, 5 figure
DAMO-StreamNet: Optimizing Streaming Perception in Autonomous Driving
Real-time perception, or streaming perception, is a crucial aspect of
autonomous driving that has yet to be thoroughly explored in existing research.
To address this gap, we present DAMO-StreamNet, an optimized framework that
combines recent advances from the YOLO series with a comprehensive analysis of
spatial and temporal perception mechanisms, delivering a cutting-edge solution.
The key innovations of DAMO-StreamNet are: (1) A robust neck structure
incorporating deformable convolution, enhancing the receptive field and feature
alignment capabilities. (2) A dual-branch structure that integrates short-path
semantic features and long-path temporal features, improving motion state
prediction accuracy. (3) Logits-level distillation for efficient optimization,
aligning the logits of teacher and student networks in semantic space. (4) A
real-time forecasting mechanism that updates support frame features with the
current frame, ensuring seamless streaming perception during inference. Our
experiments demonstrate that DAMO-StreamNet surpasses existing state-of-the-art
methods, achieving 37.8% (normal size (600, 960)) and 43.3% (large size (1200,
1920)) sAP without using extra data. This work not only sets a new benchmark
for real-time perception but also provides valuable insights for future
research. Additionally, DAMO-StreamNet can be applied to various autonomous
systems, such as drones and robots, paving the way for real-time perception.
The code is available at https://github.com/zhiqic/DAMO-StreamNet
PGformer: Proxy-Bridged Game Transformer for Multi-Person Extremely Interactive Motion Prediction
Multi-person motion prediction is a challenging task, especially for
real-world scenarios of densely interacted persons. Most previous works have
been devoted to studying the case of weak interactions (e.g., hand-shaking),
which typically forecast each human pose in isolation. In this paper, we focus
on motion prediction for multiple persons with extreme collaborations and
attempt to explore the relationships between the highly interactive persons'
motion trajectories. Specifically, a novel cross-query attention (XQA) module
is proposed to bilaterally learn the cross-dependencies between the two pose
sequences tailored for this situation. Additionally, we introduce and build a
proxy entity to bridge the involved persons, which cooperates with our proposed
XQA module and subtly controls the bidirectional information flows, acting as a
motion intermediary. We then adapt these designs to a Transformer-based
architecture and devise a simple yet effective end-to-end framework called
proxy-bridged game Transformer (PGformer) for multi-person interactive motion
prediction. The effectiveness of our method has been evaluated on the
challenging ExPI dataset, which involves highly interactive actions. We show
that our PGformer consistently outperforms the state-of-the-art methods in both
short- and long-term predictions by a large margin. Besides, our approach can
also be compatible with the weakly interacted CMU-Mocap and MuPoTS-3D datasets
and achieve encouraging results. Our code will become publicly available upon
acceptance
Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning
Depth-aware panoptic segmentation is an emerging topic in computer vision
which combines semantic and geometric understanding for more robust scene
interpretation. Recent works pursue unified frameworks to tackle this challenge
but mostly still treat it as two individual learning tasks, which limits their
potential for exploring cross-domain information. We propose a deeply unified
framework for depth-aware panoptic segmentation, which performs joint
segmentation and depth estimation both in a per-segment manner with identical
object queries. To narrow the gap between the two tasks, we further design a
geometric query enhancement method, which is able to integrate scene geometry
into object queries using latent representations. In addition, we propose a
bi-directional guidance learning approach to facilitate cross-task feature
learning by taking advantage of their mutual relations. Our method sets the new
state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS
and SemKITTI-DVPS datasets. Moreover, our guidance learning approach is shown
to deliver performance improvement even under incomplete supervision labels.Comment: to be published in ICCV 202
PoSynDA: Multi-Hypothesis Pose Synthesis Domain Adaptation for Robust 3D Human Pose Estimation
Existing 3D human pose estimators face challenges in adapting to new datasets
due to the lack of 2D-3D pose pairs in training sets. To overcome this issue,
we propose \textit{Multi-Hypothesis \textbf{P}ose \textbf{Syn}thesis
\textbf{D}omain \textbf{A}daptation} (\textbf{PoSynDA}) framework to bridge
this data disparity gap in target domain. Typically, PoSynDA uses a
diffusion-inspired structure to simulate 3D pose distribution in the target
domain. By incorporating a multi-hypothesis network, PoSynDA generates diverse
pose hypotheses and aligns them with the target domain. To do this, it first
utilizes target-specific source augmentation to obtain the target domain
distribution data from the source domain by decoupling the scale and position
parameters. The process is then further refined through the teacher-student
paradigm and low-rank adaptation. With extensive comparison of benchmarks such
as Human3.6M and MPI-INF-3DHP, PoSynDA demonstrates competitive performance,
even comparable to the target-trained MixSTE model\cite{zhang2022mixste}. This
work paves the way for the practical application of 3D human pose estimation in
unseen domains. The code is available at https://github.com/hbing-l/PoSynDA.Comment: Accepted to ACM Multimedia 2023; 10 pages, 4 figures, 8 tables; the
code is at https://github.com/hbing-l/PoSynD
KeyPosS: Plug-and-Play Facial Landmark Detection through GPS-Inspired True-Range Multilateration
In the realm of facial analysis, accurate landmark detection is crucial for
various applications, ranging from face recognition and expression analysis to
animation. Conventional heatmap or coordinate regression-based techniques,
however, often face challenges in terms of computational burden and
quantization errors. To address these issues, we present the KeyPoint
Positioning System (KeyPosS) - a groundbreaking facial landmark detection
framework that stands out from existing methods. The framework utilizes a fully
convolutional network to predict a distance map, which computes the distance
between a Point of Interest (POI) and multiple anchor points. These anchor
points are ingeniously harnessed to triangulate the POI's position through the
True-range Multilateration algorithm. Notably, the plug-and-play nature of
KeyPosS enables seamless integration into any decoding stage, ensuring a
versatile and adaptable solution. We conducted a thorough evaluation of
KeyPosS's performance by benchmarking it against state-of-the-art models on
four different datasets. The results show that KeyPosS substantially
outperforms leading methods in low-resolution settings while requiring a
minimal time overhead. The code is available at
https://github.com/zhiqic/KeyPosS.Comment: Accepted to ACM Multimedia 2023; 10 pages, 7 figures, 6 tables; the
code is at https://github.com/zhiqic/KeyPos
WordArt Designer: User-Driven Artistic Typography Synthesis using Large Language Models
This paper introduces WordArt Designer, a user-driven framework for artistic
typography synthesis, relying on the Large Language Model (LLM). The system
incorporates four key modules: the LLM Engine, SemTypo, StyTypo, and TexTypo
modules. 1) The LLM Engine, empowered by the LLM (e.g., GPT-3.5), interprets
user inputs and generates actionable prompts for the other modules, thereby
transforming abstract concepts into tangible designs. 2) The SemTypo module
optimizes font designs using semantic concepts, striking a balance between
artistic transformation and readability. 3) Building on the semantic layout
provided by the SemTypo module, the StyTypo module creates smooth, refined
images. 4) The TexTypo module further enhances the design's aesthetics through
texture rendering, enabling the generation of inventive textured fonts.
Notably, WordArt Designer highlights the fusion of generative AI with artistic
typography. Experience its capabilities on ModelScope:
https://www.modelscope.cn/studios/WordArt/WordArt.Comment: Accepted by EMNLP 2023, 10 pages, 11 figures, 1 table, the system is
at https://www.modelscope.cn/studios/WordArt/WordAr
Diversity Transfer Network for Few-Shot Learning
Few-shot learning is a challenging task that aims at training a classifier
for unseen classes with only a few training examples. The main difficulty of
few-shot learning lies in the lack of intra-class diversity within insufficient
training samples. To alleviate this problem, we propose a novel generative
framework, Diversity Transfer Network (DTN), that learns to transfer latent
diversities from known categories and composite them with support features to
generate diverse samples for novel categories in feature space. The learning
problem of the sample generation (i.e., diversity transfer) is solved via
minimizing an effective meta-classification loss in a single-stage network,
instead of the generative loss in previous works.
Besides, an organized auxiliary task co-training over known categories is
proposed to stabilize the meta-training process of DTN. We perform extensive
experiments and ablation studies on three datasets, i.e., \emph{mini}ImageNet,
CIFAR100 and CUB. The results show that DTN, with single-stage training and
faster convergence speed, obtains the state-of-the-art results among the
feature generation based few-shot learning methods. Code and supplementary
material are available at: \texttt{https://github.com/Yuxin-CV/DTN}Comment: 9 pages, 3 figures, AAAI 202