Search CORE

17 research outputs found

DynaST: Dynamic Sparse Transformer for Exemplar-Guided Image Generation

Author: Liu Songhua
Ren Sucheng
Wang Xinchao
Ye Jingwen
Publication venue
Publication date: 18/07/2022
Field of study

One key challenge of exemplar-guided image generation lies in establishing fine-grained correspondences between input and guided images. Prior approaches, despite the promising results, have relied on either estimating dense attention to compute per-point matching, which is limited to only coarse scales due to the quadratic memory cost, or fixing the number of correspondences to achieve linear complexity, which lacks flexibility. In this paper, we propose a dynamic sparse attention based Transformer model, termed Dynamic Sparse Transformer (DynaST), to achieve fine-level matching with favorable efficiency. The heart of our approach is a novel dynamic-attention unit, dedicated to covering the variation on the optimal number of tokens one position should focus on. Specifically, DynaST leverages the multi-layer nature of Transformer structure, and performs the dynamic attention scheme in a cascaded manner to refine matching results and synthesize visually-pleasing outputs. In addition, we introduce a unified training objective for DynaST, making it a versatile reference-based image translation framework for both supervised and unsupervised scenarios. Extensive experiments on three applications, pose-guided person image generation, edge-based face synthesis, and undistorted image style transfer, demonstrate that DynaST achieves superior performance in local details, outperforming the state of the art while reducing the computational cost significantly. Our code is available at https://github.com/Huage001/DynaSTComment: ECCV 202

arXiv.org e-Print Archive

Shunted Self-Attention via Multi-Scale Token Aggregation

Author: Feng Jiashi
He Shengfeng
Ren Sucheng
Wang Xinchao
Zhou Daquan
Publication venue
Publication date: 30/11/2021
Field of study

Recent Vision Transformer~(ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling long-range dependencies of image patches or tokens via self-attention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention~(SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0\% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer

arXiv.org e-Print Archive

Institutional Knowledge at Singapore Management University

Fine-grained Domain Adaptive Crowd Counting via Point-derived Segmentation

Author: Cai Hongmin
He Shengfeng
Liu Yongtuo
Ren Sucheng
Wu Hanjie
Xu Dan
Publication venue
Publication date: 31/03/2023
Field of study

Due to domain shift, a large performance drop is usually observed when a trained crowd counting model is deployed in the wild. While existing domain-adaptive crowd counting methods achieve promising results, they typically regard each crowd image as a whole and reduce domain discrepancies in a holistic manner, thus limiting further improvement of domain adaptation performance. To this end, we propose to untangle \emph{domain-invariant} crowd and \emph{domain-specific} background from crowd images and design a fine-grained domain adaption method for crowd counting. Specifically, to disentangle crowd from background, we propose to learn crowd segmentation from point-level crowd counting annotations in a weakly-supervised manner. Based on the derived segmentation, we design a crowd-aware domain adaptation mechanism consisting of two crowd-aware adaptation modules, i.e., Crowd Region Transfer (CRT) and Crowd Density Alignment (CDA). The CRT module is designed to guide crowd features transfer across domains beyond background distractions. The CDA module dedicates to regularising target-domain crowd density generation by its own crowd density distribution. Our method outperforms previous approaches consistently in the widely-used adaptation scenarios.Comment: 10 pages, 5 figures, and 9 table

arXiv.org e-Print Archive

The Modality Focusing Hypothesis: On the Blink of Multimodal Knowledge Distillation

Author: Gao Zhengqi
Ren Sucheng
Xue Zihui
Zhao Hang
Publication venue
Publication date: 13/06/2022
Field of study

Multimodal knowledge distillation (KD) extends traditional knowledge distillation to the area of multimodal learning. One common practice is to adopt a well-performed multimodal network as the teacher in the hope that it can transfer its full knowledge to a unimodal student for performance improvement. In this paper, we investigate the efficacy of multimodal KD. We begin by providing two failure cases of it and demonstrate that KD is not a universal cure in multimodal knowledge transfer. We present the modality Venn diagram to understand modality relationships and the modality focusing hypothesis revealing the decisive factor in the efficacy of multimodal KD. Experimental results on 6 multimodal datasets help justify our hypothesis, diagnose failure cases, and point directions to improve distillation performance

arXiv.org e-Print Archive

Delving deep into many-to-many attention for few-shot video object segmentation

Author: CHEN Haoxin
HE Shengfeng
REN Sucheng
WU Hanjie
ZHAO Nanxuan
Publication venue: IEEE
Publication date: 01/06/2021
Field of study

Institutional Knowledge at Singapore Management University

Edge Distraction-aware Salient Object Detection

Author: Han Guoqiang
He Shengfeng
Jiao Jianbo
Liu Wenxi
Ren Sucheng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 10/01/2023
Field of study

University of Birmingham Research Portal

Institutional Knowledge at Singapore Management University

Learning from the master: Distilling cross-modal advanced knowledge for lip reading

Author: DU Yong
HAN Guoqiang
HE Shengfeng
LV Jianming
REN Sucheng
Publication venue
Publication date: 01/06/2021
Field of study

National Science Foundation; CCF-Tencent; China Postdoctoral Science Foundation; Fundamental Research Funds for the Central Universitie

Institutional Knowledge at Singapore Management University

Shunted self-attention via multi-scale token aggregation

Author: FENG Jiashi
HE Shengfeng
REN Sucheng
WANG Xinchao
ZHOU Daquan
Publication venue: IEEE
Publication date: 01/06/2022
Field of study

Institutional Knowledge at Singapore Management University