1,844 research outputs found
Frequency-Aware Transformer for Learned Image Compression
Learned image compression (LIC) has gained traction as an effective solution
for image storage and transmission in recent years. However, existing LIC
methods are redundant in latent representation due to limitations in capturing
anisotropic frequency components and preserving directional details. To
overcome these challenges, we propose a novel frequency-aware transformer (FAT)
block that for the first time achieves multiscale directional ananlysis for
LIC. The FAT block comprises frequency-decomposition window attention (FDWA)
modules to capture multiscale and directional frequency components of natural
images. Additionally, we introduce frequency-modulation feed-forward network
(FMFFN) to adaptively modulate different frequency components, improving
rate-distortion performance. Furthermore, we present a transformer-based
channel-wise autoregressive (T-CA) model that effectively exploits channel
dependencies. Experiments show that our method achieves state-of-the-art
rate-distortion performance compared to existing LIC methods, and evidently
outperforms latest standardized codec VTM-12.1 by 14.5%, 15.1%, 13.0% in
BD-rate on the Kodak, Tecnick, and CLIC datasets
Understanding Video Transformers for Segmentation: A Survey of Application and Interpretability
Video segmentation encompasses a wide range of categories of problem
formulation, e.g., object, scene, actor-action and multimodal video
segmentation, for delineating task-specific scene components with pixel-level
masks. Recently, approaches in this research area shifted from concentrating on
ConvNet-based to transformer-based models. In addition, various
interpretability approaches have appeared for transformer models and video
temporal dynamics, motivated by the growing interest in basic scientific
understanding, model diagnostics and societal implications of real-world
deployment. Previous surveys mainly focused on ConvNet models on a subset of
video segmentation tasks or transformers for classification tasks. Moreover,
component-wise discussion of transformer-based video segmentation models has
not yet received due focus. In addition, previous reviews of interpretability
methods focused on transformers for classification, while analysis of video
temporal dynamics modelling capabilities of video models received less
attention. In this survey, we address the above with a thorough discussion of
various categories of video segmentation, a component-wise discussion of the
state-of-the-art transformer-based models, and a review of related
interpretability methods. We first present an introduction to the different
video segmentation task categories, their objectives, specific challenges and
benchmark datasets. Next, we provide a component-wise review of recent
transformer-based models and document the state of the art on different video
segmentation tasks. Subsequently, we discuss post-hoc and ante-hoc
interpretability methods for transformer models and interpretability methods
for understanding the role of the temporal dimension in video models. Finally,
we conclude our discussion with future research directions
PiCANet: Learning Pixel-wise Contextual Attention for Saliency Detection
Contexts play an important role in the saliency detection task. However,
given a context region, not all contextual information is helpful for the final
task. In this paper, we propose a novel pixel-wise contextual attention
network, i.e., the PiCANet, to learn to selectively attend to informative
context locations for each pixel. Specifically, for each pixel, it can generate
an attention map in which each attention weight corresponds to the contextual
relevance at each context location. An attended contextual feature can then be
constructed by selectively aggregating the contextual information. We formulate
the proposed PiCANet in both global and local forms to attend to global and
local contexts, respectively. Both models are fully differentiable and can be
embedded into CNNs for joint training. We also incorporate the proposed models
with the U-Net architecture to detect salient objects. Extensive experiments
show that the proposed PiCANets can consistently improve saliency detection
performance. The global and local PiCANets facilitate learning global contrast
and homogeneousness, respectively. As a result, our saliency model can detect
salient objects more accurately and uniformly, thus performing favorably
against the state-of-the-art methods
Detect Any Deepfakes: Segment Anything Meets Face Forgery Detection and Localization
The rapid advancements in computer vision have stimulated remarkable progress
in face forgery techniques, capturing the dedicated attention of researchers
committed to detecting forgeries and precisely localizing manipulated areas.
Nonetheless, with limited fine-grained pixel-wise supervision labels, deepfake
detection models perform unsatisfactorily on precise forgery detection and
localization. To address this challenge, we introduce the well-trained vision
segmentation foundation model, i.e., Segment Anything Model (SAM) in face
forgery detection and localization. Based on SAM, we propose the Detect Any
Deepfakes (DADF) framework with the Multiscale Adapter, which can capture
short- and long-range forgery contexts for efficient fine-tuning. Moreover, to
better identify forged traces and augment the model's sensitivity towards
forgery regions, Reconstruction Guided Attention (RGA) module is proposed. The
proposed framework seamlessly integrates end-to-end forgery localization and
detection optimization. Extensive experiments on three benchmark datasets
demonstrate the superiority of our approach for both forgery detection and
localization. The codes will be released soon at
https://github.com/laiyingxin2/DADF
Expediting Building Footprint Segmentation from High-resolution Remote Sensing Images via progressive lenient supervision
The efficacy of building footprint segmentation from remotely sensed images
has been hindered by model transfer effectiveness. Many existing building
segmentation methods were developed upon the encoder-decoder architecture of
U-Net, in which the encoder is finetuned from the newly developed backbone
networks that are pre-trained on ImageNet. However, the heavy computational
burden of the existing decoder designs hampers the successful transfer of these
modern encoder networks to remote sensing tasks. Even the widely-adopted deep
supervision strategy fails to mitigate these challenges due to its invalid loss
in hybrid regions where foreground and background pixels are intermixed. In
this paper, we conduct a comprehensive evaluation of existing decoder network
designs for building footprint segmentation and propose an efficient framework
denoted as BFSeg to enhance learning efficiency and effectiveness.
Specifically, a densely-connected coarse-to-fine feature fusion decoder network
that facilitates easy and fast feature fusion across scales is proposed.
Moreover, considering the invalidity of hybrid regions in the down-sampled
ground truth during the deep supervision process, we present a lenient deep
supervision and distillation strategy that enables the network to learn proper
knowledge from deep supervision. Building upon these advancements, we have
developed a new family of building segmentation networks, which consistently
surpass prior works with outstanding performance and efficiency across a wide
range of newly developed encoder networks. The code will be released on
https://github.com/HaonanGuo/BFSeg-Efficient-Building-Footprint-Segmentation-Framework.Comment: 13 pages,8 figures. Submitted to IEEE Transactions on Neural Networks
and Learning System
CINFormer: Transformer network with multi-stage CNN feature injection for surface defect segmentation
Surface defect inspection is of great importance for industrial manufacture
and production. Though defect inspection methods based on deep learning have
made significant progress, there are still some challenges for these methods,
such as indistinguishable weak defects and defect-like interference in the
background. To address these issues, we propose a transformer network with
multi-stage CNN (Convolutional Neural Network) feature injection for surface
defect segmentation, which is a UNet-like structure named CINFormer. CINFormer
presents a simple yet effective feature integration mechanism that injects the
multi-level CNN features of the input image into different stages of the
transformer network in the encoder. This can maintain the merit of CNN
capturing detailed features and that of transformer depressing noises in the
background, which facilitates accurate defect detection. In addition, CINFormer
presents a Top-K self-attention module to focus on tokens with more important
information about the defects, so as to further reduce the impact of the
redundant background. Extensive experiments conducted on the surface defect
datasets DAGM 2007, Magnetic tile, and NEU show that the proposed CINFormer
achieves state-of-the-art performance in defect detection
- …