1,192 research outputs found
Scalable Transformer for PDE Surrogate Modeling
Transformer has shown state-of-the-art performance on various applications
and has recently emerged as a promising tool for surrogate modeling of partial
differential equations (PDEs). Despite the introduction of linear-complexity
variant, applying attention to a large number of grid points can result in
instability and is still expensive to compute. In this work, we propose
Factorized Transformer(FactFormer), which is based on an axial factorized
kernel integral. Concretely, we introduce a learnable projection operator that
decomposes the input function into multiple sub-functions with one-dimensional
domain. These sub-functions are then evaluated and used to compute the
instance-based kernel with an axial factorized scheme. We showcase that the
proposed model is able to simulate 2D Kolmogorov flow on a 256 by 256 grid and
3D smoke buoyancy on a 64 by 64 by 64 grid with good accuracy and efficiency.
In addition, we find out that with the factorization scheme, the attention
matrices enjoy a more compact spectrum than full softmax-free attention
matrices
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers
Most recent semantic segmentation methods adopt a fully-convolutional network
(FCN) with an encoder-decoder architecture. The encoder progressively reduces
the spatial resolution and learns more abstract/semantic visual concepts with
larger receptive fields. Since context modeling is critical for segmentation,
the latest efforts have been focused on increasing the receptive field, through
either dilated/atrous convolutions or inserting attention modules. However, the
encoder-decoder based FCN architecture remains unchanged. In this paper, we aim
to provide an alternative perspective by treating semantic segmentation as a
sequence-to-sequence prediction task. Specifically, we deploy a pure
transformer (ie, without convolution and resolution reduction) to encode an
image as a sequence of patches. With the global context modeled in every layer
of the transformer, this encoder can be combined with a simple decoder to
provide a powerful segmentation model, termed SEgmentation TRansformer (SETR).
Extensive experiments show that SETR achieves new state of the art on ADE20K
(50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on
Cityscapes. Particularly, we achieve the first position in the highly
competitive ADE20K test server leaderboard on the day of submission.Comment: CVPR 2021. Project page at https://fudan-zvg.github.io/SETR
Mitigation of Spatial Nonstationarity with Vision Transformers
Spatial nonstationarity, the location variance of features' statistical
distributions, is ubiquitous in many natural settings. For example, in
geological reservoirs rock matrix porosity varies vertically due to
geomechanical compaction trends, in mineral deposits grades vary due to
sedimentation and concentration processes, in hydrology rainfall varies due to
the atmosphere and topography interactions, and in metallurgy crystalline
structures vary due to differential cooling. Conventional geostatistical
modeling workflows rely on the assumption of stationarity to be able to model
spatial features for the geostatistical inference. Nevertheless, this is often
not a realistic assumption when dealing with nonstationary spatial data and
this has motivated a variety of nonstationary spatial modeling workflows such
as trend and residual decomposition, cosimulation with secondary features, and
spatial segmentation and independent modeling over stationary subdomains. The
advent of deep learning technologies has enabled new workflows for modeling
spatial relationships. However, there is a paucity of demonstrated best
practice and general guidance on mitigation of spatial nonstationarity with
deep learning in the geospatial context. We demonstrate the impact of two
common types of geostatistical spatial nonstationarity on deep learning model
prediction performance and propose the mitigation of such impacts using
self-attention (vision transformer) models. We demonstrate the utility of
vision transformers for the mitigation of nonstationarity with relative errors
as low as 10%, exceeding the performance of alternative deep learning methods
such as convolutional neural networks. We establish best practice by
demonstrating the ability of self-attention networks for modeling large-scale
spatial relationships in the presence of commonly observed geospatial
nonstationarity
UniHead: Unifying Multi-Perception for Detection Heads
The detection head constitutes a pivotal component within object detectors,
tasked with executing both classification and localization functions.
Regrettably, the commonly used parallel head often lacks omni perceptual
capabilities, such as deformation perception, global perception and cross-task
perception. Despite numerous methods attempt to enhance these abilities from a
single aspect, achieving a comprehensive and unified solution remains a
significant challenge. In response to this challenge, we have developed an
innovative detection head, termed UniHead, to unify three perceptual abilities
simultaneously. More precisely, our approach (1) introduces deformation
perception, enabling the model to adaptively sample object features; (2)
proposes a Dual-axial Aggregation Transformer (DAT) to adeptly model long-range
dependencies, thereby achieving global perception; and (3) devises a Cross-task
Interaction Transformer (CIT) that facilitates interaction between the
classification and localization branches, thus aligning the two tasks. As a
plug-and-play method, the proposed UniHead can be conveniently integrated with
existing detectors. Extensive experiments on the COCO dataset demonstrate that
our UniHead can bring significant improvements to many detectors. For instance,
the UniHead can obtain +2.7 AP gains in RetinaNet, +2.9 AP gains in FreeAnchor,
and +2.1 AP gains in GFL. The code will be publicly available. Code Url:
https://github.com/zht8506/UniHead.Comment: 10 pages, 5 figure
Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers
Autoregressive transformers have shown remarkable success in video
generation. However, the transformers are prohibited from directly learning the
long-term dependency in videos due to the quadratic complexity of
self-attention, and inherently suffering from slow inference time and error
propagation due to the autoregressive process. In this paper, we propose
Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of
long-term dependency in videos and fast inference. Based on recent advances in
bidirectional transformers, our method learns to decode the entire
spatio-temporal volume of a video in parallel from partially observed patches.
The proposed transformer achieves a linear time complexity in both encoding and
decoding, by projecting observable context tokens into a fixed number of latent
tokens and conditioning them to decode the masked tokens through the
cross-attention. Empowered by linear complexity and bidirectional modeling, our
method demonstrates significant improvement over the autoregressive
Transformers for generating moderately long videos in both quality and speed.
Videos and code are available at https://sites.google.com/view/mebt-cvpr2023
Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective
Visual representation learning is the key of solving various vision problems.
Relying on the seminal grid structure priors, convolutional neural networks
(CNNs) have been the de facto standard architectures of most deep vision
models. For instance, classical semantic segmentation methods often adopt a
fully-convolutional network (FCN) with an encoder-decoder architecture. The
encoder progressively reduces the spatial resolution and learns more abstract
visual concepts with larger receptive fields. Since context modeling is
critical for segmentation, the latest efforts have been focused on increasing
the receptive field, through either dilated (i.e., atrous) convolutions or
inserting attention modules. However, the FCN-based architecture remains
unchanged. In this paper, we aim to provide an alternative perspective by
treating visual representation learning generally as a sequence-to-sequence
prediction task. Specifically, we deploy a pure Transformer to encode an image
as a sequence of patches, without local convolution and resolution reduction.
With the global context modeled in every layer of the Transformer, stronger
visual representation can be learned for better tackling vision tasks. In
particular, our segmentation model, termed as SEgmentation TRansformer (SETR),
excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on
the day of submission), Pascal Context (55.83% mIoU) and reaches competitive
results on Cityscapes. Further, we formulate a family of Hierarchical
Local-Global (HLG) Transformers characterized by local attention within windows
and global-attention across windows in a hierarchical and pyramidal
architecture. Extensive experiments show that our method achieves appealing
performance on a variety of visual recognition tasks (e.g., image
classification, object detection and instance segmentation and semantic
segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584
P2AT: Pyramid Pooling Axial Transformer for Real-time Semantic Segmentation
Recently, Transformer-based models have achieved promising results in various
vision tasks, due to their ability to model long-range dependencies. However,
transformers are computationally expensive, which limits their applications in
real-time tasks such as autonomous driving. In addition, an efficient local and
global feature selection and fusion are vital for accurate dense prediction,
especially driving scene understanding tasks. In this paper, we propose a
real-time semantic segmentation architecture named Pyramid Pooling Axial
Transformer (P2AT). The proposed P2AT takes a coarse feature from the CNN
encoder to produce scale-aware contextual features, which are then combined
with the multi-level feature aggregation scheme to produce enhanced contextual
features. Specifically, we introduce a pyramid pooling axial transformer to
capture intricate spatial and channel dependencies, leading to improved
performance on semantic segmentation. Then, we design a Bidirectional Fusion
module (BiF) to combine semantic information at different levels. Meanwhile, a
Global Context Enhancer is introduced to compensate for the inadequacy of
concatenating different semantic levels. Finally, a decoder block is proposed
to help maintain a larger receptive field. We evaluate P2AT variants on three
challenging scene-understanding datasets. In particular, our P2AT variants
achieve state-of-art results on the Camvid dataset 80.5%, 81.0%, 81.1% for
P2AT-S, P2ATM, and P2AT-L, respectively. Furthermore, our experiment on
Cityscapes and Pascal VOC 2012 have demonstrated the efficiency of the proposed
architecture, with results showing that P2AT-M, achieves 78.7% on Cityscapes.
The source code will be available a
- …