8 research outputs found

    Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective

    Full text link
    Visual representation learning is the key of solving various vision problems. Relying on the seminal grid structure priors, convolutional neural networks (CNNs) have been the de facto standard architectures of most deep vision models. For instance, classical semantic segmentation methods often adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated (i.e., atrous) convolutions or inserting attention modules. However, the FCN-based architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating visual representation learning generally as a sequence-to-sequence prediction task. Specifically, we deploy a pure Transformer to encode an image as a sequence of patches, without local convolution and resolution reduction. With the global context modeled in every layer of the Transformer, stronger visual representation can be learned for better tackling vision tasks. In particular, our segmentation model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission), Pascal Context (55.83% mIoU) and reaches competitive results on Cityscapes. Further, we formulate a family of Hierarchical Local-Global (HLG) Transformers characterized by local attention within windows and global-attention across windows in a hierarchical and pyramidal architecture. Extensive experiments show that our method achieves appealing performance on a variety of visual recognition tasks (e.g., image classification, object detection and instance segmentation and semantic segmentation).Comment: Extended version of CVPR 2021 paper arXiv:2012.1584

    Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

    Full text link
    Most recent semantic segmentation methods adopt a fully-convolutional network (FCN) with an encoder-decoder architecture. The encoder progressively reduces the spatial resolution and learns more abstract/semantic visual concepts with larger receptive fields. Since context modeling is critical for segmentation, the latest efforts have been focused on increasing the receptive field, through either dilated/atrous convolutions or inserting attention modules. However, the encoder-decoder based FCN architecture remains unchanged. In this paper, we aim to provide an alternative perspective by treating semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer (ie, without convolution and resolution reduction) to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR). Extensive experiments show that SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes. Particularly, we achieve the first position in the highly competitive ADE20K test server leaderboard on the day of submission.Comment: CVPR 2021. Project page at https://fudan-zvg.github.io/SETR

    Assessment of multi-source observation merged 1 km-grid precipitation product during the disastrous rainstorms in Guangdong

    Get PDF
    This paper aims to assess the latest 1 km-grid Analysis Real Time (ART_1 km) precipitation product developed by the National Meteorological Information Center of China Meteorological Administration (CMA), which can provide great support for disaster weather monitoring and warning, intelligent grid forecasting and weather services. Observed precipitation data from the independent stations (including non-uploaded regional meteorological stations and hydrometric stations) that were not integrated into the ART_1 km precipitation product as well as precipitation classification inspection are used to assess the quality of this product during twenty disastrous rainstorm cases from May to August during 2019-2022 in Guangdong. The results show that the ART_1 km precipitation product successfully reproduces the precipitation location, strength, and trends in these cases, with the best performance in the Pearl River Delta, the east of eastern Guangdong, and the north of northern Guangdong. The stronger the precipitation, the greater the correlation as well as the root mean square error (RMSE) and mean error (ME) between the ART_1 km precipitation and the observed precipitation. When the hourly precipitation is not classified, about 60% of these independent stations present a correlation efficient ≥ 0.8, more than 90% of the stations present an RMSE within the range of [1.0, 5.0) mm, and more than 60% of the stations present a ME within ±0.1 mm. When the hourly precipitation is < 5 mm, most of the stations have a correlation efficient < 0.5, an RMSE within the range of [1.0, 5.0) mm, and a ME within [0.0, 0.5] mm. When the hourly precipitation is ≥ 20 mm, 42%~56% of the stations have a correlation efficient ≥ 0.5, and most of the stations have an RMSE ≥ 10 mm and a ME < 0 mm, even when the hourly precipitation is ≥ 50 mm, most of the stations have a ME < -10 mm. Overall, ART_1 km precipitation is usually underestimated at the independent stations, and integrating observations from more sites into producing ART_1 km precipitation is helpful to improve the quality of the products

    Orchid

    No full text

    A tactical Internet incorporating multi-source dynamic spatio-temporal characteristics traffic prediction model

    No full text
    Tactical Internet traffic has extremely dynamic spatio-temporal features and is closely related to external features such as weather and elevation, existing network traffic prediction models can not extract its complex features well, a tactical Internet traffic prediction model that fuses multi-source dynamic spatio-temporal features is proposed. Firstly, external features are fused with traffic features as multi-source features; then the spatial features of network traffic at the current moment are extracted and the convolution weights over time are iteratively updated to obtain the spatial feature information under different time slices; finally, the spatial information of the current and historical moments are aggregated by the temporal convolution layer to predict the multi-source dynamic spatio-temporal traffic at the next moment. Compared with the single base model, the proposed method is better in all three evaluation metrics, namely, mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R2)

    Vision transformers: from semantic segmentation to dense prediction

    No full text
    The emergence of vision transformers (ViTs) in image classification has shifted the methodologies for visual representation learning. In particular, ViTs learn visual representation at full receptive field per layer across all the image patches, in comparison to the increasing receptive fields of CNNs across layers and other alternatives (e.g., large kernels and atrous convolution). In this work, for the first time we explore the global context learning potentials of ViTs for dense visual prediction (e.g., semantic segmentation). Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information, critical for dense prediction tasks. We first demonstrate that encoding an image as a sequence of patches, a vanilla ViT without local convolution and resolution reduction can yield stronger visual representation for semantic segmentation. For example, our model, termed as SEgmentation TRansformer (SETR), excels on ADE20K (50.28% mIoU, the first position in the test leaderboard on the day of submission) and performs competitively on Cityscapes. However, the basic ViT architecture falls short in broader dense prediction applications, such as object detection and instance segmentation, due to its lack of a pyramidal structure, high computational demand, and insufficient local context. For tackling general dense visual prediction tasks in a cost-effective manner, we further formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture. Extensive experiments show that our methods achieve appealing performance on a variety of dense prediction tasks (e.g., object detection and instance segmentation and semantic segmentation) as well as image classification. Our code and models are available at https://github.com/fudan-zvg/SETR

    HunYuan_tvr for Text-Video Retrieval

    Full text link
    Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while ignoring fine-grained cross-modal relationships, e.g., short clips and phrases or single frame and word. In this paper, we propose a novel method, named HunYuan\_tvr, to explore hierarchical cross-modal interactions by simultaneously exploring video-sentence, clip-phrase, and frame-word relationships. Considering intrinsic semantic relations between frames, HunYuan\_tvr first performs self-attention to explore frame-wise correlations and adaptively clusters correlated frames into clip-level representations. Then, the clip-wise correlation is explored to aggregate clip representations into a compact one to describe the video globally. In this way, we can construct hierarchical video representations for frame-clip-video granularities, and also explore word-wise correlations to form word-phrase-sentence embeddings for the text modality. Finally, hierarchical contrastive learning is designed to explore cross-modal relationships,~\emph{i.e.,} frame-word, clip-phrase, and video-sentence, which enables HunYuan\_tvr to achieve a comprehensive multi-modal understanding. Further boosted by adaptive label denoising and marginal sample enhancement, HunYuan\_tvr obtains new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 57.8%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet respectively

    Helicobacter pylori CagA-mediated ether lipid biosynthesis promotes ferroptosis susceptibility in gastric cancer

    No full text
    Abstract Helicobacter pylori, particularly cytotoxin-associated gene A (CagA)-positive strains, plays a key role in the progression of gastric cancer (GC). Ferroptosis, associated with lethal lipid peroxidation, has emerged to play an important role in malignant and infectious diseases, but the role of CagA in ferroptosis in cancer cells has not been determined. Here, we report that CagA confers GC cells sensitivity to ferroptosis both in vitro and in vivo. Mechanistically, CagA promotes the synthesis of polyunsaturated ether phospholipids (PUFA-ePLs), which is mediated by increased expression of alkylglycerone phosphate synthase (AGPS) and 1-acylglycerol-3-phosphate O-acyltransferase 3 (AGPAT3), leading to susceptibility to ferroptosis. This susceptibility is mediated by activation of the MEK/ERK/SRF pathway. SRF is a crucial transcription factor that increases AGPS transcription by binding to the AGPS promoter region. Moreover, the results demonstrated that CagA-positive cells are more sensitive to apatinib than are CagA-negative cells, suggesting that detecting the H. pylori CagA status may aid patient stratification for treatment with apatinib
    corecore