147 research outputs found
RSGPT: A Remote Sensing Vision Language Model and Benchmark
The emergence of large-scale large language models, with GPT-4 as a prominent
example, has significantly propelled the rapid advancement of artificial
general intelligence and sparked the revolution of Artificial Intelligence 2.0.
In the realm of remote sensing (RS), there is a growing interest in developing
large vision language models (VLMs) specifically tailored for data analysis in
this domain. However, current research predominantly revolves around visual
recognition tasks, lacking comprehensive, large-scale image-text datasets that
are aligned and suitable for training large VLMs, which poses significant
challenges to effectively training such models for RS applications. In computer
vision, recent research has demonstrated that fine-tuning large vision language
models on small-scale, high-quality datasets can yield impressive performance
in visual and language understanding. These results are comparable to
state-of-the-art VLMs trained from scratch on massive amounts of data, such as
GPT-4. Inspired by this captivating idea, in this work, we build a high-quality
Remote Sensing Image Captioning dataset (RSICap) that facilitates the
development of large VLMs in the RS field. Unlike previous RS datasets that
either employ model-generated captions or short descriptions, RSICap comprises
2,585 human-annotated captions with rich and high-quality information. This
dataset offers detailed descriptions for each image, encompassing scene
descriptions (e.g., residential area, airport, or farmland) as well as object
information (e.g., color, shape, quantity, absolute position, etc). To
facilitate the evaluation of VLMs in the field of RS, we also provide a
benchmark evaluation dataset called RSIEval. This dataset consists of
human-annotated captions and visual question-answer pairs, allowing for a
comprehensive assessment of VLMs in the context of RS
Multi Receptive Field Network for Semantic Segmentation
Semantic segmentation is one of the key tasks in computer vision, which is to
assign a category label to each pixel in an image. Despite significant progress
achieved recently, most existing methods still suffer from two challenging
issues: 1) the size of objects and stuff in an image can be very diverse,
demanding for incorporating multi-scale features into the fully convolutional
networks (FCNs); 2) the pixels close to or at the boundaries of object/stuff
are hard to classify due to the intrinsic weakness of convolutional networks.
To address the first issue, we propose a new Multi-Receptive Field Module
(MRFM), explicitly taking multi-scale features into account. For the second
issue, we design an edge-aware loss which is effective in distinguishing the
boundaries of object/stuff. With these two designs, our Multi Receptive Field
Network achieves new state-of-the-art results on two widely-used semantic
segmentation benchmark datasets. Specifically, we achieve a mean IoU of 83.0 on
the Cityscapes dataset and 88.4 mean IoU on the Pascal VOC2012 dataset.Comment: Accept by WACV 202
UniNeXt: Exploring A Unified Architecture for Vision Recognition
Vision Transformers have shown great potential in computer vision tasks. Most
recent works have focused on elaborating the spatial token mixer for
performance gains. However, we observe that a well-designed general
architecture can significantly improve the performance of the entire backbone,
regardless of which spatial token mixer is equipped. In this paper, we propose
UniNeXt, an improved general architecture for the vision backbone. To verify
its effectiveness, we instantiate the spatial token mixer with various typical
and modern designs, including both convolution and attention modules. Compared
with the architecture in which they are first proposed, our UniNeXt
architecture can steadily boost the performance of all the spatial token
mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt
equipped with naive local window attention even outperforms the previous
state-of-the-art. Interestingly, the ranking of these spatial token mixers also
changes under our UniNeXt, suggesting that an excellent spatial token mixer may
be stifled due to a suboptimal general architecture, which further shows the
importance of the study on the general architecture of vision backbone. All
models and codes will be publicly available
Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning
We study the task of generating profitable Non-Fungible Token (NFT) images
from user-input texts. Recent advances in diffusion models have shown great
potential for image generation. However, existing works can fall short in
generating visually-pleasing and highly-profitable NFT images, mainly due to
the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT
image, and 2) effective optimization metrics for generating high-quality NFT
images. To solve these challenges, we propose a Diffusion-based generation
framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for
NFT images. The proposed framework consists of a large language model (LLM), a
diffusion-based image generator, and a series of visual rewards by design.
First, the LLM enhances a basic human input (such as "panda") by generating
more comprehensive NFT-style prompts that include specific visual attributes,
such as "panda with Ninja style and green background." Second, the
diffusion-based image generator is fine-tuned using a large-scale NFT dataset
to capture fine-grained image styles and accessory compositions of popular NFT
elements. Third, we further propose to utilize multiple visual-policies as
optimization goals, including visual rarity levels, visual aesthetic scores,
and CLIP-based text-image relevances. This design ensures that our proposed
Diffusion-MVP is capable of minting NFT images with high visual quality and
market value. To facilitate this research, we have collected the largest
publicly available NFT image dataset to date, consisting of 1.5 million
high-quality images with corresponding texts and market values. Extensive
experiments including objective evaluations and user studies demonstrate that
our framework can generate NFT images showing more visually engaging elements
and higher market value, compared with SOTA approaches
MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
We propose the first joint audio-video generation framework that brings
engaging watching and listening experiences simultaneously, towards
high-quality realistic videos. To generate joint audio-video pairs, we propose
a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled
denoising autoencoders. In contrast to existing single-modal diffusion models,
MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising
process by design. Two subnets for audio and video learn to gradually generate
aligned audio-video pairs from Gaussian noises. To ensure semantic consistency
across modalities, we propose a novel random-shift based attention block
bridging over the two subnets, which enables efficient cross-modal alignment,
and thus reinforces the audio-video fidelity for each other. Extensive
experiments show superior results in unconditional audio-video generation, and
zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve
the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of
10k votes further demonstrate dominant preferences for our model. The code and
pre-trained models can be downloaded at
https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202
Recommended from our members
MTR4 drives liver tumorigenesis by promoting cancer metabolic switch through alternative splicing.
The metabolic switch from oxidative phosphorylation to glycolysis is required for tumorigenesis in order to provide cancer cells with energy and substrates of biosynthesis. Therefore, it is important to elucidate mechanisms controlling the cancer metabolic switch. MTR4 is a RNA helicase associated with a nuclear exosome that plays key roles in RNA processing and surveillance. We demonstrate that MTR4 is frequently overexpressed in hepatocellular carcinoma (HCC) and is an independent diagnostic marker predicting the poor prognosis of HCC patients. MTR4 drives cancer metabolism by ensuring correct alternative splicing of pre-mRNAs of critical glycolytic genes such as GLUT1 and PKM2. c-Myc binds to the promoter of the MTR4 gene and is important for MTR4 expression in HCC cells, indicating that MTR4 is a mediator of the functions of c-Myc in cancer metabolism. These findings reveal important roles of MTR4 in the cancer metabolic switch and present MTR4 as a promising therapeutic target for treating HCC
Metallogenic Dynamics Background of Ga’erqiong Cu-Au Deposit in Tibet, China
The Ga’erqiong Cu-Au deposit, which sits on the north side of the Coqên-Xainzamagmatite belt, is a large-scale skarn-type deposit, whose ore body has formed in the skarn zone in the contact part of quartz diorite and marble of Duoai formation or the cracks of quartz diorite. Its mineralization is closely related to quartz diorite. And granite porphyry-related molybdenum ore still exists in its deep part. Currently, there are disputes about the metallogenic dynamics background of this deposit. From previous studies, this paper carried out zircon LA-LCPMS U-Pb dating and petrogeochemistry study for quartz diorite of Ga’erqiong Cu-Au deposit. The testing result indicates: quartz diorite and granite porphyry were formed respectively in 88±2Ma and 83±1Ma, belonging to the magmatic activity of the early stage of Upper Cretaceous; quartz diorite and granite porphyry have geochemical characteristics similar to those of island arc rock of subduction zone and geochemical indexes similar to “adakite.” Combining with the regional tectonic evolution, we think that quartz diorite and granite porphyry were all formed in the extension environment after the collision of Lhasa block and Qiangtang block. Quartz diorite is the result of the migmatization of basic melt and acid melt evoked by asthenosphere material raise caused by lower crustal delamination; the formation of granite porphyry may be crust-mantle material’s partial melting results due to delaminated lower crustal. Therefore, Ga’erqiongskarn-type Cu-Au deposit belongs to the metallogenic response to the collisional orogeny in the closing process of Meso-Tethys.El yacimiento de cobre y oro Ga'erqiong, que se ubica en el lado norte del cinturón Coqên-Xainzamagmatite, es un depósito tipo skarn a gran escala cuyo cuerpo mineral se formó en la zona Skarn, en la parte de contacto del cuarzo de diorita y mármol de la formación Duoai y de las grietas de cuarzo de diorita. Su mineralización está cercanamente relacionada a los cuarzos de diorita. La mena de molidbeno granítico relacionada a los pórfidos tiene presencia en estas zonas profundas. Actualmente, se presentan varias discusiones sobre el origen de las dinámicas metalogénicas de este yacimiento. Con base en trabajos previos, este estudio determinó la edad del circón uranio-plomo con la técnica LA-ICPMS y analizó la petrogeoquímica de cuarzos de diorita para el yacimiento Ga'erqiong. Los resultados del análisis indican que los cuarzos de diorita y los graníticos pórfidos se formaron en 88±2Ma y 83±1Ma, respectivamente, y pertenecen a la actividad magmática de la edad temprana del Cretácico Superior; los cuarzos de diorita y los graníticos pórfidos tienen características geoquímicas similares a aquellas de las rocas del arco insular en la zona de subducción e índice geoquímicos similares a la "adakita". En combinación con la evolución de la tectónica regional, se concluye que los cuarzos de diorita y los graníticos pórfidos se formaron en el ambiente extensivo tras la colisión de los bloques Lhasa y Qiantang. Los cuarzos de diorita son el resultado de la migmatización de fundición básica y fundición ácida suscitada por el material elevado a la astenosfera gracias a un deslaminado menor de la corteza; la formación de los graníticos pórfidos podría ser el resultado de la fundición parcial de material en el manto de la corteza debido a un deslaminado menor en la corteza. Además, el depósito Ga'erqiong corresponde a la respuesta metalogénica de la orogénesis colisional en el proceso de cierre del Mesotetis
Propolis Reduces Phosphatidylcholine-Specific Phospholipase C Activity and Increases Annexin a7 Level in Oxidized-LDL-Stimulated Human Umbilical Vein Endothelial Cells
To understand the mechanisms underlying the regulating dyslipidemia action of Chinese propolis and Brazilian green propolis, we investigated their effects on phosphatidylcholine-specific phospholipase C (PC-PLC) activity and annexin a7 (ANXA7) level which play crucial roles in the control of the progress of atherosclerosis. Furthermore, active oxygen species (ROS) levels, nuclear factor-KappaB p65 (NF-κB p65), and mitochondrial membrane potential (MMP) were also investigated in oxidized-LDL- (ox-LDL-) stimulated human umbilical vein endothelial cells (HUVECs). Our data indicated that the treatment of both types of propolis 12.5 μg/mL significantly increased cell viability and attenuated apoptosis rate, increased ANXA7 level, and decreased PC-PLC activity. Both types of propolis also inhibited ROS generation as well as the subsequent MMP collapse, and NF-κB p65 activation induced by ox-LDL in HUVECs. Our results also indicated that Chinese propolis and Brazilian green propolis had similar biological activities and prevented ox-LDL induced cellular dysfunction in HUVECs
The Research Value of Biphasic Registration Quantitative Computed Tomography Emphysema Index in the Evaluation of Mild to Moderate COPD
Objective: To find the optimal quantitative index of emphysema by comparing and analyzing the quantitative indexes of emphysema in patients with mild to moderate chronic obstruction pulmonary disease (COPD) via registered biphasic quantitative computed tomography (QCT). Methods: We retrospectively collected 55 healthy controls, 21 Global Initiative for Chronic Obstructive Pulmonary Disease (GOLD) 1 case, and 31 GOLD 2 cases in our hospital. We imported the CT raw DICOM data into the "Digital Lung" analysis platform and measured the LAA-950% at the end of deep inspiration and the LAA-910% at the end of deep expiration. The expiratory and inspiratory CT images were registered. Then, the percentage of emphysema area (PRMEmph%), the percentage of functional small airway disease area (PRMfSAD%), and the percentage of the normal area (PRMNormal%) were calculated according to the threshold method. Pulmonary function indicators included FVC, FEV1%, and FEV1/FVC. Differences in general data, CT quantitative indexes, and pulmonary function between groups were assessed using the independent sample t-test, Mann–Whitney U test, or chi-square test, and the correlation was analyzed using Spearman correlation. The receiver operating characteristic (ROC) curve was drawn to analyze the diagnostic performance of CT quantitative parameters for emphysema in patients with mild to moderate COPD. Results: There were significant differences in sex, smoking index, FEV1%, FEV1/FVC, inspiratory phase LAA%-950, expiratory phase LAA%-910, PRMEmph%, PRMfSAD%, and PRMNormal% between the mild to moderate COPD patients and normal control groups. The inspiratory phase LAA%-950 was negatively correlated with FEV1/FVC, the expiratory phase LAA%-910 and PRMEmph% were negatively correlated with FVC, FEV1%, and FEV1/FVC. ROC curve analysis results showed that the areas under the curve of inspiration phase LAA%-950, expiratory phase LAA%-910, and PRMEmph% were 0.742, 0.861, and 0.876, respectively. Among them, the area under the curve of the PRMEmph% index was the largest, with a corresponding critical value of 9.84%, a sensitivity of 76.90%, and a specificity of 94.50%. Conclusion: Quantitative CT emphysema index LAA%-950 in the inspiratory phase, LAA%-910 in the expiratory phase, and PRMEmph% in biphasic can objectively evaluate emphysema in patients with mild to moderate COPD, among which PRMEmph% is the best evaluation index
- …