147 research outputs found

    RSGPT: A Remote Sensing Vision Language Model and Benchmark

    Full text link
    The emergence of large-scale large language models, with GPT-4 as a prominent example, has significantly propelled the rapid advancement of artificial general intelligence and sparked the revolution of Artificial Intelligence 2.0. In the realm of remote sensing (RS), there is a growing interest in developing large vision language models (VLMs) specifically tailored for data analysis in this domain. However, current research predominantly revolves around visual recognition tasks, lacking comprehensive, large-scale image-text datasets that are aligned and suitable for training large VLMs, which poses significant challenges to effectively training such models for RS applications. In computer vision, recent research has demonstrated that fine-tuning large vision language models on small-scale, high-quality datasets can yield impressive performance in visual and language understanding. These results are comparable to state-of-the-art VLMs trained from scratch on massive amounts of data, such as GPT-4. Inspired by this captivating idea, in this work, we build a high-quality Remote Sensing Image Captioning dataset (RSICap) that facilitates the development of large VLMs in the RS field. Unlike previous RS datasets that either employ model-generated captions or short descriptions, RSICap comprises 2,585 human-annotated captions with rich and high-quality information. This dataset offers detailed descriptions for each image, encompassing scene descriptions (e.g., residential area, airport, or farmland) as well as object information (e.g., color, shape, quantity, absolute position, etc). To facilitate the evaluation of VLMs in the field of RS, we also provide a benchmark evaluation dataset called RSIEval. This dataset consists of human-annotated captions and visual question-answer pairs, allowing for a comprehensive assessment of VLMs in the context of RS

    Multi Receptive Field Network for Semantic Segmentation

    Full text link
    Semantic segmentation is one of the key tasks in computer vision, which is to assign a category label to each pixel in an image. Despite significant progress achieved recently, most existing methods still suffer from two challenging issues: 1) the size of objects and stuff in an image can be very diverse, demanding for incorporating multi-scale features into the fully convolutional networks (FCNs); 2) the pixels close to or at the boundaries of object/stuff are hard to classify due to the intrinsic weakness of convolutional networks. To address the first issue, we propose a new Multi-Receptive Field Module (MRFM), explicitly taking multi-scale features into account. For the second issue, we design an edge-aware loss which is effective in distinguishing the boundaries of object/stuff. With these two designs, our Multi Receptive Field Network achieves new state-of-the-art results on two widely-used semantic segmentation benchmark datasets. Specifically, we achieve a mean IoU of 83.0 on the Cityscapes dataset and 88.4 mean IoU on the Pascal VOC2012 dataset.Comment: Accept by WACV 202

    UniNeXt: Exploring A Unified Architecture for Vision Recognition

    Full text link
    Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-of-the-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. All models and codes will be publicly available

    Learning Profitable NFT Image Diffusions via Multiple Visual-Policy Guided Reinforcement Learning

    Full text link
    We study the task of generating profitable Non-Fungible Token (NFT) images from user-input texts. Recent advances in diffusion models have shown great potential for image generation. However, existing works can fall short in generating visually-pleasing and highly-profitable NFT images, mainly due to the lack of 1) plentiful and fine-grained visual attribute prompts for an NFT image, and 2) effective optimization metrics for generating high-quality NFT images. To solve these challenges, we propose a Diffusion-based generation framework with Multiple Visual-Policies as rewards (i.e., Diffusion-MVP) for NFT images. The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design. First, the LLM enhances a basic human input (such as "panda") by generating more comprehensive NFT-style prompts that include specific visual attributes, such as "panda with Ninja style and green background." Second, the diffusion-based image generator is fine-tuned using a large-scale NFT dataset to capture fine-grained image styles and accessory compositions of popular NFT elements. Third, we further propose to utilize multiple visual-policies as optimization goals, including visual rarity levels, visual aesthetic scores, and CLIP-based text-image relevances. This design ensures that our proposed Diffusion-MVP is capable of minting NFT images with high visual quality and market value. To facilitate this research, we have collected the largest publicly available NFT image dataset to date, consisting of 1.5 million high-quality images with corresponding texts and market values. Extensive experiments including objective evaluations and user studies demonstrate that our framework can generate NFT images showing more visually engaging elements and higher market value, compared with SOTA approaches

    MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation

    Full text link
    We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. The code and pre-trained models can be downloaded at https://github.com/researchmm/MM-Diffusion.Comment: Accepted by CVPR 202

    Metallogenic Dynamics Background of Ga’erqiong Cu-Au Deposit in Tibet, China

    Get PDF
    The Ga’erqiong Cu-Au deposit, which sits on the north side of the Coqên-Xainzamagmatite belt, is a large-scale skarn-type deposit, whose ore body has formed in the skarn zone in the contact part of quartz diorite and marble of Duoai formation or the cracks of quartz diorite. Its mineralization is closely related to quartz diorite. And granite porphyry-related molybdenum ore still exists in its deep part. Currently, there are disputes about the metallogenic dynamics background of this deposit. From previous studies, this paper carried out zircon LA-LCPMS U-Pb dating and petrogeochemistry study for quartz diorite of Ga’erqiong Cu-Au deposit. The testing result indicates: quartz diorite and granite porphyry were formed respectively in 88±2Ma and 83±1Ma, belonging to the magmatic activity of the early stage of Upper Cretaceous; quartz diorite and granite porphyry have geochemical characteristics similar to those of island arc rock of subduction zone and geochemical indexes similar to “adakite.” Combining with the regional tectonic evolution, we think that quartz diorite and granite porphyry were all formed in the extension environment after the collision of Lhasa block and Qiangtang block. Quartz diorite is the result of the migmatization of basic melt and acid melt evoked by asthenosphere material raise caused by lower crustal delamination; the formation of granite porphyry may be crust-mantle material’s partial melting results due to delaminated lower crustal. Therefore, Ga’erqiongskarn-type Cu-Au deposit belongs to the metallogenic response to the collisional orogeny in the closing process of Meso-Tethys.El yacimiento de cobre y oro Ga'erqiong, que se ubica en el lado norte del cinturón Coqên-Xainzamagmatite, es un depósito tipo skarn a gran escala cuyo cuerpo mineral se formó en la zona Skarn, en la parte de contacto del cuarzo de diorita y mármol de la formación Duoai y de las grietas de cuarzo de diorita. Su mineralización está cercanamente relacionada a los cuarzos de diorita. La mena de molidbeno granítico relacionada a los pórfidos tiene presencia en estas zonas profundas. Actualmente, se presentan varias discusiones sobre el origen de las dinámicas metalogénicas de este yacimiento. Con base en trabajos previos, este estudio determinó la edad del circón uranio-plomo con la técnica LA-ICPMS y analizó la petrogeoquímica de cuarzos de diorita para el yacimiento Ga'erqiong. Los resultados del análisis indican que los cuarzos de diorita y los graníticos pórfidos se formaron en 88±2Ma y 83±1Ma, respectivamente, y pertenecen a la actividad magmática de la edad temprana del Cretácico Superior; los cuarzos de diorita y los graníticos pórfidos tienen características geoquímicas similares a aquellas de las rocas del arco insular en la zona de subducción e índice geoquímicos similares a la "adakita". En combinación con la evolución de la tectónica regional, se concluye que los cuarzos de diorita y los graníticos pórfidos se formaron en el ambiente extensivo tras la colisión de los bloques Lhasa y Qiantang. Los cuarzos de diorita son el resultado de la migmatización de fundición básica y fundición ácida suscitada por el material elevado a la astenosfera gracias a un deslaminado menor de la corteza; la formación de los graníticos pórfidos podría ser el resultado de la fundición parcial de material en el manto de la corteza debido a un deslaminado menor en la corteza. Además, el depósito Ga'erqiong corresponde a la respuesta metalogénica de la orogénesis colisional en el proceso de cierre del Mesotetis

    Propolis Reduces Phosphatidylcholine-Specific Phospholipase C Activity and Increases Annexin a7 Level in Oxidized-LDL-Stimulated Human Umbilical Vein Endothelial Cells

    Get PDF
    To understand the mechanisms underlying the regulating dyslipidemia action of Chinese propolis and Brazilian green propolis, we investigated their effects on phosphatidylcholine-specific phospholipase C (PC-PLC) activity and annexin a7 (ANXA7) level which play crucial roles in the control of the progress of atherosclerosis. Furthermore, active oxygen species (ROS) levels, nuclear factor-KappaB p65 (NF-κB p65), and mitochondrial membrane potential (MMP) were also investigated in oxidized-LDL- (ox-LDL-) stimulated human umbilical vein endothelial cells (HUVECs). Our data indicated that the treatment of both types of propolis 12.5 μg/mL significantly increased cell viability and attenuated apoptosis rate, increased ANXA7 level, and decreased PC-PLC activity. Both types of propolis also inhibited ROS generation as well as the subsequent MMP collapse, and NF-κB p65 activation induced by ox-LDL in HUVECs. Our results also indicated that Chinese propolis and Brazilian green propolis had similar biological activities and prevented ox-LDL induced cellular dysfunction in HUVECs

    The Research Value of Biphasic Registration Quantitative Computed Tomography Emphysema Index in the Evaluation of Mild to Moderate COPD

    Get PDF
    Objective: To find the optimal quantitative index of emphysema by comparing and analyzing the quantitative indexes of emphysema in patients with mild to moderate chronic obstruction pulmonary disease (COPD) via registered biphasic quantitative computed tomography (QCT). Methods: We retrospectively collected 55 healthy controls, 21 Global Initiative for Chronic Obstructive Pulmonary Disease (GOLD) 1 case, and 31 GOLD 2 cases in our hospital. We imported the CT raw DICOM data into the "Digital Lung" analysis platform and measured the LAA-950% at the end of deep inspiration and the LAA-910% at the end of deep expiration. The expiratory and inspiratory CT images were registered. Then, the percentage of emphysema area (PRMEmph%), the percentage of functional small airway disease area (PRMfSAD%), and the percentage of the normal area (PRMNormal%) were calculated according to the threshold method. Pulmonary function indicators included FVC, FEV1%, and FEV1/FVC. Differences in general data, CT quantitative indexes, and pulmonary function between groups were assessed using the independent sample t-test, Mann–Whitney U test, or chi-square test, and the correlation was analyzed using Spearman correlation. The receiver operating characteristic (ROC) curve was drawn to analyze the diagnostic performance of CT quantitative parameters for emphysema in patients with mild to moderate COPD. Results: There were significant differences in sex, smoking index, FEV1%, FEV1/FVC, inspiratory phase LAA%-950, expiratory phase LAA%-910, PRMEmph%, PRMfSAD%, and PRMNormal% between the mild to moderate COPD patients and normal control groups. The inspiratory phase LAA%-950 was negatively correlated with FEV1/FVC, the expiratory phase LAA%-910 and PRMEmph% were negatively correlated with FVC, FEV1%, and FEV1/FVC. ROC curve analysis results showed that the areas under the curve of inspiration phase LAA%-950, expiratory phase LAA%-910, and PRMEmph% were 0.742, 0.861, and 0.876, respectively. Among them, the area under the curve of the PRMEmph% index was the largest, with a corresponding critical value of 9.84%, a sensitivity of 76.90%, and a specificity of 94.50%. Conclusion: Quantitative CT emphysema index LAA%-950 in the inspiratory phase, LAA%-910 in the expiratory phase, and PRMEmph% in biphasic can objectively evaluate emphysema in patients with mild to moderate COPD, among which PRMEmph% is the best evaluation index
    corecore