275 research outputs found
The impacts of air pollution on human and natural capital in China: A look from a provincial perspective
Abstract Air quality has a significant impact on human health and natural systems worldwide. China, as one of the largest developing countries, faces very much serious air pollution and requires much attention. While the influences of air pollution on human or nature have been extensively investigated, few scholars considered the two effects of air pollution on human health and nature simultaneously based on the same framework. Indeed, human and nature coexist in the same biosphere on which they depend for their development and the impacts of air pollution on human health and nature occur at the same time with different and synergic effects. Only by considering both impacts we can develop a more comprehensive understanding of air pollution impacts, in particular including SO2, NO2, CO, PM10 and PM2.5. Impacts can be looked at from the point of view of damage provided and damage repair (health recovery, replacement cost). Therefore, considering the different pollutants and sectors, the influences of air pollution on human health and nature are accounted for in this study by applying the Emergy Accounting and Life Cycle Assessment Eco-indicator 99 methods under a unified framework in 31 provinces of China taken as case study. While LCA provides an accurate assessment of the direct consequences of pollution on human and natural capital (human health and biodiversity losses), the Emergy Accounting approach quantifies the biosphere work associated to repair or replace such losses over time. Furthermore, the spatial agglomeration characteristics of emissions, human and natural capital losses analyzed by means of Moran's I index. Results show that: (1) Concerning human capital losses, the amount of emissions of PM10 and PM2.5 only account for 10% of total impacts, compared to SO2, NO2, and CO emissions, but in some provinces cause more than 70% of human capital losses. And more than 80% of PM2.5 and PM10 that cause human capital losses come from the industrial and civil sectors. (2) As far as natural capital losses are concerned, compared with SO2, the losses caused by NO2 account for 80% in most provinces. And the power, industrial and transportation sectors are the three major sources of NO2 causing natural capital losses. (3) The spatial agglomeration characteristics, such as high-high cluster, high-low cluster, low-low cluster and low–high cluster, are different for air pollution emissions, human and natural capital losses. A comprehensive and detailed understanding of the impacts of air pollution is crucial for policy makers to take informed decisions
Open-world Semantic Segmentation via Contrasting and Clustering Vision-Language Embedding
To bridge the gap between supervised semantic segmentation and real-world
applications that acquires one model to recognize arbitrary new concepts,
recent zero-shot segmentation attracts a lot of attention by exploring the
relationships between unseen and seen object categories, yet requiring large
amounts of densely-annotated data with diverse base classes. In this paper, we
propose a new open-world semantic segmentation pipeline that makes the first
attempt to learn to segment semantic objects of various open-world categories
without any efforts on dense annotations, by purely exploiting the
image-caption data that naturally exist on the Internet. Our method,
Vision-language-driven Semantic Segmentation (ViL-Seg), employs an image and a
text encoder to generate visual and text embeddings for the image-caption data,
with two core components that endow its segmentation ability: First, the image
encoder is jointly trained with a vision-based contrasting and a cross-modal
contrasting, which encourage the visual embeddings to preserve both
fine-grained semantics and high-level category information that are crucial for
the segmentation task. Furthermore, an online clustering head is devised over
the image encoder, which allows to dynamically segment the visual embeddings
into distinct semantic groups such that they can be classified by comparing
with various text embeddings to complete our segmentation pipeline. Experiments
show that without using any data with dense annotations, our method can
directly segment objects of arbitrary categories, outperforming zero-shot
segmentation methods that require data labeling on three benchmark datasets.Comment: Accepted to ECCV 202
Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards
Recent advances in text-to-image diffusion models have achieved remarkable
success in generating high-quality, realistic images from given text prompts.
However, previous methods fail to perform accurate modality alignment between
text concepts and generated images due to the lack of fine-level semantic
guidance that successfully diagnoses the modality discrepancy. In this paper,
we propose FineRewards to improve the alignment between text and images in
text-to-image diffusion models by introducing two new fine-grained semantic
rewards: the caption reward and the Semantic Segment Anything (SAM) reward.
From the global semantic view, the caption reward generates a corresponding
detailed caption that depicts all important contents in the synthetic image via
a BLIP-2 model and then calculates the reward score by measuring the similarity
between the generated caption and the given prompt. From the local semantic
view, the SAM reward segments the generated images into local parts with
category labels, and scores the segmented parts by measuring the likelihood of
each category appearing in the prompted scene via a large language model, i.e.,
Vicuna-7B. Additionally, we adopt an assemble reward-ranked learning strategy
to enable the integration of multiple reward functions to jointly guide the
model training. Adapting results of text-to-image models on the MS-COCO
benchmark show that the proposed semantic reward outperforms other baseline
reward functions with a considerable margin on both visual quality and semantic
similarity with the input prompt. Moreover, by adopting the assemble
reward-ranked learning strategy, we further demonstrate that model performance
is further improved when adapting under the unifying of the proposed semantic
reward with the current image rewards
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving
Aiming towards a holistic understanding of multiple downstream tasks
simultaneously, there is a need for extracting features with better
transferability. Though many latest self-supervised pre-training methods have
achieved impressive performance on various vision tasks under the prevailing
pretrain-finetune paradigm, their generalization capacity to multi-task
learning scenarios is yet to be explored. In this paper, we extensively
investigate the transfer performance of various types of self-supervised
methods, e.g., MoCo and SimCLR, on three downstream tasks, including semantic
segmentation, drivable area segmentation, and traffic object detection, on the
large-scale driving dataset BDD100K. We surprisingly find that their
performances are sub-optimal or even lag far behind the single-task baseline,
which may be due to the distinctions of training objectives and architectural
design lied in the pretrain-finetune paradigm. To overcome this dilemma as well
as avoid redesigning the resource-intensive pre-training stage, we propose a
simple yet effective pretrain-adapt-finetune paradigm for general multi-task
training, where the off-the-shelf pretrained models can be effectively adapted
without increasing the training overhead. During the adapt stage, we utilize
learnable multi-scale adapters to dynamically adjust the pretrained model
weights supervised by multi-task objectives while leaving the pretrained
knowledge untouched. Furthermore, we regard the vision-language pre-training
model CLIP as a strong complement to the pretrain-adapt-finetune paradigm and
propose a novel adapter named LV-Adapter, which incorporates language priors in
the multi-task model via task-specific prompting and alignment between visual
and textual features.Comment: Accepted at NeurIPS 202
DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment
This paper presents DetCLIPv2, an efficient and scalable training framework
that incorporates large-scale image-text pairs to achieve open-vocabulary
object detection (OVD). Unlike previous OVD frameworks that typically rely on a
pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via
a pseudo labeling process, DetCLIPv2 directly learns the fine-grained
word-region alignment from massive image-text pairs in an end-to-end manner. To
accomplish this, we employ a maximum word-region similarity between region
proposals and textual words to guide the contrastive objective. To enable the
model to gain localization capability while learning broad concepts, DetCLIPv2
is trained with a hybrid supervision from detection, grounding and image-text
pair data under a unified data formulation. By jointly training with an
alternating scheme and adopting low-resolution input for image-text pairs,
DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2
utilizes 13X more image-text pairs than DetCLIP with a similar training time
and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2
demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2
with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which
outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP,
respectively, and even beats its fully-supervised counterpart by a large
margin.Comment: Accepted to CVPR202
DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability
Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2,
have shown remarkable results on image synthesis. On the other hand,
large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are
competent for various downstream tasks by learning to align vision and language
embeddings. In this paper, we explore the possibility of jointly modeling
generation and discrimination. Specifically, we propose DiffDis to unify the
cross-modal generative and discriminative pretraining into one single framework
under the diffusion process. DiffDis first formulates the image-text
discriminative problem as a generative diffusion process of the text embedding
from the text encoder conditioned on the image. Then, we propose a novel
dual-stream network architecture, which fuses the noisy text embedding with the
knowledge of latent images from different scales for image-text discriminative
learning. Moreover, the generative and discriminative tasks can efficiently
share the image-branch network structure in the multi-modality model.
Benefiting from diffusion-based unified training, DiffDis achieves both better
generation ability and cross-modal semantic alignment in one architecture.
Experimental results show that DiffDis outperforms single-task models on both
the image generation and the image-text discriminative tasks, e.g., 1.65%
improvement on average accuracy of zero-shot classification over 12 datasets
and 2.42 improvement on FID of zero-shot image synthesis.Comment: ICCV202
POVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection
Inspired by the success of visual-language methods (VLMs) in zero-shot
classification, recent works attempt to extend this line of work into object
detection by leveraging the localization ability of pre-trained VLMs and
generating pseudo labels for unseen classes in a self-training manner. However,
since the current VLMs are usually pre-trained with aligning sentence embedding
with global image embedding, the direct use of them lacks fine-grained
alignment for object instances, which is the core of detection. In this paper,
we propose a simple but effective Pretrain-adaPt-Pseudo labeling paradigm for
Open-Vocabulary Detection (POVD) that introduces a fine-grained visual-text
prompt adapting stage to enhance the current self-training paradigm with a more
powerful fine-grained alignment. During the adapting stage, we enable VLM to
obtain fine-grained alignment by using learnable text prompts to resolve an
auxiliary dense pixel-wise prediction task. Furthermore, we propose a visual
prompt module to provide the prior task information (i.e., the categories need
to be predicted) for the vision branch to better adapt the pretrained VLM to
the downstream tasks. Experiments show that our method achieves the
state-of-the-art performance for open-vocabulary object detection, e.g., 31.5%
mAP on unseen classes of COCO
- …