121 research outputs found

    Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image Generation

    Full text link
    The objective of personalization and stylization in text-to-image is to instruct a pre-trained diffusion model to analyze new concepts introduced by users and incorporate them into expected styles. Recently, parameter-efficient fine-tuning (PEFT) approaches have been widely adopted to address this task and have greatly propelled the development of this field. Despite their popularity, existing efficient fine-tuning methods still struggle to achieve effective personalization and stylization in T2I generation. To address this issue, we propose block-wise Low-Rank Adaptation (LoRA) to perform fine-grained fine-tuning for different blocks of SD, which can generate images faithful to input prompts and target identity and also with desired style. Extensive experiments demonstrate the effectiveness of the proposed method

    Missing Modality meets Meta Sampling (M3S): An Efficient Universal Approach for Multimodal Sentiment Analysis with Missing Modality

    Full text link
    Multimodal sentiment analysis (MSA) is an important way of observing mental activities with the help of data captured from multiple modalities. However, due to the recording or transmission error, some modalities may include incomplete data. Most existing works that address missing modalities usually assume a particular modality is completely missing and seldom consider a mixture of missing across multiple modalities. In this paper, we propose a simple yet effective meta-sampling approach for multimodal sentiment analysis with missing modalities, namely Missing Modality-based Meta Sampling (M3S). To be specific, M3S formulates a missing modality sampling strategy into the modal agnostic meta-learning (MAML) framework. M3S can be treated as an efficient add-on training component on existing models and significantly improve their performances on multimodal data with a mixture of missing modalities. We conduct experiments on IEMOCAP, SIMS and CMU-MOSI datasets, and superior performance is achieved compared with recent state-of-the-art methods

    Grounded Image Text Matching with Mismatched Relation Reasoning

    Full text link
    This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating pre-trained models on this task, with a focus on the challenging settings of limited data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained models lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. RCRN can be interpreted as a modular program and delivers strong performance in both length generalization and data efficiency

    The consumption-based black carbon emissions of China's megacities

    Get PDF
    A growing body of literature discusses the CO2 emissions of cities. Still, little is known about black carbon (BC), a short-lived warming agent. Identifying the drivers of urban BC emissions is crucial for targeting cleanup efforts. A consumption-based approach enables all emissions to be allocated along the production chain to the product and place of final consumption, whereas a production approach attributes emissions to the place where goods and services are produced. In this study, we calculate the production-based and consumption-based emissions in 2012 in four Chinese megacities: Beijing, Shanghai, Tianjin and Chongqing. The results show that capital formation is the largest contributor, accounting for 37%–69% of consumption-based emissions. Approximately 44% of BC emissions related to goods consumed in Chongqing and more than 60% for Beijing, Shanghai and Tianjin occur outside of the city boundary. The large gap between consumption and production-based emissions can be attributed to the great difference in embodied emission intensities. Therefore, collaborative efforts to reduce emission intensity can be effective in mitigating climate change for megacities as either producers or consumers

    Multi-objective analysis of the co-mitigation of CO2 and PM2.5 pollution by China's iron and steel industry

    Get PDF
    China has experienced serious fine particulate matter (PM2.5) pollution in recent years, and carbon dioxide (CO2) emissions must be controlled so that China can keep its pledge to reduce CO2 emissions by 2030. The iron and steel industry is energy intensive and contributes significantly to PM2.5 pollution in China. The simultaneous reduction of CO2 emissions and PM2.5 pollution while minimizing the total mitigation costs remains a crucial issue that must be resolved. Using a multi-objective analysis, we compared potential technology combinations based on various policy preferences and targets. Our results showed that policies designed to mitigate PM2.5 pollution have substantial co-benefits for CO2 emissions reductions. However, policies focused solely on reducing CO2 emissions fail to effectively reduce PM2.5. Furthermore, CO2 emissions reductions correspond to large financial costs, whereas PM2.5 pollution reductions are less expensive. Our results suggest that under limited budgets, decision makers should prioritize PM2.5 reductions because CO2 reductions may be simultaneously achieved. Achieving large decreases in CO2 emissions will require further technological innovations to reduce the cost threshold. Thus, China should focus on reducing PM pollution in the short term and prepare for the expected challenges associated with CO2 reductions in the future

    Towards General Visual-Linguistic Face Forgery Detection

    Full text link
    Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We argue that such supervisions lack semantic information and interpretability. To address this issues, in this paper, we propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation. Since text annotations are not available in current deepfakes datasets, VLFFD first generates the mixed forgery image with corresponding fine-grained prompts via Prompt Forgery Image Generator (PFIG). Then, the fine-grained mixed data and coarse-grained original data and is jointly trained with the Coarse-and-Fine Co-training framework (C2F), enabling the model to gain more generalization and interpretability. The experiments show the proposed method improves the existing detection models on several challenging benchmarks. Furthermore, we have integrated our method with multimodal large models, achieving noteworthy results that demonstrate the potential of our approach. This integration not only enhances the performance of our VLFFD paradigm but also underscores the versatility and adaptability of our method when combined with advanced multimodal technologies, highlighting its potential in tackling the evolving challenges of deepfake detection
    • …
    corecore