62 research outputs found

    Zero-Shot Video Moment Retrieval from Frozen Vision-Language Models

    Full text link
    Accurate video moment retrieval (VMR) requires universal visual-textual correlations that can handle unknown vocabulary and unseen scenes. However, the learned correlations are likely either biased when derived from a limited amount of moment-text data which is hard to scale up because of the prohibitive annotation cost (fully-supervised), or unreliable when only the video-text pairwise relationships are available without fine-grained temporal annotations (weakly-supervised). Recently, the vision-language models (VLM) demonstrate a new transfer learning paradigm to benefit different vision tasks through the universal visual-textual correlations derived from large-scale vision-language pairwise web data, which has also shown benefits to VMR by fine-tuning in the target domains. In this work, we propose a zero-shot method for adapting generalisable visual-textual priors from arbitrary VLM to facilitate moment-text alignment, without the need for accessing the VMR data. To this end, we devise a conditional feature refinement module to generate boundary-aware visual features conditioned on text queries to enable better moment boundary understanding. Additionally, we design a bottom-up proposal generation strategy that mitigates the impact of domain discrepancies and breaks down complex-query retrieval tasks into individual action retrievals, thereby maximizing the benefits of VLM. Extensive experiments conducted on three VMR benchmark datasets demonstrate the notable performance advantages of our zero-shot algorithm, especially in the novel-word and novel-location out-of-distribution setups.Comment: Accepted by WACV 202

    Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

    Full text link
    The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static images with sentences. Therefore, we extract visual context and spatial dynamic information from video frames and explicitly enforce their alignments with the phrases describing video changes (e.g. verb). By doing so, the potentially relevant visual and motion patterns in videos are encoded in the corresponding text embeddings (injected) so to enable more accurate video-text alignments. We conduct extensive experiments on two VMR benchmark datasets (Charades-STA and ActivityNet-Captions) and achieve state-of-the-art performances. Especially, VDI yields notable advantages when being tested on the out-of-distribution splits where the testing samples involve novel scenes and vocabulary.Comment: CVPR202

    Global projection of flood risk with a bivariate framework under 1.5–3.0°C warming levels

    Get PDF
    Global warming increases the atmospheric water-holding capacity, consequently altering the frequency, and intensity of extreme hydrological events. River floods characterized by large peak flow or prolonged duration can amplify the risk of social disruption and affect ecosystem stability. However, previous studies have mostly focused on univariate flood magnitude characteristics, such as flood peak or volume, and there is still limited understanding of how these joint flood characteristics (i.e., magnitude and duration) might co-evolve under different warming levels. Here, we develop a systematical bivariate framework to project future flood risk in 11,528 catchments across the globe. By constructing the joint distribution of flood peak and duration with copulas, we examine global flood risk with a bivariate framework under varying levels of global warming (i.e., within a range of 1.5–3.0°C above pre-industrial levels). The flood projections are produced by driving five calibrated lumped hydrological models (HMs) using the simulations with bias adjustment of five global climate models (GCMs) under three shared socioeconomic pathways (SSP126, SSP370, and SSP585). On average, global warming from 1.5 to 3.0°C tends to amplify flood peak and lengthen flood duration across almost all continents, but changes are not unidirectional and vary regionally around the globe. The joint return period (JRP) of the historical (1985–2014) 50-year flood event is projected to decrease to a median with 36 years under a medium emission pathway at the 1.5°C warming level. Finally, we evaluate the drivers of these JRP changes in the future climate and quantify the uncertainty arising from the different GCMs, SSPs, and HMs. Our findings highlight the importance of limiting greenhouse gas emission to slow down global warming and developing climate adaptation strategies to address future flood hazards

    mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration

    Full text link
    Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However, previous methods primarily focus on enhancing multi-modal capabilities. In this work, we introduce a versatile multi-modal large language model, mPLUG-Owl2, which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design, with the language decoder acting as a universal interface for managing different modalities. Specifically, mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks and achieving state-of-the-art performances with a single generic model. Notably, mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios, setting a pioneering path in the development of future multi-modal foundation models

    Electric-field-induced selective catalysis of single-molecule reaction

    Get PDF
    随着单分子电学检测技术的迅速发展,分子电子学的研究不再局限于分子电子学器件的构筑及其电学性质的测量,而且扩展到单分子尺度化学反应过程的探索。然而目前相关的研究仍然局限于理论计算方面,在单分子尺度上实时监测和调控化学反应的活性和选择性是化学领域的长期目标和挑战。针对这一挑战,洪文晶教授课题组与程俊教授课题组合作,自主研发了精密科学仪器,将单个有机分子定向连接在两个末端尺寸为原子级的电极之间,解决了化学反应中分子取向控制的问题.理论计算结果证实了定向电场可以有效地稳定化学反应的过渡态,从而降低反应能垒。该研究工作在化学化工学院洪文晶教授、程俊教授、能源材料化学协同创新中心(iChEM)刘俊扬副研究员的共同指导下完成,由硕士研究生黄晓艳、iChEM博士研究生唐淳、博士研究生李洁琼以及兰州大学的陈力川博士作为共同第一作者,化学化工学院师佳副教授、陈招斌高级工程师、夏海平教授和田中群教授,萨本栋微纳研究院杨扬副教授、环境与生态学院白敏冬教授以及兰州大学张浩力教授参与了研究工作的讨论并给予指导,博士后乐家波、博士研究生郑珏婷、张佩(已毕业)、李瑞豪、李晓慧也参与了研究工作。Oriented external electric fields (OEEFs) offer a unique chance to tune catalytic selectivity by orienting the alignment of the electric field along the axis of the activated bond for a specific chemical reaction; however, they remain a key experimental challenge. Here, we experimentally and theoretically investigated the OEEF-induced selective catalysis in a two-step cascade reaction of the Diels-Alder addition followed by an aromatization process. Characterized by the mechanically controllable break junction (MCBJ) technique in the nanogap and confirmed by nuclear magnetic resonance (NMR) in bottles, OEEFs are found to selectively catalyze the aromatization reaction by one order of magnitude owing to the alignment of the electric field on the reaction axis. Meanwhile, the Diels-Alder reaction remained unchanged since its reaction axis is orthogonal to the electric fields. This orientation-selective catalytic effect of OEEFs reveals that chemical reactions can be selectively manipulated through the elegant alignment between the electric fields and the reaction axis.This work was supported by the National Key R&D Program of China (2017YFA0204902), the National Natural Science Foundation of China (21722305, 21703188, 21673195, 21621091, 51733004, 51525303, and 91745103), the China Postdoctoral Science Foundation (2017M622060), and the Young Thousand Talents Project of China. 该工作得到国家自然科学基金委(21722305、21703188、21673195、51733004、51525303、91745103),国家重点研发计划课题(2017YFA0204902),中国博士后面上基金(2017M622060)的资助,以及固体表面物理化学国家重点实验室、醇醚酯化工清洁生产国家工程实验室、能源材料化学协同创新中心的支持

    Deep Clustering by Semantic Contrastive Learning

    Full text link
    Whilst contrastive learning has recently brought notable benefits to deep clustering of unlabelled images by learning sample-specific discriminative visual features, its potential for explicitly inferring class decision boundaries is less well understood. This is because its instance discrimination strategy is not class sensitive, therefore, the clusters derived on the resulting sample-specific feature space are not optimised for corresponding to meaningful class decision boundaries. In this work, we solve this problem by introducing Semantic Contrastive Learning (SCL). SCL imposes explicitly distance-based cluster structures on unlabelled training data by formulating a semantic (cluster-aware) contrastive learning objective. Moreover, we introduce a clustering consistency condition to be satisfied jointly by both instance visual similarities and cluster decision boundaries, and concurrently optimising both to reason about the hypotheses of semantic ground-truth classes (unknown/unlabelled) on-the-fly by their consensus. This semantic contrastive learning approach to discovering unknown class decision boundaries has considerable advantages to unsupervised learning of object recognition tasks. Extensive experiments show that SCL outperforms state-of-the-art contrastive learning and deep clustering methods on six object recognition benchmarks, especially on the more challenging finer-grained and larger datasets

    High speed national secret SM4 optical fiber communication system scheme based on FPGA

    No full text
    With the increasing use of optical fiber communication technology in the Industrial Internet of Things, cryptographic algorithms play a crucial role in ensuring the security of data transmission in embedded device environments.The SM4 packet cipher algorithm, developed independently in our country, is widely applied to wireless LAN and Internet of Things data encryption.However, the software-based encryption and decryption processes are relatively slow, which hampers their application in scenarios requiring high real-time performance, especially for embedded devices.To address this issue, a high-performance and secure optical fiber communication system was designed based on the FPGA platform and the SM4 algorithm.The FPGA was used to implement the MAC layer interface for SM4 algorithm encryption and decryption, as well as data transmission.Besides, an optimization scheme for the hardware implementation architecture of the SM4 algorithm was proposed.The critical path was shortened by employing a pipeline method, thereby improving the system clock frequency.Additionally, parallel processing of S-box transformation was accelerated to enable efficient data replacement.To reduce data reading delays, a dual-cache processing method was combined to facilitate easier processing of cache data and significantly reduce packet loss rates.This scheme greatly enhanced system data throughput.Experimental results demonstrate that compared to similar designs, the throughput of the SM4 algorithm encryption and decryption module in this scheme reaches up to 25.6 Gbit/s, with minimal differences in resource consumption.Due to limitations imposed by the 10-gigabit SFP+ optical module, the throughput of the entire optical fiber communication system reaches 9.4 Gbit/s.For 128-bit data, the average encryption speed is 0.47 μs/bit and the average decryption speed is 0.28 μs/bit, which can be applied to a variety of secure communication scenarios in the internet of things

    Thermal Performance Optimization of Integrated Microchannel Cooling Plate for IGBT Power Module

    No full text
    In high-integration electronic components, the insulated-gate bipolar transistor (IGBT) power module has a high working temperature, which requires reasonable thermal analysis and a cooling process to improve the reliability of the IGBT module. This paper presents an investigation into the heat dissipation of the integrated microchannel cooling plate in the silicon carbide IGBT power module and reports the impact of the BL series micropump on the efficiency of the cooling plate. The IGBT power module was first simplified as an equivalent-mass block with a mass of 62.64 g, a volume of 15.27 cm3, a density of 4.10 g/cm3, and a specific heat capacity of 512.53 J/(kg·K), through an equivalent method. Then, the thermal performance of the microchannel cooling plate with a main channel and a secondary channel was analyzed and the design of experiment (DOE) method was used to provide three factors and three levels of orthogonal simulation experiments. The three factors included microchannel width, number of secondary inlets, and inlet diameter. The results show that the microchannel cooling plate significantly reduces the temperature of IGBT chips and, as the microchannel width, number of secondary inlets, and inlet diameter increase, the junction temperature of chips gradually decreases. The optimal structure of the cooling plate is a microchannel width of 0.58 mm, 13 secondary inlets, and an inlet diameter of 3.8 mm, and the chip-junction temperature of this structure is decreased from 677 °C to 77.7 °C. In addition, the BL series micropump was connected to the inlet of the cooling plate and the thermal performance of the microchannel cooling plate with a micropump was analyzed. The micropump increases the frictional resistance of fluid flow, resulting in an increase in chip-junction temperature to 110 °C. This work demonstrates the impact of micropumps on the heat dissipation of cooling plates and provides a foundation for the design of cooling plates for IGBT power modules

    Feature-Distribution Perturbation and Calibration for Generalized Person ReID

    Full text link
    Person Re-identification (ReID) has been advanced remarkably over the last 10 years along with the rapid development of deep learning for visual recognition. However, the i.i.d. (independent and identically distributed) assumption commonly held in most deep learning models is somewhat non-applicable to ReID considering its objective to identify images of the same pedestrian across cameras at different locations often of variable and independent domain characteristics that are also subject to view-biased data distribution. In this work, we propose a Feature-Distribution Perturbation and Calibration (PECA) method to derive generic feature representations for person ReID, which is not only discriminative across cameras but also agnostic and deployable to arbitrary unseen target domains. Specifically, we perform per-domain feature-distribution perturbation to refrain the model from overfitting to the domain-biased distribution of each source (seen) domain by enforcing feature invariance to distribution shifts caused by perturbation. Furthermore, we design a global calibration mechanism to align feature distributions across all the source domains to improve the model generalization capacity by eliminating domain bias. These local perturbation and global calibration are conducted simultaneously, which share the same principle to avoid models overfitting by regularization respectively on the perturbed and the original distributions. Extensive experiments were conducted on eight person ReID datasets and the proposed PECA model outperformed the state-of-the-art competitors by significant margins
    corecore