28 research outputs found

    Exploiting Prompt Caption for Video Grounding

    Full text link
    Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the \emph{sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions \ie, prompt captions (PC) defined in our paper, will significantly boost the performance. To this end, we propose a Prompt Caption Network (PCNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain prompt captions by Non-Prompt Caption Suppression (NPCS). To capture the potential information in prompt captions, we propose Caption Guided Attention (CGA) project the semantic relations between prompt captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between prompt captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Without bells and whistles, extensive experiments on three public datasets (\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods

    G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

    Full text link
    The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.Comment: ICCV202

    ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

    Full text link
    Spoken language understanding (SLU) is a fundamental task in the task-oriented dialogue systems. However, the inevitable errors from automatic speech recognition (ASR) usually impair the understanding performance and lead to error propagation. Although there are some attempts to address this problem through contrastive learning, they (1) treat clean manual transcripts and ASR transcripts equally without discrimination in fine-tuning; (2) neglect the fact that the semantically similar pairs are still pushed away when applying contrastive learning; (3) suffer from the problem of Kullback-Leibler (KL) vanishing. In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness in SLU. Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models. We also introduce a distance polarization regularizer to avoid pushing away the intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing schedule to mitigate KL vanishing issue. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance

    Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

    Full text link
    Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.Comment: 8 pages,6 figures,4 table

    Processing of nanostructured polymers and advanced polymeric based nanocomposites

    Full text link

    Transition-Layer Implantation for Improving Magnetoelectric Response in Co-fired Laminated Composite

    No full text
    Magnetoelectric (ME) laminated composites with strong ME coupling are becoming increasingly prevalent in the electron device field. In this paper, an enhancement of the ME coupling effect via transition-layer implantation for co-fired lead-free laminated composite (80Bi0.5Na0.5TiO3-20Bi0.5K0.5TiO3)/(Ni0.8Zn0.2)Fe2O4 (BNKT/NZFO) was demonstrated. A transition layer composed of particulate ME composite 0.5BNKT-0.5NZFO was introduced between the BNKT piezoelectric layer and the NZFO magnetostrictive layer, effectively connecting the two-phase interface and strengthening interface stress transfer. In particular, an optimal ME voltage coefficients (αME) of 144 mV/(cm·Oe) at 1 kHz and 1.05 V/(cm·Oe) at the resonant frequency in the composite was achieved, with a layer thickness ratio (BNKT:0.5BNKT-0.5NZFO:NZFO) of 3:1:6. The static elastic model was used to determine strong interface coupling. A large magnetodielectric (MD) response of 3.95% was found under a magnetic field excitation of 4 kOe. These results demonstrate that transition-layer implantation provides a new path to enhance the ME response in co-fired laminated composite, which can play an important role in developing magnetic field-tuned electronic devices

    Self‐biased magnetoelectric composite for energy harvesting

    No full text
    Abstract The wireless sensor network energy supply technology for the Internet of things has progressed substantially, but attempts to provide sustainable and environmentally friendly energy for sensor networks remain limited and considerably cumbersome for practical application. Energy harvesting devices based on the magnetoelectric (ME) coupling effect have promising prospects in the field of self‐powered devices due to their advantages of small size, fast response, and low power consumption. Driven by application requirements, the development of composite with a self‐biased magnetoelectric (SME) coupling effect provides effective strategies for the miniaturized and high‐precision design of energy harvesting devices. This review summarizes the work mechanism, research status, characteristics, and structures of SME composites, with emphasis on the application and development of SME devices for vibration and magnetic energy harvesting. The main challenges and future development directions for the design and implementation of energy harvesting devices based on the SME effect are presented
    corecore