196 research outputs found

    Exploiting Prompt Caption for Video Grounding

    Full text link
    Video grounding aims to locate a moment of interest matching the given query sentence from an untrimmed video. Previous works ignore the \emph{sparsity dilemma} in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. In this paper, we contend that exploiting easily available captions which describe general actions \ie, prompt captions (PC) defined in our paper, will significantly boost the performance. To this end, we propose a Prompt Caption Network (PCNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain prompt captions by Non-Prompt Caption Suppression (NPCS). To capture the potential information in prompt captions, we propose Caption Guided Attention (CGA) project the semantic relations between prompt captions and query sentences into temporal space and fuse them into visual representations. Considering the gap between prompt captions and ground truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for constructing more negative pairs to maximize cross-modal mutual information. Without bells and whistles, extensive experiments on three public datasets (\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our method significantly outperforms state-of-the-art methods

    G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory

    Full text link
    The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) \emph{alignment} of features of similar samples, and (2) \emph{uniformity} of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, \ie semantic overlapping; (2) only a few moments in the video are annotated, \ie sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method.Comment: ICCV202

    Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation

    Full text link
    Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods.Comment: 8 pages,6 figures,4 table

    ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding

    Full text link
    Spoken language understanding (SLU) is a fundamental task in the task-oriented dialogue systems. However, the inevitable errors from automatic speech recognition (ASR) usually impair the understanding performance and lead to error propagation. Although there are some attempts to address this problem through contrastive learning, they (1) treat clean manual transcripts and ASR transcripts equally without discrimination in fine-tuning; (2) neglect the fact that the semantically similar pairs are still pushed away when applying contrastive learning; (3) suffer from the problem of Kullback-Leibler (KL) vanishing. In this paper, we propose Mutual Learning and Large-Margin Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness in SLU. Specifically, in fine-tuning, we apply mutual learning and train two SLU models on the manual transcripts and the ASR transcripts, respectively, aiming to iteratively share knowledge between these two models. We also introduce a distance polarization regularizer to avoid pushing away the intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing schedule to mitigate KL vanishing issue. Experiments on three datasets show that ML-LMCL outperforms existing models and achieves new state-of-the-art performance

    Edge intelligence-enabled cyber-physical systems

    Get PDF
    With the advent of the Internet of everything era, people's demand for intelligent Internet of Things (IoT) devices is steadily increasing. A more intelligent cyber-physical system (CPS) is needed to meet the diverse business requirements of users, such as ultra-reliable low-latency communication, high quality of services (QoS), and quality of experience (QoE). Edge intelligence (EI) is recognized by academia and industry as one of the key emerging technologies for the CPS, which provides the ability to analyze data at the edge rather than sending it to the cloud for analysis, and will be a key enabler to realize a world of a trillion hyperconnected smart sensing devices.As a distributed intelligent computing paradigm in which computation is largely or completely performed at distributed nodes, EI provides for the rapid development of artificial intelligence (AI) and edge computing resources to support real-time insight and analysis for applications in CPS, which brings memory, computing power and processing ability closer to the location where it is needed, reduces the volumes of data that must be moved, the consequent traffic, and the distance the data must travel. As an emerging intelligent computing paradigm, EI can accelerate content delivery and improve the QoS of applications, which is attracting more and more research attentions from academia and industry because of its advantages in throughput, delay, network scalability and intelligence in CPS.The guest editors would like to thank all the authors and the reviewers for their hard work and contributions in helping to organize this special issue. They also would like to express their heartfelt gratitude to the Editor-in-Chief, Prof. David W. Walker, for giving us this great opportunity, and the members of the Editorial Staff for their support during the process.Scopu

    “What should be computed” for supporting post-pandemic recovery policymaking?:A life-oriented perspective

    Get PDF
    The COVID-19 pandemic has caused various impacts on people’s lives, while changes in people’s lives have shown mixed effects on mitigating the spread of the SARS-CoV-2 virus. Understanding how to capture such two-way interactions is crucial, not only to control the pandemic but also to support post-pandemic urban recovery policies. As suggested by the life-oriented approach, the above interactions exist with respect to a variety of life domains, which form a complex behavior system. Through a review of the literature, this paper first points out inconsistent evidence about behavioral factors affecting the spread of COVID-19, and then argues that existing studies on the impacts of COVID-19 on people’s lives have ignored behavioral co-changes in multiple life domains. Furthermore, selected uncertain trends of people’s lives for the post-pandemic recovery are described. Finally, this paper concludes with a summary about “what should be computed?” in Computational Urban Science with respect to how to catch up with delays in the SDGs caused by the COVID-19 pandemic, how to address digital divides and dilemmas of e-society, how to capture behavioral co-changes during the post-pandemic recovery process, and how to better manage post-pandemic recovery policymaking processes.</p

    “What should be computed” for supporting post-pandemic recovery policymaking?:A life-oriented perspective

    Get PDF
    The COVID-19 pandemic has caused various impacts on people’s lives, while changes in people’s lives have shown mixed effects on mitigating the spread of the SARS-CoV-2 virus. Understanding how to capture such two-way interactions is crucial, not only to control the pandemic but also to support post-pandemic urban recovery policies. As suggested by the life-oriented approach, the above interactions exist with respect to a variety of life domains, which form a complex behavior system. Through a review of the literature, this paper first points out inconsistent evidence about behavioral factors affecting the spread of COVID-19, and then argues that existing studies on the impacts of COVID-19 on people’s lives have ignored behavioral co-changes in multiple life domains. Furthermore, selected uncertain trends of people’s lives for the post-pandemic recovery are described. Finally, this paper concludes with a summary about “what should be computed?” in Computational Urban Science with respect to how to catch up with delays in the SDGs caused by the COVID-19 pandemic, how to address digital divides and dilemmas of e-society, how to capture behavioral co-changes during the post-pandemic recovery process, and how to better manage post-pandemic recovery policymaking processes.</p
    corecore