196 research outputs found
Exploiting Prompt Caption for Video Grounding
Video grounding aims to locate a moment of interest matching the given query
sentence from an untrimmed video. Previous works ignore the \emph{sparsity
dilemma} in video annotations, which fails to provide the context information
between potential events and query sentences in the dataset. In this paper, we
contend that exploiting easily available captions which describe general
actions \ie, prompt captions (PC) defined in our paper, will significantly
boost the performance. To this end, we propose a Prompt Caption Network (PCNet)
for video grounding. Specifically, we first introduce dense video captioning to
generate dense captions and then obtain prompt captions by Non-Prompt Caption
Suppression (NPCS). To capture the potential information in prompt captions, we
propose Caption Guided Attention (CGA) project the semantic relations between
prompt captions and query sentences into temporal space and fuse them into
visual representations. Considering the gap between prompt captions and ground
truth, we propose Asymmetric Cross-modal Contrastive Learning (ACCL) for
constructing more negative pairs to maximize cross-modal mutual information.
Without bells and whistles, extensive experiments on three public datasets
(\ie, ActivityNet Captions, TACoS and ActivityNet-CG) demonstrate that our
method significantly outperforms state-of-the-art methods
G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory
The recent video grounding works attempt to introduce vanilla contrastive
learning into video grounding. However, we claim that this naive solution is
suboptimal. Contrastive learning requires two key properties: (1)
\emph{alignment} of features of similar samples, and (2) \emph{uniformity} of
the induced distribution of the normalized features on the hypersphere. Due to
two annoying issues in video grounding: (1) the co-existence of some visual
entities in both ground truth and other moments, \ie semantic overlapping; (2)
only a few moments in the video are annotated, \ie sparse annotation dilemma,
vanilla contrastive learning is unable to model the correlations between
temporally distant moments and learned inconsistent video representations. Both
characteristics lead to vanilla contrastive learning being unsuitable for video
grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a
semantically aligned and uniform video grounding framework via geodesic and
game theory. We quantify the correlations among moments leveraging the geodesic
distance that guides the model to learn the correct cross-modal
representations. Furthermore, from the novel perspective of game theory, we
propose semantic Shapley interaction based on geodesic distance sampling to
learn fine-grained semantic alignment in similar moments. Experiments on three
benchmarks demonstrate the effectiveness of our method.Comment: ICCV202
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Automatic radiology report generation has attracted enormous research
interest due to its practical value in reducing the workload of radiologists.
However, simultaneously establishing global correspondences between the image
(e.g., Chest X-ray) and its related report and local alignments between image
patches and keywords remains challenging. To this end, we propose an Unify,
Align and then Refine (UAR) approach to learn multi-level cross-modal
alignments and introduce three novel modules: Latent Space Unifier (LSU),
Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR).
Specifically, LSU unifies multimodal data into discrete tokens, making it
flexible to learn common knowledge among modalities with a shared network. The
modality-agnostic CRA learns discriminative features via a set of orthonormal
basis and a dual-gate mechanism first and then globally aligns visual and
textual representations under a triplet contrastive loss. TIR boosts
token-level local alignment via calibrating text-to-image attention with a
learnable mask. Additionally, we design a two-stage training procedure to make
UAR gradually grasp cross-modal alignments at different levels, which imitates
radiologists' workflow: writing sentence by sentence first and then checking
word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR
benchmark datasets demonstrate the superiority of our UAR against varied
state-of-the-art methods.Comment: 8 pages,6 figures,4 table
ML-LMCL: Mutual Learning and Large-Margin Contrastive Learning for Improving ASR Robustness in Spoken Language Understanding
Spoken language understanding (SLU) is a fundamental task in the
task-oriented dialogue systems. However, the inevitable errors from automatic
speech recognition (ASR) usually impair the understanding performance and lead
to error propagation. Although there are some attempts to address this problem
through contrastive learning, they (1) treat clean manual transcripts and ASR
transcripts equally without discrimination in fine-tuning; (2) neglect the fact
that the semantically similar pairs are still pushed away when applying
contrastive learning; (3) suffer from the problem of Kullback-Leibler (KL)
vanishing. In this paper, we propose Mutual Learning and Large-Margin
Contrastive Learning (ML-LMCL), a novel framework for improving ASR robustness
in SLU. Specifically, in fine-tuning, we apply mutual learning and train two
SLU models on the manual transcripts and the ASR transcripts, respectively,
aiming to iteratively share knowledge between these two models. We also
introduce a distance polarization regularizer to avoid pushing away the
intra-cluster pairs as much as possible. Moreover, we use a cyclical annealing
schedule to mitigate KL vanishing issue. Experiments on three datasets show
that ML-LMCL outperforms existing models and achieves new state-of-the-art
performance
Edge intelligence-enabled cyber-physical systems
With the advent of the Internet of everything era, people's demand for intelligent Internet of Things (IoT) devices is steadily increasing. A more intelligent cyber-physical system (CPS) is needed to meet the diverse business requirements of users, such as ultra-reliable low-latency communication, high quality of services (QoS), and quality of experience (QoE). Edge intelligence (EI) is recognized by academia and industry as one of the key emerging technologies for the CPS, which provides the ability to analyze data at the edge rather than sending it to the cloud for analysis, and will be a key enabler to realize a world of a trillion hyperconnected smart sensing devices.As a distributed intelligent computing paradigm in which computation is largely or completely performed at distributed nodes, EI provides for the rapid development of artificial intelligence (AI) and edge computing resources to support real-time insight and analysis for applications in CPS, which brings memory, computing power and processing ability closer to the location where it is needed, reduces the volumes of data that must be moved, the consequent traffic, and the distance the data must travel. As an emerging intelligent computing paradigm, EI can accelerate content delivery and improve the QoS of applications, which is attracting more and more research attentions from academia and industry because of its advantages in throughput, delay, network scalability and intelligence in CPS.The guest editors would like to thank all the authors and the reviewers for their hard work and contributions in helping to organize this special issue. They also would like to express their heartfelt gratitude to the Editor-in-Chief, Prof. David W. Walker, for giving us this great opportunity, and the members of the Editorial Staff for their support during the process.Scopu
“What should be computed” for supporting post-pandemic recovery policymaking?:A life-oriented perspective
The COVID-19 pandemic has caused various impacts on people’s lives, while changes in people’s lives have shown mixed effects on mitigating the spread of the SARS-CoV-2 virus. Understanding how to capture such two-way interactions is crucial, not only to control the pandemic but also to support post-pandemic urban recovery policies. As suggested by the life-oriented approach, the above interactions exist with respect to a variety of life domains, which form a complex behavior system. Through a review of the literature, this paper first points out inconsistent evidence about behavioral factors affecting the spread of COVID-19, and then argues that existing studies on the impacts of COVID-19 on people’s lives have ignored behavioral co-changes in multiple life domains. Furthermore, selected uncertain trends of people’s lives for the post-pandemic recovery are described. Finally, this paper concludes with a summary about “what should be computed?” in Computational Urban Science with respect to how to catch up with delays in the SDGs caused by the COVID-19 pandemic, how to address digital divides and dilemmas of e-society, how to capture behavioral co-changes during the post-pandemic recovery process, and how to better manage post-pandemic recovery policymaking processes.</p
“What should be computed” for supporting post-pandemic recovery policymaking?:A life-oriented perspective
The COVID-19 pandemic has caused various impacts on people’s lives, while changes in people’s lives have shown mixed effects on mitigating the spread of the SARS-CoV-2 virus. Understanding how to capture such two-way interactions is crucial, not only to control the pandemic but also to support post-pandemic urban recovery policies. As suggested by the life-oriented approach, the above interactions exist with respect to a variety of life domains, which form a complex behavior system. Through a review of the literature, this paper first points out inconsistent evidence about behavioral factors affecting the spread of COVID-19, and then argues that existing studies on the impacts of COVID-19 on people’s lives have ignored behavioral co-changes in multiple life domains. Furthermore, selected uncertain trends of people’s lives for the post-pandemic recovery are described. Finally, this paper concludes with a summary about “what should be computed?” in Computational Urban Science with respect to how to catch up with delays in the SDGs caused by the COVID-19 pandemic, how to address digital divides and dilemmas of e-society, how to capture behavioral co-changes during the post-pandemic recovery process, and how to better manage post-pandemic recovery policymaking processes.</p
- …