8 research outputs found
D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation
Temporal sentence grounding (TSG) aims to locate a specific moment from an
untrimmed video with a given natural language query. Recently, weakly
supervised methods still have a large performance gap compared to fully
supervised ones, while the latter requires laborious timestamp annotations. In
this study, we aim to reduce the annotation cost yet keep competitive
performance for TSG task compared to fully supervised ones. To achieve this
goal, we investigate a recently proposed glance-supervised temporal sentence
grounding task, which requires only single frame annotation (referred to as
glance annotation) for each query. Under this setup, we propose a Dynamic
Gaussian prior based Grounding framework with Glance annotation (D3G), which
consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and
a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples
reliable positive moments from a 2D temporal map via jointly leveraging
Gaussian prior and semantic consistency, which contributes to aligning the
positive sentence-moment pairs in the joint embedding space. Moreover, to
alleviate the annotation bias resulting from glance annotation and model
complex queries consisting of multiple events, we propose the DGA module, which
adjusts the distribution dynamically to approximate the ground truth of target
moments. Extensive experiments on three challenging benchmarks verify the
effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly
supervised methods by a large margin and narrows the performance gap compared
to fully supervised methods. Code is available at
https://github.com/solicucu/D3G.Comment: ICCV202
Unified and Dynamic Graph for Temporal Character Grouping in Long Videos
Video temporal character grouping locates appearing moments of major
characters within a video according to their identities. To this end, recent
works have evolved from unsupervised clustering to graph-based supervised
clustering. However, graph methods are built upon the premise of fixed affinity
graphs, bringing many inexact connections. Besides, they extract multi-modal
features with kinds of models, which are unfriendly to deployment. In this
paper, we present a unified and dynamic graph (UniDG) framework for temporal
character grouping. This is accomplished firstly by a unified representation
network that learns representations of multiple modalities within the same
space and still preserves the modality's uniqueness simultaneously. Secondly,
we present a dynamic graph clustering where the neighbors of different
quantities are dynamically constructed for each node via a cyclic matching
strategy, leading to a more reliable affinity graph. Thirdly, a progressive
association method is introduced to exploit spatial and temporal contexts among
different modalities, allowing multi-modal clustering results to be well fused.
As current datasets only provide pre-extracted features, we evaluate our UniDG
method on a collected dataset named MTCG, which contains each character's
appearing clips of face and body and speaking voice tracks. We also evaluate
our key components on existing clustering and retrieval datasets to verify the
generalization ability. Experimental results manifest that our method can
achieve promising results and outperform several state-of-the-art approaches
Cost-effectiveness of zinc interventions in China : a cohort-based Markov model
Zinc acts as an important cofactor in the body and is essential for normal functions. Several zinc interventions have been implemented worldwide to improve the public's zinc status, but limited studies have assessed their cost-effectiveness. To help inform decision-making on zinc interventions to maximize benefits within a fixed budget, we took China as an example and evaluated the cost-effectiveness of three interventions, that is, supplementation, food fortification, and biofortification. As an essential group at high risk of zinc deficiency, children aged 5-14 years, who account for 10% of the Chinese population, were selected as the target group in this study. We constructed a decision-analytic Markov model to determine the cost-effectiveness of interventions in China under different scenarios. In our model, biofortification through conventional breeding was shown to be the most cost-effective approach in most scenarios. Compared with other interventions, zinc supplementation gained fewer quality-adjusted life years at a higher net cost, suggesting that this common approach may not be optimal for large-scale, long-term implementation at the national level. While the robustness of the results was further confirmed by the sensitivity analysis, more research is needed to assess the cost-effectiveness of addressing zinc deficiency with other interventions. Further clinical trials are also expected to evaluate the effectiveness of zinc interventions in reducing pneumonia cases
Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer
Real-world recognition system often encounters a plenty of unseen labels in
practice. To identify such unseen labels, multi-label zero-shot learning
(ML-ZSL) focuses on transferring knowledge by a pre-trained textual label
embedding (e.g., GloVe). However, such methods only exploit singlemodal
knowledge from a language model, while ignoring the rich semantic information
inherent in image-text pairs. Instead, recently developed open-vocabulary (OV)
based methods succeed in exploiting such information of image-text pairs in
object detection, and achieve impressive performance. Inspired by the success
of OV-based methods, we propose a novel open-vocabulary framework, named
multimodal knowledge transfer (MKT), for multi-label classification.
Specifically, our method exploits multi-modal knowledge of image-text pairs
based on a vision and language pretraining (VLP) model. To facilitate
transferring the imagetext matching ability of VLP model, knowledge
distillation is used to guarantee the consistency of image and label
embeddings, along with prompt tuning to further update the label embeddings. To
further recognize multiple objects, a simple but effective two-stream module is
developed to capture both local and global features. Extensive experimental
results show that our method significantly outperforms state-of-theart methods
on public benchmark datasets. Code will be available at
https://github.com/seanhe97/MKT.Comment: 13 pages, 10 figure
AIM 2019 Challenge on Image Extreme Super-Resolution: Methods and Results
This paper reviews the AIM 2019 challenge on extreme image super-resolution, the problem of restoring of rich details in a low resolution image. Compared to previous, this challenge focuses on an extreme upscaling factor, x16, and employs the novel DIVerse 8K resolution (DIV8K) dataset. This report focuses on the proposed solutions and final results. The challenge had 2 tracks. The goal in Track 1 was to generate a super-resolution result with high fidelity, using the conventional PSNR as the primary metric to evaluate different methods. Track 2 instead focused on generating visually more pleasant super-resolution results, evaluated using subjective opinions. The two tracks had 71 and 52 registered participants, respectively, and 9 teams competed in the final testing phase. This report gauges the experimental protocol and baselines for the extreme image super-resolution task