8 research outputs found

    D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation

    Full text link
    Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G.Comment: ICCV202

    Unified and Dynamic Graph for Temporal Character Grouping in Long Videos

    Full text link
    Video temporal character grouping locates appearing moments of major characters within a video according to their identities. To this end, recent works have evolved from unsupervised clustering to graph-based supervised clustering. However, graph methods are built upon the premise of fixed affinity graphs, bringing many inexact connections. Besides, they extract multi-modal features with kinds of models, which are unfriendly to deployment. In this paper, we present a unified and dynamic graph (UniDG) framework for temporal character grouping. This is accomplished firstly by a unified representation network that learns representations of multiple modalities within the same space and still preserves the modality's uniqueness simultaneously. Secondly, we present a dynamic graph clustering where the neighbors of different quantities are dynamically constructed for each node via a cyclic matching strategy, leading to a more reliable affinity graph. Thirdly, a progressive association method is introduced to exploit spatial and temporal contexts among different modalities, allowing multi-modal clustering results to be well fused. As current datasets only provide pre-extracted features, we evaluate our UniDG method on a collected dataset named MTCG, which contains each character's appearing clips of face and body and speaking voice tracks. We also evaluate our key components on existing clustering and retrieval datasets to verify the generalization ability. Experimental results manifest that our method can achieve promising results and outperform several state-of-the-art approaches

    Cost-effectiveness of zinc interventions in China : a cohort-based Markov model

    No full text
    Zinc acts as an important cofactor in the body and is essential for normal functions. Several zinc interventions have been implemented worldwide to improve the public's zinc status, but limited studies have assessed their cost-effectiveness. To help inform decision-making on zinc interventions to maximize benefits within a fixed budget, we took China as an example and evaluated the cost-effectiveness of three interventions, that is, supplementation, food fortification, and biofortification. As an essential group at high risk of zinc deficiency, children aged 5-14 years, who account for 10% of the Chinese population, were selected as the target group in this study. We constructed a decision-analytic Markov model to determine the cost-effectiveness of interventions in China under different scenarios. In our model, biofortification through conventional breeding was shown to be the most cost-effective approach in most scenarios. Compared with other interventions, zinc supplementation gained fewer quality-adjusted life years at a higher net cost, suggesting that this common approach may not be optimal for large-scale, long-term implementation at the national level. While the robustness of the results was further confirmed by the sensitivity analysis, more research is needed to assess the cost-effectiveness of addressing zinc deficiency with other interventions. Further clinical trials are also expected to evaluate the effectiveness of zinc interventions in reducing pneumonia cases

    Open-Vocabulary Multi-Label Classification via Multi-modal Knowledge Transfer

    Full text link
    Real-world recognition system often encounters a plenty of unseen labels in practice. To identify such unseen labels, multi-label zero-shot learning (ML-ZSL) focuses on transferring knowledge by a pre-trained textual label embedding (e.g., GloVe). However, such methods only exploit singlemodal knowledge from a language model, while ignoring the rich semantic information inherent in image-text pairs. Instead, recently developed open-vocabulary (OV) based methods succeed in exploiting such information of image-text pairs in object detection, and achieve impressive performance. Inspired by the success of OV-based methods, we propose a novel open-vocabulary framework, named multimodal knowledge transfer (MKT), for multi-label classification. Specifically, our method exploits multi-modal knowledge of image-text pairs based on a vision and language pretraining (VLP) model. To facilitate transferring the imagetext matching ability of VLP model, knowledge distillation is used to guarantee the consistency of image and label embeddings, along with prompt tuning to further update the label embeddings. To further recognize multiple objects, a simple but effective two-stream module is developed to capture both local and global features. Extensive experimental results show that our method significantly outperforms state-of-theart methods on public benchmark datasets. Code will be available at https://github.com/seanhe97/MKT.Comment: 13 pages, 10 figure

    AIM 2019 Challenge on Image Extreme Super-Resolution: Methods and Results

    No full text
    This paper reviews the AIM 2019 challenge on extreme image super-resolution, the problem of restoring of rich details in a low resolution image. Compared to previous, this challenge focuses on an extreme upscaling factor, x16, and employs the novel DIVerse 8K resolution (DIV8K) dataset. This report focuses on the proposed solutions and final results. The challenge had 2 tracks. The goal in Track 1 was to generate a super-resolution result with high fidelity, using the conventional PSNR as the primary metric to evaluate different methods. Track 2 instead focused on generating visually more pleasant super-resolution results, evaluated using subjective opinions. The two tracks had 71 and 52 registered participants, respectively, and 9 teams competed in the final testing phase. This report gauges the experimental protocol and baselines for the extreme image super-resolution task
    corecore