145 research outputs found

    Negative Frames Matter in Egocentric Visual Query 2D Localization

    Full text link
    The recently released Ego4D dataset and benchmark significantly scales and diversifies the first-person visual perception data. In Ego4D, the Visual Queries 2D Localization task aims to retrieve objects appeared in the past from the recording in the first-person view. This task requires a system to spatially and temporally localize the most recent appearance of a given object query, where query is registered by a single tight visual crop of the object in a different scene. Our study is based on the three-stage baseline introduced in the Episodic Memory benchmark. The baseline solves the problem by detection and tracking: detect the similar objects in all the frames, then run a tracker from the most confident detection result. In the VQ2D challenge, we identified two limitations of the current baseline. (1) The training configuration has redundant computation. Although the training set has millions of instances, most of them are repetitive and the number of unique object is only around 14.6k. The repeated gradient computation of the same object lead to an inefficient training; (2) The false positive rate is high on background frames. This is due to the distribution gap between training and evaluation. During training, the model is only able to see the clean, stable, and labeled frames, but the egocentric videos also have noisy, blurry, or unlabeled background frames. To this end, we developed a more efficient and effective solution. Concretely, we bring the training loop from ~15 days to less than 24 hours, and we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline. Our solution got the first ranking on the public leaderboard. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvpr.Comment: First place winning solution for VQ2D task in CVPR-2022 Ego4D Challenge. Our code is publicly available at https://github.com/facebookresearch/vq2d_cvp

    Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

    Full text link
    This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition to this, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. Our code is available at https://github.com/facebookresearch/vq2d_cvpr.Comment: We ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challeng

    Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation

    Full text link
    Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPETComment: Technical Repor

    Skin infiltrating NK cells in cutaneous T-cell lymphoma are increased in number and display phenotypic alterations partially driven by the tumor

    Get PDF
    Cutaneous T-cell lymphomas (CTCL) are characterized by focal infiltration of malignant T cell clones in solitary skin lesions. Many CTCL patients experience an indolent disease, but some progress to advanced disease with high fatality. We hypothesized that natural killer (NK) cells participate in local control of tumor growth in CTCL skin. Immunohistochemistry and flow cytometry analysis of the density, localization, phenotype and function of NK cells in twenty-nine fresh or formalin-fixed skin biopsies from twenty-four CTCL patients and twenty-three biopsies from twenty healthy controls highlighted higher numbers of CD56+CD3- NK cells in CTCL skin. A reduced fraction of CTCL skin NK cells expressed the maturation marker CD57, the cytotoxic protein granzyme B and the activation marker CD69, indicating reduced tumor-killing abilities of the NK cells. Retained expression of immune checkpoint proteins or inhibitory proteins including PD1, TIM3, LAG3, CD73 and NKG2A and the activating receptors CD16 and NKp46 indicated maintained effector functions. Indeed, the capacity of NK cells to produce anti-tumor acting IFNγ upon PMA+ionomycin stimulation was similar in cells from CTCL and healthy skin. Co-cultures of primary human NK cells or the NK cell line NKL with CTCL cells resulted in reduced levels of granzyme B and CD69, indicating that close cellular interactions with CTCL cells induced the impaired functional NK cell phenotype. In conclusion, increased numbers of NK cells in CTCL skin exhibit a partially impaired phenotype in terms of activity. Enhancing NK cell activity with NK cell activating cytokines such as IL-15 or immune checkpoint blockade therefore represents a potential immunotherapeutic approach in CTCL.publishedVersio
    corecore