6 research outputs found

    MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning

    Full text link
    To address the problem of catastrophic forgetting due to the invisibility of old categories in sequential input, existing work based on relatively simple categorization tasks has made some progress. In contrast, video captioning is a more complex task in multimodal scenario, which has not been explored in the field of incremental learning. After identifying this stability-plasticity problem when analyzing video with sequential input, we originally propose a method to Mitigate Catastrophic Forgetting in class-incremental learning for multimodal Video Captioning (MCF-VC). As for effectively maintaining good performance on old tasks at the macro level, we design Fine-grained Sensitivity Selection (FgSS) based on the Mask of Linear's Parameters and Fisher Sensitivity to pick useful knowledge from old tasks. Further, in order to better constrain the knowledge characteristics of old and new tasks at the specific feature level, we have created the Two-stage Knowledge Distillation (TsKD), which is able to learn the new task well while weighing the old task. Specifically, we design two distillation losses, which constrain the cross modal semantic information of semantic attention feature map and the textual information of the final outputs respectively, so that the inter-model and intra-model stylized knowledge of the old class is retained while learning the new class. In order to illustrate the ability of our model to resist forgetting, we designed a metric CIDER_t to detect the stage forgetting rate. Our experiments on the public dataset MSR-VTT show that the proposed method significantly resists the forgetting of previous tasks without replaying old samples, and performs well on the new task.Comment: 13 page

    Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond

    Full text link
    Few-shot segmentation (FSS) aims to segment the novel classes with a few annotated images. Due to CLIP's advantages of aligning visual and textual information, the integration of CLIP can enhance the generalization ability of FSS model. However, even with the CLIP model, the existing CLIP-based FSS methods are still subject to the biased prediction towards base classes, which is caused by the class-specific feature level interactions. To solve this issue, we propose a visual and textual Prior Guided Mask Assemble Network (PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the bias, and formulates diverse tasks into a unified manner by assembling the prior through affinity. Specifically, the class-relevant textual and visual features are first transformed to class-agnostic prior in the form of probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including multiple General Assemble Units (GAUs) is introduced. It considers diverse and plug-and-play interactions, such as visual-textual, inter- and intra-image, training-free, and high-order ones. Lastly, to ensure the class-agnostic ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed to flexibly exploit the assembled masks and low-level features, without relying on any class-specific information. It achieves new state-of-the-art results in the FSS task, with mIoU of 77.677.6 on PASCAL-5i\text{PASCAL-}5^i and 59.459.4 on COCO-20i\text{COCO-}20^i in 1-shot scenario. Beyond this, we show that without extra re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS, co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot segmentation framework

    Slightly Shift New Classes to Remember Old Classes for Video Class-Incremental Learning

    Full text link
    Recent video class-incremental learning usually excessively pursues the accuracy of the newly seen classes and relies on memory sets to mitigate catastrophic forgetting of the old classes. However, limited storage only allows storing a few representative videos. So we propose SNRO, which slightly shifts the features of new classes to remember old classes. Specifically, SNRO contains Examples Sparse(ES) and Early Break(EB). ES decimates at a lower sample rate to build memory sets and uses interpolation to align those sparse frames in the future. By this, SNRO stores more examples under the same memory consumption and forces the model to focus on low-semantic features which are harder to be forgotten. EB terminates the training at a small epoch, preventing the model from overstretching into the high-semantic space of the current task. Experiments on UCF101, HMDB51, and UESTC-MMEA-CL datasets show that SNRO performs better than other approaches while consuming the same memory consumption

    A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for Object Detection in Remote Sensing Images

    No full text
    Object detection is a significant and challenging problem in the study area of remote sensing and image analysis. However, most existing methods are easy to miss or incorrectly locate objects due to the various sizes and aspect ratios of objects. In this paper, we propose a novel end-to-end Adaptively Aspect Ratio Multi-Scale Network (A 2 RMNet) to solve this problem. On the one hand, we design a multi-scale feature gate fusion network to adaptively integrate the multi-scale features of objects. This network is composed of gate fusion modules, refine blocks and region proposal networks. On the other hand, an aspect ratio attention network is leveraged to preserve the aspect ratios of objects, which alleviates the excessive shape distortions of objects caused by aspect ratio changes during training. Experiments show that the proposed A 2 RMNet significantly outperforms the previous state of the arts on the DOTA dataset, NWPU VHR-10 dataset, RSOD dataset and UCAS-AOD dataset by 5.73 % , 7.06 % , 3.27 % and 2.24 % , respectively

    A Survey of Vision and Language Related Multi-Modal Task

    Get PDF
    With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research
    corecore