6 research outputs found
MCF-VC: Mitigate Catastrophic Forgetting in Class-Incremental Learning for Multimodal Video Captioning
To address the problem of catastrophic forgetting due to the invisibility of
old categories in sequential input, existing work based on relatively simple
categorization tasks has made some progress. In contrast, video captioning is a
more complex task in multimodal scenario, which has not been explored in the
field of incremental learning. After identifying this stability-plasticity
problem when analyzing video with sequential input, we originally propose a
method to Mitigate Catastrophic Forgetting in class-incremental learning for
multimodal Video Captioning (MCF-VC). As for effectively maintaining good
performance on old tasks at the macro level, we design Fine-grained Sensitivity
Selection (FgSS) based on the Mask of Linear's Parameters and Fisher
Sensitivity to pick useful knowledge from old tasks. Further, in order to
better constrain the knowledge characteristics of old and new tasks at the
specific feature level, we have created the Two-stage Knowledge Distillation
(TsKD), which is able to learn the new task well while weighing the old task.
Specifically, we design two distillation losses, which constrain the cross
modal semantic information of semantic attention feature map and the textual
information of the final outputs respectively, so that the inter-model and
intra-model stylized knowledge of the old class is retained while learning the
new class. In order to illustrate the ability of our model to resist
forgetting, we designed a metric CIDER_t to detect the stage forgetting rate.
Our experiments on the public dataset MSR-VTT show that the proposed method
significantly resists the forgetting of previous tasks without replaying old
samples, and performs well on the new task.Comment: 13 page
Visual and Textual Prior Guided Mask Assemble for Few-Shot Segmentation and Beyond
Few-shot segmentation (FSS) aims to segment the novel classes with a few
annotated images. Due to CLIP's advantages of aligning visual and textual
information, the integration of CLIP can enhance the generalization ability of
FSS model. However, even with the CLIP model, the existing CLIP-based FSS
methods are still subject to the biased prediction towards base classes, which
is caused by the class-specific feature level interactions. To solve this
issue, we propose a visual and textual Prior Guided Mask Assemble Network
(PGMA-Net). It employs a class-agnostic mask assembly process to alleviate the
bias, and formulates diverse tasks into a unified manner by assembling the
prior through affinity. Specifically, the class-relevant textual and visual
features are first transformed to class-agnostic prior in the form of
probability map. Then, a Prior-Guided Mask Assemble Module (PGMAM) including
multiple General Assemble Units (GAUs) is introduced. It considers diverse and
plug-and-play interactions, such as visual-textual, inter- and intra-image,
training-free, and high-order ones. Lastly, to ensure the class-agnostic
ability, a Hierarchical Decoder with Channel-Drop Mechanism (HDCDM) is proposed
to flexibly exploit the assembled masks and low-level features, without relying
on any class-specific information. It achieves new state-of-the-art results in
the FSS task, with mIoU of on and on
in 1-shot scenario. Beyond this, we show that without extra
re-training, the proposed PGMA-Net can solve bbox-level and cross-domain FSS,
co-segmentation, zero-shot segmentation (ZSS) tasks, leading an any-shot
segmentation framework
Slightly Shift New Classes to Remember Old Classes for Video Class-Incremental Learning
Recent video class-incremental learning usually excessively pursues the
accuracy of the newly seen classes and relies on memory sets to mitigate
catastrophic forgetting of the old classes. However, limited storage only
allows storing a few representative videos. So we propose SNRO, which slightly
shifts the features of new classes to remember old classes. Specifically, SNRO
contains Examples Sparse(ES) and Early Break(EB). ES decimates at a lower
sample rate to build memory sets and uses interpolation to align those sparse
frames in the future. By this, SNRO stores more examples under the same memory
consumption and forces the model to focus on low-semantic features which are
harder to be forgotten. EB terminates the training at a small epoch, preventing
the model from overstretching into the high-semantic space of the current task.
Experiments on UCF101, HMDB51, and UESTC-MMEA-CL datasets show that SNRO
performs better than other approaches while consuming the same memory
consumption
A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for Object Detection in Remote Sensing Images
Object detection is a significant and challenging problem in the study area of remote sensing and image analysis. However, most existing methods are easy to miss or incorrectly locate objects due to the various sizes and aspect ratios of objects. In this paper, we propose a novel end-to-end Adaptively Aspect Ratio Multi-Scale Network (A 2 RMNet) to solve this problem. On the one hand, we design a multi-scale feature gate fusion network to adaptively integrate the multi-scale features of objects. This network is composed of gate fusion modules, refine blocks and region proposal networks. On the other hand, an aspect ratio attention network is leveraged to preserve the aspect ratios of objects, which alleviates the excessive shape distortions of objects caused by aspect ratio changes during training. Experiments show that the proposed A 2 RMNet significantly outperforms the previous state of the arts on the DOTA dataset, NWPU VHR-10 dataset, RSOD dataset and UCAS-AOD dataset by 5.73 % , 7.06 % , 3.27 % and 2.24 % , respectively
A Survey of Vision and Language Related Multi-Modal Task
With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research