75 research outputs found

    An energy stable and maximum bound principle preserving scheme for the dynamic Ginzburg-Landau equations under the temporal gauge

    Full text link
    This paper proposes a decoupled numerical scheme of the time-dependent Ginzburg--Landau equations under the temporal gauge. For the magnetic potential and the order parameter, the discrete scheme adopts the second type NedeĖŠ{\rm \acute{e}}lec element and the linear element for spatial discretization, respectively; and a linearized backward Euler method and the first order exponential time differencing method for time discretization, respectively. The maximum bound principle (MBP) of the order parameter and the energy dissipation law in the discrete sense are proved. The discrete energy stability and MBP-preservation can guarantee the stability and validity of the numerical simulations, and further facilitate the adoption of an adaptive time-stepping strategy, which often plays an important role in long-time simulations of vortex dynamics, especially when the applied magnetic field is strong. An optimal error estimate of the proposed scheme is also given. Numerical examples verify the theoretical results of the proposed scheme and demonstrate the vortex motions of superconductors in an external magnetic field

    Gongsun Longziā€™s ā€œformā€: Minimal word meaning

    Get PDF
    Inspired by Gongsun Longziā€™s ā€œform-namingā€ idea about word meaning, this paper argues that 1) the internal lexicon contains only the list of word-meaning pairs, with no additional information either as part of word meaning or as a structural level above it; 2) the meaning of word is a minimal C-Form, the identifying conceptual meaning that individuates a concept; 3) C-Form is the interface between word meaning and concept meaning; and 4) a sentence has a minimal semantic content, consisting of the minimal meanings of the words composing it, which is propositional and truth-evaluable, and contextual elements contribute nothing to the meaning of language expressions. This paper adheres to semantic minimalism, believing meanwhile that meaning holism helps in semantics inquiry, since reflection on language meaning differs from language meaning itself.Ā 

    MGMAE: Motion Guided Masking for Video Masked Autoencoding

    Full text link
    Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training.Comment: ICCV 2023 camera-ready versio

    Unmasked Teacher: Towards Training-Efficient Video Foundation Models

    Full text link
    Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. The code and models will be released at https://github.com/OpenGVLab/unmasked_teacher.Comment: 16 pages, 5 figures, 28 table

    Harvest Video Foundation Models via Efficient Post-Pretraining

    Full text link
    Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. The patch dropping boosts the training efficiency significantly and text masking enforces the learning of cross-modal fusion. We conduct extensive experiments to validate the effectiveness of our method on a wide range of video-language downstream tasks including various zero-shot tasks, video question answering, and video-text retrieval. Despite its simplicity, our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models. Our method is extremely efficient and can be trained in less than one day on 8 GPUs, requiring only WebVid-10M as pretraining data. We hope our method can serve as a simple yet strong counterpart for prevalent video foundation models, provide useful insights when building them, and make large pretrained models more accessible and sustainable. This is part of the InternVideo project \url{https://github.com/OpenGVLab/InternVideo}

    VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking

    Full text link
    Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also use a progressive training paradigm that involves an initial pre-training on a diverse multi-sourced unlabeled dataset, followed by a post-pre-training on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner.Comment: CVPR 2023 camera-ready versio
    • ā€¦
    corecore