75 research outputs found
An energy stable and maximum bound principle preserving scheme for the dynamic Ginzburg-Landau equations under the temporal gauge
This paper proposes a decoupled numerical scheme of the time-dependent
Ginzburg--Landau equations under the temporal gauge. For the magnetic potential
and the order parameter, the discrete scheme adopts the second type Nedlec element and the linear element for spatial discretization,
respectively; and a linearized backward Euler method and the first order
exponential time differencing method for time discretization, respectively. The
maximum bound principle (MBP) of the order parameter and the energy dissipation
law in the discrete sense are proved. The discrete energy stability and
MBP-preservation can guarantee the stability and validity of the numerical
simulations, and further facilitate the adoption of an adaptive time-stepping
strategy, which often plays an important role in long-time simulations of
vortex dynamics, especially when the applied magnetic field is strong. An
optimal error estimate of the proposed scheme is also given. Numerical examples
verify the theoretical results of the proposed scheme and demonstrate the
vortex motions of superconductors in an external magnetic field
Gongsun Longziās āformā: Minimal word meaning
Inspired by Gongsun Longziās āform-namingā idea about word meaning, this paper argues that 1) the internal lexicon contains only the list of word-meaning pairs, with no additional information either as part of word meaning or as a structural level above it; 2) the meaning of word is a minimal C-Form, the identifying conceptual meaning that individuates a concept; 3) C-Form is the interface between word meaning and concept meaning; and 4) a sentence has a minimal semantic content, consisting of the minimal meanings of the words composing it, which is propositional and truth-evaluable, and contextual elements contribute nothing to the meaning of language expressions. This paper adheres to semantic minimalism, believing meanwhile that meaning holism helps in semantics inquiry, since reflection on language meaning differs from language meaning itself.Ā
MGMAE: Motion Guided Masking for Video Masked Autoencoding
Masked autoencoding has shown excellent performance on self-supervised video
representation learning. Temporal redundancy has led to a high masking ratio
and customized masking strategy in VideoMAE. In this paper, we aim to further
improve the performance of video masked autoencoding by introducing a motion
guided masking strategy. Our key insight is that motion is a general and unique
prior in video, which should be taken into account during masked pre-training.
Our motion guided masking explicitly incorporates motion information to build
temporal consistent masking volume. Based on this masking volume, we can track
the unmasked tokens in time and sample a set of temporal consistent cubes from
videos. These temporal aligned unmasked tokens will further relieve the
information leakage issue in time and encourage the MGMAE to learn more useful
structure information. We implement our MGMAE with an online efficient optical
flow estimator and backward masking map warping strategy. We perform
experiments on the datasets of Something-Something V2 and Kinetics-400,
demonstrating the superior performance of our MGMAE to the original VideoMAE.
In addition, we provide the visualization analysis to illustrate that our MGMAE
can sample temporal consistent cubes in a motion-adaptive manner for more
effective video pre-training.Comment: ICCV 2023 camera-ready versio
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Video Foundation Models (VFMs) have received limited exploration due to high
computational costs and data scarcity. Previous VFMs rely on Image Foundation
Models (IFMs), which face challenges in transferring to the video domain.
Although VideoMAE has trained a robust ViT from limited data, its low-level
reconstruction poses convergence difficulties and conflicts with high-level
cross-modal alignment. This paper proposes a training-efficient method for
temporal-sensitive VFMs that integrates the benefits of existing methods. To
increase data efficiency, we mask out most of the low-semantics video tokens,
but selectively align the unmasked tokens with IFM, which serves as the
UnMasked Teacher (UMT). By providing semantic guidance, our method enables
faster convergence and multimodal friendliness. With a progressive pre-training
framework, our model can handle various tasks including scene-related,
temporal-related, and complex video-language understanding. Using only public
sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16
achieves state-of-the-art performances on various video tasks. The code and
models will be released at https://github.com/OpenGVLab/unmasked_teacher.Comment: 16 pages, 5 figures, 28 table
Harvest Video Foundation Models via Efficient Post-Pretraining
Building video-language foundation models is costly and difficult due to the
redundant nature of video data and the lack of high-quality video-language
datasets. In this paper, we propose an efficient framework to harvest video
foundation models from image ones. Our method is intuitively simple by randomly
dropping input video patches and masking out input text during the
post-pretraining procedure. The patch dropping boosts the training efficiency
significantly and text masking enforces the learning of cross-modal fusion. We
conduct extensive experiments to validate the effectiveness of our method on a
wide range of video-language downstream tasks including various zero-shot
tasks, video question answering, and video-text retrieval. Despite its
simplicity, our method achieves state-of-the-art performances, which are
comparable to some heavily pretrained video foundation models. Our method is
extremely efficient and can be trained in less than one day on 8 GPUs,
requiring only WebVid-10M as pretraining data. We hope our method can serve as
a simple yet strong counterpart for prevalent video foundation models, provide
useful insights when building them, and make large pretrained models more
accessible and sustainable. This is part of the InternVideo project
\url{https://github.com/OpenGVLab/InternVideo}
VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Scale is the primary factor for building a powerful foundation model that
could well generalize to a variety of downstream tasks. However, it is still
challenging to train video foundation models with billions of parameters. This
paper shows that video masked autoencoder (VideoMAE) is a scalable and general
self-supervised pre-trainer for building video foundation models. We scale the
VideoMAE in both model and data with a core design. Specifically, we present a
dual masking strategy for efficient pre-training, with an encoder operating on
a subset of video tokens and a decoder processing another subset of video
tokens. Although VideoMAE is very efficient due to high masking ratio in
encoder, masking decoder can still further reduce the overall computational
cost. This enables the efficient pre-training of billion-level models in video.
We also use a progressive training paradigm that involves an initial
pre-training on a diverse multi-sourced unlabeled dataset, followed by a
post-pre-training on a mixed labeled dataset. Finally, we successfully train a
video ViT model with a billion parameters, which achieves a new
state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and
89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In
addition, we extensively verify the pre-trained video ViT models on a variety
of downstream tasks, demonstrating its effectiveness as a general video
representation learner.Comment: CVPR 2023 camera-ready versio
- ā¦