179 research outputs found
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
This paper presents that the masked-modeling principle driving the success of
large foundational vision models can be effectively applied to audio by making
predictions in a latent space. We introduce Audio-based Joint-Embedding
Predictive Architecture (A-JEPA), a simple extension method for self-supervised
learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA
encodes visible audio spectrogram patches with a curriculum masking strategy
via context encoder, and predicts the representations of regions sampled at
well-designed locations. The target representations of those regions are
extracted by the exponential moving average of context encoder, \emph{i.e.},
target encoder, on the whole spectrogram. We find it beneficial to transfer
random block masking into time-frequency aware masking in a curriculum manner,
considering the complexity of highly correlated in local time and frequency in
audio spectrograms. To enhance contextual semantic understanding and
robustness, we fine-tune the encoder with a regularized masking on target
datasets, instead of input dropping or zero. Empirically, when built with
Vision Transformers structure, we find A-JEPA to be highly scalable and sets
new state-of-the-art performance on multiple audio and speech classification
tasks, outperforming other recent models that use externally supervised
pre-training.Comment: arXiv admin note: text overlap with arXiv:2207.06405 by other author
Music Consistency Models
Consistency models have exhibited remarkable capabilities in facilitating
efficient image/video generation, enabling synthesis with minimal sampling
steps. It has proven to be advantageous in mitigating the computational burdens
associated with diffusion models. Nevertheless, the application of consistency
models in music generation remains largely unexplored. To address this gap, we
present Music Consistency Models (\texttt{MusicCM}), which leverages the
concept of consistency models to efficiently synthesize mel-spectrogram for
music clips, maintaining high quality while minimizing the number of sampling
steps. Building upon existing text-to-music diffusion models, the
\texttt{MusicCM} model incorporates consistency distillation and adversarial
discriminator training. Moreover, we find it beneficial to generate extended
coherent music by incorporating multiple diffusion processes with shared
constraints. Experimental results reveal the effectiveness of our model in
terms of computational efficiency, fidelity, and naturalness. Notable,
\texttt{MusicCM} achieves seamless music synthesis with a mere four sampling
steps, e.g., only one second per minute of the music clip, showcasing the
potential for real-time application
Scalable Diffusion Models with State Space Backbone
This paper presents a new exploration into a category of diffusion models
built upon state space architecture. We endeavor to train diffusion models for
image data, wherein the traditional U-Net backbone is supplanted by a state
space backbone, functioning on raw patches or latent space. Given its notable
efficacy in accommodating long-range dependencies, Diffusion State Space Models
(DiS) are distinguished by treating all inputs including time, condition, and
noisy image patches as tokens. Our assessment of DiS encompasses both
unconditional and class-conditional image generation scenarios, revealing that
DiS exhibits comparable, if not superior, performance to CNN-based or
Transformer-based U-Net architectures of commensurate size. Furthermore, we
analyze the scalability of DiS, gauged by the forward pass complexity
quantified in Gflops. DiS models with higher Gflops, achieved through
augmentation of depth/width or augmentation of input tokens, consistently
demonstrate lower FID. In addition to demonstrating commendable scalability
characteristics, DiS-H/2 models in latent space achieve performance levels akin
to prior diffusion models on class-conditional ImageNet benchmarks at the
resolution of 256256 and 512512, while significantly reducing
the computational burden. The code and models are available at:
https://github.com/feizc/DiS
Efficient Modeling of Future Context for Image Captioning
Existing approaches to image captioning usually generate the sentence
word-by-word from left to right, with the constraint of conditioned on local
context including the given image and history generated words. There have been
many studies target to make use of global information during decoding, e.g.,
iterative refinement. However, it is still under-explored how to effectively
and efficiently incorporate the future context. To respond to this issue,
inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage
two-side relation with modified mask operation, we aim to graft this advance to
the conventional Autoregressive Image Captioning (AIC) model while maintaining
the inference efficiency without extra time cost. Specifically, AIC and NAIC
models are first trained combined with shared visual encoders, forcing the
visual encoder to contain sufficient and valid future context; then the AIC
model is encouraged to capture the causal dynamics of cross-layer interchanging
from NAIC model on its unconfident words, which follows a teacher-student
paradigm and optimized with the distribution calibration training objective.
Empirical evidences demonstrate that our proposed approach clearly surpass the
state-of-the-art baselines in both automatic metrics and human evaluations on
the MS COCO benchmark. The source code is available at:
https://github.com/feizc/Future-Caption.Comment: ACM Multimedia 202
Divide and Adapt: Active Domain Adaptation via Customized Learning
Active domain adaptation (ADA) aims to improve the model adaptation
performance by incorporating active learning (AL) techniques to label a
maximally-informative subset of target samples. Conventional AL methods do not
consider the existence of domain shift, and hence, fail to identify the truly
valuable samples in the context of domain adaptation. To accommodate active
learning and domain adaption, the two naturally different tasks, in a
collaborative framework, we advocate that a customized learning strategy for
the target data is the key to the success of ADA solutions. We present
Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target
instances into four categories with stratified transferable properties. With a
novel data subdivision protocol based on uncertainty and domainness, DiaNA can
accurately recognize the most gainful samples. While sending the informative
instances for annotation, DiaNA employs tailored learning strategies for the
remaining categories. Furthermore, we propose an informativeness score that
unifies the data partitioning criteria. This enables the use of a Gaussian
mixture model (GMM) to automatically sample unlabeled data into the proposed
four categories. Thanks to the "divideand-adapt" spirit, DiaNA can handle data
with large variations of domain gap. In addition, we show that DiaNA can
generalize to different domain adaptation settings, such as unsupervised domain
adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain
adaptation (SFDA), etc.Comment: CVPR2023, Highlight pape
Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models
Transformers have catalyzed advancements in computer vision and natural
language processing (NLP) fields. However, substantial computational complexity
poses limitations for their application in long-context tasks, such as
high-resolution image generation. This paper introduces a series of
architectures adapted from the RWKV model used in the NLP, with requisite
modifications tailored for diffusion model applied to image generation tasks,
referred to as Diffusion-RWKV. Similar to the diffusion with Transformers, our
model is designed to efficiently handle patchnified inputs in a sequence with
extra conditions, while also scaling up effectively, accommodating both
large-scale parameters and extensive datasets. Its distinctive advantage
manifests in its reduced spatial aggregation complexity, rendering it
exceptionally adept at processing high-resolution images, thereby eliminating
the necessity for windowing or group cached operations. Experimental results on
both condition and unconditional image generation tasks demonstrate that
Diffison-RWKV achieves performance on par with or surpasses existing CNN or
Transformer-based diffusion models in FID and IS metrics while significantly
reducing total computation FLOP usage
Progressive Denoising Model for Fine-Grained Text-to-Image Generation
Recently, vector quantized autoregressive (VQ-AR) models have shown
remarkable results in text-to-image synthesis by equally predicting discrete
image tokens from the top left to bottom right in the latent space. Although
the simple generative process surprisingly works well, is this the best way to
generate the image? For instance, human creation is more inclined to the
outline-to-fine of an image, while VQ-AR models themselves do not consider any
relative importance of each component. In this paper, we present a progressive
denoising model for high-fidelity text-to-image image generation. The proposed
method takes effect by creating new image tokens from coarse to fine based on
the existing context in a parallel manner and this procedure is recursively
applied until an image sequence is completed. The resulting coarse-to-fine
hierarchy makes the image generation process intuitive and interpretable.
Extensive experiments demonstrate that the progressive model produces
significantly better results when compared with the previous VQ-AR method in
FID score across a wide variety of categories and aspects. Moreover, the
text-to-image generation time of traditional AR increases linearly with the
output image resolution and hence is quite time-consuming even for normal-size
images. In contrast, our approach allows achieving a better trade-off between
generation quality and speed.Comment: Technique report. arXiv admin note: text overlap with
arXiv:2206.10789 by other author
Towards Efficient Sparse Coding for Scalable Image Annotation
10.1145/2502081.2502127MM 2013 - Proceedings of the 2013 ACM Multimedia Conference947-95
Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
Panoptic narrative grounding (PNG) aims to segment things and stuff objects
in an image described by noun phrases of a narrative caption. As a multimodal
task, an essential aspect of PNG is the visual-linguistic interaction between
image and caption. The previous two-stage method aggregates visual contexts
from offline-generated mask proposals to phrase features, which tend to be
noisy and fragmentary. The recent one-stage method aggregates only pixel
contexts from image features to phrase features, which may incur semantic
misalignment due to lacking object priors. To realize more comprehensive
visual-linguistic interaction, we propose to enrich phrases with coupled pixel
and object contexts by designing a Phrase-Pixel-Object Transformer Decoder
(PPO-TD), where both fine-grained part details and coarse-grained entity clues
are aggregated to phrase features. In addition, we also propose a PhraseObject
Contrastive Loss (POCL) to pull closer the matched phrase-object pairs and push
away unmatched ones for aggregating more precise object contexts from more
phrase-relevant object tokens. Extensive experiments on the PNG benchmark show
our method achieves new state-of-the-art performance with large margins.Comment: Accepted by IJCAI 202
- …
