920 research outputs found
Contrastive Learning of Medical Visual Representations from Paired Images and Text
Learning visual representations of medical images is core to medical image
understanding but its progress has been held back by the small size of
hand-labeled datasets. Existing work commonly relies on transferring weights
from ImageNet pretraining, which is suboptimal due to drastically different
image characteristics, or rule-based label extraction from the textual report
data paired with medical images, which is inaccurate and hard to generalize. We
propose an alternative unsupervised strategy to learn medical visual
representations directly from the naturally occurring pairing of images and
textual data. Our method of pretraining medical image encoders with the paired
text data via a bidirectional contrastive objective between the two modalities
is domain-agnostic, and requires no additional expert input. We test our method
by transferring our pretrained weights to 4 medical image classification tasks
and 2 zero-shot retrieval tasks, and show that our method leads to image
representations that considerably outperform strong baselines in most settings.
Notably, in all 4 classification tasks, our method requires only 10% as much
labeled training data as an ImageNet initialized counterpart to achieve better
or comparable performance, demonstrating superior data efficiency
When and why vision-language models behave like bags-of-words, and what to do about it?
Despite the success of large vision and language models (VLMs) in many
downstream applications, it is unclear how well they encode compositional
information. Here, we create the Attribution, Relation, and Order (ARO)
benchmark to systematically evaluate the ability of VLMs to understand
different types of relationships, attributes, and order. ARO consists of Visual
Genome Attribution, to test the understanding of objects' properties; Visual
Genome Relation, to test for relational understanding; and COCO &
Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude
larger than previous benchmarks of compositionality, with more than 50,000 test
cases. We show where state-of-the-art VLMs have poor relational understanding,
can blunder when linking objects to their attributes, and demonstrate a severe
lack of order sensitivity. VLMs are predominantly trained and evaluated on
large datasets with rich compositional structure in the images and captions.
Yet, training on these datasets has not been enough to address the lack of
compositional understanding, and evaluating on these datasets has failed to
surface this deficiency. To understand why these limitations emerge and are not
represented in the standard tests, we zoom into the evaluation and training
procedures. We demonstrate that it is possible to perform well on retrieval
over existing datasets without using the composition and order information.
Given that contrastive pretraining optimizes for retrieval on datasets with
similar shortcuts, we hypothesize that this can explain why the models do not
need to learn to represent compositional information. This finding suggests a
natural solution: composition-aware hard negative mining. We show that a
simple-to-implement modification of contrastive learning significantly improves
the performance on tasks requiring understanding of order and compositionality.Comment: ICLR 2023 Oral (notable-top-5%
GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding
Humans subconsciously engage in geospatial reasoning when reading articles.
We recognize place names and their spatial relations in text and mentally
associate them with their physical locations on Earth. Although pretrained
language models can mimic this cognitive process using linguistic context, they
do not utilize valuable geospatial information in large, widely available
geographical databases, e.g., OpenStreetMap. This paper introduces GeoLM, a
geospatially grounded language model that enhances the understanding of
geo-entities in natural language. GeoLM leverages geo-entity mentions as
anchors to connect linguistic information in text corpora with geospatial
information extracted from geographical databases. GeoLM connects the two types
of context through contrastive learning and masked language modeling. It also
incorporates a spatial coordinate embedding mechanism to encode distance and
direction relations to capture geospatial context. In the experiment, we
demonstrate that GeoLM exhibits promising capabilities in supporting toponym
recognition, toponym linking, relation extraction, and geo-entity typing, which
bridge the gap between natural language processing and geospatial sciences. The
code is publicly available at https://github.com/knowledge-computing/geolm.Comment: Accepted to EMNLP23 mai
λ¨μΌ νμ CLIPμ μ΄μ©ν΄ ν΅ν©λ μκ° μΈμ΄ νν κ³΅κ° νμ
νμλ
Όλ¬Έ(μμ¬) -- μμΈλνκ΅λνμ : μ΅ν©κ³ΌνκΈ°μ λνμ μ§λ₯μ 보μ΅ν©νκ³Ό, 2023. 2. κ³½λ
Έμ€.Contrastive learning is widely adopted in self-supervised representation learning (SSL) to learn common attributes from similar sample pairs. In this paper, we boldly hypothesize that an image and its caption can be simply regarded as two different views of an underlying semantic, and aim to build a unified vision-language representation space by inducing a one-tower transformer that can encode both type of data samples in a modality-agnostic manner. We show that applying typical SSL frameworks to vision-language pretraining (VLP) naively fails to train a generic one-tower model due to a severe modality gap, and propose One Representation (OneR) to mitigate the disparity. We explore emerging properties of OneR distinguished from prior works with modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning, and multi-modal retrieval, and analyze our novel multi-modal representation learning. Comprehensive evaluations demonstrate the potential of a modality-agnostic VLP framework that has unified representation space.Contrastive learningμ μκΈ°μ§λνμ΅(Self-supervised learning, SSL)μμ λ리 μ±νλμ΄ λΉμ·ν λ°μ΄ν°μμ 곡ν΅λ νΉμ§μ μΆμΆνλλ‘ νλ νμ΅ λ°©λ²λ‘ μ΄λ€. λ³Έ λ
Όλ¬Έμμ, μ°λ¦¬λ μ΄λ―Έμ§μ μ΄μ λμλλ μ€λͺ
λ¬Έμ 곡ν΅λ μ 보λ₯Ό λ°νμΌλ‘ λ€λ₯΄κ² ννλ λ°μ΄ν°λ‘ κ°μ νκ³ , λ¨μΌ νμμ νΈλμ€ν¬λ¨Έλ₯Ό νμ©νμ¬ μ΄λ―Έμ§μ ν
μ€νΈλ₯Ό νλμ νν 곡κ°μΌλ‘ 맀ννλ €κ³ νλ€. κΈ°μ‘΄μ μκΈ°μ§λνμ΅ λ°©λ²λ‘ λ€μ λ¨μν μκ° μΈμ΄ μ¬μ νμ΅μ μ μ©νλ κ²μ νν μμμ μ°¨μ΄λ‘ μΈν μ΄λ €μμ΄ μ‘΄μ¬νκ³ , μ΄λ₯Ό ν΄κ²°νκΈ° μν΄ One Representation (OneR) μ μ μνλ€. OneRμ μκ°κ³Ό μΈμ΄ κ°κ°μ νΉμ ν νν곡κ°μ κ°μ§λ μ΄μ μ μ°κ΅¬λ€κ³Ό λ¬λ¦¬ ν₯λ―Έλ‘μ΄ νΉμ±λ€μ΄ λνλλ©°, μ΄λ₯Ό zero-shot μκ°ν, μμ°μ΄κΈ°λ°μ μκ°μ μ΄ν΄ λ° λ©ν°λͺ¨λ¬ κ²μμ ν΅ν΄λ³΄μΈλ€.λν, ν¬κ΄μ μΈ νκ°λ₯Ό ν΅ν΄ ν΅ν©λ νν 곡κ°μ κ°μ§λ©°, νν μμμ ꡬμ λ°μ§ μμ μκ° μΈμ΄ μ¬μ νμ΅λ°©λ²λ‘ μ μ μ¬λ ₯μ 보μ¬μ£Όλ©° μ΄μ λν λΆμμ μ 곡νλ€.1 Introduction 1
2 Related Works 5
2.1 Self-SupervisedLearning 5
2.2 Vision-LanguagePretraining 6
2.3 UnifiedVision-LanguageFramework 7
3 Overcoming Modality Gap 9
3.1 Cross-ModalMixup 10
4 Modality-agnostic Representations 12
4.1 ContextualModalityInvariance 12
4.2 ContextualMixupContrast 13
4.3 TheoreticalExplanationofCMC 14
4.4 OneRepresentation 16
5 Experiment 18
5.1 ExperimentalSetup 18
5.1.1 Datasets 18
5.1.2 ImplementationDetails 18
5.2 QualitativeResults 20
5.2.1 Zero-shotLocalization 20
5.2.2 Text-guidedVisualReasoning 22
5.2.3 Multi-modalRetrieval 24
5.3 VisualReasoningAnalysis 24
5.3.1 Robustness 24
5.3.2 Multi-level vision-language connection 25
5.4 QuantitativeResults 26
5.4.1 Image-textRetrieval 26
5.4.2 Cross-modalKnowledgeTransfer 27
5.5 AblationStudy 28
5.5.1 ProposedLossAblation 28
5.5.2 MaskedModelingAblation 29
6 Discussion 30
7 Conclusion 32
Abstract (In Korean) 40μ
MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining
This paper presents a simple yet effective framework MaskCLIP, which
incorporates a newly proposed masked self-distillation into contrastive
language-image pretraining. The core idea of masked self-distillation is to
distill representation from a full image to the representation predicted from a
masked image. Such incorporation enjoys two vital benefits. First, masked
self-distillation targets local patch representation learning, which is
complementary to vision-language contrastive focusing on text-related
representation. Second, masked self-distillation is also consistent with
vision-language contrastive from the perspective of training objective as both
utilize the visual encoder for feature aligning, and thus is able to learn
local semantics getting indirect supervision from the language. We provide
specially designed experiments with a comprehensive analysis to validate the
two benefits. Symmetrically, we also introduce the local semantic supervision
into the text branch, which further improves the pretraining performance. With
extensive experiments, we show that MaskCLIP, when applied to various
challenging downstream tasks, achieves superior results in linear probing,
finetuning, and zero-shot performance with the guidance of the language
encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.Comment: CVPR 2023, code is available at https://github.com/LightDXY/MaskCLI
- β¦