920 research outputs found

    Contrastive Learning of Medical Visual Representations from Paired Images and Text

    Full text link
    Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency

    When and why vision-language models behave like bags-of-words, and what to do about it?

    Full text link
    Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.Comment: ICLR 2023 Oral (notable-top-5%

    GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

    Full text link
    Humans subconsciously engage in geospatial reasoning when reading articles. We recognize place names and their spatial relations in text and mentally associate them with their physical locations on Earth. Although pretrained language models can mimic this cognitive process using linguistic context, they do not utilize valuable geospatial information in large, widely available geographical databases, e.g., OpenStreetMap. This paper introduces GeoLM, a geospatially grounded language model that enhances the understanding of geo-entities in natural language. GeoLM leverages geo-entity mentions as anchors to connect linguistic information in text corpora with geospatial information extracted from geographical databases. GeoLM connects the two types of context through contrastive learning and masked language modeling. It also incorporates a spatial coordinate embedding mechanism to encode distance and direction relations to capture geospatial context. In the experiment, we demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing, which bridge the gap between natural language processing and geospatial sciences. The code is publicly available at https://github.com/knowledge-computing/geolm.Comment: Accepted to EMNLP23 mai

    단일 νƒ€μ›Œ CLIP을 μ΄μš©ν•΄ ν†΅ν•©λœ μ‹œκ° μ–Έμ–΄ ν‘œν˜„ 곡간 탐색

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(석사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : μœ΅ν•©κ³Όν•™κΈ°μˆ λŒ€ν•™μ› 지λŠ₯μ •λ³΄μœ΅ν•©ν•™κ³Ό, 2023. 2. κ³½λ…Έμ€€.Contrastive learning is widely adopted in self-supervised representation learning (SSL) to learn common attributes from similar sample pairs. In this paper, we boldly hypothesize that an image and its caption can be simply regarded as two different views of an underlying semantic, and aim to build a unified vision-language representation space by inducing a one-tower transformer that can encode both type of data samples in a modality-agnostic manner. We show that applying typical SSL frameworks to vision-language pretraining (VLP) naively fails to train a generic one-tower model due to a severe modality gap, and propose One Representation (OneR) to mitigate the disparity. We explore emerging properties of OneR distinguished from prior works with modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning, and multi-modal retrieval, and analyze our novel multi-modal representation learning. Comprehensive evaluations demonstrate the potential of a modality-agnostic VLP framework that has unified representation space.Contrastive learning은 μžκΈ°μ§€λ„ν•™μŠ΅(Self-supervised learning, SSL)μ—μ„œ 널리 μ±„νƒλ˜μ–΄ λΉ„μŠ·ν•œ λ°μ΄ν„°μ—μ„œ κ³΅ν†΅λœ νŠΉμ§•μ„ μΆ”μΆœν•˜λ„λ‘ ν•˜λŠ” ν•™μŠ΅ 방법둠이닀. λ³Έ λ…Όλ¬Έμ—μ„œ, μš°λ¦¬λŠ” 이미지와 이에 λŒ€μ‘λ˜λŠ” μ„€λͺ…문을 κ³΅ν†΅λœ 정보λ₯Ό λ°”νƒ•μœΌλ‘œ λ‹€λ₯΄κ²Œ ν‘œν˜„λœ λ°μ΄ν„°λ‘œ κ°€μ •ν•˜κ³ , 단일 νƒ€μ›Œμ˜ 트랜슀포머λ₯Ό ν™œμš©ν•˜μ—¬ 이미지와 ν…μŠ€νŠΈλ₯Ό ν•˜λ‚˜μ˜ ν‘œν˜„ κ³΅κ°„μœΌλ‘œ λ§€ν•‘ν•˜λ €κ³  ν•œλ‹€. 기쑴의 μžκΈ°μ§€λ„ν•™μŠ΅ 방법둠듀을 λ‹¨μˆœνžˆ μ‹œκ° μ–Έμ–΄ μ‚¬μ „ν•™μŠ΅μ— μ μš©ν•˜λŠ” 것은 ν‘œν˜„ μ–‘μ‹μ˜ 차이둜 μΈν•œ 어렀움이 μ‘΄μž¬ν•˜κ³ , 이λ₯Ό ν•΄κ²°ν•˜κΈ° μœ„ν•΄ One Representation (OneR) 을 μ œμ•ˆν•œλ‹€. OneR은 μ‹œκ°κ³Ό μ–Έμ–΄ 각각에 νŠΉμ •ν•œ ν‘œν˜„κ³΅κ°„μ„ κ°€μ§€λŠ” μ΄μ „μ˜ 연ꡬ듀과 달리 ν₯미둜운 νŠΉμ„±λ“€μ΄ λ‚˜νƒ€λ‚˜λ©°, 이λ₯Ό zero-shot μ‹œκ°ν™”, μžμ—°μ–΄κΈ°λ°˜μ˜ μ‹œκ°μ  이해 및 λ©€ν‹°λͺ¨λ‹¬ 검색을 톡해보인닀.λ˜ν•œ, 포괄적인 평가λ₯Ό 톡해 ν†΅ν•©λœ ν‘œν˜„ 곡간을 가지며, ν‘œν˜„ 양식에 ꡬ애받지 μ•Šμ€ μ‹œκ° μ–Έμ–΄ μ‚¬μ „ν•™μŠ΅λ°©λ²•λ‘ μ˜ 잠재λ ₯을 보여주며 이에 λŒ€ν•œ 뢄석을 μ œκ³΅ν•œλ‹€.1 Introduction 1 2 Related Works 5 2.1 Self-SupervisedLearning 5 2.2 Vision-LanguagePretraining 6 2.3 UnifiedVision-LanguageFramework 7 3 Overcoming Modality Gap 9 3.1 Cross-ModalMixup 10 4 Modality-agnostic Representations 12 4.1 ContextualModalityInvariance 12 4.2 ContextualMixupContrast 13 4.3 TheoreticalExplanationofCMC 14 4.4 OneRepresentation 16 5 Experiment 18 5.1 ExperimentalSetup 18 5.1.1 Datasets 18 5.1.2 ImplementationDetails 18 5.2 QualitativeResults 20 5.2.1 Zero-shotLocalization 20 5.2.2 Text-guidedVisualReasoning 22 5.2.3 Multi-modalRetrieval 24 5.3 VisualReasoningAnalysis 24 5.3.1 Robustness 24 5.3.2 Multi-level vision-language connection 25 5.4 QuantitativeResults 26 5.4.1 Image-textRetrieval 26 5.4.2 Cross-modalKnowledgeTransfer 27 5.5 AblationStudy 28 5.5.1 ProposedLossAblation 28 5.5.2 MaskedModelingAblation 29 6 Discussion 30 7 Conclusion 32 Abstract (In Korean) 40석

    MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

    Full text link
    This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.Comment: CVPR 2023, code is available at https://github.com/LightDXY/MaskCLI
    • …
    corecore