Search CORE

920 research outputs found

Contrastive Learning of Medical Visual Representations from Paired Images and Text

Author: Jiang Hang
Langlotz Curtis P.
Manning Christopher D.
Miura Yasuhide
Zhang Yuhao
Publication venue
Publication date: 01/10/2020
Field of study

Learning visual representations of medical images is core to medical image understanding but its progress has been held back by the small size of hand-labeled datasets. Existing work commonly relies on transferring weights from ImageNet pretraining, which is suboptimal due to drastically different image characteristics, or rule-based label extraction from the textual report data paired with medical images, which is inaccurate and hard to generalize. We propose an alternative unsupervised strategy to learn medical visual representations directly from the naturally occurring pairing of images and textual data. Our method of pretraining medical image encoders with the paired text data via a bidirectional contrastive objective between the two modalities is domain-agnostic, and requires no additional expert input. We test our method by transferring our pretrained weights to 4 medical image classification tasks and 2 zero-shot retrieval tasks, and show that our method leads to image representations that considerably outperform strong baselines in most settings. Notably, in all 4 classification tasks, our method requires only 10% as much labeled training data as an ImageNet initialized counterpart to achieve better or comparable performance, demonstrating superior data efficiency

arXiv.org e-Print Archive

When and why vision-language models behave like bags-of-words, and what to do about it?

Author: Bianchi Federico
Jurafsky Dan
Kalluri Pratyusha
Yuksekgonul Mert
Zou James
Publication venue
Publication date: 23/03/2023
Field of study

Despite the success of large vision and language models (VLMs) in many downstream applications, it is unclear how well they encode compositional information. Here, we create the Attribution, Relation, and Order (ARO) benchmark to systematically evaluate the ability of VLMs to understand different types of relationships, attributes, and order. ARO consists of Visual Genome Attribution, to test the understanding of objects' properties; Visual Genome Relation, to test for relational understanding; and COCO & Flickr30k-Order, to test for order sensitivity. ARO is orders of magnitude larger than previous benchmarks of compositionality, with more than 50,000 test cases. We show where state-of-the-art VLMs have poor relational understanding, can blunder when linking objects to their attributes, and demonstrate a severe lack of order sensitivity. VLMs are predominantly trained and evaluated on large datasets with rich compositional structure in the images and captions. Yet, training on these datasets has not been enough to address the lack of compositional understanding, and evaluating on these datasets has failed to surface this deficiency. To understand why these limitations emerge and are not represented in the standard tests, we zoom into the evaluation and training procedures. We demonstrate that it is possible to perform well on retrieval over existing datasets without using the composition and order information. Given that contrastive pretraining optimizes for retrieval on datasets with similar shortcuts, we hypothesize that this can explain why the models do not need to learn to represent compositional information. This finding suggests a natural solution: composition-aware hard negative mining. We show that a simple-to-implement modification of contrastive learning significantly improves the performance on tasks requiring understanding of order and compositionality.Comment: ICLR 2023 Oral (notable-top-5%

arXiv.org e-Print Archive

GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding

Author: Chen Muhao
Chiang Yao-Yi
Li Zekun
Zhou Wenxuan
Publication venue
Publication date: 22/10/2023
Field of study

Humans subconsciously engage in geospatial reasoning when reading articles. We recognize place names and their spatial relations in text and mentally associate them with their physical locations on Earth. Although pretrained language models can mimic this cognitive process using linguistic context, they do not utilize valuable geospatial information in large, widely available geographical databases, e.g., OpenStreetMap. This paper introduces GeoLM, a geospatially grounded language model that enhances the understanding of geo-entities in natural language. GeoLM leverages geo-entity mentions as anchors to connect linguistic information in text corpora with geospatial information extracted from geographical databases. GeoLM connects the two types of context through contrastive learning and masked language modeling. It also incorporates a spatial coordinate embedding mechanism to encode distance and direction relations to capture geospatial context. In the experiment, we demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing, which bridge the gap between natural language processing and geospatial sciences. The code is publicly available at https://github.com/knowledge-computing/geolm.Comment: Accepted to EMNLP23 mai

arXiv.org e-Print Archive

단일 타워 CLIP을 이용해 통합된 시각 언어 표현 공간 탐색

Author: 장지호
Publication venue: 서울대학교 대학원
Publication date: 01/02/2023
Field of study

학위논문(석사) -- 서울대학교대학원 : 융합과학기술대학원 지능정보융합학과, 2023. 2. 곽노준.Contrastive learning is widely adopted in self-supervised representation learning (SSL) to learn common attributes from similar sample pairs. In this paper, we boldly hypothesize that an image and its caption can be simply regarded as two different views of an underlying semantic, and aim to build a unified vision-language representation space by inducing a one-tower transformer that can encode both type of data samples in a modality-agnostic manner. We show that applying typical SSL frameworks to vision-language pretraining (VLP) naively fails to train a generic one-tower model due to a severe modality gap, and propose One Representation (OneR) to mitigate the disparity. We explore emerging properties of OneR distinguished from prior works with modality-specific representation spaces such as zero-shot object localization, text-guided visual reasoning, and multi-modal retrieval, and analyze our novel multi-modal representation learning. Comprehensive evaluations demonstrate the potential of a modality-agnostic VLP framework that has unified representation space.Contrastive learning은 자기지도학습(Self-supervised learning, SSL)에서 널리 채택되어 비슷한 데이터에서 공통된 특징을 추출하도록 하는 학습 방법론이다. 본 논문에서, 우리는 이미지와 이에 대응되는 설명문을 공통된 정보를 바탕으로 다르게 표현된 데이터로 가정하고, 단일 타워의 트랜스포머를 활용하여 이미지와 텍스트를 하나의 표현 공간으로 매핑하려고 한다. 기존의 자기지도학습 방법론들을 단순히 시각 언어 사전학습에 적용하는 것은 표현 양식의 차이로 인한 어려움이 존재하고, 이를 해결하기 위해 One Representation (OneR) 을 제안한다. OneR은 시각과 언어 각각에 특정한 표현공간을 가지는 이전의 연구들과 달리 흥미로운 특성들이 나타나며, 이를 zero-shot 시각화, 자연어기반의 시각적 이해 및 멀티모달 검색을 통해보인다.또한, 포괄적인 평가를 통해 통합된 표현 공간을 가지며, 표현 양식에 구애받지 않은 시각 언어 사전학습방법론의 잠재력을 보여주며 이에 대한 분석을 제공한다.1 Introduction 1 2 Related Works 5 2.1 Self-SupervisedLearning 5 2.2 Vision-LanguagePretraining 6 2.3 UnifiedVision-LanguageFramework 7 3 Overcoming Modality Gap 9 3.1 Cross-ModalMixup 10 4 Modality-agnostic Representations 12 4.1 ContextualModalityInvariance 12 4.2 ContextualMixupContrast 13 4.3 TheoreticalExplanationofCMC 14 4.4 OneRepresentation 16 5 Experiment 18 5.1 ExperimentalSetup 18 5.1.1 Datasets 18 5.1.2 ImplementationDetails 18 5.2 QualitativeResults 20 5.2.1 Zero-shotLocalization 20 5.2.2 Text-guidedVisualReasoning 22 5.2.3 Multi-modalRetrieval 24 5.3 VisualReasoningAnalysis 24 5.3.1 Robustness 24 5.3.2 Multi-level vision-language connection 25 5.4 QuantitativeResults 26 5.4.1 Image-textRetrieval 26 5.4.2 Cross-modalKnowledgeTransfer 27 5.5 AblationStudy 28 5.5.1 ProposedLossAblation 28 5.5.2 MaskedModelingAblation 29 6 Discussion 30 7 Conclusion 32 Abstract (In Korean) 40석

SNU Open Repository and Archive

MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining

Author: Bao Jianmin
Chen Dong
Chen Dongdong
Dong Xiaoyi
Wen Fang
Yang Hao
Yu Nenghai
Yuan Lu
Zeng Ming
Zhang Ting
Zhang Weiming
Zheng Yinglin
Publication venue
Publication date: 09/04/2023
Field of study

This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.Comment: CVPR 2023, code is available at https://github.com/LightDXY/MaskCLI

arXiv.org e-Print Archive