In this paper, we consider enhancing medical visual-language pre-training
(VLP) with domain-specific knowledge, by exploiting the paired image-text
reports from the radiological daily practice. In particular, we make the
following contributions: First, unlike existing works that directly process the
raw reports, we adopt a novel triplet extraction module to extract the
medical-related information, avoiding unnecessary complexity from language
grammar and enhancing the supervision signals; Second, we propose a novel
triplet encoding module with entity translation by querying a knowledge base,
to exploit the rich domain knowledge in medical field, and implicitly build
relationships between medical entities in the language embedding space; Third,
we propose to use a Transformer-based fusion model for spatially aligning the
entity description with visual signals at the image patch level, enabling the
ability for medical diagnosis; Fourth, we conduct thorough experiments to
validate the effectiveness of our architecture, and benchmark on numerous
public benchmarks, e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax,
COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning
settings, our model has demonstrated strong performance compared with the
former methods on disease classification and grounding