179 research outputs found
What a Whole Slide Image Can Tell? Subtype-guided Masked Transformer for Pathological Image Captioning
Pathological captioning of Whole Slide Images (WSIs), though is essential in
computer-aided pathological diagnosis, has rarely been studied due to the
limitations in datasets and model training efficacy. In this paper, we propose
a new paradigm Subtype-guided Masked Transformer (SGMT) for pathological
captioning based on Transformers, which treats a WSI as a sequence of sparse
patches and generates an overall caption sentence from the sequence. An
accompanying subtype prediction is introduced into SGMT to guide the training
process and enhance the captioning accuracy. We also present an Asymmetric
Masked Mechansim approach to tackle the large size constraint of pathological
image captioning, where the numbers of sequencing patches in SGMT are sampled
differently in the training and inferring phases, respectively. Experiments on
the PatchGastricADC22 dataset demonstrate that our approach effectively adapts
to the task with a transformer-based model and achieves superior performance
than traditional RNN-based methods. Our codes are to be made available for
further research and development
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Automatic radiology report generation has attracted enormous research
interest due to its practical value in reducing the workload of radiologists.
However, simultaneously establishing global correspondences between the image
(e.g., Chest X-ray) and its related report and local alignments between image
patches and keywords remains challenging. To this end, we propose an Unify,
Align and then Refine (UAR) approach to learn multi-level cross-modal
alignments and introduce three novel modules: Latent Space Unifier (LSU),
Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR).
Specifically, LSU unifies multimodal data into discrete tokens, making it
flexible to learn common knowledge among modalities with a shared network. The
modality-agnostic CRA learns discriminative features via a set of orthonormal
basis and a dual-gate mechanism first and then globally aligns visual and
textual representations under a triplet contrastive loss. TIR boosts
token-level local alignment via calibrating text-to-image attention with a
learnable mask. Additionally, we design a two-stage training procedure to make
UAR gradually grasp cross-modal alignments at different levels, which imitates
radiologists' workflow: writing sentence by sentence first and then checking
word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR
benchmark datasets demonstrate the superiority of our UAR against varied
state-of-the-art methods.Comment: 8 pages,6 figures,4 table
Bidirectional Captioning for Clinically Accurate and Interpretable Models
Vision-language pretraining has been shown to produce high-quality visual
encoders which transfer efficiently to downstream computer vision tasks. While
generative language models have gained widespread attention, image captioning
has thus far been mostly overlooked as a form of cross-modal pretraining in
favor of contrastive learning, especially in medical image analysis. In this
paper, we experiment with bidirectional captioning of radiology reports as a
form of pretraining and compare the quality and utility of learned embeddings
with those from contrastive pretraining methods. We optimize a CNN encoder,
transformer decoder architecture named RadTex for the radiology domain. Results
show that not only does captioning pretraining yield visual encoders that are
competitive with contrastive pretraining (CheXpert competition multi-label AUC
of 89.4%), but also that our transformer decoder is capable of generating
clinically relevant reports (captioning macro-F1 score of 0.349 using CheXpert
labeler) and responding to prompts with targeted, interactive outputs.Comment: 12 pages, 7 figures. Code release to follo
METransformer: Radiology Report Generation by Transformer with Multiple Learnable Expert Tokens
In clinical scenarios, multi-specialist consultation could significantly
benefit the diagnosis, especially for intricate cases. This inspires us to
explore a "multi-expert joint diagnosis" mechanism to upgrade the existing
"single expert" framework commonly seen in the current literature. To this end,
we propose METransformer, a method to realize this idea with a
transformer-based backbone. The key design of our method is the introduction of
multiple learnable "expert" tokens into both the transformer encoder and
decoder. In the encoder, each expert token interacts with both vision tokens
and other expert tokens to learn to attend different image regions for image
representation. These expert tokens are encouraged to capture complementary
information by an orthogonal loss that minimizes their overlap. In the decoder,
each attended expert token guides the cross-attention between input words and
visual tokens, thus influencing the generated report. A metrics-based expert
voting strategy is further developed to generate the final report. By the
multi-experts concept, our model enjoys the merits of an ensemble-based
approach but through a manner that is computationally more efficient and
supports more sophisticated interactions among experts. Experimental results
demonstrate the promising performance of our proposed model on two widely used
benchmarks. Last but not least, the framework-level innovation makes our work
ready to incorporate advances on existing "single-expert" models to further
improve its performance.Comment: Accepted by CVPR202
Self adaptive global-local feature enhancement for radiology report generation
Automated radiology report generation aims at automatically generating a
detailed description of medical images, which can greatly alleviate the
workload of radiologists and provide better medical services to remote areas.
Most existing works pay attention to the holistic impression of medical images,
failing to utilize important anatomy information. However, in actual clinical
practice, radiologists usually locate important anatomical structures, and then
look for signs of abnormalities in certain structures and reason the underlying
disease. In this paper, we propose a novel framework AGFNet to dynamically fuse
the global and anatomy region feature to generate multi-grained radiology
report. Firstly, we extract important anatomy region features and global
features of input Chest X-ray (CXR). Then, with the region features and the
global features as input, our proposed self-adaptive fusion gate module could
dynamically fuse multi-granularity information. Finally, the captioning
generator generates the radiology reports through multi-granularity features.
Experiment results illustrate that our model achieved the state-of-the-art
performance on two benchmark datasets including the IU X-Ray and MIMIC-CXR.
Further analyses also prove that our model is able to leverage the
multi-grained information from radiology images and texts so as to help
generate more accurate reports
Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring
Camera-based passive dietary intake monitoring is able to continuously
capture the eating episodes of a subject, recording rich visual information,
such as the type and volume of food being consumed, as well as the eating
behaviours of the subject. However, there currently is no method that is able
to incorporate these visual clues and provide a comprehensive context of
dietary intake from passive recording (e.g., is the subject sharing food with
others, what food the subject is eating, and how much food is left in the
bowl). On the other hand, privacy is a major concern while egocentric wearable
cameras are used for capturing. In this paper, we propose a privacy-preserved
secure solution (i.e., egocentric image captioning) for dietary assessment with
passive monitoring, which unifies food recognition, volume estimation, and
scene understanding. By converting images into rich text descriptions,
nutritionists can assess individual dietary intake based on the captions
instead of the original images, reducing the risk of privacy leakage from
images. To this end, an egocentric dietary image captioning dataset has been
built, which consists of in-the-wild images captured by head-worn and
chest-worn cameras in field studies in Ghana. A novel transformer-based
architecture is designed to caption egocentric dietary images. Comprehensive
experiments have been conducted to evaluate the effectiveness and to justify
the design of the proposed architecture for egocentric dietary image
captioning. To the best of our knowledge, this is the first work that applies
image captioning to dietary intake assessment in real life settings
Competence-based Multimodal Curriculum Learning for Medical Report Generation
Medical report generation task, which targets to produce long and coherent
descriptions of medical images, has attracted growing research interests
recently. Different from the general image captioning tasks, medical report
generation is more challenging for data-driven neural models. This is mainly
due to 1) the serious data bias and 2) the limited medical data. To alleviate
the data bias and make best use of available data, we propose a
Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically,
CMCL simulates the learning process of radiologists and optimizes the model in
a step by step manner. Firstly, CMCL estimates the difficulty of each training
instance and evaluates the competence of current model; Secondly, CMCL selects
the most suitable batch of training instances considering current model
competence. By iterating above two steps, CMCL can gradually improve the
model's performance. The experiments on the public IU-Xray and MIMIC-CXR
datasets show that CMCL can be incorporated into existing models to improve
their performance.Comment: Accepted by ACL 2021 (Oral
- …