Deep learning for histopathology has been successfully used for disease
classification, image segmentation and more. However, combining image and text
modalities using current state-of-the-art methods has been a challenge due to
the high resolution of histopathology images. Automatic report generation for
histopathology images is one such challenge. In this work, we show that using
an existing pre-trained Vision Transformer in a two-step process of first using
it to encode 4096x4096 sized patches of the Whole Slide Image (WSI) and then
using it as the encoder and an LSTM decoder for report generation, we can build
a fairly performant and portable report generation mechanism that takes into
account the whole of the high resolution image, instead of just the patches. We
are also able to use representations from an existing powerful pre-trained
hierarchical vision transformer and show its usefulness in not just zero shot
classification but also for report generation.Comment: Extended Abstract presented at Machine Learning for Health (ML4H)
symposium 2023, December 10th, 2023, New Orleans, United States, 09 page