3 research outputs found
CPIA Dataset: A Comprehensive Pathological Image Analysis Dataset for Self-supervised Learning Pre-training
Pathological image analysis is a crucial field in computer-aided diagnosis,
where deep learning is widely applied. Transfer learning using pre-trained
models initialized on natural images has effectively improved the downstream
pathological performance. However, the lack of sophisticated domain-specific
pathological initialization hinders their potential. Self-supervised learning
(SSL) enables pre-training without sample-level labels, which has great
potential to overcome the challenge of expensive annotations. Thus, studies
focusing on pathological SSL pre-training call for a comprehensive and
standardized dataset, similar to the ImageNet in computer vision. This paper
presents the comprehensive pathological image analysis (CPIA) dataset, a
large-scale SSL pre-training dataset combining 103 open-source datasets with
extensive standardization. The CPIA dataset contains 21,427,877 standardized
images, covering over 48 organs/tissues and about 100 kinds of diseases, which
includes two main data types: whole slide images (WSIs) and characteristic
regions of interest (ROIs). A four-scale WSI standardization process is
proposed based on the uniform resolution in microns per pixel (MPP), while the
ROIs are divided into three scales artificially. This multi-scale dataset is
built with the diagnosis habits under the supervision of experienced senior
pathologists. The CPIA dataset facilitates a comprehensive pathological
understanding and enables pattern discovery explorations. Additionally, to
launch the CPIA dataset, several state-of-the-art (SOTA) baselines of SSL
pre-training and downstream evaluation are specially conducted. The CPIA
dataset along with baselines is available at
https://github.com/zhanglab2021/CPIA_Dataset
One Model is All You Need: Multi-Task Learning Enables Simultaneous Histology Image Segmentation and Classification
The recent surge in performance for image analysis of digitised pathology
slides can largely be attributed to the advance of deep learning. Deep models
can be used to initially localise various structures in the tissue and hence
facilitate the extraction of interpretable features for biomarker discovery.
However, these models are typically trained for a single task and therefore
scale poorly as we wish to adapt the model for an increasing number of
different tasks. Also, supervised deep learning models are very data hungry and
therefore rely on large amounts of training data to perform well. In this paper
we present a multi-task learning approach for segmentation and classification
of nuclei, glands, lumen and different tissue regions that leverages data from
multiple independent data sources. While ensuring that our tasks are aligned by
the same tissue type and resolution, we enable simultaneous prediction with a
single network. As a result of feature sharing, we also show that the learned
representation can be used to improve downstream tasks, including nuclear
classification and signet ring cell detection. As part of this work, we use a
large dataset consisting of over 600K objects for segmentation and 440K patches
for classification and make the data publicly available. We use our approach to
process the colorectal subset of TCGA, consisting of 599 whole-slide images, to
localise 377 million, 900K and 2.1 million nuclei, glands and lumen
respectively. We make this resource available to remove a major barrier in the
development of explainable models for computational pathology