286 research outputs found
Know your sensORs -- A Modality Study For Surgical Action Classification
The surgical operating room (OR) presents many opportunities for automation
and optimization. Videos from various sources in the OR are becoming
increasingly available. The medical community seeks to leverage this wealth of
data to develop automated methods to advance interventional care, lower costs,
and improve overall patient outcomes. Existing datasets from OR room cameras
are thus far limited in size or modalities acquired, leaving it unclear which
sensor modalities are best suited for tasks such as recognizing surgical action
from videos. This study demonstrates that surgical action recognition
performance can vary depending on the image modalities used. We perform a
methodical analysis on several commonly available sensor modalities, presenting
two fusion approaches that improve classification performance. The analyses are
carried out on a set of multi-view RGB-D video recordings of 18 laparoscopic
procedures.Comment: 14 pages, presented at MICCAI 2022 AE-CA
SA-Med2D-20M Dataset: Segment Anything in 2D Medical Imaging with 20 Million masks
Segment Anything Model (SAM) has achieved impressive results for natural
image segmentation with input prompts such as points and bounding boxes. Its
success largely owes to massive labeled training data. However, directly
applying SAM to medical image segmentation cannot perform well because SAM
lacks medical knowledge -- it does not use medical images for training. To
incorporate medical knowledge into SAM, we introduce SA-Med2D-20M, a
large-scale segmentation dataset of 2D medical images built upon numerous
public and private datasets. It consists of 4.6 million 2D medical images and
19.7 million corresponding masks, covering almost the whole body and showing
significant diversity. This paper describes all the datasets collected in
SA-Med2D-20M and details how to process these datasets. Furthermore,
comprehensive statistics of SA-Med2D-20M are presented to facilitate the better
use of our dataset, which can help the researchers build medical vision
foundation models or apply their models to downstream medical applications. We
hope that the large scale and diversity of SA-Med2D-20M can be leveraged to
develop medical artificial intelligence for enhancing diagnosis, medical image
analysis, knowledge sharing, and education. The data with the redistribution
license is publicly available at https://github.com/OpenGVLab/SAM-Med2D
Ladder Fine-tuning approach for SAM integrating complementary network
Recently, foundation models have been introduced demonstrating various tasks
in the field of computer vision. These models such as Segment Anything Model
(SAM) are generalized models trained using huge datasets. Currently, ongoing
research focuses on exploring the effective utilization of these generalized
models for specific domains, such as medical imaging. However, in medical
imaging, the lack of training samples due to privacy concerns and other factors
presents a major challenge for applying these generalized models to medical
image segmentation task. To address this issue, the effective fine tuning of
these models is crucial to ensure their optimal utilization. In this study, we
propose to combine a complementary Convolutional Neural Network (CNN) along
with the standard SAM network for medical image segmentation. To reduce the
burden of fine tuning large foundation model and implement cost-efficient
trainnig scheme, we focus only on fine-tuning the additional CNN network and
SAM decoder part. This strategy significantly reduces trainnig time and
achieves competitive results on publicly available dataset. The code is
available at https://github.com/11yxk/SAM-LST
Ariadne's Thread:Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray images
Segmentation of the infected areas of the lung is essential for quantifying
the severity of lung disease like pulmonary infections. Existing medical image
segmentation methods are almost uni-modal methods based on image. However,
these image-only methods tend to produce inaccurate results unless trained with
large amounts of annotated data. To overcome this challenge, we propose a
language-driven segmentation method that uses text prompt to improve to the
segmentation result. Experiments on the QaTa-COV19 dataset indicate that our
method improves the Dice score by 6.09% at least compared to the uni-modal
methods. Besides, our extended study reveals the flexibility of multi-modal
methods in terms of the information granularity of text and demonstrates that
multi-modal methods have a significant advantage over image-only methods in
terms of the size of training data required.Comment: Provisional Acceptance by MICCAI 202
CLIP Model for Images to Textual Prompts Based on Top-k Neighbors
Text-to-image synthesis, a subfield of multimodal generation, has gained
significant attention in recent years. We propose a cost-effective approach for
image-to-prompt generation that leverages generative models to generate textual
prompts without the need for large amounts of annotated data. We divide our
method into two stages: online stage and offline stage. We use a combination of
the CLIP model and K-nearest neighbors (KNN) algorithm. The proposed system
consists of two main parts: an offline task and an online task. Our method owns
the highest metric 0.612 among these models, which is 0.013, 0.055, 0.011
higher than Clip, Clip + KNN(top 10) respectively.Comment: CLIP model, KNN, image-to-prompt
Improving Performance of Private Federated Models in Medical Image Analysis
Federated learning (FL) is a distributed machine learning (ML) approach that
allows data to be trained without being centralized. This approach is
particularly beneficial for medical applications because it addresses some key
challenges associated with medical data, such as privacy, security, and data
ownership. On top of that, FL can improve the quality of ML models used in
medical applications. Medical data is often diverse and can vary significantly
depending on the patient population, making it challenging to develop ML models
that are accurate and generalizable. FL allows medical data to be used from
multiple sources, which can help to improve the quality and generalizability of
ML models. Differential privacy (DP) is a go-to algorithmic tool to make this
process secure and private. In this work, we show that the model performance
can be further improved by employing local steps, a popular approach to
improving the communication efficiency of FL, and tuning the number of
communication rounds. Concretely, given the privacy budget, we show an optimal
number of local steps and communications rounds. We provide theoretical
motivations further corroborated with experimental evaluations on real-world
medical imaging tasks
LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching
Obtaining large pre-trained models that can be fine-tuned to new tasks with
limited annotated samples has remained an open challenge for medical imaging
data. While pre-trained deep networks on ImageNet and vision-language
foundation models trained on web-scale data are prevailing approaches, their
effectiveness on medical tasks is limited due to the significant domain shift
between natural and medical images. To bridge this gap, we introduce LVM-Med,
the first family of deep networks trained on large-scale medical datasets. We
have collected approximately 1.3 million medical images from 55 publicly
available datasets, covering a large number of organs and modalities such as
CT, MRI, X-ray, and Ultrasound. We benchmark several state-of-the-art
self-supervised algorithms on this dataset and propose a novel self-supervised
contrastive learning algorithm using a graph-matching formulation. The proposed
approach makes three contributions: (i) it integrates prior pair-wise image
similarity metrics based on local and global information; (ii) it captures the
structural constraints of feature embeddings through a loss function
constructed via a combinatorial graph-matching objective; and (iii) it can be
trained efficiently end-to-end using modern gradient-estimation techniques for
black-box solvers. We thoroughly evaluate the proposed LVM-Med on 15 downstream
medical tasks ranging from segmentation and classification to object detection,
and both for the in and out-of-distribution settings. LVM-Med empirically
outperforms a number of state-of-the-art supervised, self-supervised, and
foundation models. For challenging tasks such as Brain Tumor Classification or
Diabetic Retinopathy Grading, LVM-Med improves previous vision-language models
trained on 1 billion masks by 6-7% while using only a ResNet-50.Comment: Update Appendi
Comparative validation of machine learning algorithms for surgical workflow and skill analysis with the HeiChole benchmark
Purpose: Surgical workflow and skill analysis are key technologies for the next generation of cognitive surgical assistance systems. These systems could increase the safety of the operation through context-sensitive warnings and semi-autonomous robotic assistance or improve training of surgeons via data-driven feedback. In surgical workflow analysis up to 91% average precision has been reported for phase recognition on an open data single-center video dataset. In this work we investigated the generalizability of phase recognition algorithms in a multicenter setting including more difficult recognition tasks such as surgical action and surgical skill.
Methods: To achieve this goal, a dataset with 33 laparoscopic cholecystectomy videos from three surgical centers with a total operation time of 22 h was created. Labels included framewise annotation of seven surgical phases with 250 phase transitions, 5514 occurences of four surgical actions, 6980 occurences of 21 surgical instruments from seven instrument categories and 495 skill classifications in five skill dimensions. The dataset was used in the 2019 international Endoscopic Vision challenge, sub-challenge for surgical workflow and skill analysis. Here, 12 research teams trained and submitted their machine learning algorithms for recognition of phase, action, instrument and/or skill assessment.
Results: F1-scores were achieved for phase recognition between 23.9% and 67.7% (n = 9 teams), for instrument presence detection between 38.5% and 63.8% (n = 8 teams), but for action recognition only between 21.8% and 23.3% (n = 5 teams). The average absolute error for skill assessment was 0.78 (n = 1 team).
Conclusion: Surgical workflow and skill analysis are promising technologies to support the surgical team, but there is still room for improvement, as shown by our comparison of machine learning algorithms. This novel HeiChole benchmark can be used for comparable evaluation and validation of future work. In future studies, it is of utmost importance to create more open, high-quality datasets in order to allow the development of artificial intelligence and cognitive robotics in surgery
ModelScope Text-to-Video Technical Report
This paper introduces ModelScopeT2V, a text-to-video synthesis model that
evolves from a text-to-image synthesis model (i.e., Stable Diffusion).
ModelScopeT2V incorporates spatio-temporal blocks to ensure consistent frame
generation and smooth movement transitions. The model could adapt to varying
frame numbers during training and inference, rendering it suitable for both
image-text and video-text datasets. ModelScopeT2V brings together three
components (i.e., VQGAN, a text encoder, and a denoising UNet), totally
comprising 1.7 billion parameters, in which 0.5 billion parameters are
dedicated to temporal capabilities. The model demonstrates superior performance
over state-of-the-art methods across three evaluation metrics. The code and an
online demo are available at
\url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary}.Comment: Technical report. Project page:
\url{https://modelscope.cn/models/damo/text-to-video-synthesis/summary
- …