21 research outputs found

    Whether and When does Endoscopy Domain Pretraining Make Sense?

    Full text link
    Automated endoscopy video analysis is a challenging task in medical computer vision, with the primary objective of assisting surgeons during procedures. The difficulty arises from the complexity of surgical scenes and the lack of a sufficient amount of annotated data. In recent years, large-scale pretraining has shown great success in natural language processing and computer vision communities. These approaches reduce the need for annotated data, which is always a concern in the medical domain. However, most works on endoscopic video understanding use models pretrained on natural images, creating a domain gap between pretraining and finetuning. In this work, we investigate the need for endoscopy domain-specific pretraining based on downstream objectives. To this end, we first collect Endo700k, the largest publicly available corpus of endoscopic images, extracted from nine public Minimally Invasive Surgery (MIS) datasets. Endo700k comprises more than 700,000 unannotated raw images. Next, we introduce EndoViT, an endoscopy pretrained Vision Transformer (ViT). Through ablations, we demonstrate that domain-specific pretraining is particularly beneficial for more complex downstream tasks, such as Action Triplet Detection, and less effective and even unnecessary for simpler tasks, such as Surgical Phase Recognition. We will release both our code and pretrained models upon acceptance to facilitate further research in this direction

    FaceAtt: Enhancing Image Captioning with Facial Attributes for Portrait Images

    Full text link
    Automated image caption generation is a critical area of research that enhances accessibility and understanding of visual content for diverse audiences. In this study, we propose the FaceAtt model, a novel approach to attribute-focused image captioning that emphasizes the accurate depiction of facial attributes within images. FaceAtt automatically detects and describes a wide range of attributes, including emotions, expressions, pointed noses, fair skin tones, hair textures, attractiveness, and approximate age ranges. Leveraging deep learning techniques, we explore the impact of different image feature extraction methods on caption quality and evaluate our model's performance using metrics such as BLEU and METEOR. Our FaceAtt model leverages annotated attributes of portraits as supplementary prior knowledge for our portrait images before captioning. This innovative addition yields a subtle yet discernible enhancement in the resulting scores, exemplifying the potency of incorporating additional attribute vectors during training. Furthermore, our research contributes to the broader discourse on ethical considerations in automated captioning. This study sets the stage for future research in refining attribute-focused captioning techniques, with a focus on enhancing linguistic coherence, addressing biases, and accommodating diverse user needs

    ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models

    Full text link
    Colonoscopy analysis, particularly automatic polyp segmentation and detection, is essential for assisting clinical diagnosis and treatment. However, as medical image annotation is labour- and resource-intensive, the scarcity of annotated data limits the effectiveness and generalization of existing methods. Although recent research has focused on data generation and augmentation to address this issue, the quality of the generated data remains a challenge, which limits the contribution to the performance of subsequent tasks. Inspired by the superiority of diffusion models in fitting data distributions and generating high-quality data, in this paper, we propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks. Specifically, ArSDM utilizes the ground-truth segmentation mask as a prior condition during training and adjusts the diffusion loss for each input according to the polyp/background size ratio. Furthermore, ArSDM incorporates a pre-trained segmentation model to refine the training process by reducing the difference between the ground-truth mask and the prediction mask. Extensive experiments on segmentation and detection tasks demonstrate the generated data by ArSDM could significantly boost the performance of baseline methods.Comment: Accepted by MICCAI-202

    S2^2ME: Spatial-Spectral Mutual Teaching and Ensemble Learning for Scribble-supervised Polyp Segmentation

    Full text link
    Fully-supervised polyp segmentation has accomplished significant triumphs over the years in advancing the early diagnosis of colorectal cancer. However, label-efficient solutions from weak supervision like scribbles are rarely explored yet primarily meaningful and demanding in medical practice due to the expensiveness and scarcity of densely-annotated polyp data. Besides, various deployment issues, including data shifts and corruption, put forward further requests for model generalization and robustness. To address these concerns, we design a framework of Spatial-Spectral Dual-branch Mutual Teaching and Entropy-guided Pseudo Label Ensemble Learning (S2^2ME). Concretely, for the first time in weakly-supervised medical image segmentation, we promote the dual-branch co-teaching framework by leveraging the intrinsic complementarity of features extracted from the spatial and spectral domains and encouraging cross-space consistency through collaborative optimization. Furthermore, to produce reliable mixed pseudo labels, which enhance the effectiveness of ensemble learning, we introduce a novel adaptive pixel-wise fusion technique based on the entropy guidance from the spatial and spectral branches. Our strategy efficiently mitigates the deleterious effects of uncertainty and noise present in pseudo labels and surpasses previous alternatives in terms of efficacy. Ultimately, we formulate a holistic optimization objective to learn from the hybrid supervision of scribbles and pseudo labels. Extensive experiments and evaluation on four public datasets demonstrate the superiority of our method regarding in-distribution accuracy, out-of-distribution generalization, and robustness, highlighting its promising clinical significance. Our code is available at https://github.com/lofrienger/S2ME.Comment: MICCAI 2023 Early Acceptanc

    Mask-conditioned latent diffusion for generating gastrointestinal polyp images

    Full text link
    In order to take advantage of AI solutions in endoscopy diagnostics, we must overcome the issue of limited annotations. These limitations are caused by the high privacy concerns in the medical field and the requirement of getting aid from experts for the time-consuming and costly medical data annotation process. In computer vision, image synthesis has made a significant contribution in recent years as a result of the progress of generative adversarial networks (GANs) and diffusion probabilistic models (DPM). Novel DPMs have outperformed GANs in text, image, and video generation tasks. Therefore, this study proposes a conditional DPM framework to generate synthetic GI polyp images conditioned on given generated segmentation masks. Our experimental results show that our system can generate an unlimited number of high-fidelity synthetic polyp images with the corresponding ground truth masks of polyps. To test the usefulness of the generated data, we trained binary image segmentation models to study the effect of using synthetic data. Results show that the best micro-imagewise IOU of 0.7751 was achieved from DeepLabv3+ when the training data consists of both real data and synthetic data. However, the results reflect that achieving good segmentation performance with synthetic data heavily depends on model architectures

    DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation

    Full text link
    Accurate medical image segmentation is critical for disease quantification and treatment evaluation. While traditional Unet architectures and their transformer-integrated variants excel in automated segmentation tasks. However, they lack the ability to harness the intrinsic position and channel features of image. Existing models also struggle with parameter efficiency and computational complexity, often due to the extensive use of Transformers. To address these issues, this study proposes a novel deep medical image segmentation framework, called DA-TransUNet, aiming to integrate the Transformer and dual attention block(DA-Block) into the traditional U-shaped architecture. Unlike earlier transformer-based U-net models, DA-TransUNet utilizes Transformers and DA-Block to integrate not only global and local features, but also image-specific positional and channel features, improving the performance of medical image segmentation. By incorporating a DA-Block at the embedding layer and within each skip connection layer, we substantially enhance feature extraction capabilities and improve the efficiency of the encoder-decoder structure. DA-TransUNet demonstrates superior performance in medical image segmentation tasks, consistently outperforming state-of-the-art techniques across multiple datasets. In summary, DA-TransUNet offers a significant advancement in medical image segmentation, providing an effective and powerful alternative to existing techniques. Our architecture stands out for its ability to improve segmentation accuracy, thereby advancing the field of automated medical image diagnostics. The codes and parameters of our model will be publicly available at https://github.com/SUN-1024/DA-TransUnet

    Multi-level feature fusion network combining attention mechanisms for polyp segmentation

    Full text link
    Clinically, automated polyp segmentation techniques have the potential to significantly improve the efficiency and accuracy of medical diagnosis, thereby reducing the risk of colorectal cancer in patients. Unfortunately, existing methods suffer from two significant weaknesses that can impact the accuracy of segmentation. Firstly, features extracted by encoders are not adequately filtered and utilized. Secondly, semantic conflicts and information redundancy caused by feature fusion are not attended to. To overcome these limitations, we propose a novel approach for polyp segmentation, named MLFF-Net, which leverages multi-level feature fusion and attention mechanisms. Specifically, MLFF-Net comprises three modules: Multi-scale Attention Module (MAM), High-level Feature Enhancement Module (HFEM), and Global Attention Module (GAM). Among these, MAM is used to extract multi-scale information and polyp details from the shallow output of the encoder. In HFEM, the deep features of the encoders complement each other by aggregation. Meanwhile, the attention mechanism redistributes the weight of the aggregated features, weakening the conflicting redundant parts and highlighting the information useful to the task. GAM combines features from the encoder and decoder features, as well as computes global dependencies to prevent receptive field locality. Experimental results on five public datasets show that the proposed method not only can segment multiple types of polyps but also has advantages over current state-of-the-art methods in both accuracy and generalization ability

    Ambient Assisted Living: Scoping Review of Artificial Intelligence Models, Domains, Technology, and Concerns

    Get PDF
    Background: Ambient assisted living (AAL) is a common name for various artificial intelligence (AI)—infused applications and platforms that support their users in need in multiple activities, from health to daily living. These systems use different approaches to learn about their users and make automated decisions, known as AI models, for personalizing their services and increasing outcomes. Given the numerous systems developed and deployed for people with different needs, health conditions, and dispositions toward the technology, it is critical to obtain clear and comprehensive insights concerning AI models used, along with their domains, technology, and concerns, to identify promising directions for future work. Objective: This study aimed to provide a scoping review of the literature on AI models in AAL. In particular, we analyzed specific AI models used in AАL systems, the target domains of the models, the technology using the models, and the major concerns from the end-user perspective. Our goal was to consolidate research on this topic and inform end users, health care professionals and providers, researchers, and practitioners in developing, deploying, and evaluating future intelligent AAL systems. Methods: This study was conducted as a scoping review to identify, analyze, and extract the relevant literature. It used a natural language processing toolkit to retrieve the article corpus for an efficient and comprehensive automated literature search. Relevant articles were then extracted from the corpus and analyzed manually. This review included 5 digital libraries: IEEE, PubMed, Springer, Elsevier, and MDPI. Results: We included a total of 108 articles. The annual distribution of relevant articles showed a growing trend for all categories from January 2010 to July 2022. The AI models mainly used unsupervised and semisupervised approaches. The leading models are deep learning, natural language processing, instance-based learning, and clustering. Activity assistance and recognition were the most common target domains of the models. Ambient sensing, mobile technology, and robotic devices mainly implemented the models. Older adults were the primary beneficiaries, followed by patients and frail persons of various ages. Availability was a top beneficiary concern. Conclusions: This study presents the analytical evidence of AI models in AAL and their domains, technologies, beneficiaries, and concerns. Future research on intelligent AAL should involve health care professionals and caregivers as designers and users, comply with health-related regulations, improve transparency and privacy, integrate with health care technological infrastructure, explain their decisions to the users, and establish evaluation metrics and design guidelines. Trial Registration: PROSPERO (International Prospective Register of Systematic Reviews) CRD42022347590; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022347590This work was part of and supported by GoodBrother, COST Action 19121—Network on Privacy-Aware Audio- and Video-Based Applications for Active and Assisted Living
    corecore