23 research outputs found
FIRST - Flexible interactive retrieval SysTem for visual lifelog exploration at LSC 2020
Lifelog can provide useful insights of our daily activities. It is essential to provide a flexible way for users to retrieve certain events
or moments of interest, corresponding to a wide variation of query
types. This motivates us to develop FIRST, a Flexible Interactive Retrieval SysTem, to help users to combine or integrate various query
components in a flexible manner to handle different query scenarios, such as visual clustering data based on color histogram, visual
similarity, GPS location, or scene attributes. We also employ personalized concept detection and image captioning to enhance image
understanding from visual lifelog data, and develop an autoencoderlike approach for query text and image feature mapping. Furthermore, we refine the user interface of the retrieval system to better
assist users in query expansion and verifying sequential events
in a flexible temporal resolution to control the navigation speed
through sequences of images
Whether and When does Endoscopy Domain Pretraining Make Sense?
Automated endoscopy video analysis is a challenging task in medical computer
vision, with the primary objective of assisting surgeons during procedures. The
difficulty arises from the complexity of surgical scenes and the lack of a
sufficient amount of annotated data. In recent years, large-scale pretraining
has shown great success in natural language processing and computer vision
communities. These approaches reduce the need for annotated data, which is
always a concern in the medical domain. However, most works on endoscopic video
understanding use models pretrained on natural images, creating a domain gap
between pretraining and finetuning. In this work, we investigate the need for
endoscopy domain-specific pretraining based on downstream objectives. To this
end, we first collect Endo700k, the largest publicly available corpus of
endoscopic images, extracted from nine public Minimally Invasive Surgery (MIS)
datasets. Endo700k comprises more than 700,000 unannotated raw images. Next, we
introduce EndoViT, an endoscopy pretrained Vision Transformer (ViT). Through
ablations, we demonstrate that domain-specific pretraining is particularly
beneficial for more complex downstream tasks, such as Action Triplet Detection,
and less effective and even unnecessary for simpler tasks, such as Surgical
Phase Recognition. We will release both our code and pretrained models upon
acceptance to facilitate further research in this direction
A VR interface for browsing visual spaces at VBS2021
The Video Browser Showdown (VBS) is an annual competition in which each participant prepares an interactive video retrieval system and partakes in a live comparative evaluation at the annual MMMConference. In this paper, we introduce Eolas, which is a prototype video/image retrieval system incorporating a novel virtual reality (VR)interface. For VBS’21, Eolas represented each keyframe of the collection by an embedded feature in a latent vector space, into which a query would also be projected to facilitate retrieval within a VR environment. A user could then explore the space and perform one of a number of filter operations to traverse the space and locate the correct result
FaceAtt: Enhancing Image Captioning with Facial Attributes for Portrait Images
Automated image caption generation is a critical area of research that
enhances accessibility and understanding of visual content for diverse
audiences. In this study, we propose the FaceAtt model, a novel approach to
attribute-focused image captioning that emphasizes the accurate depiction of
facial attributes within images. FaceAtt automatically detects and describes a
wide range of attributes, including emotions, expressions, pointed noses, fair
skin tones, hair textures, attractiveness, and approximate age ranges.
Leveraging deep learning techniques, we explore the impact of different image
feature extraction methods on caption quality and evaluate our model's
performance using metrics such as BLEU and METEOR. Our FaceAtt model leverages
annotated attributes of portraits as supplementary prior knowledge for our
portrait images before captioning. This innovative addition yields a subtle yet
discernible enhancement in the resulting scores, exemplifying the potency of
incorporating additional attribute vectors during training. Furthermore, our
research contributes to the broader discourse on ethical considerations in
automated captioning. This study sets the stage for future research in refining
attribute-focused captioning techniques, with a focus on enhancing linguistic
coherence, addressing biases, and accommodating diverse user needs
ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models
Colonoscopy analysis, particularly automatic polyp segmentation and
detection, is essential for assisting clinical diagnosis and treatment.
However, as medical image annotation is labour- and resource-intensive, the
scarcity of annotated data limits the effectiveness and generalization of
existing methods. Although recent research has focused on data generation and
augmentation to address this issue, the quality of the generated data remains a
challenge, which limits the contribution to the performance of subsequent
tasks. Inspired by the superiority of diffusion models in fitting data
distributions and generating high-quality data, in this paper, we propose an
Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy
images that benefit the downstream tasks. Specifically, ArSDM utilizes the
ground-truth segmentation mask as a prior condition during training and adjusts
the diffusion loss for each input according to the polyp/background size ratio.
Furthermore, ArSDM incorporates a pre-trained segmentation model to refine the
training process by reducing the difference between the ground-truth mask and
the prediction mask. Extensive experiments on segmentation and detection tasks
demonstrate the generated data by ArSDM could significantly boost the
performance of baseline methods.Comment: Accepted by MICCAI-202
SME: Spatial-Spectral Mutual Teaching and Ensemble Learning for Scribble-supervised Polyp Segmentation
Fully-supervised polyp segmentation has accomplished significant triumphs
over the years in advancing the early diagnosis of colorectal cancer. However,
label-efficient solutions from weak supervision like scribbles are rarely
explored yet primarily meaningful and demanding in medical practice due to the
expensiveness and scarcity of densely-annotated polyp data. Besides, various
deployment issues, including data shifts and corruption, put forward further
requests for model generalization and robustness. To address these concerns, we
design a framework of Spatial-Spectral Dual-branch Mutual Teaching and
Entropy-guided Pseudo Label Ensemble Learning (SME). Concretely, for the
first time in weakly-supervised medical image segmentation, we promote the
dual-branch co-teaching framework by leveraging the intrinsic complementarity
of features extracted from the spatial and spectral domains and encouraging
cross-space consistency through collaborative optimization. Furthermore, to
produce reliable mixed pseudo labels, which enhance the effectiveness of
ensemble learning, we introduce a novel adaptive pixel-wise fusion technique
based on the entropy guidance from the spatial and spectral branches. Our
strategy efficiently mitigates the deleterious effects of uncertainty and noise
present in pseudo labels and surpasses previous alternatives in terms of
efficacy. Ultimately, we formulate a holistic optimization objective to learn
from the hybrid supervision of scribbles and pseudo labels. Extensive
experiments and evaluation on four public datasets demonstrate the superiority
of our method regarding in-distribution accuracy, out-of-distribution
generalization, and robustness, highlighting its promising clinical
significance. Our code is available at https://github.com/lofrienger/S2ME.Comment: MICCAI 2023 Early Acceptanc
Mask-conditioned latent diffusion for generating gastrointestinal polyp images
In order to take advantage of AI solutions in endoscopy diagnostics, we must
overcome the issue of limited annotations. These limitations are caused by the
high privacy concerns in the medical field and the requirement of getting aid
from experts for the time-consuming and costly medical data annotation process.
In computer vision, image synthesis has made a significant contribution in
recent years as a result of the progress of generative adversarial networks
(GANs) and diffusion probabilistic models (DPM). Novel DPMs have outperformed
GANs in text, image, and video generation tasks. Therefore, this study proposes
a conditional DPM framework to generate synthetic GI polyp images conditioned
on given generated segmentation masks. Our experimental results show that our
system can generate an unlimited number of high-fidelity synthetic polyp images
with the corresponding ground truth masks of polyps. To test the usefulness of
the generated data, we trained binary image segmentation models to study the
effect of using synthetic data. Results show that the best micro-imagewise IOU
of 0.7751 was achieved from DeepLabv3+ when the training data consists of both
real data and synthetic data. However, the results reflect that achieving good
segmentation performance with synthetic data heavily depends on model
architectures
DA-TransUNet: Integrating Spatial and Channel Dual Attention with Transformer U-Net for Medical Image Segmentation
Accurate medical image segmentation is critical for disease quantification
and treatment evaluation. While traditional Unet architectures and their
transformer-integrated variants excel in automated segmentation tasks. However,
they lack the ability to harness the intrinsic position and channel features of
image. Existing models also struggle with parameter efficiency and
computational complexity, often due to the extensive use of Transformers. To
address these issues, this study proposes a novel deep medical image
segmentation framework, called DA-TransUNet, aiming to integrate the
Transformer and dual attention block(DA-Block) into the traditional U-shaped
architecture. Unlike earlier transformer-based U-net models, DA-TransUNet
utilizes Transformers and DA-Block to integrate not only global and local
features, but also image-specific positional and channel features, improving
the performance of medical image segmentation. By incorporating a DA-Block at
the embedding layer and within each skip connection layer, we substantially
enhance feature extraction capabilities and improve the efficiency of the
encoder-decoder structure. DA-TransUNet demonstrates superior performance in
medical image segmentation tasks, consistently outperforming state-of-the-art
techniques across multiple datasets. In summary, DA-TransUNet offers a
significant advancement in medical image segmentation, providing an effective
and powerful alternative to existing techniques. Our architecture stands out
for its ability to improve segmentation accuracy, thereby advancing the field
of automated medical image diagnostics. The codes and parameters of our model
will be publicly available at https://github.com/SUN-1024/DA-TransUnet
Multi-level feature fusion network combining attention mechanisms for polyp segmentation
Clinically, automated polyp segmentation techniques have the potential to
significantly improve the efficiency and accuracy of medical diagnosis, thereby
reducing the risk of colorectal cancer in patients. Unfortunately, existing
methods suffer from two significant weaknesses that can impact the accuracy of
segmentation. Firstly, features extracted by encoders are not adequately
filtered and utilized. Secondly, semantic conflicts and information redundancy
caused by feature fusion are not attended to. To overcome these limitations, we
propose a novel approach for polyp segmentation, named MLFF-Net, which
leverages multi-level feature fusion and attention mechanisms. Specifically,
MLFF-Net comprises three modules: Multi-scale Attention Module (MAM),
High-level Feature Enhancement Module (HFEM), and Global Attention Module
(GAM). Among these, MAM is used to extract multi-scale information and polyp
details from the shallow output of the encoder. In HFEM, the deep features of
the encoders complement each other by aggregation. Meanwhile, the attention
mechanism redistributes the weight of the aggregated features, weakening the
conflicting redundant parts and highlighting the information useful to the
task. GAM combines features from the encoder and decoder features, as well as
computes global dependencies to prevent receptive field locality. Experimental
results on five public datasets show that the proposed method not only can
segment multiple types of polyps but also has advantages over current
state-of-the-art methods in both accuracy and generalization ability