13 research outputs found
ST(OR)2: Spatio-Temporal Object Level Reasoning for Activity Recognition in the Operating Room
Surgical robotics holds much promise for improving patient safety and
clinician experience in the Operating Room (OR). However, it also comes with
new challenges, requiring strong team coordination and effective OR management.
Automatic detection of surgical activities is a key requirement for developing
AI-based intelligent tools to tackle these challenges. The current
state-of-the-art surgical activity recognition methods however operate on
image-based representations and depend on large-scale labeled datasets whose
collection is time-consuming and resource-expensive. This work proposes a new
sample-efficient and object-based approach for surgical activity recognition in
the OR. Our method focuses on the geometric arrangements between clinicians and
surgical devices, thus utilizing the significant object interaction dynamics in
the OR. We conduct experiments in a low-data regime study for long video
activity recognition. We also benchmark our method againstother object-centric
approaches on clip-level action classification and show superior performance
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision applications have been driven
by fully-supervised methods, primarily using only visual data. These methods
rely on manually annotated surgical videos to predict a fixed set of object
categories, limiting their generalizability to unseen surgical procedures and
downstream tasks. In this work, we put forward the idea that the surgical video
lectures available through open surgical e-learning platforms can provide
effective supervisory signals for multi-modal representation learning without
relying on manual annotations. We address the surgery-specific linguistic
challenges present in surgical video lectures by employing multiple
complementary automatic speech recognition systems to generate text
transcriptions. We then present a novel method, SurgVLP - Surgical Vision
Language Pre-training, for multi-modal representation learning. SurgVLP
constructs a new contrastive learning objective to align video clip embeddings
with the corresponding multiple text embeddings by bringing them together
within a joint latent space. To effectively show the representation capability
of the learned joint latent space, we introduce several vision-and-language
tasks for surgery, such as text-based video retrieval, temporal activity
grounding, and video captioning, as benchmarks for evaluation. We further
demonstrate that without using any labeled ground truth, our approach can be
employed for traditional vision-only surgical downstream tasks, such as
surgical tool, phase, and triplet recognition. The code will be made available
at https://github.com/CAMMA-public/SurgVL
Approches d'adaptation de domaine non supervisées pour la localisation de personnes dans la salle d'opération
The fine-grained localization of clinicians in the operating room (OR) is a key component in designing the new OR support systems. However, the task is challenging not only because OR images contain significant visual domain differences compared to traditional vision datasets but also because data and annotations are hard to collect and generate in the OR due to privacy concerns. This thesis explores Unsupervised Domain Adaptation (UDA) methods to enable visual learning for the target domain, the OR, by working in two complementary directions. First, we study how low-resolution images with a downsampling factor as low as 12x can be used for fine-grained clinicians' localization to address privacy concerns. Second, we propose several self-supervised methods to transfer learned information from a labeled source domain to an unlabeled target domain to address the shift of visual domain and lack of annotations. These methods employ self-supervised predictions in allowing the model to learn and adapt to the unlabeled target domain. To demonstrate the effectiveness of our proposed approaches, we release the first public dataset, called the multi-view operating room (MVOR), generated from recordings of real clinical interventions. We obtain state-of-the-art results on the MVOR dataset, specifically on the privacy-preserving low-resolution OR images. We hope our proposed UDA approaches could help to scale up and deploy novel AI assistance applications for the OR environments.La localisation précise des cliniciens dans la salle d'opération est un élément clé dans la conception des nouveaux systèmes de support clinique. Cependant, la tâche est difficile non seulement parce que les images de la salle d'opération contiennent des différences visuelles significatives par rapport aux images ordinaires, mais aussi parce que les données et les annotations sont difficiles à collecter et à générer dans la salle d'opération en raison de problèmes de confidentialité. Cette thèse explore les méthodes d'adaptation de domaine non supervisées pour permettre l'apprentissage visuel pour le domaine cible, la salle d'opération, en travaillant dans deux directions complémentaires. Tout d'abord, nous étudions comment des images basse résolution avec un facteur de sous-échantillonnage allant jusqu'à 12x peuvent être utilisées pour une localisation précise des cliniciens afin de résoudre les problèmes de confidentialité. Deuxièmement, nous proposons plusieurs méthodes auto-supervisées pour transférer les informations apprises d'un domaine source étiqueté vers un domaine cible non étiqueté pour faire face au changement de domaine visuel et au manque d'annotations. Ces méthodes utilisent des prédictions auto-supervisées pour permettre au modèle d'apprendre et de s'adapter au domaine cible non étiqueté. Pour démontrer l'efficacité des approches proposées, nous publions le premier ensemble de données public, appelé Multi-View Operating Room (MVOR), généré à partir d'enregistrements d'interventions cliniques réelles. Nous obtenons des résultats de pointe sur l'ensemble de données MVOR, en particulier sur les images de salle d'opération à basse résolution préservant la confidentialité. Nous espérons que nos approches d'adaptation de domaine non supervisées proposées pourront aider à développer et à déployer de nouvelles applications d'assistance par IA pour les salles d'opération