Generative AI, in general, and synthetic visual data generation, in specific,
hold much promise for benefiting surgical training by providing photorealism to
simulation environments. Current training methods primarily rely on reading
materials and observing live surgeries, which can be time-consuming and
impractical. In this work, we take a significant step towards improving the
training process. Specifically, we use diffusion models in combination with a
zero-shot video diffusion method to interactively generate realistic
laparoscopic images and videos by specifying a surgical action through text and
guiding the generation with tool positions through segmentation masks. We
demonstrate the performance of our approach using the publicly available Cholec
dataset family and evaluate the fidelity and factual correctness of our
generated images using a surgical action recognition model as well as the
pixel-wise F1-score for the spatial control of tool generation. We achieve an
FID of 38.097 and an F1-score of 0.71.Comment: 7 pages, 4 figure