1,588 research outputs found
Early Turn-taking Prediction with Spiking Neural Networks for Human Robot Collaboration
Turn-taking is essential to the structure of human teamwork. Humans are
typically aware of team members' intention to keep or relinquish their turn
before a turn switch, where the responsibility of working on a shared task is
shifted. Future co-robots are also expected to provide such competence. To that
end, this paper proposes the Cognitive Turn-taking Model (CTTM), which
leverages cognitive models (i.e., Spiking Neural Network) to achieve early
turn-taking prediction. The CTTM framework can process multimodal human
communication cues (both implicit and explicit) and predict human turn-taking
intentions in an early stage. The proposed framework is tested on a simulated
surgical procedure, where a robotic scrub nurse predicts the surgeon's
turn-taking intention. It was found that the proposed CTTM framework
outperforms the state-of-the-art turn-taking prediction algorithms by a large
margin. It also outperforms humans when presented with partial observations of
communication cues (i.e., less than 40% of full actions). This early prediction
capability enables robots to initiate turn-taking actions at an early stage,
which facilitates collaboration and increases overall efficiency.Comment: Submitted to IEEE International Conference on Robotics and Automation
(ICRA) 201
Gesture Recognition in Robotic Surgery with Multimodal Attention
Automatically recognising surgical gestures from surgical data is an important building block of automated activity recognition and analytics, technical skill assessment, intra-operative assistance and eventually robotic automation. The complexity of articulated instrument trajectories and the inherent variability due to surgical style and patient anatomy make analysis and fine-grained segmentation of surgical motion patterns from robot kinematics alone very difficult. Surgical video provides crucial information from the surgical site with context for the kinematic data and the interaction between the instruments and tissue. Yet sensor fusion between the robot data and surgical video stream is non-trivial because the data have different frequency, dimensions and discriminative capability. In this paper, we integrate multimodal attention mechanisms in a two-stream temporal convolutional network to compute relevance scores and weight kinematic and visual feature representations dynamically in time, aiming to aid multimodal network training and achieve effective sensor fusion. We report the results of our system on the JIGSAWS benchmark dataset and on a new in vivo dataset of suturing segments from robotic prostatectomy procedures. Our results are promising and obtain multimodal prediction sequences with higher accuracy and better temporal structure than corresponding unimodal solutions. Visualization of attention scores also gives physically interpretable insights on network understanding of strengths and weaknesses of each sensor
Multi-Task Recurrent Neural Network for Surgical Gesture Recognition and Progress Prediction
Surgical gesture recognition is important for surgical data science and
computer-aided intervention. Even with robotic kinematic information,
automatically segmenting surgical steps presents numerous challenges because
surgical demonstrations are characterized by high variability in style,
duration and order of actions. In order to extract discriminative features from
the kinematic signals and boost recognition accuracy, we propose a multi-task
recurrent neural network for simultaneous recognition of surgical gestures and
estimation of a novel formulation of surgical task progress. To show the
effectiveness of the presented approach, we evaluate its application on the
JIGSAWS dataset, that is currently the only publicly available dataset for
surgical gesture recognition featuring robot kinematic data. We demonstrate
that recognition performance improves in multi-task frameworks with progress
estimation without any additional manual labelling and training.Comment: Accepted to ICRA 202
SAR-RARP50: Segmentation of surgical instrumentation and Action Recognition on Robot-Assisted Radical Prostatectomy Challenge
Surgical tool segmentation and action recognition are fundamental building
blocks in many computer-assisted intervention applications, ranging from
surgical skills assessment to decision support systems. Nowadays,
learning-based action recognition and segmentation approaches outperform
classical methods, relying, however, on large, annotated datasets. Furthermore,
action recognition and tool segmentation algorithms are often trained and make
predictions in isolation from each other, without exploiting potential
cross-task relationships. With the EndoVis 2022 SAR-RARP50 challenge, we
release the first multimodal, publicly available, in-vivo, dataset for surgical
action recognition and semantic instrumentation segmentation, containing 50
suturing video segments of Robotic Assisted Radical Prostatectomy (RARP). The
aim of the challenge is twofold. First, to enable researchers to leverage the
scale of the provided dataset and develop robust and highly accurate
single-task action recognition and tool segmentation approaches in the surgical
domain. Second, to further explore the potential of multitask-based learning
approaches and determine their comparative advantage against their single-task
counterparts. A total of 12 teams participated in the challenge, contributing 7
action recognition methods, 9 instrument segmentation techniques, and 4
multitask approaches that integrated both action recognition and instrument
segmentation. The complete SAR-RARP50 dataset is available at:
https://rdr.ucl.ac.uk/projects/SARRARP50_Segmentation_of_surgical_instrumentation_and_Action_Recognition_on_Robot-Assisted_Radical_Prostatectomy_Challenge/19109
DASZL: Dynamic Action Signatures for Zero-shot Learning
There are many realistic applications of activity recognition where the set
of potential activity descriptions is combinatorially large. This makes
end-to-end supervised training of a recognition system impractical as no
training set is practically able to encompass the entire label set. In this
paper, we present an approach to fine-grained recognition that models
activities as compositions of dynamic action signatures. This compositional
approach allows us to reframe fine-grained recognition as zero-shot activity
recognition, where a detector is composed "on the fly" from simple
first-principles state machines supported by deep-learned components. We
evaluate our method on the Olympic Sports and UCF101 datasets, where our model
establishes a new state of the art under multiple experimental paradigms. We
also extend this method to form a unique framework for zero-shot joint
segmentation and classification of activities in video and demonstrate the
first results in zero-shot decoding of complex action sequences on a
widely-used surgical dataset. Lastly, we show that we can use off-the-shelf
object detectors to recognize activities in completely de-novo settings with no
additional training.Comment: 10 pages, 4 figures, 3 tables, AAAI2021 submissio
- …