11 research outputs found
Prompt-Guided Zero-Shot Anomaly Action Recognition using Pretrained Deep Skeleton Features
This study investigates unsupervised anomaly action recognition, which
identifies video-level abnormal-human-behavior events in an unsupervised manner
without abnormal samples, and simultaneously addresses three limitations in the
conventional skeleton-based approaches: target domain-dependent DNN training,
robustness against skeleton errors, and a lack of normal samples. We present a
unified, user prompt-guided zero-shot learning framework using a target
domain-independent skeleton feature extractor, which is pretrained on a
large-scale action recognition dataset. Particularly, during the training phase
using normal samples, the method models the distribution of skeleton features
of the normal actions while freezing the weights of the DNNs and estimates the
anomaly score using this distribution in the inference phase. Additionally, to
increase robustness against skeleton errors, we introduce a DNN architecture
inspired by a point cloud deep learning paradigm, which sparsely propagates the
features between joints. Furthermore, to prevent the unobserved normal actions
from being misidentified as abnormal actions, we incorporate a similarity score
between the user prompt embeddings and skeleton features aligned in the common
space into the anomaly score, which indirectly supplements normal actions. On
two publicly available datasets, we conduct experiments to test the
effectiveness of the proposed method with respect to abovementioned
limitations.Comment: CVPR 202
Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling
This paper simultaneously addresses three limitations associated with
conventional skeleton-based action recognition; skeleton detection and tracking
errors, poor variety of the targeted actions, as well as person-wise and
frame-wise action recognition. A point cloud deep-learning paradigm is
introduced to the action recognition, and a unified framework along with a
novel deep neural network architecture called Structured Keypoint Pooling is
proposed. The proposed method sparsely aggregates keypoint features in a
cascaded manner based on prior knowledge of the data structure (which is
inherent in skeletons), such as the instances and frames to which each keypoint
belongs, and achieves robustness against input errors. Its less constrained and
tracking-free architecture enables time-series keypoints consisting of human
skeletons and nonhuman object contours to be efficiently treated as an input 3D
point cloud and extends the variety of the targeted action. Furthermore, we
propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This
trick switches the pooling kernels between the training and inference phases to
detect person-wise and frame-wise actions in a weakly supervised manner using
only video-level action labels. This trick enables our training scheme to
naturally introduce novel data augmentation, which mixes multiple point clouds
extracted from different videos. In the experiments, we comprehensively verify
the effectiveness of the proposed method against the limitations, and the
method outperforms state-of-the-art skeleton-based action recognition and
spatio-temporal action localization methods.Comment: CVPR 202
Deep Selection: A Fully Supervised Camera Selection Network for Surgery Recordings
Recording surgery in operating rooms is an essential task for education and
evaluation of medical treatment. However, recording the desired targets, such
as the surgery field, surgical tools, or doctor's hands, is difficult because
the targets are heavily occluded during surgery. We use a recording system in
which multiple cameras are embedded in the surgical lamp, and we assume that at
least one camera is recording the target without occlusion at any given time.
As the embedded cameras obtain multiple video sequences, we address the task of
selecting the camera with the best view of the surgery. Unlike the conventional
method, which selects the camera based on the area size of the surgery field,
we propose a deep neural network that predicts the camera selection probability
from multiple video sequences by learning the supervision of the expert
annotation. We created a dataset in which six different types of plastic
surgery are recorded, and we provided the annotation of camera switching. Our
experiments show that our approach successfully switched between cameras and
outperformed three baseline methods.Comment: MICCAI 202
Deep learning in diabetic foot ulcers detection: A comprehensive evaluation
There has been a substantial amount of research involving computer methods and technology for the detection and recognition of diabetic foot ulcers (DFUs), but there is a lack of systematic comparisons of state-of-the-art deep learning object detection frameworks applied to this problem. DFUC2020 provided participants with a comprehensive dataset consisting of 2,000 images for training and 2,000 images for testing. This paper summarizes the results of DFUC2020 by comparing the deep learning-based algorithms proposed by the winning teams: Faster R–CNN, three variants of Faster R–CNN and an ensemble method; YOLOv3; YOLOv5; EfficientDet; and a new Cascade Attention Network. For each deep learning method, we provide a detailed description of model architecture, parameter settings for training and additional stages including pre-processing, data augmentation and post-processing. We provide a comprehensive evaluation for each method. All the methods required a data augmentation stage to increase the number of images available for training and a post-processing stage to remove false positives. The best performance was obtained from Deformable Convolution, a variant of Faster R–CNN, with a mean average precision (mAP) of 0.6940 and an F1-Score of 0.7434. Finally, we demonstrate that the ensemble method based on different deep learning methods can enhance the F1-Score but not the mAP
Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation
We propose a method for object-aware 3D egocentric pose estimation that
tightly integrates kinematics modeling, dynamics modeling, and scene object
information. Unlike prior kinematics or dynamics-based approaches where the two
components are used disjointly, we synergize the two approaches via
dynamics-regulated training. At each timestep, a kinematic model is used to
provide a target pose using video evidence and simulation state. Then, a
prelearned dynamics model attempts to mimic the kinematic pose in a physics
simulator. By comparing the pose instructed by the kinematic model against the
pose generated by the dynamics model, we can use their misalignment to further
improve the kinematic model. By factoring in the 6DoF pose of objects (e.g.,
chairs, boxes) in the scene, we demonstrate for the first time, the ability to
estimate physically-plausible 3D human-object interactions using a single
wearable camera. We evaluate our egocentric pose estimation method in both
controlled laboratory settings and real-world scenarios.Comment: NeurIPS 2021. Project page:
https://zhengyiluo.github.io/projects/kin_poly
Surgical Tool Detection in Open Surgery Videos
Detecting surgical tools is an essential task for analyzing and evaluating surgical videos. However, most studies focus on minimally invasive surgery (MIS) and cataract surgery. Mainly because of a lack of a large, diverse, and well-annotated dataset, research in the area of open surgery has been limited so far. Open surgery video analysis is challenging because of its properties: varied number and roles of people (e.g., main surgeon, assistant surgeons, and nurses), a complex interaction of tools and hands, various operative environments, and lighting conditions. In this paper, to handle these limitations and difficulties, we introduce an egocentric open surgery dataset that includes 15 open surgeries recorded with a head-mounted camera. More than 67k bounding boxes are labeled to 19k images with 31 surgical tool categories. Finally, we present a surgical tool detection baseline model based on recent advances in object detection. The results of our new dataset show that our presented dataset provides enough interesting challenges for future methods and that it can serve as a strong benchmark to address the study of tool detection in open surgery
Hand Motion-Aware Surgical Tool Localization and Classification from an Egocentric Camera
Detecting surgical tools is an essential task for the analysis and evaluation of surgical videos. However, in open surgery such as plastic surgery, it is difficult to detect them because there are surgical tools with similar shapes, such as scissors and needle holders. Unlike endoscopic surgery, the tips of the tools are often hidden in the operating field and are not captured clearly due to low camera resolution, whereas the movements of the tools and hands can be captured. As a result that the different uses of each tool require different hand movements, it is possible to use hand movement data to classify the two types of tools. We combined three modules for localization, selection, and classification, for the detection of the two tools. In the localization module, we employed the Faster R-CNN to detect surgical tools and target hands, and in the classification module, we extracted hand movement information by combining ResNet-18 and LSTM to classify two tools. We created a dataset in which seven different types of open surgery were recorded, and we provided the annotation of surgical tool detection. Our experiments show that our approach successfully detected the two different tools and outperformed the two baseline methods
Hand Motion-Aware Surgical Tool Localization and Classification from an Egocentric Camera
Detecting surgical tools is an essential task for the analysis and evaluation of surgical videos. However, in open surgery such as plastic surgery, it is difficult to detect them because there are surgical tools with similar shapes, such as scissors and needle holders. Unlike endoscopic surgery, the tips of the tools are often hidden in the operating field and are not captured clearly due to low camera resolution, whereas the movements of the tools and hands can be captured. As a result that the different uses of each tool require different hand movements, it is possible to use hand movement data to classify the two types of tools. We combined three modules for localization, selection, and classification, for the detection of the two tools. In the localization module, we employed the Faster R-CNN to detect surgical tools and target hands, and in the classification module, we extracted hand movement information by combining ResNet-18 and LSTM to classify two tools. We created a dataset in which seven different types of open surgery were recorded, and we provided the annotation of surgical tool detection. Our experiments show that our approach successfully detected the two different tools and outperformed the two baseline methods
Multi-Camera Multi-Person Tracking and Re-Identification in an Operating Room
Multi-camera multi-person (MCMP) tracking and re-identification (ReID) are essential tasks in safety, pedestrian analysis, and so on; however, most research focuses on outdoor scenarios because they are much more complicated to deal with occlusions and misidentification in a crowded room with obstacles. Moreover, it is challenging to complete the two tasks in one framework. We present a trajectory-based method, integrating tracking and ReID tasks. First, the poses of all surgical members captured by each camera are detected frame-by-frame; then, the detected poses are exploited to track the trajectories of all members for each camera; finally, these trajectories of different cameras are clustered to re-identify the members in the operating room across all cameras. Compared to other MCMP tracking and ReID methods, the proposed one mainly exploits trajectories, taking texture features that are less distinguishable in the operating room scenario as auxiliary cues. We also integrate temporal information during ReID, which is more reliable than the state-of-the-art framework where ReID is conducted frame-by-frame. In addition, our framework requires no training before deployment in new scenarios. We also created an annotated MCMP dataset with actual operating room videos. Our experiments prove the effectiveness of the proposed trajectory-based ReID algorithm. The proposed framework achieves 85.44% accuracy in the ReID task, outperforming the state-of-the-art framework in our operating room dataset