62 research outputs found

    Reducing Prediction volatility in the surgical workflow recognition of endoscopic pituitary surgery

    Get PDF
    PURPOSE: Workflow recognition can aid surgeons before an operation when used as a training tool, during an operation by increasing operating room efficiency, and after an operation in the completion of operation notes. Although several methods have been applied to this task, they have been tested on few surgical datasets. Therefore, their generalisability is not well tested, particularly for surgical approaches utilising smaller working spaces which are susceptible to occlusion and necessitate frequent withdrawal of the endoscope. This leads to rapidly changing predictions, which reduces the clinical confidence of the methods, and hence limits their suitability for clinical translation. METHODS: Firstly, the optimal neural network is found using established methods, using endoscopic pituitary surgery as an exemplar. Then, prediction volatility is formally defined as a new evaluation metric as a proxy for uncertainty, and two temporal smoothing functions are created. The first (modal, Mn) mode-averages over the previous n predictions, and the second (threshold, Tn) ensures a class is only changed after being continuously predicted for n predictions. Both functions are independently applied to the predictions of the optimal network. RESULTS: The methods are evaluated on a 50-video dataset using fivefold cross-validation, and the optimised evaluation metric is weighted-F1 score. The optimal model is ResNet-50+LSTM achieving 0.84 in 3-phase classification and 0.74 in 7-step classification. Applying threshold smoothing further improves these results, achieving 0.86 in 3-phase classification, and 0.75 in 7-step classification, while also drastically reducing the prediction volatility. CONCLUSION: The results confirm the established methods generalise to endoscopic pituitary surgery, and show simple temporal smoothing not only reduces prediction volatility, but actively improves performance

    3D Textured Surface Reconstruction from Endoscopic Video

    Get PDF
    Endoscopy enables high-resolution visualization of tissue texture and is a critical step in many clinical workflows, including diagnosis of infections, tumors or diseases and treatment planning for cancers. This includes my target problems of radiation treatment planning in the nasopharynx and pre-cancerous polyps screening and treatment in colonoscopy. However, an endoscopic video does not provide its information in 3D space, making it difficult to use for tumor localization, and it is inefficient to review. In addition, when there are incomplete camera observations of the organ surface, full surface coverage cannot be guaranteed in an endoscopic procedure, and unsurveyed regions can hardly be noticed in a continuous first-person perspective. This dissertation introduces a new imaging approach that we call endoscopography: an endoscopic video is reconstructed into a full 3D textured surface, which we call an endoscopogram. In this dissertation, I present two endoscopography techniques. One method is a combination of a frame-by-frame algorithmic 3D reconstruction method and a groupwise deformable surface registration method. My contribution is the innovative combination of the two methods that improves the temporal consistency of the frame-by-frame 3D reconstruction algorithm and eliminates the manual intervention that was needed in the deformable surface registration method. The combined method reconstructs an endoscopogram in an offline manner, and the information contained in the tissue texture in the endoscopogram can be transferred to a 3D image such as CT through a surface-to-surface registration. Then, through an interactive tool, the physician can draw directly on the endoscopogram surface to specify a tumor, which then can be automatically transferred to CT slices to aid tumor localization. The second method is a novel deep-learning-driven dense SLAM (simultaneous localization and mapping) system, called RNN-SLAM, that in real time can produce an endoscopogram with display of the unsurveyed regions. In particular, my contribution is the deep learning system in the RNN-SLAM, called RNN-DP. RNN-DP is a novel multi-view dense depth map and odometry estimation method that uses Recurrent Neural Networks (RNN) and trains utilizing multi-view image reprojection and forward-backward flow-consistency losses.Doctor of Philosoph

    Automatic Esophageal Abnormality Detection and Classification

    Get PDF
    Esophageal cancer is counted as one of the deadliest cancers worldwide ranking the sixth among all types of cancers. Early esophageal cancer typically causes no symp- toms and mainly arises from overlooked/untreated premalignant abnormalities in the esophagus tube. Endoscopy is the main tool used for the detection of abnormalities, and the cell deformation stage is confirmed by taking biopsy samples. The process of detection and classification is considered challenging for several reasons such as; different types of abnormalities (including early cancer stages) can be located ran- domly throughout the esophagus tube, abnormal regions can have various sizes and appearances which makes it difficult to capture, and failure in discriminating between the columnar mucosa from the metaplastic epithelium. Although many studies have been conducted, it remains a challenging task and improving the accuracy of auto- matically classifying and detecting different esophageal abnormalities is an ongoing field. This thesis aims to develop novel automated methods for the detection and classification of the abnormal esophageal regions (precancerous and cancerous) from endoscopic images and videos. In this thesis, firstly, the abnormality stage of the esophageal cell deformation is clas- sified from confocal laser endomicroscopy (CLE) images. The CLE is an endoscopic tool that provides a digital pathology view of the esophagus cells. The classifica- tion is achieved by enhancing the internal features of the CLE image, using a novel enhancement filter that utilizes fractional integration and differentiation. Different imaging features including, Multi-Scale pyramid rotation LBP (MP-RLBP), gray level co-occurrence matrices (GLCM), fractal analysis, fuzzy LBP and maximally stable extremal regions (MSER), are calculated from the enhanced image to assure a robust classification result. The support vector machine (SVM) and random forest (RF) classifiers are employed to classify each image into its pathology stage. Secondly, we propose an automatic detection method to locate abnormality regions from high definition white light (HD-WLE) endoscopic images. We first investigate the performance of different deep learning detection methods on our dataset. Then we propose an approach that combines hand-designed Gabor features with extracted convolutional neural network features that are used by the Faster R-CNN to detect abnormal regions. Moreover, to further improve the detection performance, we pro- pose a novel two-input network named GFD-Faster RCNN. The proposed method generates a Gabor fractal image from the original endoscopic image using Gabor filters. Then features are learned separately from the endoscopic image and the gen- erated Gabor fractal image using the densely connected convolutional network to detect abnormal esophageal regions. Thirdly, we present a novel model to detect the abnormal regions from endoscopic videos. We design a 3D Sequential DenseConvLstm network to extract spatiotem- poral features from the input videos that are utilized by a region proposal network and ROI pooling layer to detect abnormality regions in each frame throughout the video. Additionally, we suggest an FS-CRF post-processing method that incorpor- ates the Conditional Random Field (CRF) on a frame-based level to recover missed abnormal regions in neighborhood frames within the same clip. The methods are evaluated on four datasets: (1) CLE dataset used for the classific- ation model, (2) Publicly available dataset named Kvasir, (3) MICCAI’15 Endovis challenge dataset, Both datasets (2) and (3) are used for the evaluation of detection model from endoscopic images. Finally, (4) Gastrointestinal Atlas dataset used for the evaluation of the video detection model. The experimental results demonstrate promising results of the different models and have outperformed the state-of-the-art methods

    Automated colonoscopy withdrawal phase duration estimation using cecum detection and surgical tasks classification

    Get PDF
    Colorectal cancer is the third most common type of cancer with almost two million new cases worldwide. They develop from neoplastic polyps, most commonly adenomas, which can be removed during colonoscopy to prevent colorectal cancer from occurring. Unfortunately, up to a quarter of polyps are missed during colonoscopies. Studies have shown that polyp detection during a procedure correlates with the time spent searching for polyps, called the withdrawal time. The different phases of the procedure (cleaning, therapeutic, and exploration phases) make it difficult to precisely measure the withdrawal time, which should only include the exploration phase. Separating this from the other phases requires manual time measurement during the procedure which is rarely performed. In this study, we propose a method to automatically detect the cecum, which is the start of the withdrawal phase, and to classify the different phases of the colonoscopy, which allows precise estimation of the final withdrawal time. This is achieved using a Resnet for both detection and classification trained with two public datasets and a private dataset composed of 96 full procedures. Out of 19 testing procedures, 18 have their withdrawal time correctly estimated, with a mean error of 5.52 seconds per minute per procedure

    Image-set, Temporal and Spatiotemporal Representations of Videos for Recognizing, Localizing and Quantifying Actions

    Get PDF
    This dissertation addresses the problem of learning video representations, which is defined here as transforming the video so that its essential structure is made more visible or accessible for action recognition and quantification. In the literature, a video can be represented by a set of images, by modeling motion or temporal dynamics, and by a 3D graph with pixels as nodes. This dissertation contributes in proposing a set of models to localize, track, segment, recognize and assess actions such as (1) image-set models via aggregating subset features given by regularizing normalized CNNs, (2) image-set models via inter-frame principal recovery and sparsely coding residual actions, (3) temporally local models with spatially global motion estimated by robust feature matching and local motion estimated by action detection with motion model added, (4) spatiotemporal models 3D graph and 3D CNN to model time as a space dimension, (5) supervised hashing by jointly learning embedding and quantization, respectively. State-of-the-art performances are achieved for tasks such as quantifying facial pain and human diving. Primary conclusions of this dissertation are categorized as follows: (i) Image set can capture facial actions that are about collective representation; (ii) Sparse and low-rank representations can have the expression, identity and pose cues untangled and can be learned via an image-set model and also a linear model; (iii) Norm is related with recognizability; similarity metrics and loss functions matter; (v) Combining the MIL based boosting tracker with the Particle Filter motion model induces a good trade-off between the appearance similarity and motion consistence; (iv) Segmenting object locally makes it amenable to assign shape priors; it is feasible to learn knowledge such as shape priors online from Web data with weak supervision; (v) It works locally in both space and time to represent videos as 3D graphs; 3D CNNs work effectively when inputted with temporally meaningful clips; (vi) the rich labeled images or videos help to learn better hash functions after learning binary embedded codes than the random projections. In addition, models proposed for videos can be adapted to other sequential images such as volumetric medical images which are not included in this dissertation

    Spatiotemporal Event Graphs for Dynamic Scene Understanding

    Get PDF
    Dynamic scene understanding is the ability of a computer system to interpret and make sense of the visual information present in a video of a real-world scene. In this thesis, we present a series of frameworks for dynamic scene understanding starting from road event detection from an autonomous driving perspective to complex video activity detection, followed by continual learning approaches for the life-long learning of the models. Firstly, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. ROAD is designed to test an autonomous vehicle’s ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. Due to the lack of datasets equipped with formally specified logical requirements, we also introduce the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints, as a tool for driving neurosymbolic research in the area. Next, we extend event detection to holistic scene understanding by proposing two complex activity detection methods. In the first method, we present a deformable, spatiotemporal scene graph approach, consisting of three main building blocks: action tube detection, a 3D deformable RoI pooling layer designed for learning the flexible, deformable geometry of the constituent action tubes, and a scene graph constructed by considering all parts as nodes and connecting them based on different semantics. In a second approach evolving from the first, we propose a hybrid graph neural network that combines attention applied to a graph encoding of the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Our contribution is threefold: i) a feature extraction technique; ii) a method for constructing a local scene graph followed by graph attention, and iii) a graph for temporally connecting all the local dynamic scene graphs. Finally, the last part of the thesis is about presenting a new continual semi-supervised learning (CSSL) paradigm, proposed to the attention of the machine learning community. We also propose to formulate the continual semi-supervised learning problem as a latent-variable

    Bayesian image restoration and bacteria detection in optical endomicroscopy

    Get PDF
    Optical microscopy systems can be used to obtain high-resolution microscopic images of tissue cultures and ex vivo tissue samples. This imaging technique can be translated for in vivo, in situ applications by using optical fibres and miniature optics. Fibred optical endomicroscopy (OEM) can enable optical biopsy in organs inaccessible by any other imaging systems, and hence can provide rapid and accurate diagnosis in a short time. The raw data the system produce is difficult to interpret as it is modulated by a fibre bundle pattern, producing what is called the “honeycomb effect”. Moreover, the data is further degraded due to the fibre core cross coupling problem. On the other hand, there is an unmet clinical need for automatic tools that can help the clinicians to detect fluorescently labelled bacteria in distal lung images. The aim of this thesis is to develop advanced image processing algorithms that can address the above mentioned problems. First, we provide a statistical model for the fibre core cross coupling problem and the sparse sampling by imaging fibre bundles (honeycomb artefact), which are formulated here as a restoration problem for the first time in the literature. We then introduce a non-linear interpolation method, based on Gaussian processes regression, in order to recover an interpretable scene from the deconvolved data. Second, we develop two bacteria detection algorithms, each of which provides different characteristics. The first approach considers joint formulation to the sparse coding and anomaly detection problems. The anomalies here are considered as candidate bacteria, which are annotated with the help of a trained clinician. Although this approach provides good detection performance and outperforms existing methods in the literature, the user has to carefully tune some crucial model parameters. Hence, we propose a more adaptive approach, for which a Bayesian framework is adopted. This approach not only outperforms the proposed supervised approach and existing methods in the literature but also provides computation time that competes with optimization-based methods
    corecore