While object recognition in deep neural networks (DNN)

has shown remarkable success in natural images, endoscopic

images still cannot be fully analysed using DNNs, since

analysing endoscopic images must account for occlusion,

light reflection and image blur. UNet based deep convolutional

neural networks (DNNs) offer great potential to extract

high-level spatial features, thanks to its hierarchical nature

with multiple levels of abstraction, which is especially useful

for working with multimodal endoscopic images with white

light and fluoroscopy in the diagnosis of esophageal disease.

However, the currently reported inference time for DNNs is

above 200ms, which is unsuitable to integrate into robotic

control loops. This work addresses real-time object detection

and semantic segmentation in endoscopic devices. We

show that endoscopic assistive diagnosis can approach satisfy

detection rates with a fast inference time

Cochran, Sandy

Yang, Shufan

English

Enlighten

GRAPH-SEARCH BASED UNET-D FOR THE ANALYSIS OF ENDOSCOPIC IMAGESShufan Yang1, Sandy Cochran 11School of Engineering, University of Glasgow, Glasgow, UKABSTRACTWhile object recognition in deep neural networks (DNN)has shown remarkable success in natural images, endoscopicimages still cannot be fully analysed using DNNs, sinceanalysing endoscopic images must account for occlusion,light reflection and image blur. UNet based deep convolu-tional neural networks (DNNs) offer great potential to extracthigh-level spatial features, thanks to its hierarchical naturewith multiple levels of abstraction, which is especially usefulfor working with multimodal endoscopic images with whitelight and fluoroscopy in the diagnosis of esophageal disease.However, the currently reported inference time for DNNs isabove 200ms, which is unsuitable to integrate into roboticcontrol loops. This work addresses real-time object detec-tion and semantic segmentation in endoscopic devices. Weshow that endoscopic assistive diagnosis can approach satisfydetection rates with a fast inference time.Index Terms— Endoscopic images, Deep neural net-works, Decoder-Encoder neural networks1. INTRODUCTIONA common strategy in deep convolutional neural network forsemantic segmentation tasks requires the down-sampling ofan image between convolutional and ReLU layers, and thenup-sampling the output to match the input size [1]. Atrousconvolution is designed to obtain the spatial resolution af-ter several convolution layers [2]. Although, when comparedto normal convolution layers, the atrous convolution insertsholes into its filters, thus enlarging the receptive field to agreater extent, this method often loses low level information,and is therefore unsuitable for a medical environment. To dealwith multi-scale images, a new Atrous Spatial Pyramid Pool-ing(ASPP) layer has been developed to allow the network towork on different image size and thus increase the flexibil-ity of the input scale [3]. Capturing more information, someof networks also directly used the output from convolutionlayers as the low-level features, passing it into the decoderto increase accuracy [4]. However, these structures currentlyreport an average inference time above 300ms [4]: it is essen-tial to have a fast inference time in order to achieve real-timeimage analysis.Fig. 1: The architecture of Network2. METHODAs shown in Fig. 1, the architecture of network for this chal-lenge is based on the UNet architecture. The convolution net-work layers are used as an encoder to abstract low level spatialinformation. A decoder is then implemented using transposedconvolution. Instead of using an ASPP layer, a general auto-encoder class label is kept into a dense layer. This compressedfeature vector connects to a series of up-sampling layers usingthe coast mask.2.1. AlgorithmRegular classification DCNNs generate a coast mask contain-ing probabilities for each class in a dense confidence regionsusing the following steps:1 Generate feature map using fully convolutional neuralnetwork2 Initialize a segmentation with feature detected3 Transpose convolutional using confidence check tokeep one weak edge on the common boundary4 Merge neighbouring regions (Ri and Rj) using an op-timal objective function with the confidence of wholeimage from feature map5 Generate a new maximum of confidence map throughall adjunct regionsHere, the considered objective function is:Cimage =∑Nrl=1 CβNγ(1− Pj)γ , (1)Layer name Output size ParametersConv-1 H/2, W/2 8× 8, stride 2Max pooling H/4, W/4 3× 3, stride 2Conv-block-1 H/4, W/4[1× 1 643× 3 512]Dense-confi-block H/8, W/8[1× 1 643× 3 512]Table 1: Network architecture and layers specification.where Cβ is the current region of confidence and Nγ isthe number of region of the corresponding specific adjunctregion.Pj is the probability of the jth class. γ is a free param-eter which can be used to scale up confidence level to avoidignoring small regions.After calculating the dense confidence feature map, the re-sulting features are fed to a 1x1 convolution kernel with 256filters. Finally, the result is bi-linearly up-sampled to the cor-rect dimensions. The dense confidence pyramid uses atrousconvolutional layers in a cascade fashion, where the dilationrate of each layer increases layer by layer; layers with smalldilation rates are located in the lower part, while layers withlarge dilation rates are located in upper part.Because of the great imbalance of different classes in thistest dataset, some classes have large number of pixels in al-most every image and others doesnt exist in some images atall. By setting γ > 0, we reduce the relative loss for well-classified examples to avoid miss classifying objects. In otherwords, the dense confidence layer works to alleviate errorsusing a smaller scaling factor.2.2. Data ArgumentationImagnet pre-trained Resnet-50 is used for training with 320images that EAD2019 challenge provides for the semanticsegmentation task [5, 6]. Among those images, 20% is keptfor evaluation and the rest is kept for training. The follow-ing data argumentation methods are applied: the RGB value(66.32, 76.13, 120.58) is used for normalization with batchsize 4. A random flip and rotation with (-50, 50) are used torescale the picture to 0.5-0.75 of its original size with the padsize of (600 pixel, 512 pixel). After the data augmentation,about 1300 images are obtained which are 4 times larger thanthe original dataset.2.3. Training processesThe following table 1 shows the hyper-parameters chosen forfeature map abstraction.We uses a normal distribution to pick a tensor from theinterval of (0, std), where the equation of std is:std =√(2/((1 + a2)fanin)) (2)Where, a is the negative slope of the rectifier that usedafter this layer which is 0 for Relu activation layer.The typical batch size for SGD is generally set to 6, 12, 24[7]. However, in this work, the batch size was set to 5, whichis the optimal number to strike the tradeoff of GPU memoryand speed of training.During the training process, a poly learning rate policy isimplemented on the learning rates. To begin with, the learningrate is relatively high and, after several iterations, the weightshave improved and the distance between current and the bestweights decreased. Learning rates also become smaller corre-spondingly to find the best weights. The decay learning ratepolicy is employed with the formulaη = η(1− epmaxep)power, (3)where ep and maxep are the current epoch and the maximumepoch, which is set to 500. Here the power is set to 0.9 basedon previous published method [8]. Since the training datasetincludes some of very similar data, a weight decay method[10] is followed the equation 3 and equation 4.R (w) =∑k∑l w2k,l, (4)where wk,l is the weights stored in the network. The total lossfrom the loss function will now have two parts:L (w) =1NN∑i=1Li (f (xi , w) , y) + λR (w) (5)The first term represents the loss calculated by the lossfunction chosen; the second part is the regulation part, mak-ing the network more simple. If two sets of weights all havea similar loss calculated by the loss functions, the biggerweights will have a bigger regular term and therefore has abigger total loss.3. EVALUATION3.1. Sematic segmentation resultsResults obtained from the trained models of challenge vali-dation set are listed in Figure 2. The various resolutionim-ages are shown from the top to bottom row: (1003 x1003pixel, 628 x 628 pixel, 576 x 576 pixel).A detailed example of segmentation results for 5 classesfrom the endoscopic dataset is shown in Appendix[9].3.2. Training processFigure 3 shows the loss rates at validation epoch. Althoughthe evaluation processes was not as good as the loss at train-ing epoch, it was still acceptable. The MIoU curve dramati-cally increase during the initiative 30 epochs, but then slowlyconverged to the final value, achieving 65%.Fig. 2: Results obtained from the validation set arelisted using various grey scales for five classes: Instru-ment(255), Specularity(204), Artefacts(153), Bubbles(102),Saturation(51). From left to right: (a) input (b) Unet (c)DeeplabV3+ (d)Unet-D(a) Loss rates(b) MIoUFig. 3: The train process performance at the loss rates andMIoU at each evaluation epoch3.3. ComparisonOur evaluation was implemented using the validation set. Weuse the Mean Intersection over Union to evaluate the capacityof the model:MIoU =1k + 1k∑i=0pii∑kj=0 pij +∑kj=0 pji − pii(6)The prediction (pii) was made by finding the maximumoutput features map of the segmentation model, and is up-sampled by 8 using bilinear interpolation. As shown in Fig.4, our approach (UNet-D) had very similar performance intraining compared with state of the art semantic segmenta-tion methods. In this challenge, the rules for evaluation ofsegmentation was based on the DICE and Jaccard value. Ourresults achieved the same results as other technologies, shownin Table 2 and Table3.Fig. 4: The comparison among DeeplabV3 , UNet, UNet-D(our proposed approach)Method Over-lap F2-Score Score sUNet 0.36 0.48 0.42Deeplab v3+ 0.54 0.56 0.55UNet-D 0.39 0.44 0.41Table 2: Sematic Segmentation score in the EAD2019 Chal-lengeModel Training time Prediction time SizeUNet 20h 213.5ms 28.7MBDeeplab-V3+ 40h 320.8ms 182.7MBUNet-D 30h 126.3ms 23.2MBTable 3: The comparison of training and inference perfor-manceHowever, the measurement is an inadequate measurementfor semantic segmentation. Since the DICE calculate is basedon binary cases, this means that no cross regions appeared inmultiple classes. Furthermore, the scores is in favor with thehigh DICE value.The experiment environment used was Windows 10, 64-bit with an Intel Core i7-7700HQ CPU and GeForce GTX1080 Ti. The number of inferences to calculate the aver-age result was 20. Although the UNet-D network does nothave the best performance in terms of its scores value in theEAD2019 challenge [5], it had a smaller computational foot-print, making it an excellent candidate for real-time semanticsegmentation tasks.4. CONCLUSIONThis work demonstrates that a skipped connection, keepinglow level spatial information, and removing the connectionwith the ReLu layer, using a confidence relay, can reduce theinference time. The UNet-D performance was not, however,outstanding at this challenge; part of reason was that we usesmall batch size to keep system memory low. Using the PAS-CAL VOC2012 dataset, 85% MUOI was reported at the eval-uation processes. With careful data argument methods, thesematic segmentation based on deep convolution neural net-work has great potential to be used in the real-time controlloop for the next generation of endoscopic devices.5. REFERENCES[1] O. Ronneberger, P.Fischer, and T. Brox, “U-net: Con-volutional networks for biomedical image segmentation,”in Medical Image Computing and Computer-Assisted In-tervention (MICCAI). 2015, vol. 9351 of LNCS, pp. 234–241, Springer.[2] Liang-Chieh Chen, George Papandreou, Iasonas Kokki-nos, Kevin Murphy, and Alan L. Yuille, “Deeplab: Se-mantic image segmentation with deep convolutional nets,atrous convolution, and fully connected crfs,” IEEETrans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, 2018.[3] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,Gang Yu, and Nong Sang, “Learning a discriminative fea-ture network for semantic segmentation,” in 2018 IEEEConference on Computer Vision and Pattern Recognition,CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018,2018, pp. 1857–1866.[4] Tobias Pohlen, Alexander Hermans, Markus Mathias,and Bastian Leibe, “Full-resolution residual networks forsemantic segmentation in street scenes,” in 2017 IEEEConference on Computer Vision and Pattern Recognition,CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017,pp. 3309–3318.[5] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,Adam Bailey, Stefano Realdon, James East, GeorgesWagnires, Victor Loschenov, Enrico Grisan, WalterBlondel, and Jens Rittscher, “Endoscopy artifact de-tection (EAD 2019) challenge dataset,” CoRR, vol.abs/1905.03209, 2019.[6] Sharib Ali, Felix Zhou, Adam Bailey, Barbara Braden,James East, Xin Lu, and Jens Rittscher, “A deep learningframework for quality assessment and restoration in videoendoscopy,” CoRR, vol. abs/1904.07073, 2019.[7] Ilya Sutskever, James Martens, George E. Dahl, and Ge-offrey E. Hinton, “On the importance of initializationand momentum in deep learning,” in Proceedings ofthe 30th International Conference on Machine Learning,ICML 2013, Atlanta, GA, USA, 16-21 June 2013, 2013,pp. 1139–1147.[8] Anders Krogh and John A. Hertz, “A simple weight decaycan improve generalization,” in ADVANCES IN NEURALINFORMATION PROCESSING SYSTEMS 4. 1992, pp.950–957, Morgan Kaufmann.[9] Sharib Ali, Felix Zhou, Christian Daul, Barbara Braden,Adam Bailey, Stefano Realdon, James East, GeorgesWagnires, Victor Loschenov, Enrico Grisan, Walter Blon-del, and Jens Rittscher, “Endoscopy artifact detection(ead 2019) challenge dataset,” 2019.Fig. 5: Sample semantic segmentation results for five classes

Graph-search Based UNet-d For The Analysis Of Endoscopic Images

Enlighten: Publications

http://eprints.gla.ac.uk/190035/1/190035.pdf

Graph-search Based UNet-d For The Analysis Of Endoscopic Images

Abstract

Similar works

Full text

Available Versions

Enlighten

Enlighten: Publications