Search CORE

353 research outputs found

중복 연산 생략을 통한 효율적인 영상 및 동영상 분할 모델

Author: 박효진
Publication venue: 서울대학교 대학원
Publication date: 01/08/2021
Field of study

학위논문(박사) -- 서울대학교대학원 : 융합과학기술대학원 융합과학부(지능형융합시스템전공), 2021.8. 곽노준.분할모델은 다른 컴퓨터 비전 분야와 마찬가지로 딥러닝 신경망을 사용하여 많은 성능 향상을 이루어냈다. 이 기술은 AR/VR, 자율 주행, 감시 시스템 등 다양한 시각 응용 분야에서 주변 장면을 이해하고 물체의 모양을 인식 할 수 있기 때문에 필수적이다. 그러나 기존에 제안 된 방법의 대부분은 많은 연산량을 요구 하기 때문에 실제 시스템에 곧바로 적용하는 것이 불가능하다. 본 논문은 모델 복잡성을 줄이기 위해 전체 분할 영역 중에서 Image semantic segmentation 및 semi-supervised video object segmentation 에서 쓰이는 모델 경량화를 목표로 한다. 이를 위해 기존 프레임 워크에서 불필요한 연산을 지적하고 세 가지 관점에서 해결방법을 제안한다. 먼저 decoder의 spatial redundancy 문제에 대해 논의한다. Decoder는 upsampling을 수행하여 작은 해상도 feature map을 원래의 input image 해상도로 복구하여 정확한 모양의 마스크를 생성하고, semantic 정보를 찾기 위해 각 픽셀의 클래스를 판별한다. 그러나 인접 픽셀들은 정보를 공유하고 서로 동일한 의미를 가질 확률이 높으나 이 특성을 고려한 연구가 없다. 이 문제를 해결하기 위해 spatial redundancy을 줄여 decoder의 프로세스를 제거하는 superpixel-based sampling architecture를 제안한다. 제안 된 네트워크는 통계적 프로세스 제어 방법론을 활용하여 각 레이어의 학습률을 재조정하는 학습방법을 통해 총 픽셀의 0.37 만으로 학습 및 추론을 한다. Pascal Context, SUN-RGBD 데이터셋을 이용한 실험에서, 다양한 기존 방법들과 제안 된 방법을 비교하여 연산량은 훨씬 더 적지만 더 우수하거나 비슷한 정확도를 가지는 것을 실험적으로 증명한다. 두번째로 encoder 에서 널리 쓰이는 dilated convolution 대해 논의한다. Dilated convolution 은 encoder가 큰 receptive field를 지니도록 하여 더 나은 성능을 얻기 위해 널리 사용되는 방법론이다. 모바일 디바이스에서 활용하기 위해서는 연산량을 줄여야 하며, 가장 쉬운 방법 중 하나는 depth-wise separable convolution 방법을 dilated convolution에 적용하는 것이다. 그러나 이 두 가지 방법의 간단한 조합은 지나치게 단순화 된 연산으로 인해 feature map의 정보 손실을 야기하고, 이로 인해 심각한 성능 저하가 나타난다. 이 문제를 해결하기 위해 정보 손실을 보안하는 Concentrated-Comprehensive Convolution (C3)이라는 새로운 convolutional block을 제안한다. C3-block을 다양한 분할 모델인 (DRN, ERFnet, Enet 및 Deeplab V3)에 적용하여 Cityscapes와 Pascal VOC 데이터셋에서 제안 된 방법의 장점을 실험적으로 증명한다. 또 다른 dilated convolution의 문제는 dilation rate 에 따라 모델 수행시간이 달라지는 점이다. 이론적으로 dilated convolution은 dilation rate 관계없이 유사한 모델 수행시간을 가져야하지만, 실제 수행시간이 디바이스에서는 최대 2 배까지 크게 달라진다. 이 문제를 완화하기 위해 spatial squeeze (S2) block 이라고하는 또 다른 convolutional block을 제안한다. S2-block은 장거리 정보를 이해하고 많은 계산을 줄이기 위해 average pooling을 활용하여 공간 정보를 압축한다. 다른 경량화 분할모델과 S2-block 기반의 제안된 모델과 정성 및 정량 분석을 Cityscapes 데이터셋을 이용하여 제공한다. 또한 앞에서 연구한 C3-block과 성능을 비교하며, 실제 모바일 장치에서 제안된 모델이 성공적으로 실행되는 것을 보여준다. 세번째로 비디오에서 temporal redundancy 문제에 대해 논의한다. 컴퓨터 비전의 중요한 기술 중 하나는 비디오 데이터를 효율적으로 처리하는 방법이다. Semi-supervised Video Object Segmentation (semi-VOS)은 이전 프레임의 정보를 전파하여 현재 프레임에 대한 segmentation 마스크를 생성한다. 그러나 이전 연구들은 모든 프레임을 동일하게 중요하다고 판단하고, 모델의 전체 네트워크를 사용하여 매 프레임마다 해당 마스크를 생성한다. 이를 통해 물체모양의 변화나 물체가 가려지는 어려운 비디오에서도 정확한 마스크를 생성할 수 있으나, 물체가 움직이지 않거나 느리게 움직여서 프레임 간 변화가 거의 없는 경우에는 불필요한 계산이 발생한다. 제안된 방법은 temporal information을 사용하여 물체의 움직임 정도를 측정한 뒤, 변화가 미비하다면 무거운 마스크 생성 단계를 생략한다. 이를 실현하기 위해 프레임 간의 변화량을 측정하고 프레임 간의 유사성에 따라 경로를 (전체 네트워크 계산 또는 이전 프레임 결과를 재사용) 결정하는 새로운 동적 네트워크를 제안한다. 제안된 방법은 다양한 semi-VOS 데이터셋에 (DAVIS 16, DAVIS 17 및 YouTube-VOS) 대해 정확도 저하없이 추론 속도를 크게 향상시킨다. 또한 우리의 접근 방식은 다양한 semi-VOS 방법에 적용가능함을 실험적으로 증명한다.Segmentation has seen a remarkable performance advance by using deep convolution neural networks like other fields of computer vision. This is necessary technology because we can understand surrounded scenes and recognize the shape of an object for various visual applications such as AR/VR, autonomous driving, surveillance system, etc. However, most previous methods can not directly be used for real-world systems due to tremendous computation. This dissertation focuses on image semantic segmentation and semi-supervised video object segmentation among various sub-fields in the overall segmentation realm to reduce model complexity. We point out redundant operations from conventional frameworks and propose solutions from three different perspectives. First, we discuss the spatial redundancy issue in a decoder. The decoder conducts upsampling to recover small resolution feature maps into the original resolution to generate a sharp mask and classify each pixel for finding their semantic categories. However, neighboring pixels share information and get the same semantic category each other, and thus we do not need independent pixel-wise computation in the decoder. We propose superpixel-based sampling architecture to eliminate the decoder process by reducing spatial redundancy to resolve this problem. The proposed network is trained and tested with only 0.37% of total pixels with a re-adjusting learning rate scheme by statistical process control (SPC) of gradients in each layer. We show that our network performs better or equal accuracy comparison with various conventional methods on Pascal Context, SUN-RGBD dataset. Second, we point out the dilated convolution in an encoder. This is widely used for an encoder to get the advantage of a large receptive field and improve performance. One practical choice to reduce computation for executing mobile devices is applying a depth-wise separable convolution strategy into a dilated convolution. However, the simple combination of these two methods incurs severe performance degradation due to the loss of information in the feature map from an over-simplified operation. We propose a new convolutional block called Concentrated-Comprehensive Convolution (C3) to compensate for the information loss to resolve this problem. We apply the C3-block to various segmentation frameworks (DRN, ERFnet, Enet, and Deeplab V3) to prove our proposed method's beneficial properties on Cityscapes and Pascal VOC datasets. Another issue in the dilated convolution is different latency times depending on the dilation rate. Theoretically, the dilated convolution has a similar latency time regardless of dilation rate, but we observe that the latency time is seriously different up to 2 times. To mitigate this issue, we devise another convolutional block called the spatial squeeze (S2) block. S2-block utilizes an average pooling trick for squeezing spatial information to understand long-range information and reduce lots of computation. We provide qualitative and quantitative analysis of a proposed network based on S2-block with other lightweight segmentation and compare the performance with C3-block on the Cityscapes dataset. Also, we demonstrate that our method successfully is executed for a mobile device. Third, we also tackle the temporal redundancy problem in video segmentation. One of the critical techniques in computer vision is how to handle video data efficiently. Semi-supervised Video Object Segmentation (semi-VOS) propagates information from previous frames to generate a segmentation mask for the current frame. However, previous works treat every frame with the same importance and use a full-network path. This generates high-quality segmentation across challenging scenarios such as shape-changing and occlusion. However, it also leads to unnecessary computations for stationary or slow-moving objects where the change across frames is little. In this work, we exploit this observation by using temporal information to quickly identify frames with little change and skip the heavyweight mask generation step. To realize this efficiency, we propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing the previous frame's feature -- to choose depending on the expected similarity. Experimental results show that our approach significantly improves inference speed without much accuracy degradation on challenging semi-VOS datasets -- DAVIS 16, DAVIS 17, and YouTube-VOS. Furthermore, our approach can be applied to multiple semi-VOS methods demonstrating its generality.1 Introduction 1 1.1 Challenging Problem 3 1.1.1 Semantic Segmentation 3 1.1.2 Semi-supervised Video Object Segmentation 6 1.2 Contribution 8 1.2.1 Reducing Spatial Redundancy in Decoder 8 1.2.2 Beyond Dilated Convolution 9 1.2.3 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation 10 1.3 Outline 11 2 Related Work 12 2.1 Decoder for Segmentation 12 2.2 Feature Extraction for Segmentation Encoder 14 2.3 Tracking Target for Video Object Segmentation 16 2.3.1 Mask Propagation 16 2.3.2 Online-learning 16 2.3.3 Template Matching 16 2.4 Reducing Computation for Deep Learning Networks 17 2.4.1 Convolution Factorization 17 2.4.2 Dynamic Network 18 2.5 Datasets and Measurements 19 2.5.1 Image Semantic Segmentation 19 2.5.2 Video Object Segmentation 19 2.5.3 Measurement 20 3 Reducing Spatial Redundancy in Decoder via Sampling based on Superpixel 22 3.1 Relate Work 25 3.2 Sampling Method Based on Superpixel for Train and Test 27 3.3 Details of Remapping Feature Map 28 3.4 Re-adjusting Learning Rates 30 3.5 Experiments 33 3.5.1 Implementation details 33 3.5.2 Pascal Context Benchmark Experiments 33 3.5.3 Analysis of the Number of Superpixel 35 3.5.4 SUN-RGBD Benchmark Experiments 37 4 Beyond Dilated Convolution for Better Lightweight Encoder 39 4.1 Relate Work 41 4.2 Rethinking about Property of Dilated Convolutions 42 4.3 Concentrated-Comprehensive Convolution 45 4.4 Experiments of C3 47 4.4.1 Ablation Study on C3 based on ESPNet 47 4.4.2 Evaluation on Cityscapes with Other Models 52 4.4.3 Evaluation on PASCAL VOC with Other Models 54 4.5 Rethinking about Speed of Dilated Convolutions and Multi-branches Structures 55 4.6 Spatial Squeeze Block 56 4.6.1 Overall Structure 58 4.7 Experiments of S2 61 4.7.1 Evaluation Results on the EG1800 Dataset 62 4.7.2 Ablation Study 64 4.8 Comparison between C3 and S2 64 4.8.1 Evaluation Results on the Cityscapes Dataset 65 5 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation via Dynamic Inference Framework 69 5.1 Relate Work 73 5.2 Online-learning for Semi-supervised Video Object Segmentation 74 5.2.1 Brief Explanation of Baseline Architecture 74 5.2.2 Our Dynamic Inference Framework 76 5.3 Quantifying Movement for Recognizing Temporal Redundancy 78 5.3.1 Details of Template Matching 80 5.4 Reusing Previous Feature Map 83 5.5 Extend to General Semi-supervised Video Object Segmentation 84 5.6 Gate Probability Loss 87 5.7 Experiment 89 5.7.1 DAVIS Benchmark Result 90 5.7.2 Ablation Study 93 5.7.3 YouTube-VOS Result 100 5.7.4 Qualitative Examples 102 6 Conclusion 105 6.1 Summary 105 6.2 Limitations 108 6.3 Future Works 109 Abstract (In Korean) 129 감사의 글 132박

Mono-hydra:Real-time 3D scene graph construction from monocular camera input with IMU

Author: Nex F.
Udugama U. V. B. L.
Vosselman G.
Publication venue: ArXiv.org
Publication date: 10/08/2023
Field of study

University of Twente Research Information

EMC2A-Net: An Efficient Multibranch Cross-channel Attention Network for SAR Target Classification

Author: Geng Zhe
Huang Xiaohua
Wang Qinglu
Yu Xiang
Zhu Daiyin
Publication venue
Publication date: 03/08/2022
Field of study

In recent years, convolutional neural networks (CNNs) have shown great potential in synthetic aperture radar (SAR) target recognition. SAR images have a strong sense of granularity and have different scales of texture features, such as speckle noise, target dominant scatterers and target contours, which are rarely considered in the traditional CNN model. This paper proposed two residual blocks, namely EMC2A blocks with multiscale receptive fields(RFs), based on a multibranch structure and then designed an efficient isotopic architecture deep CNN (DCNN), EMC2A-Net. EMC2A blocks utilize parallel dilated convolution with different dilation rates, which can effectively capture multiscale context features without significantly increasing the computational burden. To further improve the efficiency of multiscale feature fusion, this paper proposed a multiscale feature cross-channel attention module, namely the EMC2A module, adopting a local multiscale feature interaction strategy without dimensionality reduction. This strategy adaptively adjusts the weights of each channel through efficient one-dimensional (1D)-circular convolution and sigmoid function to guide attention at the global channel wise level. The comparative results on the MSTAR dataset show that EMC2A-Net outperforms the existing available models of the same type and has relatively lightweight network structure. The ablation experiment results show that the EMC2A module significantly improves the performance of the model by using only a few parameters and appropriate cross-channel interactions.Comment: 15 pages, 9 figures, Submitted to IEEE Transactions on Geoscience and Remote Sensing, 202

arXiv.org e-Print Archive