2,125 research outputs found

    ShadowTutor: Distributed Partial Distillation for Mobile Video DNN Inference

    Full text link
    Following the recent success of deep neural networks (DNN) on video computer vision tasks, performing DNN inferences on videos that originate from mobile devices has gained practical significance. As such, previous approaches developed methods to offload DNN inference computations for images to cloud servers to manage the resource constraints of mobile devices. However, when it comes to video data, communicating information of every frame consumes excessive network bandwidth and renders the entire system susceptible to adverse network conditions such as congestion. Thus, in this work, we seek to exploit the temporal coherence between nearby frames of a video stream to mitigate network pressure. That is, we propose ShadowTutor, a distributed video DNN inference framework that reduces the number of network transmissions through intermittent knowledge distillation to a student model. Moreover, we update only a subset of the student's parameters, which we call partial distillation, to reduce the data size of each network transmission. Specifically, the server runs a large and general teacher model, and the mobile device only runs an extremely small but specialized student model. On sparsely selected key frames, the server partially trains the student model by targeting the teacher's response and sends the updated part to the mobile device. We investigate the effectiveness of ShadowTutor with HD video semantic segmentation. Evaluations show that network data transfer is reduced by 95% on average. Moreover, the throughput of the system is improved by over three times and shows robustness to changes in network bandwidth.Comment: Accepted at ICPP 202

    Evolution of A Common Vector Space Approach to Multi-Modal Problems

    Get PDF
    A set of methods to address computer vision problems has been developed. Video un- derstanding is an activate area of research in recent years. If one can accurately identify salient objects in a video sequence, these components can be used in information retrieval and scene analysis. This research started with the development of a course-to-fine frame- work to extract salient objects in video sequences. Previous work on image and video frame background modeling involved methods that ranged from simple and efficient to accurate but computationally complex. It will be shown in this research that the novel approach to implement object extraction is efficient and effective that outperforms the existing state-of-the-art methods. However, the drawback to this method is the inability to deal with non-rigid motion. With the rapid development of artificial neural networks, deep learning approaches are explored as a solution to computer vision problems in general. Focusing on image and text, the image (or video frame) understanding can be achieved using CVS. With this concept, modality generation and other relevant applications such as automatic im- age description, text paraphrasing, can be explored. Specifically, video sequences can be modeled by Recurrent Neural Networks (RNN), the greater depth of the RNN leads to smaller error, but that makes the gradient in the network unstable during training.To overcome this problem, a Batch-Normalized Recurrent Highway Network (BNRHN) was developed and tested on the image captioning (image-to-text) task. In BNRHN, the highway layers are incorporated with batch normalization which diminish the gradient vanishing and exploding problem. In addition, a sentence to vector encoding framework that is suitable for advanced natural language processing is developed. This semantic text embedding makes use of the encoder-decoder model which is trained on sentence paraphrase pairs (text-to-text). With this scheme, the latent representation of the text is shown to encode sentences with common semantic information with similar vector rep- resentations. In addition to image-to-text and text-to-text, an image generation model is developed to generate image from text (text-to-image) or another image (image-to- image) based on the semantics of the content. The developed model, which refers to the Multi-Modal Vector Representation (MMVR), builds and encodes different modalities into a common vector space that achieve the goal of keeping semantics and conversion between text and image bidirectional. The concept of CVS is introduced in this research to deal with multi-modal conversion problems. In theory, this method works not only on text and image, but also can be generalized to other modalities, such as video and audio. The characteristics and performance are supported by both theoretical analysis and experimental results. Interestingly, the MMVR model is one of the many possible ways to build CVS. In the final stages of this research, a simple and straightforward framework to build CVS, which is considered as an alternative to the MMVR model, is presented

    Image Processing Using FPGAs

    Get PDF
    This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs

    SAM-DiffSR: Structure-Modulated Diffusion Model for Image Super-Resolution

    Full text link
    Diffusion-based super-resolution (SR) models have recently garnered significant attention due to their potent restoration capabilities. But conventional diffusion models perform noise sampling from a single distribution, constraining their ability to handle real-world scenes and complex textures across semantic regions. With the success of segment anything model (SAM), generating sufficiently fine-grained region masks can enhance the detail recovery of diffusion-based SR model. However, directly integrating SAM into SR models will result in much higher computational cost. In this paper, we propose the SAM-DiffSR model, which can utilize the fine-grained structure information from SAM in the process of sampling noise to improve the image quality without additional computational cost during inference. In the process of training, we encode structural position information into the segmentation mask from SAM. Then the encoded mask is integrated into the forward diffusion process by modulating it to the sampled noise. This adjustment allows us to independently adapt the noise mean within each corresponding segmentation area. The diffusion model is trained to estimate this modulated noise. Crucially, our proposed framework does NOT change the reverse diffusion process and does NOT require SAM at inference. Experimental results demonstrate the effectiveness of our proposed method, showcasing superior performance in suppressing artifacts, and surpassing existing diffusion-based methods by 0.74 dB at the maximum in terms of PSNR on DIV2K dataset. The code and dataset are available at https://github.com/lose4578/SAM-DiffSR

    System Abstractions for Scalable Application Development at the Edge

    Get PDF
    Recent years have witnessed an explosive growth of Internet of Things (IoT) devices, which collect or generate huge amounts of data. Given diverse device capabilities and application requirements, data processing takes place across a range of settings, from on-device to a nearby edge server/cloud and remote cloud. Consequently, edge-cloud coordination has been studied extensively from the perspectives of job placement, scheduling and joint optimization. Typical approaches focus on performance optimization for individual applications. This often requires domain knowledge of the applications, but also leads to application-specific solutions. Application development and deployment over diverse scenarios thus incur repetitive manual efforts. There are two overarching challenges to provide system-level support for application development at the edge. First, there is inherent heterogeneity at the device hardware level. The execution settings may range from a small cluster as an edge cloud to on-device inference on embedded devices, differing in hardware capability and programming environments. Further, application performance requirements vary significantly, making it even more difficult to map different applications to already heterogeneous hardware. Second, there are trends towards incorporating edge and cloud and multi-modal data. Together, these add further dimensions to the design space and increase the complexity significantly. In this thesis, we propose a novel framework to simplify application development and deployment over a continuum of edge to cloud. Our framework provides key connections between different dimensions of design considerations, corresponding to the application abstraction, data abstraction and resource management abstraction respectively. First, our framework masks hardware heterogeneity with abstract resource types through containerization, and abstracts away the application processing pipelines into generic flow graphs. Further, our framework further supports a notion of degradable computing for application scenarios at the edge that are driven by multimodal sensory input. Next, as video analytics is the killer app of edge computing, we include a generic data management service between video query systems and a video store to organize video data at the edge. We propose a video data unit abstraction based on a notion of distance between objects in the video, quantifying the semantic similarity among video data. Last, considering concurrent application execution, our framework supports multi-application offloading with device-centric control, with a userspace scheduler service that wraps over the operating system scheduler

    ์ค‘๋ณต ์—ฐ์‚ฐ ์ƒ๋žต์„ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์˜์ƒ ๋ฐ ๋™์˜์ƒ ๋ถ„ํ•  ๋ชจ๋ธ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2021.8. ๊ณฝ๋…ธ์ค€.๋ถ„ํ• ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋”ฅ๋Ÿฌ๋‹ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งŽ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃจ์–ด๋ƒˆ๋‹ค. ์ด ๊ธฐ์ˆ ์€ AR/VR, ์ž์œจ ์ฃผํ–‰, ๊ฐ์‹œ ์‹œ์Šคํ…œ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ฃผ๋ณ€ ์žฅ๋ฉด์„ ์ดํ•ดํ•˜๊ณ  ๋ฌผ์ฒด์˜ ๋ชจ์–‘์„ ์ธ์‹ ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•„์ˆ˜์ ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์— ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์˜ ๋Œ€๋ถ€๋ถ„์€ ๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ์‹œ์Šคํ…œ์— ๊ณง๋ฐ”๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋ชจ๋ธ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ „์ฒด ๋ถ„ํ•  ์˜์—ญ ์ค‘์—์„œ Image semantic segmentation ๋ฐ semi-supervised video object segmentation ์—์„œ ์“ฐ์ด๋Š” ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”๋ฅผ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ์กด ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ง€์ ํ•˜๊ณ  ์„ธ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € decoder์˜ spatial redundancy ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. Decoder๋Š” upsampling์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ž‘์€ ํ•ด์ƒ๋„ feature map์„ ์›๋ž˜์˜ input image ํ•ด์ƒ๋„๋กœ ๋ณต๊ตฌํ•˜์—ฌ ์ •ํ™•ํ•œ ๋ชจ์–‘์˜ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , semantic ์ •๋ณด๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๊ฐ ํ”ฝ์…€์˜ ํด๋ž˜์Šค๋ฅผ ํŒ๋ณ„ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ธ์ ‘ ํ”ฝ์…€๋“ค์€ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์„œ๋กœ ๋™์ผํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ํ™•๋ฅ ์ด ๋†’์œผ๋‚˜ ์ด ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ ์—ฐ๊ตฌ๊ฐ€ ์—†๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด spatial redundancy์„ ์ค„์—ฌ decoder์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ œ๊ฑฐํ•˜๋Š” superpixel-based sampling architecture๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ ๋œ ๋„คํŠธ์›Œํฌ๋Š” ํ†ต๊ณ„์  ํ”„๋กœ์„ธ์Šค ์ œ์–ด ๋ฐฉ๋ฒ•๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด์˜ ํ•™์Šต๋ฅ ์„ ์žฌ์กฐ์ •ํ•˜๋Š” ํ•™์Šต๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ด ํ”ฝ์…€์˜ 0.37 ๋งŒ์œผ๋กœ ํ•™์Šต ๋ฐ ์ถ”๋ก ์„ ํ•œ๋‹ค. Pascal Context, SUN-RGBD ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์‹คํ—˜์—์„œ, ๋‹ค์–‘ํ•œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์€ ํ›จ์”ฌ ๋” ์ ์ง€๋งŒ ๋” ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ๋กœ encoder ์—์„œ ๋„๋ฆฌ ์“ฐ์ด๋Š” dilated convolution ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. Dilated convolution ์€ encoder๊ฐ€ ํฐ receptive field๋ฅผ ์ง€๋‹ˆ๋„๋ก ํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค. ๋ชจ๋ฐ”์ผ ๋””๋ฐ”์ด์Šค์—์„œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์—ฌ์•ผ ํ•˜๋ฉฐ, ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” depth-wise separable convolution ๋ฐฉ๋ฒ•์„ dilated convolution์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ๊ฐ„๋‹จํ•œ ์กฐํ•ฉ์€ ์ง€๋‚˜์น˜๊ฒŒ ๋‹จ์ˆœํ™” ๋œ ์—ฐ์‚ฐ์œผ๋กœ ์ธํ•ด feature map์˜ ์ •๋ณด ์†์‹ค์„ ์•ผ๊ธฐํ•˜๊ณ , ์ด๋กœ ์ธํ•ด ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ •๋ณด ์†์‹ค์„ ๋ณด์•ˆํ•˜๋Š” Concentrated-Comprehensive Convolution (C3)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด convolutional block์„ ์ œ์•ˆํ•œ๋‹ค. C3-block์„ ๋‹ค์–‘ํ•œ ๋ถ„ํ•  ๋ชจ๋ธ์ธ (DRN, ERFnet, Enet ๋ฐ Deeplab V3)์— ์ ์šฉํ•˜์—ฌ Cityscapes์™€ Pascal VOC ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค. ๋˜ ๋‹ค๋ฅธ dilated convolution์˜ ๋ฌธ์ œ๋Š” dilation rate ์— ๋”ฐ๋ผ ๋ชจ๋ธ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋‹ฌ๋ผ์ง€๋Š” ์ ์ด๋‹ค. ์ด๋ก ์ ์œผ๋กœ dilated convolution์€ dilation rate ๊ด€๊ณ„์—†์ด ์œ ์‚ฌํ•œ ๋ชจ๋ธ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ฐ€์ ธ์•ผํ•˜์ง€๋งŒ, ์‹ค์ œ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋””๋ฐ”์ด์Šค์—์„œ๋Š” ์ตœ๋Œ€ 2 ๋ฐฐ๊นŒ์ง€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง„๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด spatial squeeze (S2) block ์ด๋ผ๊ณ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ convolutional block์„ ์ œ์•ˆํ•œ๋‹ค. S2-block์€ ์žฅ๊ฑฐ๋ฆฌ ์ •๋ณด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งŽ์€ ๊ณ„์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด average pooling์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์••์ถ•ํ•œ๋‹ค. ๋‹ค๋ฅธ ๊ฒฝ๋Ÿ‰ํ™” ๋ถ„ํ• ๋ชจ๋ธ๊ณผ S2-block ๊ธฐ๋ฐ˜์˜ ์ œ์•ˆ๋œ ๋ชจ๋ธ๊ณผ ์ •์„ฑ ๋ฐ ์ •๋Ÿ‰ ๋ถ„์„์„ Cityscapes ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ์ œ๊ณตํ•œ๋‹ค. ๋˜ํ•œ ์•ž์—์„œ ์—ฐ๊ตฌํ•œ C3-block๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋ฉฐ, ์‹ค์ œ ๋ชจ๋ฐ”์ผ ์žฅ์น˜์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ์„ธ๋ฒˆ์งธ๋กœ ๋น„๋””์˜ค์—์„œ temporal redundancy ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ „์˜ ์ค‘์š”ํ•œ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜๋Š” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Semi-supervised Video Object Segmentation (semi-VOS)์€ ์ด์ „ ํ”„๋ ˆ์ž„์˜ ์ •๋ณด๋ฅผ ์ „ํŒŒํ•˜์—ฌ ํ˜„์žฌ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ segmentation ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด์ „ ์—ฐ๊ตฌ๋“ค์€ ๋ชจ๋“  ํ”„๋ ˆ์ž„์„ ๋™์ผํ•˜๊ฒŒ ์ค‘์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๊ณ , ๋ชจ๋ธ์˜ ์ „์ฒด ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งค ํ”„๋ ˆ์ž„๋งˆ๋‹ค ํ•ด๋‹น ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฌผ์ฒด๋ชจ์–‘์˜ ๋ณ€ํ™”๋‚˜ ๋ฌผ์ฒด๊ฐ€ ๊ฐ€๋ ค์ง€๋Š” ์–ด๋ ค์šด ๋น„๋””์˜ค์—์„œ๋„ ์ •ํ™•ํ•œ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ด์ง€ ์•Š๊ฑฐ๋‚˜ ๋Š๋ฆฌ๊ฒŒ ์›€์ง์—ฌ์„œ ํ”„๋ ˆ์ž„ ๊ฐ„ ๋ณ€ํ™”๊ฐ€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒฝ์šฐ์—๋Š” ๋ถˆํ•„์š”ํ•œ ๊ณ„์‚ฐ์ด ๋ฐœ์ƒํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ temporal information์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌผ์ฒด์˜ ์›€์ง์ž„ ์ •๋„๋ฅผ ์ธก์ •ํ•œ ๋’ค, ๋ณ€ํ™”๊ฐ€ ๋ฏธ๋น„ํ•˜๋‹ค๋ฉด ๋ฌด๊ฑฐ์šด ๋งˆ์Šคํฌ ์ƒ์„ฑ ๋‹จ๊ณ„๋ฅผ ์ƒ๋žตํ•œ๋‹ค. ์ด๋ฅผ ์‹คํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ๋ณ€ํ™”๋Ÿ‰์„ ์ธก์ •ํ•˜๊ณ  ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์— ๋”ฐ๋ผ ๊ฒฝ๋กœ๋ฅผ (์ „์ฒด ๋„คํŠธ์›Œํฌ ๊ณ„์‚ฐ ๋˜๋Š” ์ด์ „ ํ”„๋ ˆ์ž„ ๊ฒฐ๊ณผ๋ฅผ ์žฌ์‚ฌ์šฉ) ๊ฒฐ์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋™์  ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ semi-VOS ๋ฐ์ดํ„ฐ์…‹์— (DAVIS 16, DAVIS 17 ๋ฐ YouTube-VOS) ๋Œ€ํ•ด ์ •ํ™•๋„ ์ €ํ•˜์—†์ด ์ถ”๋ก  ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹ค์–‘ํ•œ semi-VOS ๋ฐฉ๋ฒ•์— ์ ์šฉ๊ฐ€๋Šฅํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค.Segmentation has seen a remarkable performance advance by using deep convolution neural networks like other fields of computer vision. This is necessary technology because we can understand surrounded scenes and recognize the shape of an object for various visual applications such as AR/VR, autonomous driving, surveillance system, etc. However, most previous methods can not directly be used for real-world systems due to tremendous computation. This dissertation focuses on image semantic segmentation and semi-supervised video object segmentation among various sub-fields in the overall segmentation realm to reduce model complexity. We point out redundant operations from conventional frameworks and propose solutions from three different perspectives. First, we discuss the spatial redundancy issue in a decoder. The decoder conducts upsampling to recover small resolution feature maps into the original resolution to generate a sharp mask and classify each pixel for finding their semantic categories. However, neighboring pixels share information and get the same semantic category each other, and thus we do not need independent pixel-wise computation in the decoder. We propose superpixel-based sampling architecture to eliminate the decoder process by reducing spatial redundancy to resolve this problem. The proposed network is trained and tested with only 0.37% of total pixels with a re-adjusting learning rate scheme by statistical process control (SPC) of gradients in each layer. We show that our network performs better or equal accuracy comparison with various conventional methods on Pascal Context, SUN-RGBD dataset. Second, we point out the dilated convolution in an encoder. This is widely used for an encoder to get the advantage of a large receptive field and improve performance. One practical choice to reduce computation for executing mobile devices is applying a depth-wise separable convolution strategy into a dilated convolution. However, the simple combination of these two methods incurs severe performance degradation due to the loss of information in the feature map from an over-simplified operation. We propose a new convolutional block called Concentrated-Comprehensive Convolution (C3) to compensate for the information loss to resolve this problem. We apply the C3-block to various segmentation frameworks (DRN, ERFnet, Enet, and Deeplab V3) to prove our proposed method's beneficial properties on Cityscapes and Pascal VOC datasets. Another issue in the dilated convolution is different latency times depending on the dilation rate. Theoretically, the dilated convolution has a similar latency time regardless of dilation rate, but we observe that the latency time is seriously different up to 2 times. To mitigate this issue, we devise another convolutional block called the spatial squeeze (S2) block. S2-block utilizes an average pooling trick for squeezing spatial information to understand long-range information and reduce lots of computation. We provide qualitative and quantitative analysis of a proposed network based on S2-block with other lightweight segmentation and compare the performance with C3-block on the Cityscapes dataset. Also, we demonstrate that our method successfully is executed for a mobile device. Third, we also tackle the temporal redundancy problem in video segmentation. One of the critical techniques in computer vision is how to handle video data efficiently. Semi-supervised Video Object Segmentation (semi-VOS) propagates information from previous frames to generate a segmentation mask for the current frame. However, previous works treat every frame with the same importance and use a full-network path. This generates high-quality segmentation across challenging scenarios such as shape-changing and occlusion. However, it also leads to unnecessary computations for stationary or slow-moving objects where the change across frames is little. In this work, we exploit this observation by using temporal information to quickly identify frames with little change and skip the heavyweight mask generation step. To realize this efficiency, we propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing the previous frame's feature -- to choose depending on the expected similarity. Experimental results show that our approach significantly improves inference speed without much accuracy degradation on challenging semi-VOS datasets -- DAVIS 16, DAVIS 17, and YouTube-VOS. Furthermore, our approach can be applied to multiple semi-VOS methods demonstrating its generality.1 Introduction 1 1.1 Challenging Problem 3 1.1.1 Semantic Segmentation 3 1.1.2 Semi-supervised Video Object Segmentation 6 1.2 Contribution 8 1.2.1 Reducing Spatial Redundancy in Decoder 8 1.2.2 Beyond Dilated Convolution 9 1.2.3 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation 10 1.3 Outline 11 2 Related Work 12 2.1 Decoder for Segmentation 12 2.2 Feature Extraction for Segmentation Encoder 14 2.3 Tracking Target for Video Object Segmentation 16 2.3.1 Mask Propagation 16 2.3.2 Online-learning 16 2.3.3 Template Matching 16 2.4 Reducing Computation for Deep Learning Networks 17 2.4.1 Convolution Factorization 17 2.4.2 Dynamic Network 18 2.5 Datasets and Measurements 19 2.5.1 Image Semantic Segmentation 19 2.5.2 Video Object Segmentation 19 2.5.3 Measurement 20 3 Reducing Spatial Redundancy in Decoder via Sampling based on Superpixel 22 3.1 Relate Work 25 3.2 Sampling Method Based on Superpixel for Train and Test 27 3.3 Details of Remapping Feature Map 28 3.4 Re-adjusting Learning Rates 30 3.5 Experiments 33 3.5.1 Implementation details 33 3.5.2 Pascal Context Benchmark Experiments 33 3.5.3 Analysis of the Number of Superpixel 35 3.5.4 SUN-RGBD Benchmark Experiments 37 4 Beyond Dilated Convolution for Better Lightweight Encoder 39 4.1 Relate Work 41 4.2 Rethinking about Property of Dilated Convolutions 42 4.3 Concentrated-Comprehensive Convolution 45 4.4 Experiments of C3 47 4.4.1 Ablation Study on C3 based on ESPNet 47 4.4.2 Evaluation on Cityscapes with Other Models 52 4.4.3 Evaluation on PASCAL VOC with Other Models 54 4.5 Rethinking about Speed of Dilated Convolutions and Multi-branches Structures 55 4.6 Spatial Squeeze Block 56 4.6.1 Overall Structure 58 4.7 Experiments of S2 61 4.7.1 Evaluation Results on the EG1800 Dataset 62 4.7.2 Ablation Study 64 4.8 Comparison between C3 and S2 64 4.8.1 Evaluation Results on the Cityscapes Dataset 65 5 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation via Dynamic Inference Framework 69 5.1 Relate Work 73 5.2 Online-learning for Semi-supervised Video Object Segmentation 74 5.2.1 Brief Explanation of Baseline Architecture 74 5.2.2 Our Dynamic Inference Framework 76 5.3 Quantifying Movement for Recognizing Temporal Redundancy 78 5.3.1 Details of Template Matching 80 5.4 Reusing Previous Feature Map 83 5.5 Extend to General Semi-supervised Video Object Segmentation 84 5.6 Gate Probability Loss 87 5.7 Experiment 89 5.7.1 DAVIS Benchmark Result 90 5.7.2 Ablation Study 93 5.7.3 YouTube-VOS Result 100 5.7.4 Qualitative Examples 102 6 Conclusion 105 6.1 Summary 105 6.2 Limitations 108 6.3 Future Works 109 Abstract (In Korean) 129 ๊ฐ์‚ฌ์˜ ๊ธ€ 132๋ฐ•

    Edge Video Analytics: A Survey on Applications, Systems and Enabling Techniques

    Full text link
    Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. The basic concepts of EVA (e.g., definition, architectures) were not fully elucidated due to the rapid development of this domain. To fill these gaps, we provide a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.Comment: 31 pages, 13 figure
    • โ€ฆ
    corecore