6,649 research outputs found

    Attribute Value Reordering For Efficient Hybrid OLAP

    Get PDF
    The normalization of a data cube is the ordering of the attribute values. For large multidimensional arrays where dense and sparse chunks are stored differently, proper normalization can lead to improved storage efficiency. We show that it is NP-hard to compute an optimal normalization even for 1x3 chunks, although we find an exact algorithm for 1x2 chunks. When dimensions are nearly statistically independent, we show that dimension-wise attribute frequency sorting is an optimal normalization and takes time O(d n log(n)) for data cubes of size n^d. When dimensions are not independent, we propose and evaluate several heuristics. The hybrid OLAP (HOLAP) storage mechanism is already 19%-30% more efficient than ROLAP, but normalization can improve it further by 9%-13% for a total gain of 29%-44% over ROLAP

    Dissimilarity is used as evidence of category membership in multidimensional perceptual categorization: a test of the similarity-dissimilarity generalized context model

    Get PDF
    In exemplar models of categorization, the similarity between an exemplar and category members constitutes evidence that the exemplar belongs to the category. We test the possibility that the dissimilarity to members of competing categories also contributes to this evidence. Data were collected from two 2-dimensional perceptual categorization experiments, one with lines varying in orientation and length and the other with coloured patches varying in saturation and brightness. Model fits of the similarity-dissimilarity generalized context model were used to compare a model where only similarity was used with a model where both similarity and dissimilarity were used. For the majority of participants the similarity-dissimilarity model provided both a significantly better fit and better generalization, suggesting that people do also use dissimilarity as evidence

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    Methods for multi-spectral image fusion: identifying stable and repeatable information across the visible and infrared spectra

    Get PDF
    Fusion of images captured from different viewpoints is a well-known challenge in computer vision with many established approaches and applications; however, if the observations are captured by sensors also separated by wavelength, this challenge is compounded significantly. This dissertation presents an investigation into the fusion of visible and thermal image information from two front-facing sensors mounted side-by-side. The primary focus of this work is the development of methods that enable us to map and overlay multi-spectral information; the goal is to establish a combined image in which each pixel contains both colour and thermal information. Pixel-level fusion of these distinct modalities is approached using computational stereo methods; the focus is on the viewpoint alignment and correspondence search/matching stages of processing. Frequency domain analysis is performed using a method called phase congruency. An extensive investigation of this method is carried out with two major objectives: to identify predictable relationships between the elements extracted from each modality, and to establish a stable representation of the common information captured by both sensors. Phase congruency is shown to be a stable edge detector and repeatable spatial similarity measure for multi-spectral information; this result forms the basis for the methods developed in the subsequent chapters of this work. The feasibility of automatic alignment with sparse feature-correspondence methods is investigated. It is found that conventional methods fail to match inter-spectrum correspondences, motivating the development of an edge orientation histogram (EOH) descriptor which incorporates elements of the phase congruency process. A cost function, which incorporates the outputs of the phase congruency process and the mutual information similarity measure, is developed for computational stereo correspondence matching. An evaluation of the proposed cost function shows it to be an effective similarity measure for multi-spectral information

    Parallel Architectures and Parallel Algorithms for Integrated Vision Systems

    Get PDF
    Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is a system that uses vision algorithms from all levels of processing to perform for a high level application (e.g., object recognition). An IVS normally involves algorithms from low level, intermediate level, and high level vision. Designing parallel architectures for vision systems is of tremendous interest to researchers. Several issues are addressed in parallel architectures and parallel algorithms for integrated vision systems

    ์ค‘๋ณต ์—ฐ์‚ฐ ์ƒ๋žต์„ ํ†ตํ•œ ํšจ์œจ์ ์ธ ์˜์ƒ ๋ฐ ๋™์˜์ƒ ๋ถ„ํ•  ๋ชจ๋ธ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์œตํ•ฉ๊ณผํ•™๋ถ€(์ง€๋Šฅํ˜•์œตํ•ฉ์‹œ์Šคํ…œ์ „๊ณต), 2021.8. ๊ณฝ๋…ธ์ค€.๋ถ„ํ• ๋ชจ๋ธ์€ ๋‹ค๋ฅธ ์ปดํ“จํ„ฐ ๋น„์ „ ๋ถ„์•ผ์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋”ฅ๋Ÿฌ๋‹ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์—ฌ ๋งŽ์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋ฃจ์–ด๋ƒˆ๋‹ค. ์ด ๊ธฐ์ˆ ์€ AR/VR, ์ž์œจ ์ฃผํ–‰, ๊ฐ์‹œ ์‹œ์Šคํ…œ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹œ๊ฐ ์‘์šฉ ๋ถ„์•ผ์—์„œ ์ฃผ๋ณ€ ์žฅ๋ฉด์„ ์ดํ•ดํ•˜๊ณ  ๋ฌผ์ฒด์˜ ๋ชจ์–‘์„ ์ธ์‹ ํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•„์ˆ˜์ ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด์— ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์˜ ๋Œ€๋ถ€๋ถ„์€ ๋งŽ์€ ์—ฐ์‚ฐ๋Ÿ‰์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์‹ค์ œ ์‹œ์Šคํ…œ์— ๊ณง๋ฐ”๋กœ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋ชจ๋ธ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ์ „์ฒด ๋ถ„ํ•  ์˜์—ญ ์ค‘์—์„œ Image semantic segmentation ๋ฐ semi-supervised video object segmentation ์—์„œ ์“ฐ์ด๋Š” ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™”๋ฅผ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ๊ธฐ์กด ํ”„๋ ˆ์ž„ ์›Œํฌ์—์„œ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ง€์ ํ•˜๊ณ  ์„ธ ๊ฐ€์ง€ ๊ด€์ ์—์„œ ํ•ด๊ฒฐ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๋จผ์ € decoder์˜ spatial redundancy ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. Decoder๋Š” upsampling์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ž‘์€ ํ•ด์ƒ๋„ feature map์„ ์›๋ž˜์˜ input image ํ•ด์ƒ๋„๋กœ ๋ณต๊ตฌํ•˜์—ฌ ์ •ํ™•ํ•œ ๋ชจ์–‘์˜ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•˜๊ณ , semantic ์ •๋ณด๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด ๊ฐ ํ”ฝ์…€์˜ ํด๋ž˜์Šค๋ฅผ ํŒ๋ณ„ํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ธ์ ‘ ํ”ฝ์…€๋“ค์€ ์ •๋ณด๋ฅผ ๊ณต์œ ํ•˜๊ณ  ์„œ๋กœ ๋™์ผํ•œ ์˜๋ฏธ๋ฅผ ๊ฐ€์งˆ ํ™•๋ฅ ์ด ๋†’์œผ๋‚˜ ์ด ํŠน์„ฑ์„ ๊ณ ๋ คํ•œ ์—ฐ๊ตฌ๊ฐ€ ์—†๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด spatial redundancy์„ ์ค„์—ฌ decoder์˜ ํ”„๋กœ์„ธ์Šค๋ฅผ ์ œ๊ฑฐํ•˜๋Š” superpixel-based sampling architecture๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ ๋œ ๋„คํŠธ์›Œํฌ๋Š” ํ†ต๊ณ„์  ํ”„๋กœ์„ธ์Šค ์ œ์–ด ๋ฐฉ๋ฒ•๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๊ฐ ๋ ˆ์ด์–ด์˜ ํ•™์Šต๋ฅ ์„ ์žฌ์กฐ์ •ํ•˜๋Š” ํ•™์Šต๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์ด ํ”ฝ์…€์˜ 0.37 ๋งŒ์œผ๋กœ ํ•™์Šต ๋ฐ ์ถ”๋ก ์„ ํ•œ๋‹ค. Pascal Context, SUN-RGBD ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•œ ์‹คํ—˜์—์„œ, ๋‹ค์–‘ํ•œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๊ณผ ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์„ ๋น„๊ตํ•˜์—ฌ ์—ฐ์‚ฐ๋Ÿ‰์€ ํ›จ์”ฌ ๋” ์ ์ง€๋งŒ ๋” ์šฐ์ˆ˜ํ•˜๊ฑฐ๋‚˜ ๋น„์Šทํ•œ ์ •ํ™•๋„๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค. ๋‘๋ฒˆ์งธ๋กœ encoder ์—์„œ ๋„๋ฆฌ ์“ฐ์ด๋Š” dilated convolution ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. Dilated convolution ์€ encoder๊ฐ€ ํฐ receptive field๋ฅผ ์ง€๋‹ˆ๋„๋ก ํ•˜์—ฌ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ๋ฐฉ๋ฒ•๋ก ์ด๋‹ค. ๋ชจ๋ฐ”์ผ ๋””๋ฐ”์ด์Šค์—์„œ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์—ฐ์‚ฐ๋Ÿ‰์„ ์ค„์—ฌ์•ผ ํ•˜๋ฉฐ, ๊ฐ€์žฅ ์‰ฌ์šด ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜๋Š” depth-wise separable convolution ๋ฐฉ๋ฒ•์„ dilated convolution์— ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์˜ ๊ฐ„๋‹จํ•œ ์กฐํ•ฉ์€ ์ง€๋‚˜์น˜๊ฒŒ ๋‹จ์ˆœํ™” ๋œ ์—ฐ์‚ฐ์œผ๋กœ ์ธํ•ด feature map์˜ ์ •๋ณด ์†์‹ค์„ ์•ผ๊ธฐํ•˜๊ณ , ์ด๋กœ ์ธํ•ด ์‹ฌ๊ฐํ•œ ์„ฑ๋Šฅ ์ €ํ•˜๊ฐ€ ๋‚˜ํƒ€๋‚œ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ •๋ณด ์†์‹ค์„ ๋ณด์•ˆํ•˜๋Š” Concentrated-Comprehensive Convolution (C3)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด convolutional block์„ ์ œ์•ˆํ•œ๋‹ค. C3-block์„ ๋‹ค์–‘ํ•œ ๋ถ„ํ•  ๋ชจ๋ธ์ธ (DRN, ERFnet, Enet ๋ฐ Deeplab V3)์— ์ ์šฉํ•˜์—ฌ Cityscapes์™€ Pascal VOC ๋ฐ์ดํ„ฐ์…‹์—์„œ ์ œ์•ˆ ๋œ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค. ๋˜ ๋‹ค๋ฅธ dilated convolution์˜ ๋ฌธ์ œ๋Š” dilation rate ์— ๋”ฐ๋ผ ๋ชจ๋ธ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋‹ฌ๋ผ์ง€๋Š” ์ ์ด๋‹ค. ์ด๋ก ์ ์œผ๋กœ dilated convolution์€ dilation rate ๊ด€๊ณ„์—†์ด ์œ ์‚ฌํ•œ ๋ชจ๋ธ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๊ฐ€์ ธ์•ผํ•˜์ง€๋งŒ, ์‹ค์ œ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ๋””๋ฐ”์ด์Šค์—์„œ๋Š” ์ตœ๋Œ€ 2 ๋ฐฐ๊นŒ์ง€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง„๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด spatial squeeze (S2) block ์ด๋ผ๊ณ ํ•˜๋Š” ๋˜ ๋‹ค๋ฅธ convolutional block์„ ์ œ์•ˆํ•œ๋‹ค. S2-block์€ ์žฅ๊ฑฐ๋ฆฌ ์ •๋ณด๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋งŽ์€ ๊ณ„์‚ฐ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด average pooling์„ ํ™œ์šฉํ•˜์—ฌ ๊ณต๊ฐ„ ์ •๋ณด๋ฅผ ์••์ถ•ํ•œ๋‹ค. ๋‹ค๋ฅธ ๊ฒฝ๋Ÿ‰ํ™” ๋ถ„ํ• ๋ชจ๋ธ๊ณผ S2-block ๊ธฐ๋ฐ˜์˜ ์ œ์•ˆ๋œ ๋ชจ๋ธ๊ณผ ์ •์„ฑ ๋ฐ ์ •๋Ÿ‰ ๋ถ„์„์„ Cityscapes ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•˜์—ฌ ์ œ๊ณตํ•œ๋‹ค. ๋˜ํ•œ ์•ž์—์„œ ์—ฐ๊ตฌํ•œ C3-block๊ณผ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜๋ฉฐ, ์‹ค์ œ ๋ชจ๋ฐ”์ผ ์žฅ์น˜์—์„œ ์ œ์•ˆ๋œ ๋ชจ๋ธ์ด ์„ฑ๊ณต์ ์œผ๋กœ ์‹คํ–‰๋˜๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ค€๋‹ค. ์„ธ๋ฒˆ์งธ๋กœ ๋น„๋””์˜ค์—์„œ temporal redundancy ๋ฌธ์ œ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•œ๋‹ค. ์ปดํ“จํ„ฐ ๋น„์ „์˜ ์ค‘์š”ํ•œ ๊ธฐ์ˆ  ์ค‘ ํ•˜๋‚˜๋Š” ๋น„๋””์˜ค ๋ฐ์ดํ„ฐ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. Semi-supervised Video Object Segmentation (semi-VOS)์€ ์ด์ „ ํ”„๋ ˆ์ž„์˜ ์ •๋ณด๋ฅผ ์ „ํŒŒํ•˜์—ฌ ํ˜„์žฌ ํ”„๋ ˆ์ž„์— ๋Œ€ํ•œ segmentation ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ด์ „ ์—ฐ๊ตฌ๋“ค์€ ๋ชจ๋“  ํ”„๋ ˆ์ž„์„ ๋™์ผํ•˜๊ฒŒ ์ค‘์š”ํ•˜๋‹ค๊ณ  ํŒ๋‹จํ•˜๊ณ , ๋ชจ๋ธ์˜ ์ „์ฒด ๋„คํŠธ์›Œํฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งค ํ”„๋ ˆ์ž„๋งˆ๋‹ค ํ•ด๋‹น ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ฌผ์ฒด๋ชจ์–‘์˜ ๋ณ€ํ™”๋‚˜ ๋ฌผ์ฒด๊ฐ€ ๊ฐ€๋ ค์ง€๋Š” ์–ด๋ ค์šด ๋น„๋””์˜ค์—์„œ๋„ ์ •ํ™•ํ•œ ๋งˆ์Šคํฌ๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์œผ๋‚˜, ๋ฌผ์ฒด๊ฐ€ ์›€์ง์ด์ง€ ์•Š๊ฑฐ๋‚˜ ๋Š๋ฆฌ๊ฒŒ ์›€์ง์—ฌ์„œ ํ”„๋ ˆ์ž„ ๊ฐ„ ๋ณ€ํ™”๊ฐ€ ๊ฑฐ์˜ ์—†๋Š” ๊ฒฝ์šฐ์—๋Š” ๋ถˆํ•„์š”ํ•œ ๊ณ„์‚ฐ์ด ๋ฐœ์ƒํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ temporal information์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌผ์ฒด์˜ ์›€์ง์ž„ ์ •๋„๋ฅผ ์ธก์ •ํ•œ ๋’ค, ๋ณ€ํ™”๊ฐ€ ๋ฏธ๋น„ํ•˜๋‹ค๋ฉด ๋ฌด๊ฑฐ์šด ๋งˆ์Šคํฌ ์ƒ์„ฑ ๋‹จ๊ณ„๋ฅผ ์ƒ๋žตํ•œ๋‹ค. ์ด๋ฅผ ์‹คํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ๋ณ€ํ™”๋Ÿ‰์„ ์ธก์ •ํ•˜๊ณ  ํ”„๋ ˆ์ž„ ๊ฐ„์˜ ์œ ์‚ฌ์„ฑ์— ๋”ฐ๋ผ ๊ฒฝ๋กœ๋ฅผ (์ „์ฒด ๋„คํŠธ์›Œํฌ ๊ณ„์‚ฐ ๋˜๋Š” ์ด์ „ ํ”„๋ ˆ์ž„ ๊ฒฐ๊ณผ๋ฅผ ์žฌ์‚ฌ์šฉ) ๊ฒฐ์ •ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋™์  ๋„คํŠธ์›Œํฌ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋‹ค์–‘ํ•œ semi-VOS ๋ฐ์ดํ„ฐ์…‹์— (DAVIS 16, DAVIS 17 ๋ฐ YouTube-VOS) ๋Œ€ํ•ด ์ •ํ™•๋„ ์ €ํ•˜์—†์ด ์ถ”๋ก  ์†๋„๋ฅผ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค. ๋˜ํ•œ ์šฐ๋ฆฌ์˜ ์ ‘๊ทผ ๋ฐฉ์‹์€ ๋‹ค์–‘ํ•œ semi-VOS ๋ฐฉ๋ฒ•์— ์ ์šฉ๊ฐ€๋Šฅํ•จ์„ ์‹คํ—˜์ ์œผ๋กœ ์ฆ๋ช…ํ•œ๋‹ค.Segmentation has seen a remarkable performance advance by using deep convolution neural networks like other fields of computer vision. This is necessary technology because we can understand surrounded scenes and recognize the shape of an object for various visual applications such as AR/VR, autonomous driving, surveillance system, etc. However, most previous methods can not directly be used for real-world systems due to tremendous computation. This dissertation focuses on image semantic segmentation and semi-supervised video object segmentation among various sub-fields in the overall segmentation realm to reduce model complexity. We point out redundant operations from conventional frameworks and propose solutions from three different perspectives. First, we discuss the spatial redundancy issue in a decoder. The decoder conducts upsampling to recover small resolution feature maps into the original resolution to generate a sharp mask and classify each pixel for finding their semantic categories. However, neighboring pixels share information and get the same semantic category each other, and thus we do not need independent pixel-wise computation in the decoder. We propose superpixel-based sampling architecture to eliminate the decoder process by reducing spatial redundancy to resolve this problem. The proposed network is trained and tested with only 0.37% of total pixels with a re-adjusting learning rate scheme by statistical process control (SPC) of gradients in each layer. We show that our network performs better or equal accuracy comparison with various conventional methods on Pascal Context, SUN-RGBD dataset. Second, we point out the dilated convolution in an encoder. This is widely used for an encoder to get the advantage of a large receptive field and improve performance. One practical choice to reduce computation for executing mobile devices is applying a depth-wise separable convolution strategy into a dilated convolution. However, the simple combination of these two methods incurs severe performance degradation due to the loss of information in the feature map from an over-simplified operation. We propose a new convolutional block called Concentrated-Comprehensive Convolution (C3) to compensate for the information loss to resolve this problem. We apply the C3-block to various segmentation frameworks (DRN, ERFnet, Enet, and Deeplab V3) to prove our proposed method's beneficial properties on Cityscapes and Pascal VOC datasets. Another issue in the dilated convolution is different latency times depending on the dilation rate. Theoretically, the dilated convolution has a similar latency time regardless of dilation rate, but we observe that the latency time is seriously different up to 2 times. To mitigate this issue, we devise another convolutional block called the spatial squeeze (S2) block. S2-block utilizes an average pooling trick for squeezing spatial information to understand long-range information and reduce lots of computation. We provide qualitative and quantitative analysis of a proposed network based on S2-block with other lightweight segmentation and compare the performance with C3-block on the Cityscapes dataset. Also, we demonstrate that our method successfully is executed for a mobile device. Third, we also tackle the temporal redundancy problem in video segmentation. One of the critical techniques in computer vision is how to handle video data efficiently. Semi-supervised Video Object Segmentation (semi-VOS) propagates information from previous frames to generate a segmentation mask for the current frame. However, previous works treat every frame with the same importance and use a full-network path. This generates high-quality segmentation across challenging scenarios such as shape-changing and occlusion. However, it also leads to unnecessary computations for stationary or slow-moving objects where the change across frames is little. In this work, we exploit this observation by using temporal information to quickly identify frames with little change and skip the heavyweight mask generation step. To realize this efficiency, we propose a novel dynamic network that estimates change across frames and decides which path -- computing a full network or reusing the previous frame's feature -- to choose depending on the expected similarity. Experimental results show that our approach significantly improves inference speed without much accuracy degradation on challenging semi-VOS datasets -- DAVIS 16, DAVIS 17, and YouTube-VOS. Furthermore, our approach can be applied to multiple semi-VOS methods demonstrating its generality.1 Introduction 1 1.1 Challenging Problem 3 1.1.1 Semantic Segmentation 3 1.1.2 Semi-supervised Video Object Segmentation 6 1.2 Contribution 8 1.2.1 Reducing Spatial Redundancy in Decoder 8 1.2.2 Beyond Dilated Convolution 9 1.2.3 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation 10 1.3 Outline 11 2 Related Work 12 2.1 Decoder for Segmentation 12 2.2 Feature Extraction for Segmentation Encoder 14 2.3 Tracking Target for Video Object Segmentation 16 2.3.1 Mask Propagation 16 2.3.2 Online-learning 16 2.3.3 Template Matching 16 2.4 Reducing Computation for Deep Learning Networks 17 2.4.1 Convolution Factorization 17 2.4.2 Dynamic Network 18 2.5 Datasets and Measurements 19 2.5.1 Image Semantic Segmentation 19 2.5.2 Video Object Segmentation 19 2.5.3 Measurement 20 3 Reducing Spatial Redundancy in Decoder via Sampling based on Superpixel 22 3.1 Relate Work 25 3.2 Sampling Method Based on Superpixel for Train and Test 27 3.3 Details of Remapping Feature Map 28 3.4 Re-adjusting Learning Rates 30 3.5 Experiments 33 3.5.1 Implementation details 33 3.5.2 Pascal Context Benchmark Experiments 33 3.5.3 Analysis of the Number of Superpixel 35 3.5.4 SUN-RGBD Benchmark Experiments 37 4 Beyond Dilated Convolution for Better Lightweight Encoder 39 4.1 Relate Work 41 4.2 Rethinking about Property of Dilated Convolutions 42 4.3 Concentrated-Comprehensive Convolution 45 4.4 Experiments of C3 47 4.4.1 Ablation Study on C3 based on ESPNet 47 4.4.2 Evaluation on Cityscapes with Other Models 52 4.4.3 Evaluation on PASCAL VOC with Other Models 54 4.5 Rethinking about Speed of Dilated Convolutions and Multi-branches Structures 55 4.6 Spatial Squeeze Block 56 4.6.1 Overall Structure 58 4.7 Experiments of S2 61 4.7.1 Evaluation Results on the EG1800 Dataset 62 4.7.2 Ablation Study 64 4.8 Comparison between C3 and S2 64 4.8.1 Evaluation Results on the Cityscapes Dataset 65 5 Reducing Temporal Redundancy in Semi-supervised Video Object Segmentation via Dynamic Inference Framework 69 5.1 Relate Work 73 5.2 Online-learning for Semi-supervised Video Object Segmentation 74 5.2.1 Brief Explanation of Baseline Architecture 74 5.2.2 Our Dynamic Inference Framework 76 5.3 Quantifying Movement for Recognizing Temporal Redundancy 78 5.3.1 Details of Template Matching 80 5.4 Reusing Previous Feature Map 83 5.5 Extend to General Semi-supervised Video Object Segmentation 84 5.6 Gate Probability Loss 87 5.7 Experiment 89 5.7.1 DAVIS Benchmark Result 90 5.7.2 Ablation Study 93 5.7.3 YouTube-VOS Result 100 5.7.4 Qualitative Examples 102 6 Conclusion 105 6.1 Summary 105 6.2 Limitations 108 6.3 Future Works 109 Abstract (In Korean) 129 ๊ฐ์‚ฌ์˜ ๊ธ€ 132๋ฐ•
    • โ€ฆ
    corecore