51,981 research outputs found

    Bayesian Neural Bandit Using Online SWAG

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šค๋Œ€ํ•™์› ๋ฐ์ดํ„ฐ์‚ฌ์ด์–ธ์Šคํ•™๊ณผ, 2022. 8. ์˜ค๋ฏผํ™˜.In this paper, we propose a Neural SWAG Bandit algorithm that combines a neural network-based bandit algorithm with Stochastic Weight Averaging Gaussian (SWAG), a Bayesian deep learning methodology. Neural Bandit is a bandit algorithm that uses the output of neural networks as an estimated reward. SWAG is a Bayesian Deep Learning method that samples parameters from the gaussian posterior distribution, which has been shown to have state-of-the-art performance and robustness compared to benchmark algorithms. By adapting SWAG into an online setting and combining it with Neural Bandit, we can leverage efficient sampling from deep neural networks while learning online. Our experiment results indicate that Neural SWAG Bandit benefits from Bayesian deep learning as well as exhibits superior performance compared to existing benchmark algorithms.1 INTRODUCTION 1 1.1 Bandit Algorithm 1 1.2 Neural Bandit 2 1.3 Bayesian Deep Learning 3 1.4 SWAG(Stochastic Weight Averaging Gaussian) 4 1.5 Contributions 4 2 BACKGROUND & RELATED WORK 5 2.1 Bandit Algorithm 5 2.1.1 Multi-armed Bandit 5 2.1.2 Contextual Bandit 6 2.1.3 Exploration & Exploitation tradeoff 6 2.1.4 Existing Bandit Algorithms 7 2.2 Neural Bandit 8 2.2.1 Neural Bandit 8 2.2.2 Existing Neural Bandit Algorithms 8 2.3 Bayesian Deep Learning 9 2.3.1 Bayesian and Uncertainty 9 2.3.2 Bayes Rule 9 2.3.3 Bayesian Neural Network 10 2.3.4 Existing Bayesian Deep Learning methods 10 2.4 SWAG Algorithm 11 2.4.1 SGD (Stochastic Gradient Descent) 11 2.4.2 SWA (Stochastic Weight Averaging) 12 2.4.3 SWAG-Diagonal 12 2.4.4 SWAG 12 3 THE NeuralSWAG ALGORITHM 14 4 EVALUATION METHODOLOGY 16 4.1 Cumulative Regret 16 5 EXPERIMENTS 17 5.1 Dataset 17 5.1.1 Simulation Dataset 17 5.1.2 Real-world Dataset 18 5.2 Model 19 5.3 Experiments setting 19 5.3.1 Setting for Simulation Datasets 19 5.3.2 Setting for Real-world Datasets 20 5.4 Compared Algorithms 21 5.5 Experimental Results 21 5.5.1 Results for Simulation Datasets 21 5.5.2 Results for Real-world Datasets 22 6 CONCLUSIONS 25 Bibliography 26์„

    Deep Ensembling with No Overhead for either Training or Testing:The All-Round Blessings of Dynamic Sparsity

    Get PDF
    The success of deep ensembles on improving predictive performance, uncertainty estimation, and out-of-distribution robustness has been extensively studied in the machine learning literature. Albeit the promising results, naively training multiple deep neural networks and combining their predictions at inference leads to prohibitive computational costs and memory requirements. Recently proposed efficient ensemble approaches reach the performance of the traditional deep ensembles with significantly lower costs. However, the training resources required by these approaches are still at least the same as training a single dense model. In this work, we draw a unique connection between sparse neural network training and deep ensembles, yielding a novel efficient ensemble learning framework called FreeTickets. Instead of training multiple dense networks and averaging them, we directly train sparse subnetworks from scratch and extract diverse yet accurate subnetworks during this efficient, sparse-to-sparse training. Our framework, FreeTickets, is defined as the ensemble of these relatively cheap sparse subnetworks. Despite being an ensemble method, FreeTickets has even fewer parameters and training FLOPs than a single dense model. This seemingly counter-intuitive outcome is due to the ultra training/inference efficiency of dynamic sparse training. FreeTickets surpasses the dense baseline in all the following criteria: prediction accuracy, uncertainty estimation, out-of-distribution (OoD) robustness, as well as efficiency for both training and inference. Impressively, FreeTickets outperforms the naive deep ensemble with ResNet50 on ImageNet using around only 1/5 of the training FLOPs required by the latter. We have released our source code at https://github.com/VITA-Group/FreeTickets.Comment: published in International Conference on Learning Representations (ICLR 2022

    ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์˜ ํ–ฅ์ƒ์„ ์œ„ํ•œ ๊นŠ์€ ์‹ ๊ฒฝ๋ง ์–‘์žํ™”

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์„ฑ์›์šฉ.์ตœ๊ทผ ๊นŠ์€ ์‹ ๊ฒฝ๋ง(deep neural network, DNN)์€ ์˜์ƒ, ์Œ์„ฑ ์ธ์‹ ๋ฐ ํ•ฉ์„ฑ ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ๋Œ€๋ถ€๋ถ„์˜ ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ ๋งŽ์€ ๊ฐ€์ค‘์น˜(parameter) ์ˆ˜์™€ ๊ณ„์‚ฐ๋Ÿ‰์„ ์š”๊ตฌํ•˜์—ฌ ์ž„๋ฒ ๋””๋“œ ์‹œ์Šคํ…œ์—์„œ์˜ ๋™์ž‘์„ ๋ฐฉํ•ดํ•œ๋‹ค. ์ธ๊ณต์‹ ๊ฒฝ๋ง์€ ๋‚ฎ์€ ์ •๋ฐ€๋„์—์„œ๋„ ์ž˜ ๋™์ž‘ํ•˜๋Š” ์ธ๊ฐ„์˜ ์‹ ๊ฒฝ์„ธํฌ๋ฅผ ๋ชจ๋ฐฉํ•˜์˜€๊ธฐ ๋–„๋ฌธ์— ๋‚ฎ์€ ์ •๋ฐ€๋„์—์„œ๋„ ์ž˜ ๋™์ž‘ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ์–‘์žํ™”(quantization)๋Š” ์ด๋Ÿฌํ•œ ํŠน์ง•์„ ์ด์šฉํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๊นŠ์€ ์‹ ๊ฒฝ๋ง ๊ณ ์ •์†Œ์ˆ˜์  ์–‘์žํ™”๋Š” 8-bit ์ด์ƒ์˜ ๋‹จ์–ด๊ธธ์ด์—์„œ ๋ถ€๋™์†Œ์ˆ˜์ ๊ณผ ์œ ์‚ฌํ•œ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜์žˆ์ง€๋งŒ, ๊ทธ๋ณด๋‹ค ๋‚ฎ์€ 1-, 2-bit์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง„๋‹ค. ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ๋ถˆ๊ท ํ˜• ์–‘์žํ™”๊ธฐ๋‚˜ ์ ์‘์  ์–‘์žํ™” ๋“ฑ์˜ ๋” ์ •๋ฐ€ํ•œ ์ธ๊ณต์‹ ๊ฒฝ๋ง ์–‘์žํ™” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด์˜ ์—ฐ๊ตฌ์™€ ๋งค์šฐ ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๊ณ ์ • ์†Œ์ˆ˜์  ๋„คํŠธ์›Œํฌ์˜ ์ผ๋ฐ˜ํ™”๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ถ”์—ˆ์œผ๋ฉฐ, ์ด๋ฅผ ์œ„ํ•ด ์žฌํ›ˆ๋ จ(retraining) ์•Œ๊ณ ๋ฆฌ์ฆ˜์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ์–‘์žํ™”๋œ ์ธ๊ณต์‹ ๊ฒฝ๋ง์˜ ์„ฑ๋Šฅ์„ ๋ถ„์„ํ•œ๋‹ค. ์„ฑ๋Šฅ ๋ถ„์„์€ ๋ ˆ์ด์–ด๋ณ„ ๋ฏผ๊ฐ๋„ ์ธก์ •(layer-wise sensitivity analysis)์— ๊ธฐ๋ฐ˜ํ•œ๋‹ค. ๋˜ํ•œ ์–‘์žํ™” ๋ชจ๋ธ์˜ ๋„“์ด์™€ ๊นŠ์ด์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ๋„ ๋ถ„์„ํ•œ๋‹ค. ๋ถ„์„๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์–‘์žํ™” ์Šคํ… ์ ์‘ ํ›ˆ๋ จ๋ฒ•(quantization step size adaptation)๊ณผ ์ ์ง„์  ์–‘์žํ™” ํ›ˆ๋ จ ๋ฐฉ๋ฒ•(gradual quantization)์„ ์ œ์•ˆํ•œ๋‹ค. ์–‘์žํ™”๋œ ์‹ ๊ฒฝ๋ง ํ›ˆ๋ จ์‹œ ์–‘์žํ™” ๋…ธ์ด์ฆˆ๋ฅผ ์ ๋‹นํžˆ ์กฐ์ •ํ•˜์—ฌ ์†์‹ค ํ‰๋ฉด(loss surface)์ƒ์— ํ‰ํ‰ํ•œ ๋ฏธ๋‹ˆ๋งˆ(minima)์— ๋„๋‹ฌ ํ•  ์ˆ˜ ์žˆ๋Š” ์–‘์žํ™” ํ›ˆ๋ จ ๋ฐฉ๋ฒ• ๋˜ํ•œ ์ œ์•ˆํ•œ๋‹ค. HLHLp (high-low-high-low-precision)๋กœ ๋ช…๋ช…๋œ ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์€ ์–‘์žํ™” ์ •๋ฐ€๋„๋ฅผ ํ›ˆ๋ จ์ค‘์— ๋†’๊ฒŒ-๋‚ฎ๊ฒŒ-๋†’๊ฒŒ-๋‚ฎ๊ฒŒ ๋ฐ”๊พธ๋ฉด์„œ ํ›ˆ๋ จํ•œ๋‹ค. ํ›ˆ๋ จ๋ฅ (learning rate)๋„ ์–‘์žํ™” ์Šคํ… ์‚ฌ์ด์ฆˆ๋ฅผ ๊ณ ๋ คํ•˜์—ฌ ์œ ๋™์ ์œผ๋กœ ๋ฐ”๋€๋‹ค. ์ œ์•ˆํ•˜๋Š” ํ›ˆ๋ จ๋ฐฉ๋ฒ•์€ ์ผ๋ฐ˜์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ ํ›ˆ๋ จ๋œ ์–‘์žํ™” ๋ชจ๋ธ์— ๋น„ํ•ด ์ƒ๋‹นํžˆ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€๋‹ค. ๋˜ํ•œ ์„ ํ›ˆ๋ จ๋œ ์„ ์ƒ ๋ชจ๋ธ๋กœ ํ•™์ƒ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ์ง€์‹ ์ฆ๋ฅ˜(knowledge distillation, KD) ๊ธฐ์ˆ ์„ ์ด์šฉํ•˜์—ฌ ์–‘์žํ™”์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ํŠนํžˆ ์„ ์ƒ ๋ชจ๋ธ์„ ์„ ํƒํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ์ง€์‹ ์ฆ๋ฅ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•œ๋‹ค. ๋ถ€๋™์†Œ์ˆ˜์  ์„ ์ƒ๋ชจ๋ธ๊ณผ ์–‘์žํ™” ๋œ ์„ ์ƒ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ํ›ˆ๋ จ ์‹œํ‚จ ๊ฒฐ๊ณผ ์„ ์ƒ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด๋‚ด๋Š” ์†Œํ”„ํŠธ๋งฅ์Šค(softmax) ๋ถ„ํฌ๊ฐ€ ์ง€์‹์ฆ๋ฅ˜ํ•™์Šต ๊ฒฐ๊ณผ์— ํฌ๊ฒŒ ์˜ํ–ฅ์„ ์ฃผ๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ•˜์˜€๋‹ค. ์†Œํ”„ํŠธ๋งฅ์Šค ๋ถ„ํฌ๋Š” ์ง€์‹์ฆ๋ฅ˜์˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์„ ํ†ตํ•ด ์กฐ์ ˆ๋ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ง€์‹์ฆ๋ฅ˜ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋“ค๊ฐ„์˜ ์—ฐ๊ด€๊ด€๊ณ„ ๋ถ„์„์„ ํ†ตํ•ด ๋†’์€ ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ์ ์ง„์ ์œผ๋กœ ์†Œํ”„ํŠธ ์†์‹ค ํ•จ์ˆ˜ ๋ฐ˜์˜ ๋น„์œจ์„ ํ›ˆ๋ จ์ค‘์— ์ค„์—ฌ๊ฐ€๋Š” ์ ์ง„์  ์†Œํ”„ํŠธ ์†์‹ค ๊ฐ์†Œ(gradual soft loss reducing)๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ์–‘์žํ™”๋ชจ๋ธ์„ ํ‰๊ท ๋‚ด์–ด ๋†’์€ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ฐ–๋Š” ์–‘์žํ™” ๋ชจ๋ธ์„ ์–ป๋Š” ํ›ˆ๋ จ ๋ฐฉ๋ฒ•์ธ ํ™•๋ฅ  ์–‘์žํ™” ๊ฐ€์ค‘์น˜ ํ‰๊ท (stochastic quantized weight averaging, SQWA) ํ›ˆ๋ จ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์€ (1) ๋ถ€๋™์†Œ์ˆ˜์  ํ›ˆ๋ จ, (2) ๋ถ€๋™์†Œ์ˆ˜์  ๋ชจ๋ธ์˜ ์ง์ ‘ ์–‘์žํ™”(direct quantization), (3) ์žฌํ›ˆ๋ จ(retraining)๊ณผ์ •์—์„œ ์ง„๋™ ํ›ˆ๋ จ์œจ(cyclical learning rate)์„ ์‚ฌ์šฉํ•˜์—ฌ ํœธ๋ จ์œจ์ด ์ง„๋™๋‚ด์—์„œ ๊ฐ€์žฅ ๋‚ฎ์„ ๋•Œ ๋ชจ๋ธ๋“ค์„ ์ €์žฅ, (4) ์ €์žฅ๋œ ๋ชจ๋ธ๋“ค์„ ํ‰๊ท , (5) ํ‰๊ท  ๋œ ๋ชจ๋ธ์„ ๋‚ฎ์€ ํ›ˆ๋ จ์œจ๋กœ ์žฌ์กฐ์ • ํ•˜๋Š” ๋‹ค์ค‘ ๋‹จ๊ณ„ ํ›ˆ๋ จ๋ฒ•์ด๋‹ค. ์ถ”๊ฐ€๋กœ ์–‘์žํ™” ๊ฐ€์ค‘์น˜ ๋„๋ฉ”์ธ์—์„œ ์—ฌ๋Ÿฌ ์–‘์žํ™” ๋ชจ๋ธ๋“ค์„ ํ•˜๋‚˜์˜ ์†์‹คํ‰๋ฉด๋‚ด์— ๋™์‹œ์— ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋Š” ์‹ฌ์ƒ(visualization) ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ œ์•ˆํ•˜๋Š” ์‹ฌ์ƒ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด SQWA๋กœ ํ›ˆ๋ จ๋œ ์–‘์žํ™” ๋ชจ๋ธ์€ ์†์‹คํ‰๋ฉด์˜ ๊ฐ€์šด๋ฐ ๋ถ€๋ถ„์— ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์˜€๋‹ค.Deep neural networks (DNNs) achieve state-of-the-art performance for various applications such as image recognition and speech synthesis across different fields. However, their implementation in embedded systems is difficult owing to the large number of associated parameters and high computational costs. In general, DNNs operate well using low-precision parameters because they mimic the operation of human neurons; therefore, quantization of DNNs could further improve their operational performance. In many applications, word-length larger than 8 bits leads to DNN performance comparable to that of a full-precision model; however, shorter word-length such as those of 1 or 2 bits can result in significant performance degradation. To alleviate this problem, complex quantization methods implemented via asymmetric or adaptive quantizers have been employed in previous works. In contrast, in this study, we propose a different approach for quantization of DNNs. In particular, we focus on improving the generalization capability of quantized DNNs (QDNNs) instead of employing complex quantizers. To this end, first, we analyze the performance characteristics of quantized DNNs using a retraining algorithm; we employ layer-wise sensitivity analysis to investigate the quantization characteristics of each layer. In addition, we analyze the differences in QDNN performance for different quantized network sizes. Based on our analyses, two simple quantization training techniques, namely \textit{adaptive step size retraining} and \textit{gradual quantization} are proposed. Furthermore, a new training scheme for QDNNs is proposed, which is referred to as high-low-high-low-precision (HLHLp) training scheme, that allows the network to achieve flat minima on its loss surface with the aid of quantization noise. As the name suggests, the proposed training method employs high-low-high-low precision for network training in an alternating manner. Accordingly, the learning rate is also abruptly changed at each stage. Our obtained analysis results include that the proposed training technique leads to good performance improvement for QDNNs compared with previously reported fine tuning-based quantization schemes. Moreover, the knowledge distillation (KD) technique that utilizes a pre-trained teacher model for training a student network is exploited for the optimization of the QDNNs. We explore the effect of teacher network selection and investigate that of different hyperparameters on the quantization of DNNs using KD. In particular, we use several large floating-point and quantized models as teacher networks. Our experiments indicate that, for effective KD training, softmax distribution produced by a teacher network is more important than its performance. Furthermore, because softmax distribution of a teacher network can be controlled using KD hyperparameters, we analyze the interrelationship of each KD component for QDNN training. We show that even a small teacher model can achieve the same distillation performance as a larger teacher model. We also propose the gradual soft loss reducing (GSLR) technique for robust KD-based QDNN optimization, wherein the mixing ratio of hard and soft losses during training is controlled. In addition, we present a new QDNN optimization approach, namely \textit{stochastic quantized weight averaging} (SQWA), to design low-precision DNNs with good generalization capability using model averaging. The proposed approach includes (1) floating-point model training, (2) direct quantization of weights, (3) capture of multiple low-precision models during retraining with cyclical learning rate, (4) averaging of the captured models, and (5) re-quantization of the averaged model and its fine-tuning with low learning rate. Additionally, we present a loss-visualization technique for the quantized weight domain to elucidate the behavior of the proposed method. Our visualization results indicate that a QDNN optimized using our proposed approach is located near the center of the flat minimum on the loss surface.1.Introduction 1 1.1 Quantization of Deep Neural Networks 1 1.2 Generalization Capability of DNNs 3 1.3 Improved Generalization Capability of QDNNs 3 1.4 Outline of the Dissertation 5 2. Analysis of Fixedpoint Quantization of Deep Neural Networks 6 2.1 Introduction 6 2.2 Fixedpoint Performance Analysis of Deep Neural Networks 8 2.2.1 Model Design of Deep Neural Networks 8 2.2.2 Retrainbased Weight Quantization 10 2.2.3 Quantization Sensitivity Analysis 12 2.2.4 Empirical Analysis 13 2.3 Step Size Adaptation and Gradual Quantization for Retraining of DeepNeural Networks 22 2.3.1 Stepsize adaptation during retraining 22 2.3.2 Gradual quantization scheme 24 2.3.3 Experimental Results 24 2.4 Concluding remarks 30 3. HLHLp:Quantized Neural Networks Training for Reaching Flat Minimain Loss Surface 32 3.1 Introduction 32 3.2 Related Works 33 3.2.1 Quantization of Deep Neural Networks 33 3.2.2 Flat Minima in Loss Surfaces 34 3.3 Training QDNN for IMproved Generalization Capability 35 3.3.1 Analysis of Training with Quantized Weights 35 3.3.2 Highlowhighlowprecision Training 38 3.4 Experimental Results 40 3.4.1 Image Classification with CNNs 41 3.4.2 Language Modeling on PTB and WikiText2 44 3.4.3 Speech Recognition on WSJ Corpus 48 3.4.4 Discussion 49 3.5 Concluding Remarks 55 4 Knowledge Distillation for Optimization of Quantized Deep Neural Networks 56 4.1 Introduction 56 4.2 Quantized Deep Neural Netowrk Training Using Knowledge Distillation 57 4.2.1 Quantization of deep neural networks and knowledge distillation 58 4.2.2 Teacher model selection for KD 59 4.2.3 Discussion on hyperparameters of KD 62 4.3 Experimental Results 62 4.3.1 Experimental setup 62 4.3.2 Results on CIFAR10 and CIFAR100 64 4.3.3 Model size and temperature 66 4.3.4 Gradual Soft Loss Reducing 68 4.4 Concluding Remarks 68 5 SQWA: Stochastic Quantized Weight Averaging for Improving the Generalization Capability of LowPrecision Deep Neural Networks 70 5.1 Introduction 70 5.2 Related works 71 5.2.1 Quantization of deep neural networks for efficient implementations 71 5.2.2 Stochastic weight averaging and losssurface visualization 72 5.3 Quantization of DNN and loss surface visualization 73 5.3.1 Quantization of deep neural networks 73 5.3.2 Loss surface visualization for QDNNs 75 5.4 SQWA algorithm 76 5.5 Experimental results 80 5.5.1 CIFAR100 80 5.5.2 ImageNet 87 5.6 Concluding remarks 90 6 Conclusion 92 Abstract (In Korean) 110Docto

    Coupled Ensembles of Neural Networks

    Full text link
    We investigate in this paper the architecture of deep convolutional networks. Building on existing state of the art models, we propose a reconfiguration of the model parameters into several parallel branches at the global network level, with each branch being a standalone CNN. We show that this arrangement is an efficient way to significantly reduce the number of parameters without losing performance or to significantly improve the performance with the same level of performance. The use of branches brings an additional form of regularization. In addition to the split into parallel branches, we propose a tighter coupling of these branches by placing the "fuse (averaging) layer" before the Log-Likelihood and SoftMax layers during training. This gives another significant performance improvement, the tighter coupling favouring the learning of better representations, even at the level of the individual branches. We refer to this branched architecture as "coupled ensembles". The approach is very generic and can be applied with almost any DCNN architecture. With coupled ensembles of DenseNet-BC and parameter budget of 25M, we obtain error rates of 2.92%, 15.68% and 1.50% respectively on CIFAR-10, CIFAR-100 and SVHN tasks. For the same budget, DenseNet-BC has error rate of 3.46%, 17.18%, and 1.8% respectively. With ensembles of coupled ensembles, of DenseNet-BC networks, with 50M total parameters, we obtain error rates of 2.72%, 15.13% and 1.42% respectively on these tasks
    • โ€ฆ
    corecore