190 research outputs found
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
Ensemble of SingleโLayered ComplexโValued Neural Networks for Classification Tasks
This paper presents ensemble approaches in single-layered complex-valued
neural network (CVNN) to solve real-valued classification problems. Each
component CVNN of an ensemble uses a recently proposed activation function
for its complex-valued neurons (CVNs). A gradient-descent based learning
algorithm was used to train the component CVNNs. We applied two ensemble
methods, negative correlation learning and bagging, to create the ensembles.
Experimental results on a number of real-world benchmark problems showed a
substantial performance improvement of the ensembles over the individual
single-layered CVNN classifiers. Furthermore, the generalization performances
were nearly equivalent to those obtained by the ensembles of real-valued
multilayer neural networks
Training Data Influence Analysis and Estimation: A Survey
Good models require good training data. For overparameterized deep models,
the causal relationship between training data and model predictions is
increasingly opaque and poorly understood. Influence analysis partially
demystifies training's underlying interactions by quantifying the amount each
training instance alters the final model. Measuring the training data's
influence exactly can be provably hard in the worst case; this has led to the
development and use of influence estimators, which only approximate the true
influence. This paper provides the first comprehensive survey of training data
influence analysis and estimation. We begin by formalizing the various, and in
places orthogonal, definitions of training data influence. We then organize
state-of-the-art influence analysis methods into a taxonomy; we describe each
of these methods in detail and compare their underlying assumptions, asymptotic
complexities, and overall strengths and weaknesses. Finally, we propose future
research directions to make influence analysis more useful in practice as well
as more theoretically and empirically sound. A curated, up-to-date list of
resources related to influence analysis is available at
https://github.com/ZaydH/influence_analysis_papers
Modification of Learning Ratio and Drop-Out for Stochastic Gradient Descendant Algorithm
The stochastic gradient descendant algorithm is one of the most popular neural network training algorithms. Many authors have contributed to modifying or adapting its shape and parametrizations in order to improve its performance. In this paper, the authors propose two modifications on this algorithm that can result in a better performance without increasing significantly the computational and time resources needed. The first one is a dynamic learning ratio depending on the network layer where it is applied, and the second one is a dynamic drop-out that decreases through the epochs of training. These techniques have been tested against different benchmark function to see their effect on the learning process. The obtained results show that the application of these techniques improves the performance of the learning of the neural network, especially when they are used together.The current study has been sponsored by the Government of the Basque Country-ELKARTEK21/10 KK-2021/00014 (โEstudio de nuevas tรฉcnicas de inteligencia artificial basadas en Deep Learning dirigidas a la optimizaciรณn de procesos industrialesโ) research program
Deep Learning based Recommender System: A Survey and New Perspectives
With the ever-growing volume of online information, recommender systems have
been an effective strategy to overcome such information overload. The utility
of recommender systems cannot be overstated, given its widespread adoption in
many web applications, along with its potential impact to ameliorate many
problems related to over-choice. In recent years, deep learning has garnered
considerable interest in many research fields such as computer vision and
natural language processing, owing not only to stellar performance but also the
attractive property of learning feature representations from scratch. The
influence of deep learning is also pervasive, recently demonstrating its
effectiveness when applied to information retrieval and recommender systems
research. Evidently, the field of deep learning in recommender system is
flourishing. This article aims to provide a comprehensive review of recent
research efforts on deep learning based recommender systems. More concretely,
we provide and devise a taxonomy of deep learning based recommendation models,
along with providing a comprehensive summary of the state-of-the-art. Finally,
we expand on current trends and provide new perspectives pertaining to this new
exciting development of the field.Comment: The paper has been accepted by ACM Computing Surveys.
https://doi.acm.org/10.1145/328502
Exploring CNNs: an application study on nuclei recognition task in colon cancer histology images
In this work we explore the recent advances in the field of Convolutional Neural Network (CNN), with particular interest to the task of image classification. Moreover, we explore a new neural network algorithm, called ladder network, which enables the semi-supervised framework on pre-existing neural networks.
These techniques were applied to a task of nuclei classification in routine colon cancer histology images.
Specifically, starting from an existing CNN developed for this purpose, we improve its performances utilizing a better data augmentation, a more efficient initialization of the network and adding the batch normalization layer. These improvements were made to achieve a state-of-the-art architecture which could be compatible with the ladder network algorithm. A specific custom version of the ladder network algorithm was implemented in our CNN in order to use the amount of data without a label presented with the used database.
However we observed a deterioration of the performances using the unlabeled examples of this database, probably due to a distribution bias in them compared to the labeled ones.
Even without using of the semi-supervised framework, the ladder algorithm allows to obtain a better representation in the CNN which leads to a dramatic performance improvement of the starting CNN algorithm.
We reach this result only with a little increase in complexity of the final model, working specifically on the training process of the algorithm
Agree to Disagree: Diversity through Disagreement for Better Transferability
Gradient-based learning algorithms have an implicit simplicity bias which in
effect can limit the diversity of predictors being sampled by the learning
procedure. This behavior can hinder the transferability of trained models by
(i) favoring the learning of simpler but spurious features -- present in the
training data but absent from the test data -- and (ii) by only leveraging a
small subset of predictive features. Such an effect is especially magnified
when the test distribution does not exactly match the train distribution --
referred to as the Out of Distribution (OOD) generalization problem. However,
given only the training data, it is not always possible to apriori assess if a
given feature is spurious or transferable. Instead, we advocate for learning an
ensemble of models which capture a diverse set of predictive features. Towards
this, we propose a new algorithm D-BAT (Diversity-By-disAgreement Training),
which enforces agreement among the models on the training data, but
disagreement on the OOD data. We show how D-BAT naturally emerges from the
notion of generalized discrepancy, as well as demonstrate in multiple
experiments how the proposed method can mitigate shortcut-learning, enhance
uncertainty and OOD detection, as well as improve transferability.Comment: 23 pages, 17 figure
์์ํ๋ ๊น์ ์ ๊ฒฝ๋ง์ ํน์ฑ ๋ถ์ ๋ฐ ์ต์ ํ
ํ์๋
ผ๋ฌธ (๋ฐ์ฌ) -- ์์ธ๋ํ๊ต ๋ํ์ : ๊ณต๊ณผ๋ํ ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2020. 8. ์ฑ์์ฉ.Deep neural networks (DNNs) have achieved impressive performance on various machine learning tasks. However, performance improvements are usually accompanied by increased network complexity incurring vast arithmetic operations and memory accesses. In addition, the recent increase in demand for utilizing DNNs in resource-limited devices leads to a plethora of explorations in model compression and acceleration. Among them, network quantization is one of the most cost-efficient implementation methods for DNNs. Network quantization converts the precision of parameters and signals from 32-bit floating-point to 8, 4, or 2-bit fixed-point precision. The weight quantization can directly compress DNNs by reducing the representation levels of the parameters. Activation outputs can also be quantized to reduce the computational costs and working memory footprint. However, severe quantization degrades the performance of the network. Many previous studies focused on developing optimization methods for the quantization of given models without considering the effects of the quantization on DNNs. Therefore, extreme simulation is required to obtain quantization precision that maintains performance on different models or datasets.
In this dissertation, we attempt to measure the per-parameter capacity of DNN models and interpret the results to obtain insights on the optimum quantization of parameters. The uniform random vectors are sampled and used for training generic forms of fully connected DNNs, convolutional neural networks (CNNs), and recurrent neural networks (RNNs). We conduct memorization and classification tests to study the effects of the parameters number and precision on the performance. The model and the per-parameter capacities are assessed by measuring the mutual information between the input and the classified output. To get insight for parameter quantization when performing real tasks, the training and the test performances are compared.
In addition, we analyze and demonstrate that quantization noise of weight and activation are disparate in inference. Synthesized data is designed to visualize the effects of weight and activation quantization. The results indicate that deeper models are more prone to activation quantization, while wider models improve the resiliency to both weight and activation quantization. Considering the characteristics of the quantization errors, we propose a holistic approach for the optimization of QDNNs, which contains QDNN training methods as well as quantization-friendly architecture design.
Based on the observation that the activation quantization induces noised prediction, we propose the Stochastic Precision Ensemble training for QDNNs (SPEQ). The SPEQ is teacher-student learning, but the teacher and the student share the model parameters. We obtain the teacher's soft labels by changing the bit-precision of the activation stochastically at each layer of the forward-pass computation. The student model is trained with these soft labels to reduce the activation quantization noise. Instead of the KL-divergence, the cosine-distance loss is employed for the KD training. Since the teacher model changes continuously by random bit-precision assignment, it exploits the effect of stochastic ensemble KD. The SPEQ method outperforms various tasks, such as image classification, question-answering, and transfer learning without requiring cumbersome teacher networks.์ต๊ทผ ๊น์ ์ ๊ฒฝ๋ง(deep neural network, DNN)์ ๋ค์ํ ๋ถ์ผ์์ ๋งค์ฐ ์ธ์์ ์ธ ์ฑ๋ฅ์ ๋ณด์ด๊ณ ์๋ค. ๊ทธ๋ฌ๋, ์ ๊ฒฝ๋ง์ ๋ณต์ก๋๊ฐ ํจ๊ป ์ฆ๊ฐํ๋ฉด์, ์ ์ ๋ ๋ง์ ๊ณ์ฐ ๋ฐ ๋ฉ๋ชจ๋ฆฌ ์ ๊ทผ ๋น์ฉ์ด ๋ฐ์ํ๊ณ ์๋ค. ์ธ๊ณต์ ๊ฒฝ๋ง์ ์์ํ(quantization)๋ ๊น์ ์ ๊ฒฝ๋ง์ ๋์ ๋น์ฉ์ ์ค์ผ ์ ์๋ ํจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ ์ค ํ๋์ด๋ค. ์ผ๋ฐ์ ์ผ๋ก, ์ ๊ฒฝ๋ง์ ๊ฐ์ค์น(weights) ๋ฐ ํ์ฑํ๋ ์ ํธ(activation outputs)๋ 32 ๋นํธ ๋ถ๋ ์์์ (floating-point) ์ ๋ฐ๋๋ฅผ ๊ฐ์ง๋ค. ๊ณ ์ ์์์ ์์ํ๋ ์ด๋ฅผ ๋ ๋ฎ์ ์ ๋ฐ๋๋ก ํํํจ์ผ๋ก์จ ์ ๊ฒฝ๋ง์ ํฌ๊ธฐ ๋ฐ ์ฐ์ฐ ๋น์ฉ์ ์ค์ธ๋ค. ๊ทธ๋ฌ๋, 1๋๋ 2๋นํธ ๋ฑ ๋งค์ฐ ๋ฎ์ ์ ๋ฐ๋ก๋ ์์ํ๋ ์ ๊ฒฝ๋ง์ ๋ถ๋ ์์์ ์ ๊ฒฝ๋ง๊ณผ ๋น๊ตํ์ฌ ํฐ ์ฑ๋ฅ ํ๋ฝ์ ๋ณด์ธ๋ค. ๊ธฐ์กด์ ์ฐ๊ตฌ๋ค์ ์์ํ ์๋ฌ(error)์ ๋ํ ๋ถ์ ์์ด ์ฃผ์ด์ง ๋ฐ์ดํฐ์ ๋ชจ๋ธ์ ๋ํ ์ต์ ํ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ์ด๋ฌํ ์ฐ๊ตฌ ๊ฒฐ๊ณผ๋ฅผ ๋ค๋ฅธ ๋ชจ๋ธ๊ณผ ๋ฐ์ดํฐ์ ์ ์ฉํ๊ธฐ ์ํด์๋ ์๋ง์ ์๋ฎฌ๋ ์ด์
์ ์ํํ์ฌ ์ฑ๋ฅ์ ์ ์งํ ์ ์๋ ์์ํ ์ ๋ฐ๋์ ํ๊ณ๋ฅผ ์ฐพ์์ผ ํ๋ค.
๋ณธ ์ฐ๊ตฌ์์๋ ์ ๊ฒฝ๋ง์์์ ์์ํ ํน์ฑ์ ๋ถ์ํ๊ณ , ์์ํ๋ก ์ธํ ์ ๊ฒฝ๋ง์ ์ฑ๋ฅ ์ ํ ์์ธ์ ์ ์ํ๋ค. ์ ๊ฒฝ๋ง์ ์์ํ๋ ํฌ๊ฒ ๊ฐ์ค์น ์์ํ(weight quantization)์ ํ์ฑํ ํจ์ ์์ํ(activation quantization)๋ก ๋๋๋ค. ๋จผ์ , ๊ฐ์ค์น ์์ํ์ ํน์ฑ์ ๋ถ์ํ๊ธฐ ์ํด ๋ฌด์์ ํ๋ จ ์ํ์ ์์ฑํ๊ณ , ์ด ๋ฐ์ดํฐ๋ก ์ ๊ฒฝ๋ง์ ํ๋ จ์ํค๋ฉด์ ์ ๊ฒฝ๋ง์ ์๊ธฐ ๋ฅ๋ ฅ(memorization capacity)์ ์ ๋ํ ํ๋ค. ์ ๊ฒฝ๋ง์ด ์์ ์ ์๊ธฐ ๋ฅ๋ ฅ์ ์ต๋๋ก ํ์ฉํ๋๋ก ํ๋ จ์ํจ ๋ค ์ฑ๋ฅ์ด ํ๋ฝํ๋ ์์ํ ์ ๋ฐ๋์ ํ๊ณ๋ฅผ ๋ถ์ํ๋ค. ๋ถ์ ๊ฒฐ๊ณผ, ๊ฐ์ค์น๊ฐ ์ ๋ณด๋์ ์๊ธฐ ์์ํ๋ ์์ํ ์ ๋ฐ๋๋ ํ๋ผ๋ฏธํฐ์ ์์ ๊ด๊ณ๊ฐ ์์์ ํ์ธํ์๋ค. ๋ฟ๋ง ์๋๋ผ, ํ๋ผ๋ฏธํฐ์ ์ ์ฅ๋ ์ ๋ณด๋ฅผ ์ ์งํ ์ ์๋ ํ๊ณ ์์ํ ์ ๋ฐ๋๋ ๋ชจ๋ธ์ ๊ตฌ์กฐ์ ๋ฐ๋ผ ๋ฌ๋ผ์ง๋ค.
๋ํ, ๋ณธ ์ฐ๊ตฌ์์๋ ํ์ฑํ ํจ์ ์์ํ์ ๊ฐ์ค์น ์์ํ๋ก ์ธํ ์๋ฌ์ ์ฐจ์ด์ ์ ๋ถ์ํ๋ค. ํฉ์ฑ ๋ฐ์ดํฐ(synthesized data)๋ฅผ ์์ฑํ๊ณ , ์ด ๋ฐ์ดํฐ๋ก ํ๋ จ๋ ๋ชจ๋ธ์ ์์ํ ํ ๋ค ์์ํ ์๋ฌ๋ฅผ ์๊ฐํ ํ๋ค. ๋ถ์ ๊ฒฐ๊ณผ ๊ฐ์ค์น ์์ํ๋ ์ ๊ฒฝ๋ง์ ์ฉ๋(capacity)์ ๊ฐ์์ํค๋ฉฐ, ์ ๊ฒฝ๋ง์ ํ๋ผ๋ฏธํฐ ์๋ฅผ ์ฆ๊ฐ์ํค๋ฉด ๊ฐ์ค์น ์์ํ ์๋ฌ๊ฐ ๊ฐ์ํ๋ค. ๋ฐ๋ฉด, ํ์ฑํ ํจ์์ ์์ํ๋ ์ถ๋ก ๊ณผ์ (inference)์์ ์ก์(noise)์ ์ ๋ฐํ๋ฉฐ ์ ๊ฒฝ๋ง์ ๊น์ด๊ฐ ๊น์ด์ง ์๋ก ํ์ฑํ ํจ์์ ์๋ฌ๊ฐ ์ฆํญ๋๋ค. ๋ณธ ์ฐ๊ตฌ์์๋, ๋ ์์ํ ์๋ฌ์ ์ฐจ์ด๋ฅผ ๋ฐํ์ผ๋ก ์์ํ ์นํ์ ์ํคํ
์ฒ ์ค๊ณ์ ๊ณ ์ ์์์ ํ๋ จ ๋ฐฉ๋ฒ์ ํฌํจํ๋ ํฌ๊ด์ ์ธ ๊ณ ์ ์์์ ์ต์ ํ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค.
๋ฟ๋ง ์๋๋ผ, ํ์ฑํ ํจ์๊ฐ ์์ํ๋ ์ ๊ฒฝ๋ง์ ์ฑ๋ฅ ๋ณต์๋ ฅ์ ๋์ด๋ ๋ฐฉ๋ฒ์ผ๋ก SPEQ ํ๋ จ ๋ฐฉ๋ฒ์ ์ ์ํ๋ค. ์ ์ํ๋ ํ๋ จ ๋ฐฉ๋ฒ์ ์ง์ ์ฆ๋ฅ (knowledge distillation, KD) ๊ธฐ๋ฐ ํ์ต ๋ฐฉ๋ฒ์ผ๋ก, ๋งค ํ๋ จ ๋จ๊ณ ๋ง๋ค ์๋ก ๋ค๋ฅธ ์ ์ ๋ชจ๋ธ์ ์ ๋ณด๋ฅผ ํ์ฉํ๋ค. ์ ์ ๋ชจ๋ธ์ ํ๋ผ๋ฏธํฐ๋ ํ์ ๋ชจ๋ธ๊ณผ ๋์ผํ๋ฉฐ, ํ์ฑํ ํจ์์ ์์ํ ์ ๋ฐ๋๋ฅผ ํ๋ฅ ์ ์ผ๋ก ์ ํํจ์ผ๋ก์จ ์ ์ ๋ชจ๋ธ์ ์ํํธ ๋ผ๋ฒจ(soft label)์ ์์ฑํ๋ค. ๋ฐ๋ผ์ ์ ์ ๋ชจ๋ธ์ ํ์ ๋ชจ๋ธ์์ ์ ๋ฐ๋๋ ์์ํ ์ก์์ ๊ณ ๋ คํ ์ง์์ ์ ๊ณตํด ์ค๋ค. ํ์ ๋ชจ๋ธ์ ํ๋ จ ๋จ๊ณ๋ง๋ค ๋ค๋ฅธ ์ข
๋ฅ์ ์์ํ ์ก์์ ๊ณ ๋ คํ ์ง์์ผ๋ก ํ๋ จ๋๊ธฐ ๋๋ฌธ์ ์์๋ธ ํ์ต(ensemble training) ํจ๊ณผ๋ฅผ ์ป์ ์ ์๋ค. ์ ์ํ๋ SPEQ ํ๋ จ ๋ฐฉ๋ฒ์ ๋ค์ํ ๋ถ์ผ์์ ์์ํ๋ ์ ๊ฒฝ๋ง์ ์ฑ๋ฅ์ ํฌ๊ฒ ํฅ์์์ผฐ๋ค.1 Introduction 1
1.1 Quantization of Deep Neural Networks 1
1.1.1 Weight and Activation Quantization on Deep Neural Networks 2
1.1.2 Analysis of Quantized Deep Neural Networks 3
1.2 Scope of the Dissertation 4
1.2.1 Characterization of Quantization Errors 4
1.2.2 Optimization of Quantized Deep Neural Networks 6
2 Memorization Capacity of Deep Neural Networks under Parameter Quantization 8
2.1 Introduction 8
2.2 Related Works and Backgrounds 10
2.2.1 Neural Network Capacity 10
2.2.2 Fixed-Point Deep Neural Networks 11
2.3 Network Capacity Measurements of DNNs 11
2.3.1 Capacity Measurements on a Memorization Task 11
2.3.2 Network Quantization Method 13
2.3.3 Network Quantization and Parameter Capacity 14
2.4 Experimental Results on Capacity of Floating-point DNNs 15
2.4.1 Capacity of FCDNNs 15
2.4.2 Capacity of CNNs 19
2.4.3 Capacity of RNNs 19
2.5 Experimental Results of Parameter Quantization 21
2.5.1 Capacity under Parameter Quantization 21
2.5.2 Quantization Experiments on CIFAR-10 Dataset 23
2.5.3 Quantization Experiments on Shuffled CIFAR-10 Dataset 25
2.6 Concluding Remarks 28
3 Characterization and Holistic Optimization of Quantized Deep Neural Networks 30
3.1 Introduction 30
3.2 Backgrounds 32
3.2.1 Related Works on Network Quantization 32
3.2.2 Revisit of QDNN Optimization 33
3.3 Visualization of Quantization Errors using Synthetic Dataset 34
3.3.1 Synthetic Dataset Generation 34
3.3.2 Results on Synthetic Dataset 37
3.4 QDNN Optimization with Architectural Transformation and Improved Training 39
3.4.1 Architecture Transformation for Improved Robustness to Quantization 40
3.4.2 Cyclical Learning Rate Scheduling for Improved Generalization 41
3.4.3 Regularization for Limiting the Activation Noise Amplification 42
3.5 Experimental Results 42
3.5.1 Visualizing the Effects of Quantization on the Segmentation Task 42
3.5.2 The Width and Depth Effects on QDNNs 44
3.5.3 QDNN Architecture Selection under the Parameter Constraint 49
3.5.4 Results of Training Methods on QDNNs 51
3.6 Concluding Remarks 53
4 Parameter Shared Stochastic Precision Knowledge Distillation for Quantized Deep Neural Networks 55
4.1 Introduction 55
4.2 Background and Related Works 58
4.2.1 Quantization of Deep Neural Networks 58
4.2.2 Knowledge Distillation for Quantization 59
4.3 Stochastic Precision Ensemble Training for QDNNs 60
4.3.1 Quantization Method 60
4.3.2 Stochastic Precision Self-Distillation with Model Sharing 61
4.3.3 Stochastic Ensemble Learning 63
4.3.4 Cosine Similarity Learning 65
4.4 Experimental Results 70
4.4.1 Experiment Setup 70
4.4.2 Results on CIFAR-10 and CIFAR-100 Datasets 70
4.4.3 Results on ImageNet Dataset 73
4.4.4 Results on Transfer Learning 76
4.5 Concluding Remarks 78
5 Conclusion 80
Abstract (In Korean) 97
๊ฐ์ฌ์ ๊ธ 99Docto
- โฆ