14 research outputs found

    Excess risk bound for deep learning under weak dependence

    Full text link
    This paper considers deep neural networks for learning weakly dependent processes in a general framework that includes, for instance, regression estimation, time series prediction, time series classification. The ψ\psi-weak dependence structure considered is quite large and covers other conditions such as mixing, association,…\ldots Firstly, the approximation of smooth functions by deep neural networks with a broad class of activation functions is considered. We derive the required depth, width and sparsity of a deep neural network to approximate any H\"{o}lder smooth function, defined on any compact set \mx. Secondly, we establish a bound of the excess risk for the learning of weakly dependent observations by deep neural networks. When the target function is sufficiently smooth, this bound is close to the usual O(nβˆ’1/2)\mathcal{O}(n^{-1/2})

    Kernel-convoluted Deep Neural Networks with Data Augmentation

    Full text link
    The Mixup method (Zhang et al. 2018), which uses linearly interpolated data, has emerged as an effective data augmentation tool to improve generalization performance and the robustness to adversarial examples. The motivation is to curtail undesirable oscillations by its implicit model constraint to behave linearly at in-between observed data points and promote smoothness. In this work, we formally investigate this premise, propose a way to explicitly impose smoothness constraints, and extend it to incorporate with implicit model constraints. First, we derive a new function class composed of kernel-convoluted models (KCM) where the smoothness constraint is directly imposed by locally averaging the original functions with a kernel function. Second, we propose to incorporate the Mixup method into KCM to expand the domains of smoothness. In both cases of KCM and the KCM adapted with the Mixup, we provide risk analysis, respectively, under some conditions for kernels. We show that the upper bound of the excess risk is not slower than that of the original function class. The upper bound of the KCM with the Mixup remains dominated by that of the KCM if the perturbation of the Mixup vanishes faster than O(nβˆ’1/2)O(n^{-1/2}) where nn is a sample size. Using CIFAR-10 and CIFAR-100 datasets, our experiments demonstrate that the KCM with the Mixup outperforms the Mixup method in terms of generalization and robustness to adversarial examples

    Least-Squares Neural Network (LSNN) Method For Linear Advection-Reaction Equation: General Discontinuous Interface

    Full text link
    We studied the least-squares ReLU neural network method (LSNN) for solving linear advection-reaction equation with discontinuous solution in [Cai, Zhiqiang, Jingshuang Chen, and Min Liu. ``Least-squares ReLU neural network (LSNN) method for linear advection-reaction equation.'' Journal of Computational Physics 443 (2021), 110514]. The method is based on a least-squares formulation and uses a new class of approximating functions: ReLU neural network (NN) functions. A critical and additional component of the LSNN method, differing from other NN-based methods, is the introduction of a proper designed discrete differential operator. In this paper, we study the LSNN method for problems with arbitrary discontinuous interfaces. First, we show that ReLU NN functions with depth ⌈log⁑2(d+1)βŒ‰+1\lceil \log_2(d+1)\rceil+1 can approximate any dd-dimensional step function on arbitrary discontinuous interfaces with any prescribed accuracy. By decomposing the solution into continuous and discontinuous parts, we prove theoretically that discretization error of the LSNN method using ReLU NN functions with depth ⌈log⁑2(d+1)βŒ‰+1\lceil \log_2(d+1)\rceil+1 is mainly determined by the continuous part of the solution provided that the solution jump is constant. Numerical results for both two and three dimensional problems with various discontinuous interfaces show that the LSNN method with enough layers is accurate and does not exhibit the common Gibbs phenomena along the discontinuous interface.Comment: 24 page

    Smooth function approximation by deep neural networks with general activation functions

    Full text link
    There has been a growing interest in expressivity of deep neural networks. However, most of the existing work about this topic focuses only on the specific activation function such as ReLU or sigmoid. In this paper, we investigate the approximation ability of deep neural networks with a broad class of activation functions. This class of activation functions includes most of frequently used activation functions. We derive the required depth, width and sparsity of a deep neural network to approximate any H\"older smooth function upto a given approximation error for the large class of activation functions. Based on our approximation error analysis, we derive the minimax optimality of the deep neural network estimators with the general activation functions in both regression and classification problems.Comment: 24 page

    Nonparametric logistic regression with deep learning

    Full text link
    Consider the nonparametric logistic regression problem. In the logistic regression, we usually consider the maximum likelihood estimator, and the excess risk is the expectation of the Kullback-Leibler (KL) divergence between the true and estimated conditional class probabilities. However, in the nonparametric logistic regression, the KL divergence could diverge easily, and thus, the convergence of the excess risk is difficult to prove or does not hold. Several existing studies show the convergence of the KL divergence under strong assumptions. In most cases, our goal is to estimate the true conditional class probabilities. Thus, instead of analyzing the excess risk itself, it suffices to show the consistency of the maximum likelihood estimator in some suitable metric. In this paper, using a simple unified approach for analyzing the nonparametric maximum likelihood estimator (NPMLE), we directly derive the convergence rates of the NPMLE in the Hellinger distance under mild assumptions. Although our results are similar to the results in some existing studies, we provide simple and more direct proofs for these results. As an important application, we derive the convergence rates of the NPMLE with deep neural networks and show that the derived rate nearly achieves the minimax optimal rate.Comment: 23 page

    기계 ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ˜ 점근 μ„±μ§ˆ 연ꡬ

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사)--μ„œμšΈλŒ€ν•™κ΅ λŒ€ν•™μ› :μžμ—°κ³Όν•™λŒ€ν•™ 톡계학과,2020. 2. κΉ€μš©λŒ€.In this thesis, we study the asymptotic properties of three machine learning algorithms including two supervised learning algorithms with deep neural networks and a Bayesian learning method for high-dimensional factor models. The first research problem involves learning deep neural network (DNN) classifiers. We derive the fast convergence rates of a DNN classifier learned using the hinge loss. We consider various cases for a true probability model and show that the DNN classifier achieves fast convergence rates for all cases, provided its architecture is carefully selected. The second research topic is learning sparse DNNs. We propose a sparse learning algorithm, which minimizes penalized empirical risk using a novel sparsity-inducing penalty. We establish an oracle inequality for the excess risk of the proposed sparse DNN estimator and derive convergence rates for several learning tasks. In particular, the proposed sparse DNN estimator can adaptively attain minimax optimal convergence rates for nonparametric regression problems. The third part of the thesis is devoted to Bayesian non-parametric learning for high-dimensional factor models. We propose a prior distribution based on the two-parameter Indian buffet process, which is computationally tractable. We proved that the resulting posterior distribution concentrates on the true factor dimensionality as well as contracts to the true covariance matrix at a near-optimal rate.λ³Έ 논문은 μ„Έ 가지 기계 ν•™μŠ΅ μ•Œκ³ λ¦¬μ¦˜μ˜ 점근 μ„±μ§ˆμ„ μ—°κ΅¬ν•œλ‹€. 처음 두 μž₯은 μ§€λ„ν•™μŠ΅μ— μ‚¬μš©λ˜λŠ” κΉŠμ€ 신경망 ν•™μŠ΅μ„ 닀루며 λ§ˆμ§€λ§‰ μž₯은 고차원 μš”μΈ λͺ¨ν˜•μ˜ λ² μ΄μ§€μ•ˆ μΆ”μ • 방법에 λŒ€ν•˜μ—¬ μ—°κ΅¬ν•œλ‹€. 첫 번째 μž₯μ—μ„œλŠ” κΉŠμ€ 신경망 λΆ„λ₯˜κΈ°μ— λŒ€ν•˜μ—¬ μ—°κ΅¬ν•œλ‹€. μš°λ¦¬λŠ” νžŒμ§€ 손싀 ν•¨μˆ˜λ‘œ ν•™μŠ΅ν•œ κΉŠμ€ 신경망 λΆ„λ₯˜κΈ°κ°€ λͺ‡ 가지 ν™•λ₯  λͺ¨ν˜•μ— λŒ€ν•΄ λΉ λ₯Έ 수렴 속도λ₯Ό 달성함을 λ³΄μ˜€λ‹€. 두 번째 μž₯μ—μ„œλŠ” κΉŠμ€ μ‹ κ²½λ§μ˜ ν¬μ†Œ ν•™μŠ΅μ— λŒ€ν•˜μ—¬ μ—°κ΅¬ν•œλ‹€. μš°λ¦¬λŠ” κ²½ν—˜ μœ„ν—˜κ³Ό ν¬μ†Œμ„±μ„ λΆ€μ—¬ν•˜λŠ” 벌점 ν•¨μˆ˜λ₯Ό λ”ν•œ λͺ©μ  ν•¨μˆ˜λ₯Ό μ΅œμ†Œν™”ν•˜λŠ” ν•™μŠ΅ 방법을 μ œμ•ˆν•˜μ˜€λ‹€. μš°λ¦¬λŠ” μ œμ•ˆν•˜λŠ” κΉŠμ€ ν¬μ†Œ 신경망 μΆ”μ •λŸ‰μ— λŒ€ν•œ μ‹ μ˜ 뢀등식을 μ–»μ—ˆμœΌλ©°, 이λ₯Ό 톡해 λͺ‡ 가지 톡계 λ¬Έμ œμ—μ„œμ˜ 수렴 속도λ₯Ό κ΅¬ν•˜μ˜€λ‹€. 특히 μ œμ•ˆν•˜λŠ” κΉŠμ€ ν¬μ†Œ 신경망 μΆ”μ •λŸ‰μ€ λΉ„λͺ¨μˆ˜ νšŒκ·€ λ¬Έμ œμ—μ„œ μ μ‘μ μœΌλ‘œ μ΅œμ†Œμ΅œλŒ€ μ΅œμ μ„±μ„ 달성함을 λ³΄μ˜€λ‹€. λ§ˆμ§€λ§‰ μž₯은 고차원 μš”μΈ λͺ¨ν˜•μ—μ„œ λ² μ΄μ§€μ•ˆ ν•™μŠ΅μ˜ 점근 μ„±μ§ˆμ„ μ—°κ΅¬ν•œλ‹€. μš°λ¦¬λŠ” λͺ¨μˆ˜κ°€ λ‘κ°œμΈ μΈλ„λΆ€νŽ˜κ³Όμ •μ„ 기반으둜 ν•œ 사전뢄포λ₯Ό μ œμ•ˆν•˜μ˜€λ‹€. μ œμ•ˆν•œ μ‚¬μ „λΆ„ν¬λ‘œλΆ€ν„° μœ λ„λœ 사후뢄포가 곡뢄산 행렬을 거의 졜적의 수렴 μ†λ„λ‘œ 좔정함과 λ™μ‹œμ— μš”μΈ 차원을 μΌκ΄€λ˜κ²Œ μΆ”μ •ν•  수 μžˆμŒμ„ 증λͺ…ν•˜μ˜€λ‹€.Introduction 1 0.1 Motivation 1 0.2 Outline and contributions2 1 Fast convergence rates of deep neural networks for classification 5 1.1 Introduction 5 1.1.1 Notation 7 1.2 Estimation of the classifier with DNNs 8 1.2.1 About the hinge loss 8 1.2.2 Learning DNN with the hinge loss 10 1.3 Fast convergence rates of DNN classifiers with the hinge loss 12 1.3.1 Case 1: Smooth conditional class probability 12 1.3.2 Case 2: Smooth boundary 13 1.3.3 Case 3: Margin condition 17 1.4 Adaptive estimation 18 1.5 Use of the logistic loss 21 1.6 Concluding remarks 24 1.7 Proofs 25 1.7.1 Complexity of a class of DNNs 25 1.7.2 Convergence rate of the excess surrogate risk for general surrogate losses 26 1.7.3 Generic convergence rate for the hinge loss 31 1.7.4 Proof of Theorem 1.3.1 33 1.7.5 Proof of Theorem 1.3.2 35 1.7.6 Proof of Theorem 1.3.3 37 1.7.7 Proof of Theorem 1.3.4 40 1.7.8 Proof of Theorem 1.4.1 42 1.7.9 Proof of Theorem 1.5.1 47 1.7.10 Proof of Proposition 1.7.9 50 2 Rate-optimal sparse learning for deep neural networks 53 2.1 Introduction 53 2.1.1 Notation 54 2.1.2 Deep neural networks 55 2.1.3 Empirical risk minimization algorithm with a sparsity constraint and its nonadaptiveness 56 2.1.4 Outline 57 2.2 Learning sparse deep neural networks with the clipped L1 penalty 57 2.3 Main results 59 2.3.1 Nonparametric regression 59 2.3.2 Classification with strictly convex losses 65 2.4 Implementation 67 2.5 Numerical studies 69 2.5.1 Regression with simulated data 69 2.5.2 Classification with real data 71 2.6 Conclusion 73 2.7 Proofs 74 2.7.1 Covering numbers of classes of DNNs 74 2.7.2 Proofs of Theorem 2.3.1 and Theorem 2.3.3 77 2.7.3 Proofs of Theorem 2.3.2 and Theorem 2.3.4 84 3 Posterior consistency of the factor dimensionality in high-dimensional sparse factor models 87 3.1 Introduction 87 3.1.1 Notation 89 3.2 Assumptions and prior distribution 90 3.2.1 Assumptions 90 3.2.2 Prior distribution and its properties 92 3.2.2.1 Induced distribution of the factor dimensionality 93 3.2.2.2 Induced distribution of the sparsity 94 3.2.2.3 Prior concentration near the true loading matrix 94 3.3 Asymptotic properties of the posterior distribution 96 3.3.1 Posterior contraction rate for covariance matrix 96 3.3.2 Posterior consistency of the factor dimensionality 97 3.4 Numerical results 98 3.4.1 MCMC algorithm 99 3.4.2 Simulation study 101 3.5 Discussions about adaptive priors 103 3.6 Concluding remarks 105 3.7 Proofs 106 3.7.1 Proofs of lemmas and corollary in Section 3.2 106 3.7.2 Proofs of theorems in Section 3.3 112 3.7.3 Proof of Theorem 3.5.1. 118 3.7.4 Auxiliary lemmas 121 Appendix A Smooth function approximation by deep neural networks with general activation functions 129 A.1 Introduction 129 A.1.1 Notation 130 A.2 Deep neural networks 131 A.3 Classes of activation functions 132 A.3.1 Piecewise linear activation functions 132 A.3.2 Locally quadratic activation functions 133 A.4 Approximation of Hlder smooth functions by deep neural networks 135 A.5 Application to statistical learning theory 139 A.5.1 Application to regression 141 A.5.2 Application to binary classification 142 A.6 Proofs 144 A.6.1 Proof of Theorem A.4.1 for piecewise linear activation functions 144 A.6.2 Proof of Theorem A.4.1 for locally quadratic activation functions 146 A.6.3 Proof of Proposition A.5.1 154 A.6.4 Proof of Theorem A.5.2. 155 A.6.5 Proof of Theorem A.5.3. 157 Appendix B Poisson mixture of finite feature models 159 B.1 Overview 159 B.1.1 Equivalence classes 160 B.1.2 Notation 161 B.2 Equivalent representations 161 B.2.1 Urn schemes 161 B.2.2 Hierarchical representation 163 B.3 Application to sparse Bayesian factor models 165 B.3.1 Model and prior 165 B.3.2 Assumptions on the true distribution 166 B.3.3 Preliminary results 167 B.3.4 Asymptotic properties 169 B.4 Proofs 170 B.4.1 Proofs of results in Section B.2 170 B.4.2 Proofs of results in Section B.3.3 174 B.4.3 Proof of Theorem B.3.5 177 Bibliography 181Docto
    corecore