Search CORE

4 research outputs found

Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise

Author: Shen Jie
Zeng Shiwei
Publication venue
Publication date: 01/06/2023
Field of study

The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of

K

-sparse degree-

d

PTFs on

\mathbb{R}^n

, where any such concept depends only on

K

out of

n

attributes of the input. Our main contribution is a new algorithm that runs in time

({nd}/{\epsilon})^{O(d)}

and under the Gaussian marginal distribution, PAC learns the class up to error rate

\epsilon

with

O(\frac{K^{4d}}{\epsilon^{2d}} \cdot \log^{5d} n)

samples even when an

\eta \leq O(\epsilon^d)

fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-

2d

polynomial as a filter to detect corrupted samples.Comment: ICML 202

arXiv.org e-Print Archive

SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

Author: Abbe Emmanuel
Boix-Adsera Enric
Misiakiewicz Theodor
Publication venue
Publication date: 31/08/2023
Field of study

We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For

d

-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function

f

with low-dimensional support is

\tilde\Theta (d^{\max(\mathrm{Leap}(f),2)})

. We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds

arXiv.org e-Print Archive

Provably learning a multi-head attention layer

Author: Chen Sitan
Li Yuanzhi
Publication venue
Publication date: 06/02/2024
Field of study

The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length

k

, attention matrices

\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}

, and projection matrices

\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}

, the corresponding multi-head attention layer

F: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d}

transforms length-

k

sequences of

d

-dimensional tokens

\mathbf{X}\in\mathbb{R}^{k\times d}

via

F(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_i

. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided

\{\mathbf{W}_i, \mathbf{\Theta}_i\}

satisfy certain non-degeneracy conditions, we give a

(dk)^{O(m^3)}

-time algorithm that learns

F

to small error given random labeled examples drawn uniformly from

\{\pm 1\}^{k\times d}

. - We prove computational lower bounds showing that in the worst case, exponential dependence on

m

is unavoidable. We focus on Boolean

\mathbf{X}

to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feedforward networks, which predominantly exploit algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices" thereof.Comment: 105 pages, comments welcom

arXiv.org e-Print Archive

Recommended from our members

Topics on Machine Learning under Imperfect Supervision

Author: Yuan Gan
Publication venue
Publication date: 01/01/2024
Field of study

This dissertation comprises several studies addressing supervised learning problems where the supervision is imperfect. Firstly, we investigate the margin conditions in active learning. Active learning is characterized by its special mechanism where the learner can sample freely over the feature space and exploit mostly the limited labeling budget by querying the most informative labels. Our primary focus is to discern critical conditions under which certain active learning algorithms can outperform the optimal passive learning minimax rate. Within a non-parametric multi-class classification framework,our results reveal that the uniqueness of Bayes labels across the feature space serves as the pivotal determinant for the superiority of active learning over passive learning. Secondly, we study the estimation of central mean subspace (CMS), and its application in transfer learning. We show that a fast parametric convergence rate is achievable via estimating the expected smoothed gradient outer product, for a general class of covariate distribution that admits Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most r and the covariates follow the standard Gaussian, we show that the prefactor depends on the ambient dimension d as d^r. Furthermore, we show that under a transfer learning setting, an oracle rate of prediction error as if the CMS is known is achievable, when the source training data is abundant. Finally, we present an innovative application involving the utilization of weak (noisy) labels for addressing an Individual Tree Crown (ITC) segmentation challenge. Here, the objective is to delineate individual tree crowns within a 3D LiDAR scan of tropical forests, with only 2D noisy manual delineations of crowns on RGB images available as a source of weak supervision. We propose a refinement algorithm designed to enhance the performance of existing unsupervised learning methodologies for the ITC segmentation problem

Columbia University Academic Commons