4 research outputs found
Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise
The concept class of low-degree polynomial threshold functions (PTFs) plays a
fundamental role in machine learning. In this paper, we study PAC learning of
-sparse degree- PTFs on , where any such concept depends
only on out of attributes of the input. Our main contribution is a new
algorithm that runs in time and under the Gaussian
marginal distribution, PAC learns the class up to error rate with
samples even when an fraction of them are corrupted by the nasty noise of
Bshouty et al. (2002), possibly the strongest corruption model. Prior to this
work, attribute-efficient robust algorithms are established only for the
special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a
structural result that translates the attribute sparsity to a sparsity pattern
of the Chow vector under the basis of Hermite polynomials, and 2) a novel
attribute-efficient robust Chow vector estimation algorithm which uses
exclusively a restricted Frobenius norm to either certify a good approximation
or to validate a sparsity-induced degree- polynomial as a filter to detect
corrupted samples.Comment: ICML 202
SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics
We investigate the time complexity of SGD learning on fully-connected neural
networks with isotropic data. We put forward a complexity measure -- the leap
-- which measures how "hierarchical" target functions are. For -dimensional
uniform Boolean or isotropic Gaussian data, our main conjecture states that the
time complexity to learn a function with low-dimensional support is
. We prove a version of this
conjecture for a class of functions on Gaussian isotropic data and 2-layer
neural networks, under additional technical assumptions on how SGD is run. We
show that the training sequentially learns the function support with a
saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going
beyond leap 1 (merged-staircase functions), and by going beyond the mean-field
and gradient flow approximations that prohibit the full complexity control
obtained here. Finally, we note that this gives an SGD complexity for the full
training trajectory that matches that of Correlational Statistical Query (CSQ)
lower-bounds
Provably learning a multi-head attention layer
The multi-head attention layer is one of the key components of the
transformer architecture that sets it apart from traditional feed-forward
models. Given a sequence length , attention matrices
, and
projection matrices , the corresponding multi-head attention layer transforms length- sequences of -dimensional
tokens via .
In this work, we initiate the study of provably learning a multi-head attention
layer from random examples and give the first nontrivial upper and lower bounds
for this problem:
- Provided satisfy certain
non-degeneracy conditions, we give a -time algorithm that learns
to small error given random labeled examples drawn uniformly from .
- We prove computational lower bounds showing that in the worst case,
exponential dependence on is unavoidable.
We focus on Boolean to mimic the discrete nature of tokens in
large language models, though our techniques naturally extend to standard
continuous settings, e.g. Gaussian. Our algorithm, which is centered around
using examples to sculpt a convex body containing the unknown parameters, is a
significant departure from existing provable algorithms for learning
feedforward networks, which predominantly exploit algebraic and rotation
invariance properties of the Gaussian distribution. In contrast, our analysis
is more flexible as it primarily relies on various upper and lower tail bounds
for the input distribution and "slices" thereof.Comment: 105 pages, comments welcom
Recommended from our members
Topics on Machine Learning under Imperfect Supervision
This dissertation comprises several studies addressing supervised learning problems where the supervision is imperfect.
Firstly, we investigate the margin conditions in active learning. Active learning is characterized by its special mechanism where the learner can sample freely over the feature space and exploit mostly the limited labeling budget by querying the most informative labels. Our primary focus is to discern critical conditions under which certain active learning algorithms can outperform the optimal passive learning minimax rate. Within a non-parametric multi-class classification framework,our results reveal that the uniqueness of Bayes labels across the feature space serves as the pivotal determinant for the superiority of active learning over passive learning.
Secondly, we study the estimation of central mean subspace (CMS), and its application in transfer learning. We show that a fast parametric convergence rate is achievable via estimating the expected smoothed gradient outer product, for a general class of covariate distribution that admits Gaussian or heavier distributions. When the link function is a polynomial with a degree of at most r and the covariates follow the standard Gaussian, we show that the prefactor depends on the ambient dimension d as d^r. Furthermore, we show that under a transfer learning setting, an oracle rate of prediction error as if the CMS is known is achievable, when the source training data is abundant.
Finally, we present an innovative application involving the utilization of weak (noisy) labels for addressing an Individual Tree Crown (ITC) segmentation challenge. Here, the objective is to delineate individual tree crowns within a 3D LiDAR scan of tropical forests, with only 2D noisy manual delineations of crowns on RGB images available as a source of weak supervision. We propose a refinement algorithm designed to enhance the performance of existing unsupervised learning methodologies for the ITC segmentation problem