4 research outputs found

    Attribute-Efficient PAC Learning of Low-Degree Polynomial Threshold Functions with Nasty Noise

    Full text link
    The concept class of low-degree polynomial threshold functions (PTFs) plays a fundamental role in machine learning. In this paper, we study PAC learning of KK-sparse degree-dd PTFs on Rn\mathbb{R}^n, where any such concept depends only on KK out of nn attributes of the input. Our main contribution is a new algorithm that runs in time (nd/ϵ)O(d)({nd}/{\epsilon})^{O(d)} and under the Gaussian marginal distribution, PAC learns the class up to error rate ϵ\epsilon with O(K4dϵ2dlog5dn)O(\frac{K^{4d}}{\epsilon^{2d}} \cdot \log^{5d} n) samples even when an ηO(ϵd)\eta \leq O(\epsilon^d) fraction of them are corrupted by the nasty noise of Bshouty et al. (2002), possibly the strongest corruption model. Prior to this work, attribute-efficient robust algorithms are established only for the special case of sparse homogeneous halfspaces. Our key ingredients are: 1) a structural result that translates the attribute sparsity to a sparsity pattern of the Chow vector under the basis of Hermite polynomials, and 2) a novel attribute-efficient robust Chow vector estimation algorithm which uses exclusively a restricted Frobenius norm to either certify a good approximation or to validate a sparsity-induced degree-2d2d polynomial as a filter to detect corrupted samples.Comment: ICML 202

    SGD learning on neural networks: leap complexity and saddle-to-saddle dynamics

    Full text link
    We investigate the time complexity of SGD learning on fully-connected neural networks with isotropic data. We put forward a complexity measure -- the leap -- which measures how "hierarchical" target functions are. For dd-dimensional uniform Boolean or isotropic Gaussian data, our main conjecture states that the time complexity to learn a function ff with low-dimensional support is Θ~(dmax(Leap(f),2))\tilde\Theta (d^{\max(\mathrm{Leap}(f),2)}). We prove a version of this conjecture for a class of functions on Gaussian isotropic data and 2-layer neural networks, under additional technical assumptions on how SGD is run. We show that the training sequentially learns the function support with a saddle-to-saddle dynamic. Our result departs from [Abbe et al. 2022] by going beyond leap 1 (merged-staircase functions), and by going beyond the mean-field and gradient flow approximations that prohibit the full complexity control obtained here. Finally, we note that this gives an SGD complexity for the full training trajectory that matches that of Correlational Statistical Query (CSQ) lower-bounds

    Provably learning a multi-head attention layer

    Full text link
    The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length kk, attention matrices Θ1,,ΘmRd×d\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}, and projection matrices W1,,WmRd×d\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}, the corresponding multi-head attention layer F:Rk×dRk×dF: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d} transforms length-kk sequences of dd-dimensional tokens XRk×d\mathbf{X}\in\mathbb{R}^{k\times d} via F(X)i=1msoftmax(XΘiX)XWiF(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_i. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided {Wi,Θi}\{\mathbf{W}_i, \mathbf{\Theta}_i\} satisfy certain non-degeneracy conditions, we give a (dk)O(m3)(dk)^{O(m^3)}-time algorithm that learns FF to small error given random labeled examples drawn uniformly from {±1}k×d\{\pm 1\}^{k\times d}. - We prove computational lower bounds showing that in the worst case, exponential dependence on mm is unavoidable. We focus on Boolean X\mathbf{X} to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feedforward networks, which predominantly exploit algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices" thereof.Comment: 105 pages, comments welcom
    corecore