4 research outputs found

    On the hardness of learning sparse parities

    Get PDF
    This work investigates the hardness of computing sparse solutions to systems of linear equations over F_2. Consider the k-EvenSet problem: given a homogeneous system of linear equations over F_2 on n variables, decide if there exists a nonzero solution of Hamming weight at most k (i.e. a k-sparse solution). While there is a simple O(n^{k/2})-time algorithm for it, establishing fixed parameter intractability for k-EvenSet has been a notorious open problem. Towards this goal, we show that unless k-Clique can be solved in n^{o(k)} time, k-EvenSet has no poly(n)2^{o(sqrt{k})} time algorithm and no polynomial time algorithm when k = (log n)^{2+eta} for any eta > 0. Our work also shows that the non-homogeneous generalization of the problem -- which we call k-VectorSum -- is W[1]-hard on instances where the number of equations is O(k log n), improving on previous reductions which produced Omega(n) equations. We also show that for any constant eps > 0, given a system of O(exp(O(k))log n) linear equations, it is W[1]-hard to decide if there is a k-sparse linear form satisfying all the equations or if every function on at most k-variables (k-junta) satisfies at most (1/2 + eps)-fraction of the equations. In the setting of computational learning, this shows hardness of approximate non-proper learning of k-parities. In a similar vein, we use the hardness of k-EvenSet to show that that for any constant d, unless k-Clique can be solved in n^{o(k)} time there is no poly(m, n)2^{o(sqrt{k}) time algorithm to decide whether a given set of m points in F_2^n satisfies: (i) there exists a non-trivial k-sparse homogeneous linear form evaluating to 0 on all the points, or (ii) any non-trivial degree d polynomial P supported on at most k variables evaluates to zero on approx. Pr_{F_2^n}[P(z) = 0] fraction of the points i.e., P is fooled by the set of points

    The computational hardness of feature selection in strict-pure synthetic genetic datasets

    Get PDF
    A common task in knowledge discovery is finding a few features correlated with an outcome in a sea of mostly irrelevant data. This task is particularly formidable in genetic datasets containing thousands to millions of Single Nucleotide Polymorphisms (SNPs) for each individual; the goal here is to find a small subset of SNPs correlated with whether an individual is sick or healthy(labeled data). Although determining a correlation between any given SNP (genotype) and a disease label (phenotype) is relatively straightforward, detecting subsets of SNPs such that the correlation is only apparent when the whole subset is considered seems to be much harder. In this thesis, we study the computational hardness of this problem, in particular for a widely used method of generating synthetic SNP datasets. More specifically, we consider the feature selection problem in datasets generated by ”pure and strict” models, such as ones produced by the popular GAMETES software. In these datasets, there is a high correlation between a predefined target set of features (SNPs) and a label; however, any subset of the target set appears uncorrelated with the outcome. Our main result is a (linear-time, parameter-preserving) reduction from the well-known Learning Parity with Noise (LPN) problem to feature selection in such pure and strict datasets. This gives us a host of consequences for the complexity of feature selection in this setting. First, not only it is NP-hard (to even approximate), it is computationally hard on average under a standard cryptographic assumption on hardness on learning parity with noise; moreover, in general it is as hard for the uniform distribution as for arbitrary distributions, and as hard for random noise as for adversarial noise. For the worst case complexity, we get a tighter parameterized lower bound: even in the non-noisy case, finding a parity of Hamming weight at most k is W[1]-hard when the number of samples is relatively small (logarithmic in the number of features). Finally, most relevant to the development of feature selection heuristics, by the unconditional hardness of LPN in Kearns’ statistical query model, no heuristic that only computes statistics about the samples rather than considering samples themselves, can successfully perform feature selection in such pure and strict datasets. This eliminates a large class of common approaches to feature selection