4 research outputs found
On the hardness of learning sparse parities
This work investigates the hardness of computing sparse solutions to systems
of linear equations over F_2. Consider the k-EvenSet problem: given a
homogeneous system of linear equations over F_2 on n variables, decide if there
exists a nonzero solution of Hamming weight at most k (i.e. a k-sparse
solution). While there is a simple O(n^{k/2})-time algorithm for it,
establishing fixed parameter intractability for k-EvenSet has been a notorious
open problem. Towards this goal, we show that unless k-Clique can be solved in
n^{o(k)} time, k-EvenSet has no poly(n)2^{o(sqrt{k})} time algorithm and no
polynomial time algorithm when k = (log n)^{2+eta} for any eta > 0.
Our work also shows that the non-homogeneous generalization of the problem --
which we call k-VectorSum -- is W[1]-hard on instances where the number of
equations is O(k log n), improving on previous reductions which produced
Omega(n) equations. We also show that for any constant eps > 0, given a system
of O(exp(O(k))log n) linear equations, it is W[1]-hard to decide if there is a
k-sparse linear form satisfying all the equations or if every function on at
most k-variables (k-junta) satisfies at most (1/2 + eps)-fraction of the
equations. In the setting of computational learning, this shows hardness of
approximate non-proper learning of k-parities. In a similar vein, we use the
hardness of k-EvenSet to show that that for any constant d, unless k-Clique can
be solved in n^{o(k)} time there is no poly(m, n)2^{o(sqrt{k}) time algorithm
to decide whether a given set of m points in F_2^n satisfies: (i) there exists
a non-trivial k-sparse homogeneous linear form evaluating to 0 on all the
points, or (ii) any non-trivial degree d polynomial P supported on at most k
variables evaluates to zero on approx. Pr_{F_2^n}[P(z) = 0] fraction of the
points i.e., P is fooled by the set of points
The computational hardness of feature selection in strict-pure synthetic genetic datasets
A common task in knowledge discovery is finding a few features correlated with an
outcome in a sea of mostly irrelevant data. This task is particularly formidable in
genetic datasets containing thousands to millions of Single Nucleotide Polymorphisms
(SNPs) for each individual; the goal here is to find a small subset of SNPs correlated
with whether an individual is sick or healthy(labeled data). Although determining
a correlation between any given SNP (genotype) and a disease label (phenotype) is
relatively straightforward, detecting subsets of SNPs such that the correlation is only
apparent when the whole subset is considered seems to be much harder. In this thesis,
we study the computational hardness of this problem, in particular for a widely used
method of generating synthetic SNP datasets.
More specifically, we consider the feature selection problem in datasets generated
by ”pure and strict” models, such as ones produced by the popular GAMETES software.
In these datasets, there is a high correlation between a predefined target set of
features (SNPs) and a label; however, any subset of the target set appears uncorrelated
with the outcome.
Our main result is a (linear-time, parameter-preserving) reduction from the well-known
Learning Parity with Noise (LPN) problem to feature selection in such pure
and strict datasets. This gives us a host of consequences for the complexity of feature
selection in this setting. First, not only it is NP-hard (to even approximate), it is computationally hard on average under a standard cryptographic assumption on
hardness on learning parity with noise; moreover, in general it is as hard for the
uniform distribution as for arbitrary distributions, and as hard for random noise as
for adversarial noise. For the worst case complexity, we get a tighter parameterized
lower bound: even in the non-noisy case, finding a parity of Hamming weight at most
k is W[1]-hard when the number of samples is relatively small (logarithmic in the
number of features).
Finally, most relevant to the development of feature selection heuristics, by the
unconditional hardness of LPN in Kearns’ statistical query model, no heuristic that
only computes statistics about the samples rather than considering samples themselves,
can successfully perform feature selection in such pure and strict datasets.
This eliminates a large class of common approaches to feature selection