15 research outputs found
Knowledge Refinement via Rule Selection
In several different applications, including data transformation and entity
resolution, rules are used to capture aspects of knowledge about the
application at hand. Often, a large set of such rules is generated
automatically or semi-automatically, and the challenge is to refine the
encapsulated knowledge by selecting a subset of rules based on the expected
operational behavior of the rules on available data. In this paper, we carry
out a systematic complexity-theoretic investigation of the following rule
selection problem: given a set of rules specified by Horn formulas, and a pair
of an input database and an output database, find a subset of the rules that
minimizes the total error, that is, the number of false positive and false
negative errors arising from the selected rules. We first establish
computational hardness results for the decision problems underlying this
minimization problem, as well as upper and lower bounds for its
approximability. We then investigate a bi-objective optimization version of the
rule selection problem in which both the total error and the size of the
selected rules are taken into account. We show that testing for membership in
the Pareto front of this bi-objective optimization problem is DP-complete.
Finally, we show that a similar DP-completeness result holds for a bi-level
optimization version of the rule selection problem, where one minimizes first
the total error and then the size
Secluded Connectivity Problems
Consider a setting where possibly sensitive information sent over a path in a
network is visible to every {neighbor} of the path, i.e., every neighbor of
some node on the path, thus including the nodes on the path itself. The
exposure of a path can be measured as the number of nodes adjacent to it,
denoted by . A path is said to be secluded if its exposure is small. A
similar measure can be applied to other connected subgraphs, such as Steiner
trees connecting a given set of terminals. Such subgraphs may be relevant due
to considerations of privacy, security or revenue maximization. This paper
considers problems related to minimum exposure connectivity structures such as
paths and Steiner trees. It is shown that on unweighted undirected -node
graphs, the problem of finding the minimum exposure path connecting a given
pair of vertices is strongly inapproximable, i.e., hard to approximate within a
factor of for any (under an
appropriate complexity assumption), but is approximable with ratio
, where is the maximum degree in the graph. One of
our main results concerns the class of bounded-degree graphs, which is shown to
exhibit the following interesting dichotomy. On the one hand, the minimum
exposure path problem is NP-hard on node-weighted or directed bounded-degree
graphs (even when the maximum degree is 4). On the other hand, we present a
polynomial algorithm (based on a nontrivial dynamic program) for the problem on
unweighted undirected bounded-degree graphs. Likewise, the problem is shown to
be polynomial also for the class of (weighted or unweighted) bounded-treewidth
graphs
An Improved Algorithm for Learning to Perform Exception-Tolerant Abduction
Abstract
Inference from an observed or hypothesized condition to a plausible cause or explanation for this condition is known as abduction. For many tasks, the acquisition of the necessary knowledge by machine learning has been widely found to be highly effective. However, the semantics of learned knowledge are weaker than the usual classical semantics, and this necessitates new formulations of many tasks. We focus on a recently introduced formulation of the abductive inference task that is thus adapted to the semantics of machine learning. A key problem is that we cannot expect that our causes or explanations will be perfect, and they must tolerate some error due to the world being more complicated than our formalization allows. This is a version of the qualification problem, and in machine learning, this is known as agnostic learning. In the work by Juba that introduced the task of learning to make abductive inferences, an algorithm is given for producing k-DNF explanations that tolerates such exceptions: if the best possible k-DNF explanation fails to justify the condition with probability , then the algorithm is promised to find a k-DNF explanation that fails to justify the condition with probability at most , where n is the number of propositional attributes used to describe the domain. Here, we present an improved algorithm for this task. When the best k-DNF fails with probability , our algorithm finds a k-DNF that fails with probability at most (i.e., suppressing logarithmic factors in n and ).We examine the empirical advantage of this new algorithm over the previous algorithm in two test domains, one of explaining conditions generated by a “noisy k-DNF rule, and another of explaining conditions that are actually generated by a linear threshold rule.
We also apply the algorithm on the real world application Anomaly explanation. In this work, as opposed to anomaly detection, we are interested in finding possible descriptions of what may be causing anomalies in visual data. We use PCA to perform anomaly detection. The task is attaching semantics drawn from the image meta-data to a portion of the anomalous images from some source such as web-came. Such a partial description of the anomalous images in terms of the meta-data is useful both because it may help to explain what causes the identified anomalies, and also because it may help to identify the truly unusual images that defy such simple categorization. We find that it is a good match to apply our approximation algorithm on this task. Our algorithm successfully finds plausible explanations of the anomalies. It yields low error rate when the data set is large(\u3e80,000 inputs) and also works well when the data set is not very large(\u3c 50,000 examples). It finds small 2-DNFs that are easy to interpret and capture a non-negligible
Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks
Abstract. Matrix factorizations are a popular tool to mine regularities from data. There are many ways to interpret the factorizations, but one particularly suited for data mining utilizes the fact that a matrix product can be interpreted as a sum of rank-1 matrices. Then the factorization of a matrix becomes the task of finding a small number of rank-1 matrices, sum of which is a good representation of the original matrix. Seen this way, it becomes obvious that many problems in data mining can be expressed as matrix factorizations with correct definitions of what a rank-1 matrix and a sum of rank-1 matrices mean. This paper develops a unified theory, based on generalized outer product operators, that encompasses many pattern set mining tasks. The focus is on the computational aspects of the theory and studying the computational complexity and approximability of many problems related to generalized matrix factorizations. The results immediately apply to a large number of data mining problems, and hopefully allow generalizing future results and algorithms, as well.
Review Selection Using Micro-Reviews
Singapore National Research Foundation under International Research Centre @ Singapore Funding Initiativ
A Birthday Repetition Theorem and Complexity of Approximating Dense CSPs
A -birthday repetition of a
two-prover game is a game in which the two provers are sent
random sets of questions from of sizes and respectively.
These two sets are sampled independently uniformly among all sets of questions
of those particular sizes. We prove the following birthday repetition theorem:
when satisfies some mild conditions, decreases exponentially in where is the total number of
questions. Our result positively resolves an open question posted by Aaronson,
Impagliazzo and Moshkovitz (CCC 2014).
As an application of our birthday repetition theorem, we obtain new
fine-grained hardness of approximation results for dense CSPs. Specifically, we
establish a tight trade-off between running time and approximation ratio for
dense CSPs by showing conditional lower bounds, integrality gaps and
approximation algorithms. In particular, for any sufficiently large and for
every , we show the following results:
- We exhibit an -approximation algorithm for dense Max -CSPs
with alphabet size via -level of Sherali-Adams relaxation.
- Through our birthday repetition theorem, we obtain an integrality gap of
for -level Lasserre relaxation for fully-dense Max
-CSP.
- Assuming that there is a constant such that Max 3SAT cannot
be approximated to within of the optimal in sub-exponential
time, our birthday repetition theorem implies that any algorithm that
approximates fully-dense Max -CSP to within a factor takes
time, almost tightly matching the algorithmic
result based on Sherali-Adams relaxation.Comment: 45 page