15 research outputs found

    Knowledge Refinement via Rule Selection

    Full text link
    In several different applications, including data transformation and entity resolution, rules are used to capture aspects of knowledge about the application at hand. Often, a large set of such rules is generated automatically or semi-automatically, and the challenge is to refine the encapsulated knowledge by selecting a subset of rules based on the expected operational behavior of the rules on available data. In this paper, we carry out a systematic complexity-theoretic investigation of the following rule selection problem: given a set of rules specified by Horn formulas, and a pair of an input database and an output database, find a subset of the rules that minimizes the total error, that is, the number of false positive and false negative errors arising from the selected rules. We first establish computational hardness results for the decision problems underlying this minimization problem, as well as upper and lower bounds for its approximability. We then investigate a bi-objective optimization version of the rule selection problem in which both the total error and the size of the selected rules are taken into account. We show that testing for membership in the Pareto front of this bi-objective optimization problem is DP-complete. Finally, we show that a similar DP-completeness result holds for a bi-level optimization version of the rule selection problem, where one minimizes first the total error and then the size

    Secluded Connectivity Problems

    Full text link
    Consider a setting where possibly sensitive information sent over a path in a network is visible to every {neighbor} of the path, i.e., every neighbor of some node on the path, thus including the nodes on the path itself. The exposure of a path PP can be measured as the number of nodes adjacent to it, denoted by N[P]N[P]. A path is said to be secluded if its exposure is small. A similar measure can be applied to other connected subgraphs, such as Steiner trees connecting a given set of terminals. Such subgraphs may be relevant due to considerations of privacy, security or revenue maximization. This paper considers problems related to minimum exposure connectivity structures such as paths and Steiner trees. It is shown that on unweighted undirected nn-node graphs, the problem of finding the minimum exposure path connecting a given pair of vertices is strongly inapproximable, i.e., hard to approximate within a factor of O(2log1ϵn)O(2^{\log^{1-\epsilon}n}) for any ϵ>0\epsilon>0 (under an appropriate complexity assumption), but is approximable with ratio Δ+3\sqrt{\Delta}+3, where Δ\Delta is the maximum degree in the graph. One of our main results concerns the class of bounded-degree graphs, which is shown to exhibit the following interesting dichotomy. On the one hand, the minimum exposure path problem is NP-hard on node-weighted or directed bounded-degree graphs (even when the maximum degree is 4). On the other hand, we present a polynomial algorithm (based on a nontrivial dynamic program) for the problem on unweighted undirected bounded-degree graphs. Likewise, the problem is shown to be polynomial also for the class of (weighted or unweighted) bounded-treewidth graphs

    An Improved Algorithm for Learning to Perform Exception-Tolerant Abduction

    Get PDF
    Abstract Inference from an observed or hypothesized condition to a plausible cause or explanation for this condition is known as abduction. For many tasks, the acquisition of the necessary knowledge by machine learning has been widely found to be highly effective. However, the semantics of learned knowledge are weaker than the usual classical semantics, and this necessitates new formulations of many tasks. We focus on a recently introduced formulation of the abductive inference task that is thus adapted to the semantics of machine learning. A key problem is that we cannot expect that our causes or explanations will be perfect, and they must tolerate some error due to the world being more complicated than our formalization allows. This is a version of the qualification problem, and in machine learning, this is known as agnostic learning. In the work by Juba that introduced the task of learning to make abductive inferences, an algorithm is given for producing k-DNF explanations that tolerates such exceptions: if the best possible k-DNF explanation fails to justify the condition with probability , then the algorithm is promised to find a k-DNF explanation that fails to justify the condition with probability at most , where n is the number of propositional attributes used to describe the domain. Here, we present an improved algorithm for this task. When the best k-DNF fails with probability , our algorithm finds a k-DNF that fails with probability at most (i.e., suppressing logarithmic factors in n and ).We examine the empirical advantage of this new algorithm over the previous algorithm in two test domains, one of explaining conditions generated by a “noisy k-DNF rule, and another of explaining conditions that are actually generated by a linear threshold rule. We also apply the algorithm on the real world application Anomaly explanation. In this work, as opposed to anomaly detection, we are interested in finding possible descriptions of what may be causing anomalies in visual data. We use PCA to perform anomaly detection. The task is attaching semantics drawn from the image meta-data to a portion of the anomalous images from some source such as web-came. Such a partial description of the anomalous images in terms of the meta-data is useful both because it may help to explain what causes the identified anomalies, and also because it may help to identify the truly unusual images that defy such simple categorization. We find that it is a good match to apply our approximation algorithm on this task. Our algorithm successfully finds plausible explanations of the anomalies. It yields low error rate when the data set is large(\u3e80,000 inputs) and also works well when the data set is not very large(\u3c 50,000 examples). It finds small 2-DNFs that are easy to interpret and capture a non-negligible

    Generalized Matrix Factorizations as a Unifying Framework for Pattern Set Mining: Complexity Beyond Blocks

    Full text link
    Abstract. Matrix factorizations are a popular tool to mine regularities from data. There are many ways to interpret the factorizations, but one particularly suited for data mining utilizes the fact that a matrix product can be interpreted as a sum of rank-1 matrices. Then the factorization of a matrix becomes the task of finding a small number of rank-1 matrices, sum of which is a good representation of the original matrix. Seen this way, it becomes obvious that many problems in data mining can be expressed as matrix factorizations with correct definitions of what a rank-1 matrix and a sum of rank-1 matrices mean. This paper develops a unified theory, based on generalized outer product operators, that encompasses many pattern set mining tasks. The focus is on the computational aspects of the theory and studying the computational complexity and approximability of many problems related to generalized matrix factorizations. The results immediately apply to a large number of data mining problems, and hopefully allow generalizing future results and algorithms, as well.

    Review Selection Using Micro-Reviews

    Get PDF
    Singapore National Research Foundation under International Research Centre @ Singapore Funding Initiativ

    A Birthday Repetition Theorem and Complexity of Approximating Dense CSPs

    Get PDF
    A (k×l)(k \times l)-birthday repetition Gk×l\mathcal{G}^{k \times l} of a two-prover game G\mathcal{G} is a game in which the two provers are sent random sets of questions from G\mathcal{G} of sizes kk and ll respectively. These two sets are sampled independently uniformly among all sets of questions of those particular sizes. We prove the following birthday repetition theorem: when G\mathcal{G} satisfies some mild conditions, val(Gk×l)val(\mathcal{G}^{k \times l}) decreases exponentially in Ω(kl/n)\Omega(kl/n) where nn is the total number of questions. Our result positively resolves an open question posted by Aaronson, Impagliazzo and Moshkovitz (CCC 2014). As an application of our birthday repetition theorem, we obtain new fine-grained hardness of approximation results for dense CSPs. Specifically, we establish a tight trade-off between running time and approximation ratio for dense CSPs by showing conditional lower bounds, integrality gaps and approximation algorithms. In particular, for any sufficiently large ii and for every k2k \geq 2, we show the following results: - We exhibit an O(q1/i)O(q^{1/i})-approximation algorithm for dense Max kk-CSPs with alphabet size qq via Ok(i)O_k(i)-level of Sherali-Adams relaxation. - Through our birthday repetition theorem, we obtain an integrality gap of q1/iq^{1/i} for Ω~k(i)\tilde\Omega_k(i)-level Lasserre relaxation for fully-dense Max kk-CSP. - Assuming that there is a constant ϵ>0\epsilon > 0 such that Max 3SAT cannot be approximated to within (1ϵ)(1-\epsilon) of the optimal in sub-exponential time, our birthday repetition theorem implies that any algorithm that approximates fully-dense Max kk-CSP to within a q1/iq^{1/i} factor takes (nq)Ω~k(i)(nq)^{\tilde \Omega_k(i)} time, almost tightly matching the algorithmic result based on Sherali-Adams relaxation.Comment: 45 page
    corecore