15 research outputs found
Knowledge Refinement via Rule Selection
In several different applications, including data transformation and entity
resolution, rules are used to capture aspects of knowledge about the
application at hand. Often, a large set of such rules is generated
automatically or semi-automatically, and the challenge is to refine the
encapsulated knowledge by selecting a subset of rules based on the expected
operational behavior of the rules on available data. In this paper, we carry
out a systematic complexity-theoretic investigation of the following rule
selection problem: given a set of rules specified by Horn formulas, and a pair
of an input database and an output database, find a subset of the rules that
minimizes the total error, that is, the number of false positive and false
negative errors arising from the selected rules. We first establish
computational hardness results for the decision problems underlying this
minimization problem, as well as upper and lower bounds for its
approximability. We then investigate a bi-objective optimization version of the
rule selection problem in which both the total error and the size of the
selected rules are taken into account. We show that testing for membership in
the Pareto front of this bi-objective optimization problem is DP-complete.
Finally, we show that a similar DP-completeness result holds for a bi-level
optimization version of the rule selection problem, where one minimizes first
the total error and then the size
Scalable Boolean Tensor Factorizations using Random Walks
Tensors are becoming increasingly common in data mining, and consequently,
tensor factorizations are becoming more and more important tools for data
miners. When the data is binary, it is natural to ask if we can factorize it
into binary factors while simultaneously making sure that the reconstructed
tensor is still binary. Such factorizations, called Boolean tensor
factorizations, can provide improved interpretability and find Boolean
structure that is hard to express using normal factorizations. Unfortunately
the algorithms for computing Boolean tensor factorizations do not usually scale
well. In this paper we present a novel algorithm for finding Boolean CP and
Tucker decompositions of large and sparse binary tensors. In our experimental
evaluation we show that our algorithm can handle large tensors and accurately
reconstructs the latent Boolean structure
Binary Matrix Factorisation via Column Generation
Identifying discrete patterns in binary data is an important dimensionality
reduction tool in machine learning and data mining. In this paper, we consider
the problem of low-rank binary matrix factorisation (BMF) under Boolean
arithmetic. Due to the NP-hardness of this problem, most previous attempts rely
on heuristic techniques. We formulate the problem as a mixed integer linear
program and use a large scale optimisation technique of column generation to
solve it without the need of heuristic pattern mining. Our approach focuses on
accuracy and on the provision of optimality guarantees. Experimental results on
real world datasets demonstrate that our proposed method is effective at
producing highly accurate factorisations and improves on the previously
available best known results for 15 out of 24 problem instances
Approximating Red-Blue Set Cover and Minimum Monotone Satisfying Assignment
We provide new approximation algorithms for the Red-Blue Set Cover and
Circuit Minimum Monotone Satisfying Assignment (MMSA) problems. Our algorithm
for Red-Blue Set Cover achieves -approximation improving on
the -approximation due to Elkin and Peleg (where is the
number of sets). Our approximation algorithm for MMSA (for circuits of
depth ) gives an approximation for , where is the number of gates and
variables. No non-trivial approximation algorithms for MMSA with
were previously known.
We complement these results with lower bounds for these problems: For
Red-Blue Set Cover, we provide a nearly approximation preserving reduction from
Min -Union that gives an hardness
under the Dense-vs-Random conjecture, while for MMSA we sketch a proof that an
SDP relaxation strengthened by Sherali--Adams has an integrality gap of
where as the circuit depth .Comment: APPROX 202
An Improved Algorithm for Learning to Perform Exception-Tolerant Abduction
Abstract
Inference from an observed or hypothesized condition to a plausible cause or explanation for this condition is known as abduction. For many tasks, the acquisition of the necessary knowledge by machine learning has been widely found to be highly effective. However, the semantics of learned knowledge are weaker than the usual classical semantics, and this necessitates new formulations of many tasks. We focus on a recently introduced formulation of the abductive inference task that is thus adapted to the semantics of machine learning. A key problem is that we cannot expect that our causes or explanations will be perfect, and they must tolerate some error due to the world being more complicated than our formalization allows. This is a version of the qualification problem, and in machine learning, this is known as agnostic learning. In the work by Juba that introduced the task of learning to make abductive inferences, an algorithm is given for producing k-DNF explanations that tolerates such exceptions: if the best possible k-DNF explanation fails to justify the condition with probability , then the algorithm is promised to find a k-DNF explanation that fails to justify the condition with probability at most , where n is the number of propositional attributes used to describe the domain. Here, we present an improved algorithm for this task. When the best k-DNF fails with probability , our algorithm finds a k-DNF that fails with probability at most (i.e., suppressing logarithmic factors in n and ).We examine the empirical advantage of this new algorithm over the previous algorithm in two test domains, one of explaining conditions generated by a “noisy k-DNF rule, and another of explaining conditions that are actually generated by a linear threshold rule.
We also apply the algorithm on the real world application Anomaly explanation. In this work, as opposed to anomaly detection, we are interested in finding possible descriptions of what may be causing anomalies in visual data. We use PCA to perform anomaly detection. The task is attaching semantics drawn from the image meta-data to a portion of the anomalous images from some source such as web-came. Such a partial description of the anomalous images in terms of the meta-data is useful both because it may help to explain what causes the identified anomalies, and also because it may help to identify the truly unusual images that defy such simple categorization. We find that it is a good match to apply our approximation algorithm on this task. Our algorithm successfully finds plausible explanations of the anomalies. It yields low error rate when the data set is large(\u3e80,000 inputs) and also works well when the data set is not very large(\u3c 50,000 examples). It finds small 2-DNFs that are easy to interpret and capture a non-negligible
Hybrid ASP-based Approach to Pattern Mining
Detecting small sets of relevant patterns from a given dataset is a central
challenge in data mining. The relevance of a pattern is based on user-provided
criteria; typically, all patterns that satisfy certain criteria are considered
relevant. Rule-based languages like Answer Set Programming (ASP) seem
well-suited for specifying such criteria in a form of constraints. Although
progress has been made, on the one hand, on solving individual mining problems
and, on the other hand, developing generic mining systems, the existing methods
either focus on scalability or on generality. In this paper we make steps
towards combining local (frequency, size, cost) and global (various condensed
representations like maximal, closed, skyline) constraints in a generic and
efficient way. We present a hybrid approach for itemset, sequence and graph
mining which exploits dedicated highly optimized mining systems to detect
frequent patterns and then filters the results using declarative ASP. To
further demonstrate the generic nature of our hybrid framework we apply it to a
problem of approximately tiling a database. Experiments on real-world datasets
show the effectiveness of the proposed method and computational gains for
itemset, sequence and graph mining, as well as approximate tiling.
Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table