Search CORE

15 research outputs found

Knowledge Refinement via Rule Selection

Author: Kolaitis Phokion G.
Popa Lucian
Qian Kun
Publication venue
Publication date: 28/01/2019
Field of study

In several different applications, including data transformation and entity resolution, rules are used to capture aspects of knowledge about the application at hand. Often, a large set of such rules is generated automatically or semi-automatically, and the challenge is to refine the encapsulated knowledge by selecting a subset of rules based on the expected operational behavior of the rules on available data. In this paper, we carry out a systematic complexity-theoretic investigation of the following rule selection problem: given a set of rules specified by Horn formulas, and a pair of an input database and an output database, find a subset of the rules that minimizes the total error, that is, the number of false positive and false negative errors arising from the selected rules. We first establish computational hardness results for the decision problems underlying this minimization problem, as well as upper and lower bounds for its approximability. We then investigate a bi-objective optimization version of the rule selection problem in which both the total error and the size of the selected rules are taken into account. We show that testing for membership in the Pareto front of this bi-objective optimization problem is DP-complete. Finally, we show that a similar DP-completeness result holds for a bi-level optimization version of the rule selection problem, where one minimizes first the total error and then the size

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Scalable Boolean Tensor Factorizations using Random Walks

Author: Erdős Dóra
Miettinen Pauli
Publication venue
Publication date: 01/01/2013
Field of study

Tensors are becoming increasingly common in data mining, and consequently, tensor factorizations are becoming more and more important tools for data miners. When the data is binary, it is natural to ask if we can factorize it into binary factors while simultaneously making sure that the reconstructed tensor is still binary. Such factorizations, called Boolean tensor factorizations, can provide improved interpretability and find Boolean structure that is hard to express using normal factorizations. Unfortunately the algorithms for computing Boolean tensor factorizations do not usually scale well. In this paper we present a novel algorithm for finding Boolean CP and Tucker decompositions of large and sparse binary tensors. In our experimental evaluation we show that our algorithm can handle large tensors and accurately reconstructs the latent Boolean structure

arXiv.org e-Print Archive

MPG.PuRe

Approximating Red-Blue Set Cover and Minimum Monotone Satisfying Assignment

Author: Makarychev Yury
Vakilian Ali
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Binary Matrix Factorisation via Column Generation

Author: Gunluk Oktay
Hauser Raphael A.
Kovacs Reka A.
Publication venue
Publication date: 01/01/2021
Field of study

Identifying discrete patterns in binary data is an important dimensionality reduction tool in machine learning and data mining. In this paper, we consider the problem of low-rank binary matrix factorisation (BMF) under Boolean arithmetic. Due to the NP-hardness of this problem, most previous attempts rely on heuristic techniques. We formulate the problem as a mixed integer linear program and use a large scale optimisation technique of column generation to solve it without the need of heuristic pattern mining. Our approach focuses on accuracy and on the provision of optimality guarantees. Experimental results on real world datasets demonstrate that our proposed method is effective at producing highly accurate factorisations and improves on the previously available best known results for 15 out of 24 problem instances

arXiv.org e-Print Archive

Oxford University Research Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Approximating Red-Blue Set Cover and Minimum Monotone Satisfying Assignment

Author: Chlamtáč Eden
Makarychev Yury
Vakilian Ali
Publication venue
Publication date: 07/07/2023
Field of study

We provide new approximation algorithms for the Red-Blue Set Cover and Circuit Minimum Monotone Satisfying Assignment (MMSA) problems. Our algorithm for Red-Blue Set Cover achieves

\tilde O(m^{1/3})

-approximation improving on the

\tilde O(m^{1/2})

-approximation due to Elkin and Peleg (where

m

is the number of sets). Our approximation algorithm for MMSA

_t

(for circuits of depth

t

) gives an

\tilde O(N^{1-\delta})

approximation for

\delta = \frac{1}{3}2^{3-\lceil t/2\rceil}

, where

N

is the number of gates and variables. No non-trivial approximation algorithms for MMSA

_t

with

t\geq 4

were previously known. We complement these results with lower bounds for these problems: For Red-Blue Set Cover, we provide a nearly approximation preserving reduction from Min

k

-Union that gives an

\tilde\Omega(m^{1/4 - \varepsilon})

hardness under the Dense-vs-Random conjecture, while for MMSA we sketch a proof that an SDP relaxation strengthened by Sherali--Adams has an integrality gap of

N^{1-\varepsilon}

where

\varepsilon \to 0

as the circuit depth

t\to \infty

.Comment: APPROX 202

arXiv.org e-Print Archive

An Improved Algorithm for Learning to Perform Exception-Tolerant Abduction

Author: Zhang Mengxue
Publication venue: Washington University Open Scholarship
Publication date: 20/05/2017
Field of study

Abstract Inference from an observed or hypothesized condition to a plausible cause or explanation for this condition is known as abduction. For many tasks, the acquisition of the necessary knowledge by machine learning has been widely found to be highly effective. However, the semantics of learned knowledge are weaker than the usual classical semantics, and this necessitates new formulations of many tasks. We focus on a recently introduced formulation of the abductive inference task that is thus adapted to the semantics of machine learning. A key problem is that we cannot expect that our causes or explanations will be perfect, and they must tolerate some error due to the world being more complicated than our formalization allows. This is a version of the qualification problem, and in machine learning, this is known as agnostic learning. In the work by Juba that introduced the task of learning to make abductive inferences, an algorithm is given for producing k-DNF explanations that tolerates such exceptions: if the best possible k-DNF explanation fails to justify the condition with probability , then the algorithm is promised to find a k-DNF explanation that fails to justify the condition with probability at most , where n is the number of propositional attributes used to describe the domain. Here, we present an improved algorithm for this task. When the best k-DNF fails with probability , our algorithm finds a k-DNF that fails with probability at most (i.e., suppressing logarithmic factors in n and ).We examine the empirical advantage of this new algorithm over the previous algorithm in two test domains, one of explaining conditions generated by a “noisy k-DNF rule, and another of explaining conditions that are actually generated by a linear threshold rule. We also apply the algorithm on the real world application Anomaly explanation. In this work, as opposed to anomaly detection, we are interested in finding possible descriptions of what may be causing anomalies in visual data. We use PCA to perform anomaly detection. The task is attaching semantics drawn from the image meta-data to a portion of the anomalous images from some source such as web-came. Such a partial description of the anomalous images in terms of the meta-data is useful both because it may help to explain what causes the identified anomalies, and also because it may help to identify the truly unusual images that defy such simple categorization. We find that it is a good match to apply our approximation algorithm on this task. Our algorithm successfully finds plausible explanations of the anomalies. It yields low error rate when the data set is large(\u3e80,000 inputs) and also works well when the data set is not very large(\u3c 50,000 examples). It finds small 2-DNFs that are easy to interpret and capture a non-negligible

Washington University St. Louis: Open Scholarship

Hybrid ASP-based Approach to Pattern Mining

Author: Miettinen Pauli
Paramonov Sergey
Stepanova Daria
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2017
Field of study

Detecting small sets of relevant patterns from a given dataset is a central challenge in data mining. The relevance of a pattern is based on user-provided criteria; typically, all patterns that satisfy certain criteria are considered relevant. Rule-based languages like Answer Set Programming (ASP) seem well-suited for specifying such criteria in a form of constraints. Although progress has been made, on the one hand, on solving individual mining problems and, on the other hand, developing generic mining systems, the existing methods either focus on scalability or on generality. In this paper we make steps towards combining local (frequency, size, cost) and global (various condensed representations like maximal, closed, skyline) constraints in a generic and efficient way. We present a hybrid approach for itemset, sequence and graph mining which exploits dedicated highly optimized mining systems to detect frequent patterns and then filters the results using declarative ASP. To further demonstrate the generic nature of our hybrid framework we apply it to a problem of approximately tiling a database. Experiments on real-world datasets show the effectiveness of the proposed method and computational gains for itemset, sequence and graph mining, as well as approximate tiling. Under consideration in Theory and Practice of Logic Programming (TPLP).Comment: 29 pages, 7 figures, 5 table

arXiv.org e-Print Archive

MPG.PuRe