13 research outputs found
Textual Membership Queries
Human labeling of data can be very time-consuming and expensive, yet, in many
cases it is critical for the success of the learning process. In order to
minimize human labeling efforts, we propose a novel active learning solution
that does not rely on existing sources of unlabeled data. It uses a small
amount of labeled data as the core set for the synthesis of useful membership
queries (MQs) - unlabeled instances generated by an algorithm for human
labeling. Our solution uses modification operators, functions that modify
instances to some extent. We apply the operators on a small set of instances
(core set), creating a set of new membership queries. Using this framework, we
look at the instance space as a search space and apply search algorithms in
order to generate new examples highly relevant to the learner. We implement
this framework in the textual domain and test it on several text classification
tasks and show improved classifier performance as more MQs are labeled and
incorporated into the training set. To the best of our knowledge, this is the
first work on membership queries in the textual domain.Comment: Accepted to IJCAI 2020. Code is available at
github.com/jonzarecki/textual-mqs . Additional material is available at
tinyurl.com/sup-textualmqs . SOLE copyright holder is IJCAI (International
Joint Conferences on Artificial Intelligence), all rights reserve
Redescription Mining and Applications in Bioinformatics
Our ability to interrogate the cell and computationally assimilate its answers is improving at a dramatic pace. For instance, the study of even a focused aspect of cellular activity, such as gene action, now benefits from multiple high-throughput data acquisition technologies such as microarrays, genome-wide deletion screens, and RNAi assays. A critical need is the development of algorithms that can bridge, relate, and unify diverse categories of data descriptors. Redescription mining is such an approach. Given a set of biological objects (e.g., genes, proteins) and a collection of descriptors defined over this set, the goal of redescription mining is to use the given descriptors as a vocabulary and find subsets of data that afford multiple definitions. The premise of redescription mining is that subsets that afford multiple definitions are likely to exhibit concerted behavior and are, hence, interesting. We present algorithms for redescription mining based on formal concept analysis and applications of redescription mining to multiple biological datasets. We demonstrate how redescriptions identify conceptual clusters of data using mutually reinforcing features, without explicit training information.
On the Complexity of Mining Itemsets from the Crowd Using Taxonomies
We study the problem of frequent itemset mining in domains where data is not
recorded in a conventional database but only exists in human knowledge. We
provide examples of such scenarios, and present a crowdsourcing model for them.
The model uses the crowd as an oracle to find out whether an itemset is
frequent or not, and relies on a known taxonomy of the item domain to guide the
search for frequent itemsets. In the spirit of data mining with oracles, we
analyze the complexity of this problem in terms of (i) crowd complexity, that
measures the number of crowd questions required to identify the frequent
itemsets; and (ii) computational complexity, that measures the computational
effort required to choose the questions. We provide lower and upper complexity
bounds in terms of the size and structure of the input taxonomy, as well as the
size of a concise description of the output itemsets. We also provide
constructive algorithms that achieve the upper bounds, and consider more
efficient variants for practical situations.Comment: 18 pages, 2 figures. To be published to ICDT'13. Added missing
acknowledgemen
Predicate Generation for Learning-Based Quantifier-Free Loop Invariant Inference
PETITION FOR ORIGINAL WRIT OF MANDAMUS DIRECTED TO THE HONORABLE DAVID L. MOWER DISTRICT JUDGE OF SEVIER COUNTY, STATE OF UTA
Conjunctions of Unate DNF Formulas: Learning and Structure
AbstractA central topic in query learning is to determine which classes of Boolean formulas are efficiently learnable with membership and equivalence queries. We consider the class Rkconsisting of conjunctions ofkunate DNF formulas. This class generalizes the class ofk-clause CNF formulas and the class of unate DNF formulas, both of which are known to be learnable in polynomial time with membership and equivalence queries. We prove that R2can be properly learned with a polynomial number of polynomial-size membership and equivalence queries, but can be properly learned in polynomial time with such queries if and only if P=NP. Thus the barrier to properly learning R2with membership and equivalence queries is computational rather than informational. Few results of this type are known. In our proofs, we use recent results of Hellersteinet al.(1997,J. Assoc. Comput. Mach.43(5), 840–862), characterizing the classes that are polynomial-query learnable, together with work of Bshouty on the monotone dimension of Boolean functions. We extend some of our results to Rkand pose open questions on learning DNF formulas of small monotone dimension. We also prove structural results for Rk. We construct, for any fixedk⩾2, a class of functionsfthat cannot be represented by any formula in Rk, but which cannot be “easily” shown to have this property. More precisely, for any functionfonnvariables in the class, the value offon any polynomial-size set of points in its domain is not a witness thatfcannot be represented by a formula in Rk. Our construction is based on BCH codes
Efficiently Learning Monotone Decision Trees with ID3
Since the Probably Approximately Correct learning model was introduced in 1984, there has been much effort in designing computationally efficient algorithms for learning Boolean functions from random examples drawn from a uniform distribution. In this paper, I take the ID3 information-gain-first classification algorithm and apply it to the task of learning monotone Boolean functions from examples that are uniformly distributed over {0,1}^n. I limited my scope to the class of monotone Boolean functions that can be represented as read-2 width-2 disjunctive normal form expressions. I modeled these functions as graphs and examined each type of connected component contained in these models, i.e. path graphs and cycle graphs. I determined the influence of the variables in the pieces of these graph models in order to understand how ID3 behaves when learning these functions. My findings show that ID3 will produce an optimal decision tree for this class of Boolean functions
Exact Learning Boolean Functions via the Monotone Theory
We study the learnability of boolean functions from membership and equivalence queries. We develop the Monotone Theory that proves 1) Any boolean function is learnable in polynomial time in its minimal DNF size, its minimal CNF size and the number of variables n. In particular, 2) Decision trees are learnable. Our algorithms are in the model of exact learning with membership queries and unrestricted equivalence queires. The hypotheses to the equivalence queries and the output hypotheses are depth 3 formulas