331,536 research outputs found
Top-Down Induction of Decision Trees: Rigorous Guarantees and Inherent Limitations
Consider the following heuristic for building a decision tree for a function
. Place the most influential variable of
at the root, and recurse on the subfunctions and on the
left and right subtrees respectively; terminate once the tree is an
-approximation of . We analyze the quality of this heuristic,
obtaining near-matching upper and lower bounds:
Upper bound: For every with decision tree size and every
, this heuristic builds a decision tree of size
at most .
Lower bound: For every and , there is an with decision tree size such that
this heuristic builds a decision tree of size .
We also obtain upper and lower bounds for monotone functions:
and
respectively. The lower bound disproves conjectures of Fiat and Pechyony (2004)
and Lee (2009).
Our upper bounds yield new algorithms for properly learning decision trees
under the uniform distribution. We show that these algorithms---which are
motivated by widely employed and empirically successful top-down decision tree
learning heuristics such as ID3, C4.5, and CART---achieve provable guarantees
that compare favorably with those of the current fastest algorithm (Ehrenfeucht
and Haussler, 1989). Our lower bounds shed new light on the limitations of
these heuristics.
Finally, we revisit the classic work of Ehrenfeucht and Haussler. We extend
it to give the first uniform-distribution proper learning algorithm that
achieves polynomial sample and memory complexity, while matching its
state-of-the-art quasipolynomial runtime
Parallelizing Deadlock Resolution in Symbolic Synthesis of Distributed Programs
Previous work has shown that there are two major complexity barriers in the
synthesis of fault-tolerant distributed programs: (1) generation of fault-span,
the set of states reachable in the presence of faults, and (2) resolving
deadlock states, from where the program has no outgoing transitions. Of these,
the former closely resembles with model checking and, hence, techniques for
efficient verification are directly applicable to it. Hence, we focus on
expediting the latter with the use of multi-core technology.
We present two approaches for parallelization by considering different design
choices. The first approach is based on the computation of equivalence classes
of program transitions (called group computation) that are needed due to the
issue of distribution (i.e., inability of processes to atomically read and
write all program variables). We show that in most cases the speedup of this
approach is close to the ideal speedup and in some cases it is superlinear. The
second approach uses traditional technique of partitioning deadlock states
among multiple threads. However, our experiments show that the speedup for this
approach is small. Consequently, our analysis demonstrates that a simple
approach of parallelizing the group computation is likely to be the effective
method for using multi-core computing in the context of deadlock resolution
On the usage of the probability integral transform to reduce the complexity of multi-way fuzzy decision trees in Big Data classification problems
We present a new distributed fuzzy partitioning method to reduce the
complexity of multi-way fuzzy decision trees in Big Data classification
problems. The proposed algorithm builds a fixed number of fuzzy sets for all
variables and adjusts their shape and position to the real distribution of
training data. A two-step process is applied : 1) transformation of the
original distribution into a standard uniform distribution by means of the
probability integral transform. Since the original distribution is generally
unknown, the cumulative distribution function is approximated by computing the
q-quantiles of the training set; 2) construction of a Ruspini strong fuzzy
partition in the transformed attribute space using a fixed number of equally
distributed triangular membership functions. Despite the aforementioned
transformation, the definition of every fuzzy set in the original space can be
recovered by applying the inverse cumulative distribution function (also known
as quantile function). The experimental results reveal that the proposed
methodology allows the state-of-the-art multi-way fuzzy decision tree (FMDT)
induction algorithm to maintain classification accuracy with up to 6 million
fewer leaves.Comment: Appeared in 2018 IEEE International Congress on Big Data (BigData
Congress). arXiv admin note: text overlap with arXiv:1902.0935
Affine arithmetic-based methodology for energy hub operation-scheduling in the presence of data uncertainty
In this study, the role of self-validated computing for solving the energy hub-scheduling problem in the presence of multiple and heterogeneous sources of data uncertainties is explored and a new solution paradigm based on affine arithmetic is conceptualised. The benefits deriving from the application of this methodology are analysed in details, and several numerical results are presented and discussed
- …