1,161 research outputs found
Scalable Optimal Multiway-Split Decision Trees with Constraints
There has been a surge of interest in learning optimal decision trees using
mixed-integer programs (MIP) in recent years, as heuristic-based methods do not
guarantee optimality and find it challenging to incorporate constraints that
are critical for many practical applications. However, existing MIP methods
that build on an arc-based formulation do not scale well as the number of
binary variables is in the order of , where and
refer to the depth of the tree and the size of the dataset. Moreover, they can
only handle sample-level constraints and linear metrics. In this paper, we
propose a novel path-based MIP formulation where the number of decision
variables is independent of . We present a scalable column generation
framework to solve the MIP optimally. Our framework produces a multiway-split
tree which is more interpretable than the typical binary-split trees due to its
shorter rules. Our method can handle nonlinear metrics such as F1 score and
incorporate a broader class of constraints. We demonstrate its efficacy with
extensive experiments. We present results on datasets containing up to
1,008,372 samples while existing MIP-based decision tree models do not scale
well on data beyond a few thousand points. We report superior or competitive
results compared to the state-of-art MIP-based methods with up to a 24X
reduction in runtime
Dynamic Ordered Sets with Exponential Search Trees
We introduce exponential search trees as a novel technique for converting
static polynomial space search structures for ordered sets into fully-dynamic
linear space data structures.
This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and
updating a dynamic set of n integer keys in linear space. Here searching an
integer y means finding the maximum key in the set which is smaller than or
equal to y. This problem is equivalent to the standard text book problem of
maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein:
Introduction to Algorithms, 2nd ed., MIT Press, 2001).
The best previous deterministic linear space bound was O(log n/loglog n) due
Fredman and Willard from STOC 1990. No better deterministic search bound was
known using polynomial space.
We also get the following worst-case linear space trade-offs between the
number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log
n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however,
not likely to be optimal.
Our results are generalized to finger searching and string searching,
providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for
applications in subsequent paper
Unbiased split selection for classification trees based on the Gini Index
The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion
A Bulk-Parallel Priority Queue in External Memory with STXXL
We propose the design and an implementation of a bulk-parallel external
memory priority queue to take advantage of both shared-memory parallelism and
high external memory transfer speeds to parallel disks. To achieve higher
performance by decoupling item insertions and extractions, we offer two
parallelization interfaces: one using "bulk" sequences, the other by defining
"limit" items. In the design, we discuss how to parallelize insertions using
multiple heaps, and how to calculate a dynamic prediction sequence to prefetch
blocks and apply parallel multiway merge for extraction. Our experimental
results show that in the selected benchmarks the priority queue reaches 75% of
the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the
speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape
Network Sparsification for Steiner Problems on Planar and Bounded-Genus Graphs
We propose polynomial-time algorithms that sparsify planar and bounded-genus
graphs while preserving optimal or near-optimal solutions to Steiner problems.
Our main contribution is a polynomial-time algorithm that, given an unweighted
graph embedded on a surface of genus and a designated face bounded
by a simple cycle of length , uncovers a set of size
polynomial in and that contains an optimal Steiner tree for any set of
terminals that is a subset of the vertices of .
We apply this general theorem to prove that: * given an unweighted graph
embedded on a surface of genus and a terminal set , one
can in polynomial time find a set that contains an optimal
Steiner tree for and that has size polynomial in and ; * an
analogous result holds for an optimal Steiner forest for a set of terminal
pairs; * given an unweighted planar graph and a terminal set , one can in polynomial time find a set that contains
an optimal (edge) multiway cut separating and that has size polynomial
in .
In the language of parameterized complexity, these results imply the first
polynomial kernels for Steiner Tree and Steiner Forest on planar and
bounded-genus graphs (parameterized by the size of the tree and forest,
respectively) and for (Edge) Multiway Cut on planar graphs (parameterized by
the size of the cutset). Additionally, we obtain a weighted variant of our main
contribution
Half-integrality, LP-branching and FPT Algorithms
A recent trend in parameterized algorithms is the application of polytope
tools (specifically, LP-branching) to FPT algorithms (e.g., Cygan et al., 2011;
Narayanaswamy et al., 2012). However, although interesting results have been
achieved, the methods require the underlying polytope to have very restrictive
properties (half-integrality and persistence), which are known only for few
problems (essentially Vertex Cover (Nemhauser and Trotter, 1975) and Node
Multiway Cut (Garg et al., 1994)). Taking a slightly different approach, we
view half-integrality as a \emph{discrete} relaxation of a problem, e.g., a
relaxation of the search space from to such that
the new problem admits a polynomial-time exact solution. Using tools from CSP
(in particular Thapper and \v{Z}ivn\'y, 2012) to study the existence of such
relaxations, we provide a much broader class of half-integral polytopes with
the required properties, unifying and extending previously known cases.
In addition to the insight into problems with half-integral relaxations, our
results yield a range of new and improved FPT algorithms, including an
-time algorithm for node-deletion Unique Label Cover with
label set and an -time algorithm for Group Feedback Vertex
Set, including the setting where the group is only given by oracle access. All
these significantly improve on previous results. The latter result also implies
the first single-exponential time FPT algorithm for Subset Feedback Vertex Set,
answering an open question of Cygan et al. (2012).
Additionally, we propose a network flow-based approach to solve some cases of
the relaxation problem. This gives the first linear-time FPT algorithm to
edge-deletion Unique Label Cover.Comment: Added results on linear-time FPT algorithms (not present in SODA
paper
- …