    Scalable Optimal Multiway-Split Decision Trees with Constraints

    There has been a surge of interest in learning optimal decision trees using mixed-integer programs (MIP) in recent years, as heuristic-based methods do not guarantee optimality and find it challenging to incorporate constraints that are critical for many practical applications. However, existing MIP methods that build on an arc-based formulation do not scale well as the number of binary variables is in the order of O(2dN)\mathcal{O}(2^dN), where dd and NN refer to the depth of the tree and the size of the dataset. Moreover, they can only handle sample-level constraints and linear metrics. In this paper, we propose a novel path-based MIP formulation where the number of decision variables is independent of NN. We present a scalable column generation framework to solve the MIP optimally. Our framework produces a multiway-split tree which is more interpretable than the typical binary-split trees due to its shorter rules. Our method can handle nonlinear metrics such as F1 score and incorporate a broader class of constraints. We demonstrate its efficacy with extensive experiments. We present results on datasets containing up to 1,008,372 samples while existing MIP-based decision tree models do not scale well on data beyond a few thousand points. We report superior or competitive results compared to the state-of-art MIP-based methods with up to a 24X reduction in runtime

    Dynamic Ordered Sets with Exponential Search Trees

    We introduce exponential search trees as a novel technique for converting static polynomial space search structures for ordered sets into fully-dynamic linear space data structures. This leads to an optimal bound of O(sqrt(log n/loglog n)) for searching and updating a dynamic set of n integer keys in linear space. Here searching an integer y means finding the maximum key in the set which is smaller than or equal to y. This problem is equivalent to the standard text book problem of maintaining an ordered set (see, e.g., Cormen, Leiserson, Rivest, and Stein: Introduction to Algorithms, 2nd ed., MIT Press, 2001). The best previous deterministic linear space bound was O(log n/loglog n) due Fredman and Willard from STOC 1990. No better deterministic search bound was known using polynomial space. We also get the following worst-case linear space trade-offs between the number n, the word length w, and the maximal key U < 2^w: O(min{loglog n+log n/log w, (loglog n)(loglog U)/(logloglog U)}). These trade-offs are, however, not likely to be optimal. Our results are generalized to finger searching and string searching, providing optimal results for both in terms of n.Comment: Revision corrects some typoes and state things better for applications in subsequent paper

    Unbiased split selection for classification trees based on the Gini Index

    The Gini gain is one of the most common variable selection criteria in machine learning. We derive the exact distribution of the maximally selected Gini gain in the context of binary classification using continuous predictors by means of a combinatorial approach. This distribution provides a formal support for variable selection bias in favor of variables with a high amount of missing values when the Gini gain is used as split selection criterion, and we suggest to use the resulting p-value as an unbiased split selection criterion in recursive partitioning algorithms. We demonstrate the efficiency of our novel method in simulation- and real data- studies from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. Our method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy criterion

    A Bulk-Parallel Priority Queue in External Memory with STXXL

    We propose the design and an implementation of a bulk-parallel external memory priority queue to take advantage of both shared-memory parallelism and high external memory transfer speeds to parallel disks. To achieve higher performance by decoupling item insertions and extractions, we offer two parallelization interfaces: one using "bulk" sequences, the other by defining "limit" items. In the design, we discuss how to parallelize insertions using multiple heaps, and how to calculate a dynamic prediction sequence to prefetch blocks and apply parallel multiway merge for extraction. Our experimental results show that in the selected benchmarks the priority queue reaches 75% of the full parallel I/O bandwidth of rotational disks and and 65% of SSDs, or the speed of sorting in external memory when bounded by computation.Comment: extended version of SEA'15 conference pape

    Network Sparsification for Steiner Problems on Planar and Bounded-Genus Graphs

    We propose polynomial-time algorithms that sparsify planar and bounded-genus graphs while preserving optimal or near-optimal solutions to Steiner problems. Our main contribution is a polynomial-time algorithm that, given an unweighted graph GG embedded on a surface of genus gg and a designated face ff bounded by a simple cycle of length kk, uncovers a set FE(G)F \subseteq E(G) of size polynomial in gg and kk that contains an optimal Steiner tree for any set of terminals that is a subset of the vertices of ff. We apply this general theorem to prove that: * given an unweighted graph GG embedded on a surface of genus gg and a terminal set SV(G)S \subseteq V(G), one can in polynomial time find a set FE(G)F \subseteq E(G) that contains an optimal Steiner tree TT for SS and that has size polynomial in gg and E(T)|E(T)|; * an analogous result holds for an optimal Steiner forest for a set SS of terminal pairs; * given an unweighted planar graph GG and a terminal set SV(G)S \subseteq V(G), one can in polynomial time find a set FE(G)F \subseteq E(G) that contains an optimal (edge) multiway cut CC separating SS and that has size polynomial in C|C|. In the language of parameterized complexity, these results imply the first polynomial kernels for Steiner Tree and Steiner Forest on planar and bounded-genus graphs (parameterized by the size of the tree and forest, respectively) and for (Edge) Multiway Cut on planar graphs (parameterized by the size of the cutset). Additionally, we obtain a weighted variant of our main contribution

    Half-integrality, LP-branching and FPT Algorithms

    A recent trend in parameterized algorithms is the application of polytope tools (specifically, LP-branching) to FPT algorithms (e.g., Cygan et al., 2011; Narayanaswamy et al., 2012). However, although interesting results have been achieved, the methods require the underlying polytope to have very restrictive properties (half-integrality and persistence), which are known only for few problems (essentially Vertex Cover (Nemhauser and Trotter, 1975) and Node Multiway Cut (Garg et al., 1994)). Taking a slightly different approach, we view half-integrality as a \emph{discrete} relaxation of a problem, e.g., a relaxation of the search space from {0,1}V\{0,1\}^V to {0,1/2,1}V\{0,1/2,1\}^V such that the new problem admits a polynomial-time exact solution. Using tools from CSP (in particular Thapper and \v{Z}ivn\'y, 2012) to study the existence of such relaxations, we provide a much broader class of half-integral polytopes with the required properties, unifying and extending previously known cases. In addition to the insight into problems with half-integral relaxations, our results yield a range of new and improved FPT algorithms, including an O(Σ2k)O^*(|\Sigma|^{2k})-time algorithm for node-deletion Unique Label Cover with label set Σ\Sigma and an O(4k)O^*(4^k)-time algorithm for Group Feedback Vertex Set, including the setting where the group is only given by oracle access. All these significantly improve on previous results. The latter result also implies the first single-exponential time FPT algorithm for Subset Feedback Vertex Set, answering an open question of Cygan et al. (2012). Additionally, we propose a network flow-based approach to solve some cases of the relaxation problem. This gives the first linear-time FPT algorithm to edge-deletion Unique Label Cover.Comment: Added results on linear-time FPT algorithms (not present in SODA paper