295 research outputs found
Optimal Sparse Decision Trees
Decision tree algorithms have been among the most popular algorithms for
interpretable (transparent) machine learning since the early 1980's. The
problem that has plagued decision tree algorithms since their inception is
their lack of optimality, or lack of guarantees of closeness to optimality:
decision tree algorithms are often greedy or myopic, and sometimes produce
unquestionably suboptimal models. Hardness of decision tree optimization is
both a theoretical and practical obstacle, and even careful mathematical
programming approaches have not been able to solve these problems efficiently.
This work introduces the first practical algorithm for optimal decision trees
for binary variables. The algorithm is a co-design of analytical bounds that
reduce the search space and modern systems techniques, including data
structures and a custom bit-vector library. Our experiments highlight
advantages in scalability, speed, and proof of optimality.Comment: 33rd Conference on Neural Information Processing Systems (NeurIPS
2019), Vancouver, Canad
Recommended from our members
The Case for Browser Provenance
In our increasingly networked world, web browsers are important applications. Originally an interface tool for accessing distributed documents, browsers have become ubiquitous, incorporating a significant portion of user interaction. A modern browser now also reads email, plays media, edits documents, and runs applications. Consequently, browsers process large quantities of data, and must record metadata, such as history, to help users manage their data. Most of the metadata that modern browsers record is actually provenance â metadata that captures the causality and lineage of data obtained via the browser. We demonstrate that characterizing browser metadata as provenance and then applying techniques from the provenance research community enables new browser functionality. For example, provenance can improve both history and web search by indicating contextual and personal relationships between data items. Users can also answer complex questions about the origins of their data by querying provenance. Our initial results suggest these features are feasible to implement and could perform well in modern browsers.Engineering and Applied Science
Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn
Numerous proposals exist for load balancing in peer-to-peer (p2p) networks. Some focus on namespace balancing, making the distance between nodes as uniform as possible. This technique works well under ideal conditions, but not under those found empirically. Instead, researchers have found heavytailed query distributions (skew), high rates of node join and leave (churn), and wide variation in node network and storage capacity (heterogeneity). Other approaches tackle these less-thanideal conditions, but give up on important security properties. We propose an algorithm that both facilitates good performance and does not dilute security. Our algorithm, k-Choices, achieves load balance by greedily matching nodesâ target workloads with actual applied workloads through limited sampling, and limits any fundamental decrease in security by basing each nodesâ set of potential identifiers on a single certificate. Our algorithm compares favorably to four others in trace-driven simulations. We have implemented our algorithm and found that it improved aggregate throughput by 20% in a widely heterogeneous system in our experiments.Engineering and Applied Science
Performance Introspection of Graph Databases
The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for understanding graph database performance. PIG consists of a hierarchical collection of benchmarks that compose to produce performance models; the models provide a way to illuminate the strengths and weaknesses of a particular implementation. The suite has three layers of benchmarks: primitive operations, composite access patterns, and graph algorithms. While the framework could be used to compare different graph database systems, its primary goal is to help explain the observed performance of a particular system. Such introspection allows one to evaluate the degree to which systems exploit their knowledge of graph access patterns. We present both the PIG methodology and infrastructure and then demonstrate its efficacy by analyzing the popular Neo4j and DEX graph databases.Engineering and Applied Science
Recommended from our members
Mining the Web for Medical Hypothesis: A Proof-of-Concept System
As the prevalence of blogs, discussion forums, and online news services continues to grow, so too does the portion of this Web content that relates to health and medicine. We propose that everyday, medically-oriented Web content is a valuable and viable data source for medical hypothesis generation and testing, despite its being noisy. In this paper, we present a proof-of-concept system supporting this notion. We construct a corpus comprising news articles relating to the drugs Vioxx, Naproxen and Ibuprofen, that were published between 1998-2002. Using this corpus, we show that there was a signiïŹcant link between Vioxx and the concept âMyocardial Infarctionâ well before the drug was withdrawn from the market in 2004. Indeed, within the Vioxx-related content, the concept ranks amongst the top 3.3% in terms of importance. When compared with the Naproxen and Ibuprofen control literatures, the term occurs signiïŹcantly more frequently in the Vioxx-related content.Engineering and Applied Science
Recommended from our members
A General-Purpose Provenance Library
Most provenance capture takes place inside particular tools - a workflow engine, a database, an operating system, or an application. However, most users have an existing toolset - a collection of different tools that work well for their needs and with which they are comfortable. Currently, such users have limited ability to collect provenance without disrupting their work and changing environments, which most users are hesitant to do. Even users who are willing to adopt new tools, may realize limited benefit from provenance in those tools if they do not integrate with their entire environment, which may include multiple languages and frameworks. We present the Core Provenance Library (CPL), a portable, multi-lingual library that application programmers can easily incorporate into a variety of tools to collect and integrate provenance. Although the manual instrumentation adds extra work for application programmers, we show that in most cases, the work is minimal, and the resulting system solves several problems that plague more constrained provenance collection systems.Engineering and Applied Science
Recommended from our members
Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs
Provenance systems can produce enormous provenance graphs that can be used for a variety of tasks from determining the inputs to a particular process to debugging entire workflow executions or tracking difficult-to-find dependencies. Visualization can be a useful tool to sup- port such tasks, but graphs of such scale (thousands to millions of nodes) are notoriously difficult to visualize. This paper presents the Provenance Map Orbiter, a tool for interactively exploring large provenance graphs using graph summarization and semantic zoom. It presents its users with a high-level abstracted view of the graph and the ability to incrementally drill down to the details.Engineering and Applied Science
Recommended from our members
Hierarchical File Systems Are Dead
For over forty years, we have assumed hierarchical file system namespaces. These namespaces were a rudimentary attempt at simple organization. As users have begun to interact with increasing amounts of data and are increasingly demanding search capability, such a simple hierarchical model has outlasted its usefulness. For this reason, we should design file systems whose organizations map to the ways we access and manipulate data now. We present a new file system architecture in which we replace the hierarchical namespace with a tagged, search-based one.Engineering and Applied Science
- âŠ