Search CORE

295 research outputs found

Optimal Sparse Decision Trees

Author: Hu Xiyang
Rudin Cynthia
Seltzer Margo
Publication venue
Publication date: 17/09/2020
Field of study

Decision tree algorithms have been among the most popular algorithms for interpretable (transparent) machine learning since the early 1980's. The problem that has plagued decision tree algorithms since their inception is their lack of optimality, or lack of guarantees of closeness to optimality: decision tree algorithms are often greedy or myopic, and sometimes produce unquestionably suboptimal models. Hardness of decision tree optimization is both a theoretical and practical obstacle, and even careful mathematical programming approaches have not been able to solve these problems efficiently. This work introduces the first practical algorithm for optimal decision trees for binary variables. The algorithm is a co-design of analytical bounds that reduce the search space and modern systems techniques, including data structures and a custom bit-vector library. Our experiments highlight advantages in scalability, speed, and proof of optimality.Comment: 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canad

arXiv.org e-Print Archive

Recommended from our members

The Case for Browser Provenance

Author: Margo Daniel Wyatt
Seltzer Margo I.
Publication venue: USENIX Association
Publication date: 06/10/2011
Field of study

In our increasingly networked world, web browsers are important applications. Originally an interface tool for accessing distributed documents, browsers have become ubiquitous, incorporating a significant portion of user interaction. A modern browser now also reads email, plays media, edits documents, and runs applications. Consequently, browsers process large quantities of data, and must record metadata, such as history, to help users manage their data. Most of the metadata that modern browsers record is actually provenance – metadata that captures the causality and lineage of data obtained via the browser. We demonstrate that characterizing browser metadata as provenance and then applying techniques from the provenance research community enables new browser functionality. For example, provenance can improve both history and web search by indicating contextual and personal relationships between data items. Users can also answer complex questions about the origins of their data by querying provenance. Our initial results suggest these features are feasible to implement and could perform well in modern browsers.Engineering and Applied Science

Harvard University - DASH

Distributed, Secure Load Balancing with Skew, Heterogeneity, and Churn

Author: Ledlie Jonathan
Seltzer Margo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

Numerous proposals exist for load balancing in peer-to-peer (p2p) networks. Some focus on namespace balancing, making the distance between nodes as uniform as possible. This technique works well under ideal conditions, but not under those found empirically. Instead, researchers have found heavytailed query distributions (skew), high rates of node join and leave (churn), and wide variation in node network and storage capacity (heterogeneity). Other approaches tackle these less-thanideal conditions, but give up on important security properties. We propose an algorithm that both facilitates good performance and does not dilute security. Our algorithm, k-Choices, achieves load balance by greedily matching nodes’ target workloads with actual applied workloads through limited sampling, and limits any fundamental decrease in security by basing each nodes’ set of potential identifiers on a single certificate. Our algorithm compares favorably to four others in trace-driven simulations. We have implemented our algorithm and found that it improved aggregate throughput by 20% in a widely heterogeneous system in our experiments.Engineering and Applied Science

CiteSeerX

Harvard University - DASH

Performance Introspection of Graph Databases

Author: Macko Peter
Margo Daniel Wyatt
Seltzer Margo I.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

The explosion of graph data in social and biological networks, recommendation systems, provenance databases, etc. makes graph storage and processing of paramount importance. We present a performance introspection framework for graph databases, PIG, which provides both a toolset and methodology for understanding graph database performance. PIG consists of a hierarchical collection of benchmarks that compose to produce performance models; the models provide a way to illuminate the strengths and weaknesses of a particular implementation. The suite has three layers of benchmarks: primitive operations, composite access patterns, and graph algorithms. While the framework could be used to compare different graph database systems, its primary goal is to help explain the observed performance of a particular system. Such introspection allows one to evaluate the degree to which systems exploit their knowledge of graph access patterns. We present both the PIG methodology and infrastructure and then demonstrate its efficacy by analyzing the popular Neo4j and DEX graph databases.Engineering and Applied Science

CiteSeerX

Crossref

Harvard University - DASH

Recommended from our members

Mining the Web for Medical Hypothesis: A Proof-of-Concept System

Author: Maclean Diana
Seltzer Margo I.
Publication venue
Publication date: 14/05/2012
Field of study

As the prevalence of blogs, discussion forums, and online news services continues to grow, so too does the portion of this Web content that relates to health and medicine. We propose that everyday, medically-oriented Web content is a valuable and viable data source for medical hypothesis generation and testing, despite its being noisy. In this paper, we present a proof-of-concept system supporting this notion. We construct a corpus comprising news articles relating to the drugs Vioxx, Naproxen and Ibuprofen, that were published between 1998-2002. Using this corpus, we show that there was a signiﬁcant link between Vioxx and the concept “Myocardial Infarction” well before the drug was withdrawn from the market in 2004. Indeed, within the Vioxx-related content, the concept ranks amongst the top 3.3% in terms of importance. When compared with the Naproxen and Ibuprofen control literatures, the term occurs signiﬁcantly more frequently in the Vioxx-related content.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

A General-Purpose Provenance Library

Author: Macko Peter
Seltzer Margo I.
Publication venue: USENIX Associaiton
Publication date: 26/11/2012
Field of study

Most provenance capture takes place inside particular tools - a workflow engine, a database, an operating system, or an application. However, most users have an existing toolset - a collection of different tools that work well for their needs and with which they are comfortable. Currently, such users have limited ability to collect provenance without disrupting their work and changing environments, which most users are hesitant to do. Even users who are willing to adopt new tools, may realize limited benefit from provenance in those tools if they do not integrate with their entire environment, which may include multiple languages and frameworks. We present the Core Provenance Library (CPL), a portable, multi-lingual library that application programmers can easily incorporate into a variety of tools to collect and integrate provenance. Although the manual instrumentation adds extra work for application programmers, we show that in most cases, the work is minimal, and the resulting system solves several problems that plague more constrained provenance collection systems.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Provenance Map Orbiter: Interactive Exploration of Large Provenance Graphs

Author: Macko Peter
Seltzer Margo I.
Publication venue: USENIX Association
Publication date: 07/10/2011
Field of study

Provenance systems can produce enormous provenance graphs that can be used for a variety of tasks from determining the inputs to a particular process to debugging entire workflow executions or tracking difficult-to-find dependencies. Visualization can be a useful tool to sup- port such tasks, but graphs of such scale (thousands to millions of nodes) are notoriously difficult to visualize. This paper presents the Provenance Map Orbiter, a tool for interactively exploring large provenance graphs using graph summarization and semantic zoom. It presents its users with a high-level abstracted view of the graph and the ability to incrementally drill down to the details.Engineering and Applied Science

Harvard University - DASH

Recommended from our members

Hierarchical File Systems Are Dead

Author: Murphy Nicholas
Seltzer Margo I.
Publication venue: USENIX Association
Publication date: 21/09/2011
Field of study

For over forty years, we have assumed hierarchical file system namespaces. These namespaces were a rudimentary attempt at simple organization. As users have begun to interact with increasing amounts of data and are increasingly demanding search capability, such a simple hierarchical model has outlasted its usefulness. For this reason, we should design file systems whose organizations map to the ways we access and manipulate data now. We present a new file system architecture in which we replace the hierarchical namespace with a tagged, search-based one.Engineering and Applied Science

Harvard University - DASH