685 research outputs found

    Inductive benchmarking for purely functional data structures

    Get PDF
    Every designer of a new data structure wants to know how well it performs in comparison with others. But finding, coding and testing applications as benchmarks can be tedious and time-consuming. Besides, how a benchmark uses a data structure may considerably affect its apparent efficiency, so the choice of applications may bias the results. We address these problems by developing a tool for inductive benchmarking. This tool, Auburn, can generate benchmarks across a wide distribution of uses. We precisely define 'the use of a data structure', upon which we build the core algorithms of Auburn: how to generate a benchmark from a description of use, and how to extract a description of use from an application. We then apply inductive classification techniques to obtain decision trees for the choice between competing data structures. We test Auburn by benchmarking several implementations of three common data structures: queues, random-access lists and heaps. These and other results show Auburn to be a useful and accurate tool, but they also reveal some limitations of the approach

    Incremental Dead State Detection in Logarithmic Time

    Full text link
    Identifying live and dead states in an abstract transition system is a recurring problem in formal verification; for example, it arises in our recent work on efficiently deciding regex constraints in SMT. However, state-of-the-art graph algorithms for maintaining reachability information incrementally (that is, as states are visited and before the entire state space is explored) assume that new edges can be added from any state at any time, whereas in many applications, outgoing edges are added from each state as it is explored. To formalize the latter situation, we propose guided incremental digraphs (GIDs), incremental graphs which support labeling closed states (states which will not receive further outgoing edges). Our main result is that dead state detection in GIDs is solvable in O(logm)O(\log m) amortized time per edge for mm edges, improving upon O(m)O(\sqrt{m}) per edge due to Bender, Fineman, Gilbert, and Tarjan (BFGT) for general incremental directed graphs. We introduce two algorithms for GIDs: one establishing the logarithmic time bound, and a second algorithm to explore a lazy heuristics-based approach. To enable an apples-to-apples experimental comparison, we implemented both algorithms, two simpler baselines, and the state-of-the-art BFGT baseline using a common directed graph interface in Rust. Our evaluation shows 110110-530530x speedups over BFGT for the largest input graphs over a range of graph classes, random graphs, and graphs arising from regex benchmarks.Comment: 22 pages + reference

    Algorithms for Stable Matching and Clustering in a Grid

    Full text link
    We study a discrete version of a geometric stable marriage problem originally proposed in a continuous setting by Hoffman, Holroyd, and Peres, in which points in the plane are stably matched to cluster centers, as prioritized by their distances, so that each cluster center is apportioned a set of points of equal area. We show that, for a discretization of the problem to an n×nn\times n grid of pixels with kk centers, the problem can be solved in time O(n2log5n)O(n^2 \log^5 n), and we experiment with two slower but more practical algorithms and a hybrid method that switches from one of these algorithms to the other to gain greater efficiency than either algorithm alone. We also show how to combine geometric stable matchings with a kk-means clustering algorithm, so as to provide a geometric political-districting algorithm that views distance in economic terms, and we experiment with weighted versions of stable kk-means in order to improve the connectivity of the resulting clusters.Comment: 23 pages, 12 figures. To appear (without the appendices) at the 18th International Workshop on Combinatorial Image Analysis, June 19-21, 2017, Plovdiv, Bulgari

    A type-theory for higher-order amortized analysis

    Get PDF
    Die Verifikation von "Worst-Case" Schranken für Ressourcennutzung ist ein wichtiges Problem in der Informatik. Der Nutzen einer solchen Verifikation hängt von der Präzision der Analyse ab. Aus Gründen der Präzision ist es manchmal nützlich, die durchschnittlichen Kosten einer Folge von Operationen zu berücksichtigen, statt die Kosten jeder einzelnen Operation getrennt zu betrachten. Diese Art von Analyse wird oft als amortisierte Ressourcenanalyse bezeichnet. Typischerweise profitieren Programme, die ihren Zustand optimieren, um die Kosten zukünftiger Ausführungen zu reduzieren, von solchen Ansätzen. Die Analyse der Ressourcennutzung einer mit zwei (LIFO) Listen implementierten funktionalen (FIFO) Schlange ist ein klassisches Beispiel für eine amortisierte Analyse. In dieser Arbeit präsentieren wir λamor, eine Typentheorie für die amortisierte Analyse der Ressourcennutzung höherstufiger Programme. Eine typische amortisierte Analyse speichert einen "ghost state", der als Potenzial bezeichnet wird, zusammen mit den Datenstrukturen. Die Kernidee der amortisierten Analyse ist es, zu zeigen, dass das dem Programm zur Verfügung stehende Potenzial ausreicht, um die Ressourcennutzung des Programms zu erfassen. Die Verifikation in λamor basiert auf der Realisierung dieser Idee in einer Typentheorie. Wir erreichen dies indem wir ein allgemeines typentheoretisches Konstrukt zur Darstellung des Potenzials auf der Ebene von Typen definieren und anschließend eine affine Typentheorie aufbauen. Mit λamor zeigen wir, dass eine typentheoretische amortisierte Analyse mit gut verstandenen Konzepten aus substrukturellen und modalen Typentheorien durchgeführt werden kann. Trotzdem ergibt sich ein äußerst aussagekräftiges Framework, das für die Ressourcenanalyse von höherstufigen Programmen, sowohl ein einem "strikten", als auch in einem "lazy" Setting, verwendet werden kann. Wir präsentieren Einbettungen zweier stark verschiedener Arten von typentheoretischen Ressourcenanalyseframeworks (eines basiert auf Effekten, das andere auf Koeffekten) in λamor. Wir zeigen, dass λamor korrekt (sound) ist (mithilfe eines "Logical relations" Modells) und, dass es vollständig für PCF-Programme ist (unter Verwendung einer der Einbettungen). Als nächstes verwenden wir Ideen von λamor, um eine andere Typentheorie (genannt λcg) für einen ganz anderen Anwendungsfall - Informationsflusskontrolle (IFC) - zu entwickeln. λcg verwendet ähnliche typentheoretische Konstrukte wie λamor für das Potenzial verwendet, um die Vertraulichkeitsmarkierungen (den "ghost state" für IFC) darzustellen. Schließlich abstrahieren wir von den spezifischen "ghost states" (Potenzial und Vertraulichkeitsmarkierungen) und entwickeln eine Typentheorie für einen allgemeinen "ghost state" mit einer monoidalen Struktur.Verification of worst-case bounds (on the resource usage of programs) is an important problem in computer science. The usefulness of such verification depends on the precision of the underlying analysis. For precision, sometimes it is useful to consider the average cost over a sequence of operations, instead of separately considering the cost of each individual operation. This kind of an analysis is often referred to as amortized resource analysis. Typically, programs that optimize their internal state to reduce the cost of future executions benefit from such approaches. Analyzing resource usage of a standard functional (FIFO) queue implemented using two functional (LIFO) lists is a classic example of amortized analysis. In this thesis we present λamor, a type-theory for amortized resource analysis of higher-order functional programs. A typical amortized analysis works by storing a ghost state called the potential with data structures. The key idea underlying amortized analysis is to show that, the available potential with the program is sufficient to account for the resource usage of that program. Verification in λamor is based on internalizing this idea into a type theory. We achieve this by providing a general type-theoretic construct to represent potential at the level of types and then building an affine type-theory around it. With λamor we show that, type-theoretic amortized analysis can be performed using well understood concepts from sub-structural and modal type theories. Yet, it yields an extremely expressive framework which can be used for resource analysis of higher-order programs, both in a strict and lazy setting. We show embeddings of two very different styles (one based on effects and the other on coeffects) of type-theoretic resource analysis frameworks into λamor. We show that λamor is sound (using a logical relations model) and complete for cost analysis of PCF programs (using one of the embeddings). Next, we apply ideas from λamor to develop another type theory (called λcg) for a very different domain – Information Flow Control (IFC). λcg uses a similar typetheoretic construct (which λamor uses for the potential) to represent confidentiality label (the ghost state for IFC). Finally, we abstract away from the specific ghost states (potential and confidentiality label) and describe how to develop a type-theory for a general ghost state with a monoidal structure

    Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets

    Full text link
    This paper introduces new algorithms and data structures for quick counting for machine learning datasets. We focus on the counting task of constructing contingency tables, but our approach is also applicable to counting the number of records in a dataset that match conjunctive queries. Subject to certain assumptions, the costs of these operations can be shown to be independent of the number of records in the dataset and loglinear in the number of non-zero entries in the contingency table. We provide a very sparse data structure, the ADtree, to minimize memory use. We provide analytical worst-case bounds for this structure for several models of data distribution. We empirically demonstrate that tractably-sized data structures can be produced for large real-world datasets by (a) using a sparse tree structure that never allocates memory for counts of zero, (b) never allocating memory for counts that can be deduced from other counts, and (c) not bothering to expand the tree fully near its leaves. We show how the ADtree can be used to accelerate Bayes net structure finding algorithms, rule learning algorithms, and feature selection algorithms, and we provide a number of empirical results comparing ADtree methods against traditional direct counting approaches. We also discuss the possible uses of ADtrees in other machine learning methods, and discuss the merits of ADtrees in comparison with alternative representations such as kd-trees, R-trees and Frequent Sets.Comment: See http://www.jair.org/ for any accompanying file
    corecore