6 research outputs found

    AutoBayes: A System for Generating Data Analysis Programs from Statistical Models

    No full text
    Data analysis is an important scientific task which is required whenever information needs to be extracted from raw data. Statistical approaches to data analysis, which use methods from probability theory and numerical analysis, are well-founded but difficult to implement: the development of a statistical data analysis program for any given application is time-consuming and requires substantial knowledge and experience in several areas. In this paper, we describe AutoBayes, a program synthesis system for the generation of data analysis programs from statistical models. A statistical model specifies the properties for each problem variable (i.e., observation or parameter) and its dependencies in the form of a probability distribution. It is a fully declarative problem description, similar in spirit to a set of differential equations. From such a model, AutoBayes generates optimized and fully commented C/C++ code which can be linked dynamically into the Matlab and Octave environments. Code is produced by a schema-guided deductive synthesis process. A schema consists of a code template and applicability constraints which are checked against the model during synthesis using theorem proving technology. AutoBayes augments schema-guided synthesis by symbolic-algebraic computation and can thus derive closed-form solutions for many problems. It is well-suited for tasks like estimating best-fitting model parameters for the given data. Here, we describe AutoBayes's system architecture, in particular the schema-guided synthesis kernel. Its capabilities are illustrated by a number of advanced textbook examples and benchmarks

    Towards Intelligent Assistance for a Data Mining Process:-

    Get PDF
    A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and non-trivial interactions, both novices and data-mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype Intelligent Discovery Assistant (IDA), which provides users with (i) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a comprehensive demonstration of cost-sensitive classification using a more involved process and data from the 1998 KDDCUP competition.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Automation And Visualization Of Program Correctness For Automatically Generating Code

    Get PDF
    Program synthesis systems can be highly advantageous in that users can automatically generate code to fit a wide variety of applications from high-level specifications without needing any low-level programming skills or knowledge of which type of data structures and algorithms should be used. NASA has developed and uses two of these systems, AUTOFILTER and AUTOBAYES. Though much is gained in terms of time and cost efficiency in the use of these systems, they suffer from an issue that is inherent in all code generator systems, the verifiability of the correctness of the generated code against the input specifications. Many times, this verification process can take just as long, if not longer than manually developing and testing the code would have been. Because of this, much work has been done by NASA and others to develop methods for automatic certification that can be produced along with the program and are easy to use. However, there is still more work to be done in this area, especially in the area of automatic visual verification (e.g., by using UML diagrams to provide visual aid in the verification of the generated code). Work has been done by Grant et al. in collaboration with NASA to develop a rigorous approach to system correctness verification that uses domain-specific graphical meta-models of the expected input/output systems with identified constraints on the input/output and their relationships. Though this approach has been applied to AUTOFILTER, it has yet to be applied to other domains. In this work, Grant’s approach is extended to the data analysis domain by being applied to AUTOBAYES. A model of the input specification for AUTOBAYES was obtained for the case in which a normal distribution of data is assumed. This model, derived from the AUTOBAYES input files, the n-dimensional Gaussian equation, and allowed priors, is a UML class diagram (CD). Similarly, a UML CD model of the AUTOBAYES program output was derived. These CD\u27s were then used to develop 30 constraints on the input, the output, and the relationship between them. These constraints were then transformed into the OCL formal specification language and analyzed with the USE tool, along with the derived comprehensive CD (i.e., a combination of the input CD, output CD, and the relationships between each other). These models and constraints were used to successfully check that all of the developed constraints were satisfied with the model representing AUTOBAYES. Unfortunately, a configuration for a full validation with USE was not obtained, after several iterations, due to project time restrictions. However, the results obtained adequately demonstrate that this method can be extended to the domain of AUTOBAYES. This work was motivated both due to its relevance to NASA in the chosen case study of AUTOBAYES as well to show that Grant’s approach can be extended to other domains beyond AUTOFILTER

    Towards Intelligent Assistance for a Data Mining Process:-

    Get PDF
    A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and non-trivial interactions, both novices and data-mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype Intelligent Discovery Assistant (IDA), which provides users with (i) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a comprehensive demonstration of cost-sensitive classification using a more involved process and data from the 1998 KDDCUP competition.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Towards Automated Synthesis of Data Mining Programs

    No full text
    Code synthesis is routinely used in industry to generate GUIs, form filling applications, and database support code and is even used with COBOL. In this paper we consider the question of whether code synthesis could also be applied to the data mining phase of knowledge discovery. We view this as a rapid prototyping method. Rapid prototyping of statistical data analysis algorithms would allow experienced analysts to experiment with different statistical models before choosing one, but without requiring prohibitively expensive programming efforts. It would also smooth the steep learning curve often faced by novice users of data mining tools and libraries. Finally, it would accelerate dissemination of essential research results and the development of applications. In this paper, we present a framework and the basic software for the automated synthesis of data analysis programs. We use a specification language that generalizes Bayesian networks, a popular notation used in many communities..
    corecore