12 research outputs found

    Bayesian Clustering by Dynamics

    Get PDF
    This paper introduces a Bayesian method for clustering dynamic processes. The method models dynamics as Markov chains and then applies an agglomerative clustering procedure to discover the most probable set of clusters capturing different dynamics. To increase ef£ciency, the method uses an entropy-based heuristic search strategy. A controlled experiment suggests that the method is very accurate when applied to artificial time series in a broad range of conditions and, when applied to clustering sensor data from mobile robots, it produces clusters that are meaningful in the domain of application

    Maximum Entropy Discrimination

    Get PDF
    We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques

    Three algorithms for causal learning

    Get PDF
    The field of causal learning has grown in the past decade, establishing itself as a major focus in artificial intelligence research. Traditionally, approaches to causal learning are split into two areas. One area involves the learning of structures from observational data alone and the second, involves the methodologies of conducting and learning from experiments. In this dissertation, I investigate three different aspects of causal learning, all of which are based on the causal Bayesian network framework. Constraint based structure search algorithms that learn partially directed acyclic graphs as causal models from observational data rely on the faithfulness assumption, which is often violated due to inaccurate statistical tests on finite datasets. My first contribution is a modification of the traditional approaches to achieving greater robustness in the light of these faults. Secondly, I present a new algorithm to infer the parent set of a variable when a specific type of experiment called a `hard intervention\u27 is performed. I also present an auxiliary result of this effort, a fast algorithm to estimate the Kullback Leibler divergence of high dimensional distributions from datasets. Thirdly, I introduce a fast heuristic algorithm to optimize the number and sequence of experiments required towards complete causal discovery for different classes of causal graphs and provide suggestions for implementing an interactive version. Finally, I provide numerical simulation results for each algorithm discussed and present some directions for future research

    An exploratory analysis of large health cohort study using Bayesian networks

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006.Includes bibliographical references (p. 91-98).Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range of data over long periods. The wealth of data in such studies may yield important results in addition to the already numerous findings, especially when subjected to newer analytical methods. Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships among variables, using the tools of probability and graph theory, and have been widely used in analyzing dependencies and the interplay between variables. We used BN to perform an exploratory analysis on a rich collection of data from one large health cohort study, the Nurses' Health Study (NHS), with the focus on breast cancer. We explored the data from the NHS using BN to look for breast cancer risk factors, including a group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid as a potential riskfactor after matching on age and number of children. Our results showed for clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI 1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95% CI 0.22-0.97).(cont.) We developed breast cancer risk models using BN. We trained models on 75% of the data, and evaluated them on the remaining. Because of the clinical importance of predicting risks for Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI: 0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively. We also evaluated the calibration performance of the models, and applied a filter to the output to improve the linear relationship between predicted and observed risks using Agglomerative Information Bottleneck clustering without sacrificing much discrimination performance.by Delin Shen.Ph.D

    Probabilistic modelling of oil rig drilling operations for business decision support: a real world application of Bayesian networks and computational intelligence.

    Get PDF
    This work investigates the use of evolved Bayesian networks learning algorithms based on computational intelligence meta-heuristic algorithms. These algorithms are applied to a new domain provided by the exclusive data, available to this project from an industry partnership with ODS-Petrodata, a business intelligence company in Aberdeen, Scotland. This research proposes statistical models that serve as a foundation for building a novel operational tool for forecasting the performance of rig drilling operations. A prototype for a tool able to forecast the future performance of a drilling operation is created using the obtained data, the statistical model and the experts' domain knowledge. This work makes the following contributions: applying K2GA and Bayesian networks to a real-world industry problem; developing a well-performing and adaptive solution to forecast oil drilling rig performance; using the knowledge of industry experts to guide the creation of competitive models; creating models able to forecast oil drilling rig performance consistently with nearly 80% forecast accuracy, using either logistic regression or Bayesian network learning using genetic algorithms; introducing the node juxtaposition analysis graph, which allows the visualisation of the frequency of nodes links appearing in a set of orderings, thereby providing new insights when analysing node ordering landscapes; exploring the correlation factors between model score and model predictive accuracy, and showing that the model score does not correlate with the predictive accuracy of the model; exploring a method for feature selection using multiple algorithms and drastically reducing the modelling time by multiple factors; proposing new fixed structure Bayesian network learning algorithms for node ordering search-space exploration. Finally, this work proposes real-world applications for the models based on current industry needs, such as recommender systems, an oil drilling rig selection tool, a user-ready rig performance forecasting software and rig scheduling tools

    Gene Regulatory Networks: Modeling, Intervention and Context

    Get PDF
    abstract: Biological systems are complex in many dimensions as endless transportation and communication networks all function simultaneously. Our ability to intervene within both healthy and diseased systems is tied directly to our ability to understand and model core functionality. The progress in increasingly accurate and thorough high-throughput measurement technologies has provided a deluge of data from which we may attempt to infer a representation of the true genetic regulatory system. A gene regulatory network model, if accurate enough, may allow us to perform hypothesis testing in the form of computational experiments. Of great importance to modeling accuracy is the acknowledgment of biological contexts within the models -- i.e. recognizing the heterogeneous nature of the true biological system and the data it generates. This marriage of engineering, mathematics and computer science with systems biology creates a cycle of progress between computer simulation and lab experimentation, rapidly translating interventions and treatments for patients from the bench to the bedside. This dissertation will first discuss the landscape for modeling the biological system, explore the identification of targets for intervention in Boolean network models of biological interactions, and explore context specificity both in new graphical depictions of models embodying context-specific genomic regulation and in novel analysis approaches designed to reveal embedded contextual information. Overall, the dissertation will explore a spectrum of biological modeling with a goal towards therapeutic intervention, with both formal and informal notions of biological context, in such a way that will enable future work to have an even greater impact in terms of direct patient benefit on an individualized level.Dissertation/ThesisPh.D. Computer Science 201

    A model driven approach to imbalanced data learning

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Autonomous Hypothesis Generation for Knowledge Discovery in Continuous Domains

    Full text link
    Advances of computational power, data collection and storage techniques are making new data available every day. This situation has given rise to hypothesis generation research, which complements conventional hypothesis testing research. Hypothesis generation research adopts techniques from machine learning and data mining to autonomously uncover causal relations among variables in the form of previously unknown hidden patterns and models from data. Those patterns and models can come in different forms (e.g. rules, classifiers, clusters, causal relations). In some situations, data are collected without prior supposition or imposition of a specific research goal or hypothesis. Sometimes domain knowledge for this type of problem is also limited. For example, in sensor networks, sensors constantly record data. In these data, not all forms of relationships can be described in advance. Moreover, the environment may change without prior knowledge. In a situation like this one, hypothesis generation techniques can potentially provide a paradigm to gain new insights about the data and the underlying system. This thesis proposes a general hypothesis generation framework, whereby assumptions about the observational data and the system are not predefined. The problem is decomposed into two interrelated sub-problems: (1) the associative hypothesis generation problem and (2) the causal hypothesis generation problem. The former defines a task of finding evidence of the potential causal relations in data. The latter defines a refined task of identifying casual relations. A novel association rule algorithm for continuous domains, called functional association rule mining, is proposed to address the first problem. An agent based causal search algorithm is then designed for the second problem. It systematically tests the potential causal relations by querying the system to generate specific data; thus allowing for causality to be asserted. Empirical experiments show that the functional association rule mining algorithm can uncover associative relations from data. If the underlying relationships in the data overlap, the algorithm decomposes these relationships into their constituent non-overlapping parts. Experiments with the causal search algorithm show a relative low error rate on the retrieved hidden causal structures. In summary, the contributions of this thesis are: (1) a general framework for hypothesis generation in continuous domains, which relaxes a number of conditions assumed in existing automatic causal modelling algorithms and defines a more general hypothesis generation problem; (2) a new functional association rule mining algorithm, which serves as a probing step to identify associative relations in a given dataset and provides a novel functional association rule definition and algorithms to the literature of association rule mining; (3) a new causal search algorithm, which identifies the hidden causal relations of an unknown system on the basis of functional association rule mining and relaxes a number of assumptions commonly used in automatic causal modelling
    corecore