Search CORE

12 research outputs found

Bayesian Clustering by Dynamics

Author: Cohen Paul
Ramoni Marco
Sebastiani Paola
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2001
Field of study

This paper introduces a Bayesian method for clustering dynamic processes. The method models dynamics as Markov chains and then applies an agglomerative clustering procedure to discover the most probable set of clusters capturing different dynamics. To increase ef£ciency, the method uses an entropy-based heuristic search strategy. A controlled experiment suggests that the method is very accurate when applied to artificial time series in a broad range of conditions and, when applied to clustering sensor data from mobile robots, it produces clusters that are meaningful in the domain of application

CiteSeerX

ScholarWorks@UMass Amherst

Maximum Entropy Discrimination

Author: Jaakkola Tommi
Jebara Tony
Meila Marina
Publication venue
Publication date: 01/01/1999
Field of study

We present a general framework for discriminative estimation based on the maximum entropy principle and its extensions. All calculations involve distributions over structures and/or parameters rather than specific settings and reduce to relative entropy projections. This holds even when the data is not separable within the chosen parametric class, in the context of anomaly detection rather than classification, or when the labels in the training set are uncertain or incomplete. Support vector machines are naturally subsumed under this class and we provide several extensions. We are also able to estimate exactly and efficiently discriminative distributions over tree structures of class-conditional models within this framework. Preliminary experimental results are indicative of the potential in these techniques

CiteSeerX

DSpace@MIT

Three algorithms for causal learning

Author: Rammohan Roshan Ram
Publication venue: UNM Digital Repository
Publication date: 01/12/2010
Field of study

The field of causal learning has grown in the past decade, establishing itself as a major focus in artificial intelligence research. Traditionally, approaches to causal learning are split into two areas. One area involves the learning of structures from observational data alone and the second, involves the methodologies of conducting and learning from experiments. In this dissertation, I investigate three different aspects of causal learning, all of which are based on the causal Bayesian network framework. Constraint based structure search algorithms that learn partially directed acyclic graphs as causal models from observational data rely on the faithfulness assumption, which is often violated due to inaccurate statistical tests on finite datasets. My first contribution is a modification of the traditional approaches to achieving greater robustness in the light of these faults. Secondly, I present a new algorithm to infer the parent set of a variable when a specific type of experiment called a `hard intervention\u27 is performed. I also present an auxiliary result of this effort, a fast algorithm to estimate the Kullback Leibler divergence of high dimensional distributions from datasets. Thirdly, I introduce a fast heuristic algorithm to optimize the number and sequence of experiments required towards complete causal discovery for different classes of causal graphs and provide suggestions for implementing an interactive version. Finally, I provide numerical simulation results for each algorithm discussed and present some directions for future research

An exploratory analysis of large health cohort study using Bayesian networks

Author: Shen Delin
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2006
Field of study

Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2006.Includes bibliographical references (p. 91-98).Large health cohort studies are among the most effective ways in studying the causes, treatments and outcomes of diseases by systematically collecting a wide range of data over long periods. The wealth of data in such studies may yield important results in addition to the already numerous findings, especially when subjected to newer analytical methods. Bayesian Networks (BN) provide a relatively new method of representing uncertain relationships among variables, using the tools of probability and graph theory, and have been widely used in analyzing dependencies and the interplay between variables. We used BN to perform an exploratory analysis on a rich collection of data from one large health cohort study, the Nurses' Health Study (NHS), with the focus on breast cancer. We explored the data from the NHS using BN to look for breast cancer risk factors, including a group of Single Nucleotide Polymorphisms (SNP). We found no association between the SNPs and breast cancer, but found a dependency between clomid and breast cancer. We evaluated clomid as a potential riskfactor after matching on age and number of children. Our results showed for clomid an increased risk of estrogen receptor positive breast cancer (odds ratio 1.52, 95% CI 1.11-2.09) and a decreased risk of estrogen receptor negative breast cancer (odds ratio 0.46, 95% CI 0.22-0.97).(cont.) We developed breast cancer risk models using BN. We trained models on 75% of the data, and evaluated them on the remaining. Because of the clinical importance of predicting risks for Estrogen Receptor positive and Progesterone Receptor positive breast cancer, we focused on this specific type of breast cancer to predict two-year, four-year, and six-year risks. The concordance statistics of the prediction results on test sets are 0.70 (95% CI: 0.67-0.74), 0.68 (95% CI: 0.64-0.72), and 0.66 (95% CI: 0.62-0.69) for two, four, and six year models, respectively. We also evaluated the calibration performance of the models, and applied a filter to the output to improve the linear relationship between predicted and observed risks using Agglomerative Information Bottleneck clustering without sacrificing much discrimination performance.by Delin Shen.Ph.D

DSpace@MIT

Probabilistic modelling of oil rig drilling operations for business decision support: a real world application of Bayesian networks and computational intelligence.

Author: Fournier François A.
Publication venue
Publication date: 31/03/2013
Field of study

This work investigates the use of evolved Bayesian networks learning algorithms based on computational intelligence meta-heuristic algorithms. These algorithms are applied to a new domain provided by the exclusive data, available to this project from an industry partnership with ODS-Petrodata, a business intelligence company in Aberdeen, Scotland. This research proposes statistical models that serve as a foundation for building a novel operational tool for forecasting the performance of rig drilling operations. A prototype for a tool able to forecast the future performance of a drilling operation is created using the obtained data, the statistical model and the experts' domain knowledge. This work makes the following contributions: applying K2GA and Bayesian networks to a real-world industry problem; developing a well-performing and adaptive solution to forecast oil drilling rig performance; using the knowledge of industry experts to guide the creation of competitive models; creating models able to forecast oil drilling rig performance consistently with nearly 80% forecast accuracy, using either logistic regression or Bayesian network learning using genetic algorithms; introducing the node juxtaposition analysis graph, which allows the visualisation of the frequency of nodes links appearing in a set of orderings, thereby providing new insights when analysing node ordering landscapes; exploring the correlation factors between model score and model predictive accuracy, and showing that the model score does not correlate with the predictive accuracy of the model; exploring a method for feature selection using multiple algorithms and drastically reducing the modelling time by multiple factors; proposing new fixed structure Bayesian network learning algorithms for node ordering search-space exploration. Finally, this work proposes real-world applications for the models based on current industry needs, such as recommender systems, an oil drilling rig selection tool, a user-ready rig performance forecasting software and rig scheduling tools

Open Access Institutional Repository at Robert Gordon University

Recommended from our members

Effective techniques for handling incomplete data using decision trees

Author: Twala Bhekisipho E.T.H.
Publication venue
Publication date: 01/01/2005
Field of study

Decision Trees (DTs) have been recognized as one of the most successful formalisms for knowledge representation and reasoning and are currently applied to a variety of data mining or knowledge discovery applications, particularly for classification problems. There are several efficient methods to learn a DT from data. However, these methods are often limited to the assumption that data are complete. In this thesis, some contributions to the field of machine learning and statistics that solve the problem of extracting DTs for learning and classification tasks from incomplete databases are presented. The methodology underlying the thesis blends together well-established statistical theories with the most advanced techniques for machine learning and automated reasoning with uncertainty. The first contribution is the extensive simulations which study the impact of missing data on predictive accuracy of existing DTs which can cope with missing values, when missing values are in both the training and test sets or when they are in either of the two sets. All simulations are performed under missing completely at random, missing at random and informatively missing mechanisms and for different missing data patterns and proportions. The proposal of a simple, novel, yet effective proposed procedure for training and testing using decision trees in the presence of missing data is the next contribution. Original and simple splitting criteria for attribute selection in tree building are put forward. The proposed technique is evaluated and validated in empirical tests over many real world application domains. In this work, the proposed algorithm maintains (sometimes exceeds) the outstanding accuracy of multiple imputation, especially on datasets containing mixed attributes and purely nominal attributes. Also, the proposed algorithm greatly improves in accuracy for IM data. Another major advantage of this method over multiple imputation is the important saving in computational resources due to it simplicity. The next contribution is the proposal of three versions of simple probabilistic techniques that could be used for classifying incomplete vectors using decision trees based on complete data. The proposed procedure is superficially similar to that of fractional cases but more effective. The experimental results demonstrate that these approaches can achieve comparative quality to sophisticated algorithms like multiple imputation and therefore are applicable to all kinds of datasets. Finally, novel uses of two proposed ensemble procedures for handling incomplete training and test data are proposed and discussed. The algorithms combine the two best approaches either with resampling (REMIMIA) or without resampling (EMIMIA) of the training data before growing the decision trees. Experiments are used to evaluate and validate the success of the proposed ensemble methods with respect to individual missing data techniques in the form of empirical tests. EMIMIA attains the highest overall level of prediction accuracy

Open Research Online (The Open University)

Gene Regulatory Networks: Modeling, Intervention and Context

Author
Publication venue
Publication date: 01/01/2013
Field of study

abstract: Biological systems are complex in many dimensions as endless transportation and communication networks all function simultaneously. Our ability to intervene within both healthy and diseased systems is tied directly to our ability to understand and model core functionality. The progress in increasingly accurate and thorough high-throughput measurement technologies has provided a deluge of data from which we may attempt to infer a representation of the true genetic regulatory system. A gene regulatory network model, if accurate enough, may allow us to perform hypothesis testing in the form of computational experiments. Of great importance to modeling accuracy is the acknowledgment of biological contexts within the models -- i.e. recognizing the heterogeneous nature of the true biological system and the data it generates. This marriage of engineering, mathematics and computer science with systems biology creates a cycle of progress between computer simulation and lab experimentation, rapidly translating interventions and treatments for patients from the bench to the bedside. This dissertation will first discuss the landscape for modeling the biological system, explore the identification of targets for intervention in Boolean network models of biological interactions, and explore context specificity both in new graphical depictions of models embodying context-specific genomic regulation and in novel analysis approaches designed to reveal embedded contextual information. Overall, the dissertation will explore a spectrum of biological modeling with a goal towards therapeutic intervention, with both formal and informal notions of biological context, in such a way that will enable future work to have an even greater impact in terms of direct patient benefit on an individualized level.Dissertation/ThesisPh.D. Computer Science 201

ASU Digital Repository

A model driven approach to imbalanced data learning

Author: YIN HONGLI
Publication venue
Publication date: 15/03/2011
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Autonomous Hypothesis Generation for Knowledge Discovery in Continuous Domains

Author: Wang Bing
Publication venue: UNSW, Sydney
Publication date: 01/01/2014
Field of study

Advances of computational power, data collection and storage techniques are making new data available every day. This situation has given rise to hypothesis generation research, which complements conventional hypothesis testing research. Hypothesis generation research adopts techniques from machine learning and data mining to autonomously uncover causal relations among variables in the form of previously unknown hidden patterns and models from data. Those patterns and models can come in different forms (e.g. rules, classifiers, clusters, causal relations). In some situations, data are collected without prior supposition or imposition of a specific research goal or hypothesis. Sometimes domain knowledge for this type of problem is also limited. For example, in sensor networks, sensors constantly record data. In these data, not all forms of relationships can be described in advance. Moreover, the environment may change without prior knowledge. In a situation like this one, hypothesis generation techniques can potentially provide a paradigm to gain new insights about the data and the underlying system. This thesis proposes a general hypothesis generation framework, whereby assumptions about the observational data and the system are not predefined. The problem is decomposed into two interrelated sub-problems: (1) the associative hypothesis generation problem and (2) the causal hypothesis generation problem. The former defines a task of finding evidence of the potential causal relations in data. The latter defines a refined task of identifying casual relations. A novel association rule algorithm for continuous domains, called functional association rule mining, is proposed to address the first problem. An agent based causal search algorithm is then designed for the second problem. It systematically tests the potential causal relations by querying the system to generate specific data; thus allowing for causality to be asserted. Empirical experiments show that the functional association rule mining algorithm can uncover associative relations from data. If the underlying relationships in the data overlap, the algorithm decomposes these relationships into their constituent non-overlapping parts. Experiments with the causal search algorithm show a relative low error rate on the retrieved hidden causal structures. In summary, the contributions of this thesis are: (1) a general framework for hypothesis generation in continuous domains, which relaxes a number of conditions assumed in existing automatic causal modelling algorithms and defines a more general hypothesis generation problem; (2) a new functional association rule mining algorithm, which serves as a probing step to identify associative relations in a given dataset and provides a novel functional association rule definition and algorithms to the literature of association rule mining; (3) a new causal search algorithm, which identifies the hidden causal relations of an unknown system on the basis of functional association rule mining and relaxes a number of assumptions commonly used in automatic causal modelling

UNSWorks