95,647 research outputs found

    A Study of Accelerated Bayesian Additive Regression Trees

    Get PDF
    abstract: Bayesian Additive Regression Trees (BART) is a non-parametric Bayesian model that often outperforms other popular predictive models in terms of out-of-sample error. This thesis studies a modified version of BART called Accelerated Bayesian Additive Regression Trees (XBART). The study consists of simulation and real data experiments comparing XBART to other leading algorithms, including BART. The results show that XBART maintains BART’s predictive power while reducing its computation time. The thesis also describes the development of a Python package implementing XBART.Dissertation/ThesisMasters Thesis Statistics 201

    Junction trees constructions in Bayesian networks

    Get PDF
    © Published under licence by IOP Publishing Ltd. Junction trees are used as graphical structures over which propagation will be carried out through a very important property called the ruining intersection property. This paper examines an alternative method for constructing junction trees that are essential for the efficient computations of probabilities in Bayesian networks. The new proposed method converts a sequence of subsets of a Bayesian network into a junction tree, in other words, into a set of cliques that has the running intersection property. The obtained set of cliques and separators coincide with the junction trees obtained by the moralization and triangulation process, but it has the advantage of adapting to any computational task by adding links to the Bayesian network graph

    Species relationship in the genus Silene L. Section Auriculatae (Caryophyllaceae) based on morphology and RAPD analyses

    Get PDF
    Morphological and RAPD studies were performed on Silene species of the sect. Auriculatae growing in Iran for the first time using phenetic, parsimony and Bayesian analyses. Trees obtained differed in the species groupings although agreed in some parts. Parsimony and Bayesian analyses of morphological characters produced some clades which were not well supported by bootstrap and clade credibility values but UPGMA tree showed a high cophenetic correlation. Grouping based on morphological characters partly support the species affinity given in Flora Iranica. Out of 40 RAPD primers used 15 primer produced reproducible polymorphic bands. In total 347 bands were produced out of which 340 bands were polymorph and 7 bands were monomorph. Among the species studied S. goniocaula showed the highest number of RAPD bands (184), while S. commelinifolia var. isophylla showed the lowest number (123). Some of the species studied showed the presence of specific bands which may be use for species discrimination. NJ and Bayesian trees of RAPD data partly agree with morphological trees obtained

    Bayesian Decision Trees via Tractable Priors and Probabilistic Context-Free Grammars

    Full text link
    Decision Trees are some of the most popular machine learning models today due to their out-of-the-box performance and interpretability. Often, Decision Trees models are constructed greedily in a top-down fashion via heuristic search criteria, such as Gini impurity or entropy. However, trees constructed in this manner are sensitive to minor fluctuations in training data and are prone to overfitting. In contrast, Bayesian approaches to tree construction formulate the selection process as a posterior inference problem; such approaches are more stable and provide greater theoretical guarantees. However, generating Bayesian Decision Trees usually requires sampling from complex, multimodal posterior distributions. Current Markov Chain Monte Carlo-based approaches for sampling Bayesian Decision Trees are prone to mode collapse and long mixing times, which makes them impractical. In this paper, we propose a new criterion for training Bayesian Decision Trees. Our criterion gives rise to BCART-PCFG, which can efficiently sample decision trees from a posterior distribution across trees given the data and find the maximum a posteriori (MAP) tree. Learning the posterior and training the sampler can be done in time that is polynomial in the dataset size. Once the posterior has been learned, trees can be sampled efficiently (linearly in the number of nodes). At the core of our method is a reduction of sampling the posterior to sampling a derivation from a probabilistic context-free grammar. We find that trees sampled via BCART-PCFG perform comparable to or better than greedily-constructed Decision Trees in classification accuracy on several datasets. Additionally, the trees sampled via BCART-PCFG are significantly smaller -- sometimes by as much as 20x.Comment: 10 pages, 1 figur

    On PAC-Bayesian Bounds for Random Forests

    Full text link
    Existing guarantees in terms of rigorous upper bounds on the generalization error for the original random forest algorithm, one of the most frequently used machine learning methods, are unsatisfying. We discuss and evaluate various PAC-Bayesian approaches to derive such bounds. The bounds do not require additional hold-out data, because the out-of-bag samples from the bagging in the training process can be exploited. A random forest predicts by taking a majority vote of an ensemble of decision trees. The first approach is to bound the error of the vote by twice the error of the corresponding Gibbs classifier (classifying with a single member of the ensemble selected at random). However, this approach does not take into account the effect of averaging out of errors of individual classifiers when taking the majority vote. This effect provides a significant boost in performance when the errors are independent or negatively correlated, but when the correlations are strong the advantage from taking the majority vote is small. The second approach based on PAC-Bayesian C-bounds takes dependencies between ensemble members into account, but it requires estimating correlations between the errors of the individual classifiers. When the correlations are high or the estimation is poor, the bounds degrade. In our experiments, we compute generalization bounds for random forests on various benchmark data sets. Because the individual decision trees already perform well, their predictions are highly correlated and the C-bounds do not lead to satisfactory results. For the same reason, the bounds based on the analysis of Gibbs classifiers are typically superior and often reasonably tight. Bounds based on a validation set coming at the cost of a smaller training set gave better performance guarantees, but worse performance in most experiments

    AWF Edwards and the origin of Bayesian phylogenetics

    Get PDF
    In the early 1960s, Anthony Edwards and Luca Cavalli-Sforza made an effort to apply R.A. Fisher’s maximum likelihood (ML) method to estimate genealogical trees of human populations using gene frequency data. They used the Yule branching process to describe the probabilities of the trees and branching times and the Brownian motion process to model the drift of gene frequencies (after a suitable transformation) over time along the branches. They experienced considerable difficulties, including “singularities” in the likelihood surface, mainly because a distinction between parameters and random variables was not clearly made. In the process they invented the distance (additive-tree) and parsimony (minimum-evolution) methods, both of which they viewed as heuristic approximations to ML. The statistical nature of the inference problem was not clarified until Edwards 1, which pointed out that the trees should be estimated from their conditional distribution given the genetic data, rather than from the “likelihood function”. In modern terminology, this is the Bayesian approach to phylogeny estimation: the Yule process specifies a prior on trees, while the conditional distribution of the trees given the data is the posterior. This article discusses the connections of the remarkable paper of Edwards 1 to modern Bayesian phylogenetics, and briefly comments on some modelling decisions Edwards made then that still concern us today in modern Bayesian phylogenetics. The reader I have in mind is familiar with modern phylogenetic methods but may not have read Edwards, which is published in a statistics journal
    • …
    corecore