Search CORE

1,458 research outputs found

A new BART prior for flexible modeling with categorical predictors

Author: Deshpande Sameer K.
Publication venue
Publication date: 08/11/2022
Field of study

Default implementations of Bayesian Additive Regression Trees (BART) represent categorical predictors using several binary indicators, one for each level of each categorical predictor. Regression trees built with these indicators partition the levels using a ``remove one a time strategy.'' Unfortunately, the vast majority of partitions of the levels cannot be built with this strategy, severely limiting BART's ability to ``borrow strength'' across groups of levels. We overcome this limitation with a new class of regression tree and a new decision rule prior that can assign multiple levels to both the left and right child of a decision node. Motivated by spatial applications with areal data, we introduce a further decision rule prior that partitions the areas into spatially contiguous regions by deleting edges from random spanning trees of a suitably defined network. We implemented our new regression tree priors in the flexBART package, which, compared to existing implementations, often yields improved out-of-sample predictive performance without much additional computational burden. We demonstrate the efficacy of flexBART using examples from baseball and the spatiotemporal modeling of crime.Comment: Software available at https://github.com/skdeshpande91/flexBAR

arXiv.org e-Print Archive

Clustering in multivariate data: visualization, case and variable reduction

Author: Kwon Sunhee
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/1999
Field of study

Cluster analysis is a very common problem for multivariate data. It is receiving intense attention due to the current boom in data warehousing and mining driven by the growth in information technology today. Technology is allowing us to collect massive data sets, both in cases and variables, and develop sophisticated interactive and dynamic graphics. There are three current issues for cluster analysis: visualizing cluster structure, reducing the number of cases, and reducing the number of variables in very large data sets. This thesis addresses each of these issues;The lower-dimensional projection of data found by projection pursuit which preserves the cluster structure helps clustering by eliminating the influence of nuisance variables. Initially partitioning data into a set of small classifications improves the efficiency of hierarchical agglomerative clustering by saving the time and memory for the beginning stage of clustering. Minimal spanning tree is used for this partitioning method

Digital Repository @ Iowa State University (ISU)

An efficient message passing algorithm for multi-target tracking

Author: Chen Lei
Chen Zhexu (Michael)
Willsky Alan S.
Çetin Müjdat
Publication venue: IEEE (Institute of Electrical and Electronics Engineers)
Publication date: 01/06/2009
Field of study

We propose a new approach for multi-sensor multi-target tracking by constructing statistical models on graphs with continuous-valued nodes for target states and discrete-valued nodes for data association hypotheses. These graphical representations lead to message-passing algorithms for the fusion of data across time, sensor, and target that are radically different than algorithms such as those found in state-of-the-art multiple hypothesis tracking (MHT) algorithms. Important differences include: (a) our message-passing algorithms explicitly compute different probabilities and estimates than MHT algorithms; (b) our algorithms propagate information from future data about past hypotheses via messages backward in time (rather than doing this via extending track hypothesis trees forward in time); and (c) the combinatorial complexity of the problem is manifested in a different way, one in which particle-like, approximated, messages are propagated forward and backward in time (rather than hypotheses being enumerated and truncated over time). A side benefit of this structure is that it automatically provides smoothed target trajectories using future data. A major advantage is the potential for low-order polynomial (and linear in some cases) dependency on the length of the tracking interval N, in contrast with the exponential complexity in N for so-called N-scan algorithms. We provide experimental results that support this potential. As a result, we can afford to use longer tracking intervals, allowing us to incorporate out-of-sequence data seamlessly and to conduct track-stitching when future data provide evidence that disambiguates tracks well into the past

Sabanci University Research Database

Low-Complexity Nonparametric Bayesian Online Prediction with Universal Guarantees

Author: Cazals Frédéric
Lhéritier Alix
Publication venue
Publication date: 08/12/2019
Field of study

We propose a novel nonparametric online predictor for discrete labels conditioned on multivariate continuous features. The predictor is based on a feature space discretization induced by a full-fledged k-d tree with randomly picked directions and a recursive Bayesian distribution, which allows to automatically learn the most relevant feature scales characterizing the conditional distribution. We prove its pointwise universality, i.e., it achieves a normalized log loss performance asymptotically as good as the true conditional entropy of the labels given the features. The time complexity to process the

n

-th sample point is

O(\log n)

in probability with respect to the distribution generating the data points, whereas other exact nonparametric methods require to process all past observations. Experiments on challenging datasets show the computational and statistical efficiency of our algorithm in comparison to standard and state-of-the-art methods.Comment: Camera-ready version published in NeurIPS 201

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL-Rennes 1

Tree models: a Bayesian perspective

Author: Egan Blaise
Publication venue
Publication date: 01/01/2006
Field of study

Submitted in partial fulfilment of the requirements for the degree of Master of Philosophy at Queen Mary, University of London, November 2006Classical tree models represent an attempt to create nonparametric models which have good predictive powers as well a simple structure readily comprehensible by non- experts. Bayesian tree models have been created by a team consisting of Chipman, George and McCulloch and second team consisting of Denison, Mallick and Smith. Both approaches employ Green's Reversible Jump Markov Chain Monte Carlo tech- nique to carry out a more e®ective search than the `greedy' methods used classically. The aim of this work is to evaluate both types of Bayesian tree models from a Bayesian perspective and compare them

Queen Mary Research Online

Recommended from our members

Effective techniques for handling incomplete data using decision trees

Author: Twala Bhekisipho E.T.H.
Publication venue
Publication date: 01/01/2005
Field of study

Decision Trees (DTs) have been recognized as one of the most successful formalisms for knowledge representation and reasoning and are currently applied to a variety of data mining or knowledge discovery applications, particularly for classification problems. There are several efficient methods to learn a DT from data. However, these methods are often limited to the assumption that data are complete. In this thesis, some contributions to the field of machine learning and statistics that solve the problem of extracting DTs for learning and classification tasks from incomplete databases are presented. The methodology underlying the thesis blends together well-established statistical theories with the most advanced techniques for machine learning and automated reasoning with uncertainty. The first contribution is the extensive simulations which study the impact of missing data on predictive accuracy of existing DTs which can cope with missing values, when missing values are in both the training and test sets or when they are in either of the two sets. All simulations are performed under missing completely at random, missing at random and informatively missing mechanisms and for different missing data patterns and proportions. The proposal of a simple, novel, yet effective proposed procedure for training and testing using decision trees in the presence of missing data is the next contribution. Original and simple splitting criteria for attribute selection in tree building are put forward. The proposed technique is evaluated and validated in empirical tests over many real world application domains. In this work, the proposed algorithm maintains (sometimes exceeds) the outstanding accuracy of multiple imputation, especially on datasets containing mixed attributes and purely nominal attributes. Also, the proposed algorithm greatly improves in accuracy for IM data. Another major advantage of this method over multiple imputation is the important saving in computational resources due to it simplicity. The next contribution is the proposal of three versions of simple probabilistic techniques that could be used for classifying incomplete vectors using decision trees based on complete data. The proposed procedure is superficially similar to that of fractional cases but more effective. The experimental results demonstrate that these approaches can achieve comparative quality to sophisticated algorithms like multiple imputation and therefore are applicable to all kinds of datasets. Finally, novel uses of two proposed ensemble procedures for handling incomplete training and test data are proposed and discussed. The algorithms combine the two best approaches either with resampling (REMIMIA) or without resampling (EMIMIA) of the training data before growing the decision trees. Experiments are used to evaluate and validate the success of the proposed ensemble methods with respect to individual missing data techniques in the form of empirical tests. EMIMIA attains the highest overall level of prediction accuracy

Open Research Online (The Open University)