15 research outputs found
Model selection methods in the linear mixed model for longitudinal data
The increased use of repeated measures for longitudinal studies has resulted in the necessity for more research in the modeling of this type of data. In this dissertation, we extend three candidate model selection methods from the univariate linear model to the linear mixed model, and investigate their behavior. Mallows' Cp statistic was developed for the univariate linear model in 1964. Here we propose a Cp statistic for the linear mixed model and show that it can be a promising method for fixed effects selection. Of all the methods investigated in this dissertation, the Cp statistic gave the most favorable results in terms of fixed effects selection and is the least computationally demanding of all the candidate methods. The KIC statistic, a symmetric divergence information criteria, explored here appears to be promising as a model selection method for both fixed effects and covariance structure. In the selection of the correct covariance structure, the KIC tended to hold middle ground between the AIC and the BIC. In terms of fixed effects, the KIC appears to perform significantly better than either the AIC or BIC in the selection of fixed effects when there is no interaction effect present. The predicted sum of squares (PRESS) statistic has been developed for the linear mixed model and is available in the SAS statistical software, but its abilities as a model selection method lacked sufficient evaluation. From our study, it appears that the PRESS statistic does not add much as a fixed effect selection method compared to the Cp or the KIC while being more computationally intensive. All three criteria are investigated using simulation studies and a large example dataset evaluating health outcomes in the elderly to determine their reliability. As a by-product of this research, the reliability of standard selection criteria in the linear mixed model, namely the AIC and BIC, are also evaluated. Numerous areas of future research within the context of model selection methods in the linear mixed model, are identified
Predictive Inference Based on Markov Chain Monte Carlo Output
In Bayesian inference, predictive distributions are typically in the form of samples generated via Markov chain Monte Carlo or related algorithms. In this paper, we conduct a systematic analysis of how to make and evaluate probabilistic forecasts from such simulation output. Based on proper scoring rules, we develop a notion of consistency that allows to assess the adequacy of methods for estimating the stationary distribution underlying the simulation output. We then provide asymptotic results that account for the salient features of Bayesian posterior simulators and derive conditions under which choices from the literature satisfy our notion of consistency. Importantly, these conditions depend on the scoring rule being used, such that the choices of approximation method and scoring rule are intertwined. While the logarithmic rule requires fairly stringent conditions, the continuous ranked probability score yields consistent approximations under minimal assumptions. These results are illustrated in a simulation study and an economic data example. Overall, mixtureâofâparameters approximations that exploit the parametric structure of Bayesian models perform particularly well. Under the continuous ranked probability score, the empirical distribution function is a simple and appealing alternative option
Predictive Inference Based on Markov Chain Monte Carlo Output
In Bayesian inference, predictive distributions are typically in the form of samples generated via Markov chain Monte Carlo or related algorithms. In this paper, we conduct a systematic analysis of how to make and evaluate probabilistic forecasts from such simulation output. Based on proper scoring rules, we develop a notion of consistency that allows to assess the adequacy of methods for estimating the stationary distribution underlying the simulation output. We then provide asymptotic results that account for the salient features of Bayesian posterior simulators and derive conditions under which choices from the literature satisfy our notion of consistency. Importantly, these conditions depend on the scoring rule being used, such that the choices of approximation method and scoring rule are intertwined. While the logarithmic rule requires fairly stringent conditions, the continuous ranked probability score yields consistent approximations under minimal assumptions. These results are illustrated in a simulation study and an economic data example. Overall, mixtureâofâparameters approximations that exploit the parametric structure of Bayesian models perform particularly well. Under the continuous ranked probability score, the empirical distribution function is a simple and appealing alternative option
Bayesian model averaging on hydraulic conductivity estimation and groundwater head prediction
Characterization of aquifer heterogeneity is inherently difficult because of the insufficiency of data, the inflexibility of parameterization methods, and non-uniqueness of parameterization methods. Groundwater predictions are greatly affected by multiple interpretations of aquifer properties and the uncertainties of model parameters. This study introduces a Bayesian model averaging (BMA) method along with multiple generalized parameterization (GP) methods to identify hydraulic conductivity and along with multiple simulation models to predict groundwater head and quantify the prediction uncertainty. Two major issues about BMA are discussed. The first problem is with using Occamâs window in usual BMA applications. Occamâs window only accepts models in a very narrow range, tending to single out the best method and discard other good methods. A variance window is proposed to replace Occamâs window to cope with this problem. The second problem is with using the Kashyap information criterion (KIC) in the approximation of posterior model probabilities, which tends to prefer highly uncertain model by considering the Fisher information matrix. The Bayesian information criterion (BIC) is recommended because it is able to avoid controversial results and it is computationally efficient. Numerical examples are designed to test the Bayesian model averaging method on hydraulic conductivity identification and groundwater head prediction. The proposed methodologies are then applied to the hydraulic conductivity identification of the Alamitos Gap area, and the hydraulic conductivity estimation and groundwater head prediction of the â1,500-footâ sand in East Baton Rouge Parish, Louisiana. The results show that the GP method provides great flexibility in parameterization with small conditional variance. The use of the variance window is necessary to avoid a dominant model when many models perform equally well. Compared to KIC, BIC is able to give an unbiased posterior model probability. It is also concluded that the uncertainty increases by including multiple models under the BMA framework, but risks are reduced by avoiding overconfidence in the solution from one model
Discriminative, generative, and imitative learning
Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2002.Includes bibliographical references (leaves 201-212).I propose a common framework that combines three different paradigms in machine learning: generative, discriminative and imitative learning. A generative probabilistic distribution is a principled way to model many machine learning and machine perception problems. Therein, one provides domain specific knowledge in terms of structure and parameter priors over the joint space of variables. Bayesian networks and Bayesian statistics provide a rich and flexible language for specifying this knowledge and subsequently refining it with data and observations. The final result is a distribution that is a good generator of novel exemplars. Conversely, discriminative algorithms adjust a possibly non-distributional model to data optimizing for a specific task, such as classification or prediction. This typically leads to superior performance yet compromises the flexibility of generative modeling. I present Maximum Entropy Discrimination (MED) as a framework to combine both discriminative estimation and generative probability densities. Calculations involve distributions over parameters, margins, and priors and are provably and uniquely solvable for the exponential family. Extensions include regression, feature selection, and transduction. SVMs are also naturally subsumed and can be augmented with, for example, feature selection, to obtain substantial improvements. To extend to mixtures of exponential families, I derive a discriminative variant of the Expectation-Maximization (EM) algorithm for latent discriminative learning (or latent MED).(cont.) While EM and Jensen lower bound log-likelihood, a dual upper bound is made possible via a novel reverse-Jensen inequality. The variational upper bound on latent log-likelihood has the same form as EM bounds, is computable efficiently and is globally guaranteed. It permits powerful discriminative learning with the wide range of contemporary probabilistic mixture models (mixtures of Gaussians, mixtures of multinomials and hidden Markov models). We provide empirical results on standardized data sets that demonstrate the viability of the hybrid discriminative-generative approaches of MED and reverse-Jensen bounds over state of the art discriminative techniques or generative approaches. Subsequently, imitative learning is presented as another variation on generative modeling which also learns from exemplars from an observed data source. However, the distinction is that the generative model is an agent that is interacting in a much more complex surrounding external world. It is not efficient to model the aggregate space in a generative setting. I demonstrate that imitative learning (under appropriate conditions) can be adequately addressed as a discriminative prediction task which outperforms the usual generative approach. This discriminative-imitative learning approach is applied with a generative perceptual system to synthesize a real-time agent that learns to engage in social interactive behavior.by Tony Jebara.Ph.D
Bayesian Saltwater Intrusion Prediction and Remediation Design under Uncertainty
Groundwater resources are vital for sustainable economic and demographic developments. Reliable prediction of groundwater head and contaminant transport is necessary for sustainable management of the groundwater resources. However, the groundwater simulation models are subjected to uncertainty in their predictions. The goals of this research are to: (1) quantify the uncertainty in the groundwater model predictions and (2) investigate the impact of the quantified uncertainty on the aquifer remediation designs. To pursue the first goal, this study generalizes the Bayesian model averaging (BMA) method and introduces the hierarchical Bayesian model averaging (HBMA) method that segregates and prioritizes sources of uncertainty in a hierarchical structure and conduct BMA for saltwater intrusion prediction. A BMA tree of models is developed to understand the impact of individual sources of uncertainty and uncertainty propagation on model predictions. The uncertainty analysis using HBMA leads to finding the best modeling proposition and to calculating the relative and absolute model weights. To pursue the second goal of the study, the chance-constrained (CC) programming is proposed to deal with the uncertainty in the remediation design. Prior studies of CC programming for the groundwater remediation designs are limited to considering parameter estimation uncertainty. This study combines the CC programming with the BMA and HBMA methods and proposes the BMA-CC framework and the HBMA-CC framework to also include the model structure uncertainty in the CC programming. The results show that the prediction variances from the parameter estimation uncertainty are much smaller than those from the model structure uncertainty. Ignoring the model structure uncertainty in the remediation design may lead to overestimating the design reliability, which can cause design failure
Recommended from our members
Extended Entropy Maximisation and Queueing Systems with Heavy-Tailed Distributions
Numerous studies on Queueing systems, such as Internet traffic flows, have shown to be bursty, self-similar and/or long-range dependent, because of the heavy (long) tails for the various distributions of interest, including intermittent intervals and queue lengths. Other studies have addressed vacation in no-customersâ queueing system or when the server fails. These patterns are important for capacity planning, performance prediction, and optimization of networks and have a negative impact on their effective functioning. Heavy-tailed distributions have been commonly used by telecommunication engineers to create workloads for simulation studies, which, regrettably, may show peculiar queueing characteristics. To cost-effectively examine the impacts of different network patterns on heavy- tailed queues, new and reliable analytical approaches need to be developed. It is decided to establish a brand-new analytical framework based on optimizing entropy functionals, such as those of Shannon, RĂ©nyi, Tsallis, and others that have been suggested within statistical physics and information theory, subject to suitable linear and non-linear system constraints. In both discrete and continuous time domains, new heavy tail analytic performance distributions will be developed, with a focus on those exhibiting the power law behaviour seen in many Internet scenarios.
The exposition of two major revolutionary approaches, namely the unification of information geometry and classical queueing systems and unifying information length theory with transient queueing systems. After conclusions, open problems arising from this thesis and limitations are introduced as future work
MODELING HETEROTACHY IN PHYLOGENETICS
Il a Ă©tĂ© dĂ©montrĂ© que lâhĂ©tĂ©rotachie, variation du taux de substitutions au cours du temps et entre les sites, est un phĂ©nomĂšne frĂ©quent au sein de donnĂ©es rĂ©elles. Ăchouer Ă modĂ©liser lâhĂ©tĂ©rotachie peut potentiellement causer des artĂ©facts phylogĂ©nĂ©tiques. Actuellement, plusieurs modĂšles traitent lâhĂ©tĂ©rotachie : le modĂšle Ă mĂ©lange des longueurs de branche (MLB) ainsi que diverses formes du modĂšle covarion. Dans ce projet, notre but est de trouver un modĂšle qui prenne efficacement en compte les signaux hĂ©tĂ©rotaches prĂ©sents dans les donnĂ©es, et ainsi amĂ©liorer lâinfĂ©rence phylogĂ©nĂ©tique.
Pour parvenir Ă nos fins, deux Ă©tudes ont Ă©tĂ© rĂ©alisĂ©es. Dans la premiĂšre, nous comparons le modĂšle MLB avec le modĂšle covarion et le modĂšle homogĂšne grĂące aux test AIC et BIC, ainsi que par validation croisĂ©e. A partir de nos rĂ©sultats, nous pouvons conclure que le modĂšle MLB nâest pas nĂ©cessaire pour les sites dont les longueurs de branche diffĂšrent sur lâensemble de lâarbre, car, dans les donnĂ©es rĂ©elles, le signaux hĂ©tĂ©rotaches qui interfĂšrent avec lâinfĂ©rence phylogĂ©nĂ©tique sont gĂ©nĂ©ralement concentrĂ©s dans une zone limitĂ©e de lâarbre. Dans la seconde Ă©tude, nous relaxons lâhypothĂšse que le modĂšle covarion est homogĂšne entre les sites, et dĂ©veloppons un modĂšle Ă mĂ©langes basĂ© sur un processus de Dirichlet. Afin dâĂ©valuer diffĂ©rents modĂšles hĂ©tĂ©rogĂšnes, nous dĂ©finissons plusieurs tests de non-conformitĂ© par Ă©chantillonnage postĂ©rieur prĂ©dictif pour Ă©tudier divers aspects de lâĂ©volution molĂ©culaire Ă partir de cartographies stochastiques. Ces tests montrent que le modĂšle Ă mĂ©langes covarion utilisĂ© avec une loi gamma est capable de reflĂ©ter adĂ©quatement les variations de substitutions tant Ă lâintĂ©rieur dâun site quâentre les sites.
Notre recherche permet de dĂ©crire de façon dĂ©taillĂ©e lâhĂ©tĂ©rotachie dans des donnĂ©es rĂ©elles et donne des pistes Ă suivre pour de futurs modĂšles hĂ©tĂ©rotaches. Les tests de non conformitĂ© par Ă©chantillonnage postĂ©rieur prĂ©dictif fournissent des outils de diagnostic pour Ă©valuer les modĂšles en dĂ©tails. De plus, nos deux Ă©tudes rĂ©vĂšlent la non spĂ©cificitĂ© des modĂšles hĂ©tĂ©rogĂšnes et, en consĂ©quence, la prĂ©sence dâinteractions entre diffĂ©rents modĂšles hĂ©tĂ©rogĂšnes. Nos Ă©tudes suggĂšrent fortement que les donnĂ©es contiennent diffĂ©rents caractĂšres hĂ©tĂ©rogĂšnes qui devraient ĂȘtre pris en compte simultanĂ©ment dans les analyses phylogĂ©nĂ©tiques.Heterotachy, substitution rate variation across sites and time, has shown to be a frequent phenomenon in the real data. Failure to model heterotachy could potentially cause phylogenetic artefacts. Currently, there are several models to handle heterotachy, the mixture branch length model (MBL) and several variant forms of the covarion model. In this project, our objective is to find a model that efficiently handles heterotachous signals in the data, and thereby improves phylogenetic inference.
In order to achieve our goal, two individual studies were conducted. In the first study, we make comparisons among the MBL, covarion and homotachous models using AIC, BIC and cross validation. Based on our results, we conclude that the MBL model, in which sites have different branch lengths along the entire tree, is an over-parameterized model. Real data indicate that the heterotachous signals which interfere with phylogenetic inference are generally limited to a small area of the tree. In the second study, we relax the assumption of the homogeneity of the covarion parameters over sites, and develop a mixture covarion model using a Dirichlet process. In order to evaluate different heterogeneous models, we design several posterior predictive discrepancy tests to study different aspects of molecular evolution using stochastic mappings. The posterior predictive discrepancy tests demonstrate that the covarion mixture +Î model is able to adequately model the substitution variation within and among sites.
Our research permits a detailed view of heterotachy in real datasets and gives directions for future heterotachous models. The posterior predictive discrepancy tests provide diagnostic tools to assess models in detail. Furthermore, both of our studies reveal the non-specificity of heterogeneous models. Our studies strongly suggest that different heterogeneous features in the data should be handled simultaneously