Search CORE

5 research outputs found

BayesBinMix: an R Package for Model Based Clustering of Multivariate Binary Data

Author: Papastamoulis Panagiotis
Rattray Magnus
Publication venue
Publication date: 03/04/2017
Field of study

The BayesBinMix package offers a Bayesian framework for clustering binary data with or without missing values by fitting mixtures of multivariate Bernoulli distributions with an unknown number of components. It allows the joint estimation of the number of clusters and model parameters using Markov chain Monte Carlo sampling. Heated chains are run in parallel and accelerate the convergence to the target posterior distribution. Identifiability issues are addressed by implementing label switching algorithms. The package is demonstrated and benchmarked against the Expectation-Maximization algorithm using a simulation study as well as a real dataset.Comment: Accepted to the R Journal. The package is available on CRAN: https://CRAN.R-project.org/package=BayesBinMi

arXiv.org e-Print Archive

The University of Manchester - Institutional Repository

Variable selection via penalized regression and the genetic algorithm using information complexity, with applications for high-dimensional -omics data

Author: Massaro Tyler J.
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/08/2016
Field of study

This dissertation is a collection of examples, algorithms, and techniques for researchers interested in selecting influential variables from statistical regression models. Chapters 1, 2, and 3 provide background information that will be used throughout the remaining chapters, on topics including but not limited to information complexity, model selection, covariance estimation, stepwise variable selection, penalized regression, and especially the genetic algorithm (GA) approach to variable subsetting. In chapter 4, we fully develop the framework for performing GA subset selection in logistic regression models. We present advantages of this approach against stepwise and elastic net regularized regression in selecting variables from a classical set of ICU data. We further compare these results to an entirely new procedure for variable selection developed explicitly for this dissertation, called the post hoc adjustment of measured effects (PHAME). In chapter 5, we reproduce many of the same results from chapter 4 for the first time in a multinomial logistic regression setting. The utility and convenience of the PHAME procedure is demonstrated on a set of cancer genomic data. Chapter 6 marks a departure from supervised learning problems as we shift our focus to unsupervised problems involving mixture distributions of count data from epidemiologic fields. We start off by reintroducing Minimum Hellinger Distance estimation alongside model selection techniques as a worthy alternative to the EM algorithm for generating mixtures of Poisson distributions. We also create for the first time a GA that derives mixtures of negative binomial distributions. The work from chapter 6 is incorporated into chapters 7 and 8, where we conclude the dissertation with a novel analysis of mixtures of count data regression models. We provide algorithms based on single and multi-target genetic algorithms which solve the mixture of penalized count data regression models problem, and demonstrate the usefulness of this technique on HIV count data that were used in a previous study published by Gray, Massaro, et al. (2015) as well as on time-to-event data taken from the cancer genomic data sets from earlier

University of Tennessee, Knoxville: Trace

On the estimation of mixtures of Poisson regression models with large number of components

Author: Aitkin
Anders
Biernacki
Biernacki
Böhning
Cathy Maugis-Rabusseau
Cui
Dempster
Finch
Fraley
Frühwirth-Schnatter
Grün
Grün
Karlis
Laird
Lambert
Leisch
Li
Marie-Laure Martin-Magniette
McLachlan
McLachlan
Nelder
Panagiotis Papastamoulis
Papastamoulis
Papastamoulis
Richardson
Schwarz
Stewart
Wang
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Modelling heterogeneity in large datasets of counts under the presence of covariates demands advanced clustering methods. Towards this direction a mixture of Poisson regressions is proposed. Conditionally on the covariates and a cluster, the multivariate distribution is a product of independent Poisson distributions. A variety of different parameterizations is taken into account for the slope of the conditional log-means. Also considered is the case of partitioning the response variables into sets of replicates sharing the same conditional log-mean up to an additive constant. Model parameters are estimated via an Expectation Maximization algorithm with Newton Raphson steps. In particular, an efficient initialization is introduced in order to improve the inference: a splitting scheme is combined with a Small-EM strategy. Simulations and application on two real high-throughput sequencing datasets highlight improvements of parameter estimations. The proposed methodology is implemented in the R package poisson glm. mix, available on CRAN

HAL Evry

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

HAL-INSA Toulouse

Machine Learning in Insurance

Author
Publication venue: 'MDPI AG'
Publication date: 01/05/2021
Field of study

Machine learning is a relatively new field, without a unanimous definition. In many ways, actuaries have been machine learners. In both pricing and reserving, but also more recently in capital modelling, actuaries have combined statistical methodology with a deep understanding of the problem at hand and how any solution may affect the company and its customers. One aspect that has, perhaps, not been so well developed among actuaries is validation. Discussions among actuaries’ “preferred methods” were often without solid scientific arguments, including validation of the case at hand. Through this collection, we aim to promote a good practice of machine learning in insurance, considering the following three key issues: a) who is the client, or sponsor, or otherwise interested real-life target of the study? b) The reason for working with a particular data set and a clarification of the available extra knowledge, that we also call prior knowledge, besides the data set alone. c) A mathematical statistical argument for the validation procedure

Directory of Open Access Books (DOAB)