499 research outputs found
Strategies for online inference of model-based clustering in large and growing networks
In this paper we adapt online estimation strategies to perform model-based
clustering on large networks. Our work focuses on two algorithms, the first
based on the SAEM algorithm, and the second on variational methods. These two
strategies are compared with existing approaches on simulated and real data. We
use the method to decipher the connexion structure of the political websphere
during the US political campaign in 2008. We show that our online EM-based
algorithms offer a good trade-off between precision and speed, when estimating
parameters for mixture distributions in the context of random graphs.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS359 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
DNA Microarray Data Analysis: A New Survey on Biclustering
There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data
Predictive modelling for health and health-care utilisation : an observational study for Australians aged 45 and up
The burden of chronic disease is growing at a fast pace, leading to poor quality of life and high healthcare expenditures in a large portion of the Australian population. Much of the burden is borne by hospitals, and therefore there is an ever-increasing interest in preventative interventions that can keep people out of hospitals and healthier for longer periods. There is a wide range of potential interventions that may be able to achieve this goal, and policy makers need to decide which one should be funded and implemented. This task is difficult for two reasons: first it is often not clear what is the short-term effectiveness of an intervention, and how it varies in specific sub-populations, and second it is also not clear what the long-term intended and unintended consequences might be. In this thesis I make contributions to address both these difficulties. On the short-term side I focus on the use of physical activity to prevent the development of chronic disease and to reduce hospital costs. Increasing physical activity has been long heralded as a way to achieve these goals but evidence of its effectiveness has been elusive. In this thesis I provide data driven evidence to justify policies that encourage higher levels of physical activity (PA) in middle age and older Australian population. I use data from the “45 and up” and the Social, Economic and Environmental Factors (SEEF) study, linked with the Admitted Patient Data Collection (APDC), to identify and study the cost and health trajectories of individuals with different levels of physical activity. The results show a clear statistically significant association between PA and lower hospitalisation cost, as well as between PA and reduced risk of heart disease, diabetes and stroke. On the long-term side of the analysis, I placed this thesis in the context of a larger program of work performed at Western Sydney University that aims to build a microsimulation model for the analysis of health policy interventions. In this framework I studied predictive models that use survey and/or administrative data to predict hospital costs and resource utilisation. I placed particular emphasis on the application of methods borrowed from Natural Language Processing to understand how to use the thousands of diagnosis and procedure codes found in administrative data as input to predictive models. The methods developed in this thesis go beyond the application to hospital data and can be used in any predictive model that relies on complex coding of healthcare information
STK /WST 795 Research Reports
These documents contain the honours research reports for each year for the Department of Statistics.Honours Research Reports - University of Pretoria 20XXStatisticsBSs (Hons) Mathematical Statistics, BCom (Hons) Statistics, BCom (Hons) Mathematical StatisticsUnrestricte
Une revue bibliographique de la classification croisée au travers du modèle des blocs latents
International audienceWe present here model-based co-clustering methods, with a focus on the latent block model (LBM). We introduce several specifications of the LBM (standard, sparse, Bayesian) and review some identifiability results. We show how the complex dependency structure prevents standard maximum likelihood estimation and present alternative and popular inference methods. Those estimation methods are based on a tractable approximation of the likelihood and rely on iterative procedures, which makes them difficult to analyze. We nevertheless present some asymptotic results for consistency. The results are partial as they rely on a reasonable but still unproved condition. Likewise, available model selection tools for choosing the number of groups in rows and columns are only valid up to a conjecture. We also briefly discuss non model-based co-clustering procedures. Finally, we show how LBM can be used for bipartite graph analysis and highlight throughout this review its connection to the Stochastic Block Model.Nous présentons ici les méthodes de co-clustering, avec une emphase sur les modèles à blocs latents (LBM) et les parallèles qui existent entre le LBM et le Modèle à Blocs Stochastiques (SBM), notamment pour l'analyse de graphes bipartites. Nous introduisons différentes variantes du LBM (standard, sparse, bayésien) et présentons des résultats d'identifiabilité. Nous montrons comment la structure de dépendance complexe induite par le LBM rend l'estimation des paramètres par maximum de vraisemblance impossible en pratique et passons en revue des méthodes d'inférence alternatives. Ces dernières sont basées sur des procédures itératives, combinées à des approximations faciles à maximiser de la vraisemblance, ce qui les rend malaisés à analyser théoriquement. Il existe néanmoins des résultats de consistence, partiels en ce qu'ils reposent sur une condition raisonnable mais encore non démontrée. De même, les outils de sélection de modèle actuellement disponibles pour choisir le nombre de cluster reposent sur une conjecture. Nous replacons brièvement LBM dans le contexte des méthodes de co-clustering qui ne s'appuient pas sur un modèle génératif, particulièrement celles basées sur la factorisation de matrices. Nous concluons avec une étude de cas qui illustre les avantages du co-clustering sur le clustering simple
Conformance Checking and Simulation-based Evolutionary Optimization for Deployment and Reconfiguration of Software in the Cloud
Many SaaS providers nowadays want to leverage the cloud's capabilities also for their existing applications, for example, to enable sound scalability and cost-effectiveness. This thesis provides the approach CloudMIG that supports SaaS providers to migrate those applications to IaaS and PaaS-based cloud environments. CloudMIG consists of a step-by-step process and focuses on two core components. (1) Restrictions imposed by specific cloud environments (so-called cloud environment constraints (CECs)), such as a limited file system access or forbidden method calls, can be validated by an automatic conformance checking approach. (2) A cloud deployment option (CDO) determines which cloud environment, cloud resource types, deployment architecture, and runtime reconfiguration rules for exploiting a cloud's elasticity should be used. The implied performance and costs can differ in orders of magnitude. CDOs can be automatically optimized with the help of our simulation-based genetic algorithm CDOXplorer. Extensive lab experiments and an experiment in an industrial context show CloudMIG's applicability and the excellent performance of its two core components
- …