499 research outputs found

    Strategies for online inference of model-based clustering in large and growing networks

    Full text link
    In this paper we adapt online estimation strategies to perform model-based clustering on large networks. Our work focuses on two algorithms, the first based on the SAEM algorithm, and the second on variational methods. These two strategies are compared with existing approaches on simulated and real data. We use the method to decipher the connexion structure of the political websphere during the US political campaign in 2008. We show that our online EM-based algorithms offer a good trade-off between precision and speed, when estimating parameters for mixture distributions in the context of random graphs.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS359 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    DNA Microarray Data Analysis: A New Survey on Biclustering

    Get PDF
    There are subsets of genes that have similar behavior under subsets of conditions, so we say that they coexpress, but behave independently under other subsets of conditions. Discovering such coexpressions can be helpful to uncover genomic knowledge such as gene networks or gene interactions. That is why, it is of utmost importance to make a simultaneous clustering of genes and conditions to identify clusters of genes that are coexpressed under clusters of conditions. This type of clustering is called biclustering.Biclustering is an NP-hard problem. Consequently, heuristic algorithms are typically used to approximate this problem by finding suboptimal solutions. In this paper, we make a new survey on biclustering of gene expression data, also called microarray data

    Predictive modelling for health and health-care utilisation : an observational study for Australians aged 45 and up

    Get PDF
    The burden of chronic disease is growing at a fast pace, leading to poor quality of life and high healthcare expenditures in a large portion of the Australian population. Much of the burden is borne by hospitals, and therefore there is an ever-increasing interest in preventative interventions that can keep people out of hospitals and healthier for longer periods. There is a wide range of potential interventions that may be able to achieve this goal, and policy makers need to decide which one should be funded and implemented. This task is difficult for two reasons: first it is often not clear what is the short-term effectiveness of an intervention, and how it varies in specific sub-populations, and second it is also not clear what the long-term intended and unintended consequences might be. In this thesis I make contributions to address both these difficulties. On the short-term side I focus on the use of physical activity to prevent the development of chronic disease and to reduce hospital costs. Increasing physical activity has been long heralded as a way to achieve these goals but evidence of its effectiveness has been elusive. In this thesis I provide data driven evidence to justify policies that encourage higher levels of physical activity (PA) in middle age and older Australian population. I use data from the “45 and up” and the Social, Economic and Environmental Factors (SEEF) study, linked with the Admitted Patient Data Collection (APDC), to identify and study the cost and health trajectories of individuals with different levels of physical activity. The results show a clear statistically significant association between PA and lower hospitalisation cost, as well as between PA and reduced risk of heart disease, diabetes and stroke. On the long-term side of the analysis, I placed this thesis in the context of a larger program of work performed at Western Sydney University that aims to build a microsimulation model for the analysis of health policy interventions. In this framework I studied predictive models that use survey and/or administrative data to predict hospital costs and resource utilisation. I placed particular emphasis on the application of methods borrowed from Natural Language Processing to understand how to use the thousands of diagnosis and procedure codes found in administrative data as input to predictive models. The methods developed in this thesis go beyond the application to hospital data and can be used in any predictive model that relies on complex coding of healthcare information

    STK /WST 795 Research Reports

    Get PDF
    These documents contain the honours research reports for each year for the Department of Statistics.Honours Research Reports - University of Pretoria 20XXStatisticsBSs (Hons) Mathematical Statistics, BCom (Hons) Statistics, BCom (Hons) Mathematical StatisticsUnrestricte

    Une revue bibliographique de la classification croisée au travers du modèle des blocs latents

    Get PDF
    International audienceWe present here model-based co-clustering methods, with a focus on the latent block model (LBM). We introduce several specifications of the LBM (standard, sparse, Bayesian) and review some identifiability results. We show how the complex dependency structure prevents standard maximum likelihood estimation and present alternative and popular inference methods. Those estimation methods are based on a tractable approximation of the likelihood and rely on iterative procedures, which makes them difficult to analyze. We nevertheless present some asymptotic results for consistency. The results are partial as they rely on a reasonable but still unproved condition. Likewise, available model selection tools for choosing the number of groups in rows and columns are only valid up to a conjecture. We also briefly discuss non model-based co-clustering procedures. Finally, we show how LBM can be used for bipartite graph analysis and highlight throughout this review its connection to the Stochastic Block Model.Nous présentons ici les méthodes de co-clustering, avec une emphase sur les modèles à blocs latents (LBM) et les parallèles qui existent entre le LBM et le Modèle à Blocs Stochastiques (SBM), notamment pour l'analyse de graphes bipartites. Nous introduisons différentes variantes du LBM (standard, sparse, bayésien) et présentons des résultats d'identifiabilité. Nous montrons comment la structure de dépendance complexe induite par le LBM rend l'estimation des paramètres par maximum de vraisemblance impossible en pratique et passons en revue des méthodes d'inférence alternatives. Ces dernières sont basées sur des procédures itératives, combinées à des approximations faciles à maximiser de la vraisemblance, ce qui les rend malaisés à analyser théoriquement. Il existe néanmoins des résultats de consistence, partiels en ce qu'ils reposent sur une condition raisonnable mais encore non démontrée. De même, les outils de sélection de modèle actuellement disponibles pour choisir le nombre de cluster reposent sur une conjecture. Nous replacons brièvement LBM dans le contexte des méthodes de co-clustering qui ne s'appuient pas sur un modèle génératif, particulièrement celles basées sur la factorisation de matrices. Nous concluons avec une étude de cas qui illustre les avantages du co-clustering sur le clustering simple

    Conformance Checking and Simulation-based Evolutionary Optimization for Deployment and Reconfiguration of Software in the Cloud

    Get PDF
    Many SaaS providers nowadays want to leverage the cloud's capabilities also for their existing applications, for example, to enable sound scalability and cost-effectiveness. This thesis provides the approach CloudMIG that supports SaaS providers to migrate those applications to IaaS and PaaS-based cloud environments. CloudMIG consists of a step-by-step process and focuses on two core components. (1) Restrictions imposed by specific cloud environments (so-called cloud environment constraints (CECs)), such as a limited file system access or forbidden method calls, can be validated by an automatic conformance checking approach. (2) A cloud deployment option (CDO) determines which cloud environment, cloud resource types, deployment architecture, and runtime reconfiguration rules for exploiting a cloud's elasticity should be used. The implied performance and costs can differ in orders of magnitude. CDOs can be automatically optimized with the help of our simulation-based genetic algorithm CDOXplorer. Extensive lab experiments and an experiment in an industrial context show CloudMIG's applicability and the excellent performance of its two core components
    corecore