7 research outputs found

    Adaptive Seeding for Gaussian Mixture Models

    Full text link
    We present new initialization methods for the expectation-maximization algorithm for multivariate Gaussian mixture models. Our methods are adaptions of the well-known KK-means++ initialization and the Gonzalez algorithm. Thereby we aim to close the gap between simple random, e.g. uniform, and complex methods, that crucially depend on the right choice of hyperparameters. Our extensive experiments indicate the usefulness of our methods compared to common techniques and methods, which e.g. apply the original KK-means++ and Gonzalez directly, with respect to artificial as well as real-world data sets.Comment: This is a preprint of a paper that has been accepted for publication in the Proceedings of the 20th Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD) 2016. The final publication is available at link.springer.com (http://link.springer.com/chapter/10.1007/978-3-319-31750-2 24

    Extending mixtures of factor models using the restricted multivariate skew-normal distribution

    Get PDF
    The mixture of factor analyzers (MFA) model provides a powerful tool for analyzing high-dimensional data as it can reduce the number of free parameters through its factor-analytic representation of the component covariance matrices. This paper extends the MFA model to incorporate a restricted version of the multivariate skew-normal distribution for the latent component factors, called mixtures of skew-normal factor analyzers (MSNFA). The proposed MSNFA model allows us to relax the need of the normality assumption for the latent factors in order to accommodate skewness in the observed data. The MSNFA model thus provides an approach to model-based density estimation and clustering of high-dimensional data exhibiting asymmetric characteristics. A computationally feasible Expectation Conditional Maximization (ECM) algorithm is developed for computing the maximum likelihood estimates of model parameters. The potential of the proposed methodology is exemplified using both real and simulated data. (C) 2015 Elsevier Inc. All rights reserved

    Modelling route choice behaviour with incomplete data: an application to the London Underground

    Get PDF
    This thesis develops a modelling framework for learning route choice behaviour of travellers on an underground railway system, with a major emphasis on the use of smart-card data. The motivation for this topic comes from two respects. On the one hand, in a metropolis, particularly those furnished with massive underground services (e.g. London, Beijing and Paris), severe passenger-traffic congestion may often occur, especially during rush hours. In order to support the public transport managers in taking actions that are more effective in smoothening the passenger flows, there is bound to be a need for better understanding of the passengers’ routing behaviour when they are travelling on such public transport networks. On the other hand, a wealth of travel data is nowadays readily obtainable, largely owing to the widespread implementation of automatic fare collection systems (AFC) as well as popularity of smart cards on the public transport. Nevertheless, a core limitation of such data is that the actual route-choice decisions taken by the passengers might not be available, especially when their journeys involve alternative routes and/or within-station interchanges. Mostly, the AFC systems (e.g. the Oyster system in London) record only data of passengers’ entry and exit, rather than their route choices. We are thus interested in whether it is possible to analytically infer the route-choice information based on the ‘incomplete’ data. Within the scope of this thesis, passengers’ single journeys are investigated on a station basis, where sufficiently large samples of the smart-card users’ travel records can be gained. With their journey time data being modelled by simple finite mixture distributions, Bayesian inference is applied to estimate posterior probabilities for each route that a given passenger might have chosen from all possible alternatives. We learn the route-choice probabilities of every individual passenger in any given sample, conditional on an observation of the passenger’s journey time. Further to this, the estimated posterior probabilities are also updated for each passenger, by taking into account additional information including their entry times as well as the timetables. To understand passengers’ actual route choice behaviour, we then make use of adapted discrete choice model, replacing the conventional dependent variable of actual route choices by the posterior choice probabilities for different possible outcomes. This proposed methodology is illustrated with seven case studies based in the area of central zone of the London Underground network, by using the Oyster smart-card data. Two standard mixture models, i.e. the probability distributions of Gaussian and log-normal mixtures, are tested, respectively. The outcome demonstrates a good performance of the mixture models. Moreover, relying on the updated choice probabilities in the estimation of a multinomial logit latent choice model, we show that we could estimate meaningful relative sensitivities to the travel times of different journey segments. This approach thus allows us to gain an insight into passengers’ route choice preferences even in the absence of observations of their actual chosen routes

    Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis

    Get PDF
    In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection
    corecore