1,902 research outputs found

    Methods for fast and reliable clustering

    Get PDF

    Contributions to Collective Dynamical Clustering-Modeling of Discrete Time Series

    Get PDF
    The analysis of sequential data is important in business, science, and engineering, for tasks such as signal processing, user behavior mining, and commercial transactions analysis. In this dissertation, we build upon the Collective Dynamical Modeling and Clustering (CDMC) framework for discrete time series modeling, by making contributions to clustering initialization, dynamical modeling, and scaling. We first propose a modified Dynamic Time Warping (DTW) approach for clustering initialization within CDMC. The proposed approach provides DTW metrics that penalize deviations of the warping path from the path of constant slope. This reduces over-warping, while retaining the efficiency advantages of global constraint approaches, and without relying on domain dependent constraints. Second, we investigate the use of semi-Markov chains as dynamical models of temporal sequences in which state changes occur infrequently. Semi-Markov chains allow explicitly specifying the distribution of state visit durations. This makes them superior to traditional Markov chains, which implicitly assume an exponential state duration distribution. Third, we consider convergence properties of the CDMC framework. We establish convergence by viewing CDMC from an Expectation Maximization (EM) perspective. We investigate the effect on the time to convergence of our efficient DTW-based initialization technique and selected dynamical models. We also explore the convergence implications of various stopping criteria. Fourth, we consider scaling up CDMC to process big data, using Storm, an open source distributed real-time computation system that supports batch and distributed data processing. We performed experimental evaluation on human sleep data and on user web navigation data. Our results demonstrate the superiority of the strategies introduced in this dissertation over state-of-the-art techniques in terms of modeling quality and efficiency

    Modeling HIV Drug Resistance

    Get PDF
    Despite the development of antiviral drugs and the optimization of therapies, the emergence of drug resistance remains one of the most challenging issues for successful treatments of HIV-infected patients. The availability of massive HIV drug resistance data provides us not only exciting opportunities for HIV research, but also the curse of high dimensionality. We provide several statistical learning methods in this thesis to analyze sequence data from different perspectives. We propose a hierarchical random graph approach to identify possible covariation among residue-specific mutations. Viral progression pathways were inferred using an EM-like algorithm in literature, and we present a normalization method to improve the accuracy of parameter estimations. To predict the drug resistance from genotypic data, we also build a novel regression model utilizing the information from progression pathways. Finally, we introduce a computational approach to determine viral fitness, for which our initial computational results closely agree with experimental results. Work on two other topics are presented in the Appendices. Latent class models find applications in several areas including social and biological sciences. Finding explicit maximum likelihood estimation has been elusive. We present a positive solution to a conjecture on a special latent class model proposed by Bernd Sturmfels from UC Berkeley. Monomial ideals provide ubiquitous links between combinatorics and commutative algebra. Irreducible decomposition of monomial ideals is a basic computational problem and it finds applications in several areas. We present two algorithms for finding irreducible decomposition of monomial ideals

    A Study on Variational Component Splitting approach for Mixture Models

    Get PDF
    Increase in use of mobile devices and the introduction of cloud-based services have resulted in the generation of enormous amount of data every day. This calls for the need to group these data appropriately into proper categories. Various clustering techniques have been introduced over the years to learn the patterns in data that might better facilitate the classification process. Finite mixture model is one of the crucial methods used for this task. The basic idea of mixture models is to fit the data at hand to an appropriate distribution. The design of mixture models hence involves finding the appropriate parameters of the distribution and estimating the number of clusters in the data. We use a variational component splitting framework to do this which could simultaneously learn the parameters of the model and estimate the number of components in the model. The variational algorithm helps to overcome the computational complexity of purely Bayesian approaches and the over fitting problems experienced with Maximum Likelihood approaches guaranteeing convergence. The choice of distribution remains the core concern of mixture models in recent research. The efficiency of Dirichlet family of distributions for this purpose has been proved in latest studies especially for non-Gaussian data. This led us to study the impact of variational component splitting approach on mixture models based on several distributions. Hence, our contribution is the application of variational component splitting approach to design finite mixture models based on inverted Dirichlet, generalized inverted Dirichlet and inverted Beta-Liouville distributions. In addition, we also incorporate a simultaneous feature selection approach for generalized inverted Dirichlet mixture model along with component splitting as another experimental contribution. We evaluate the performance of our models with various real-life applications such as object, scene, texture, speech and video categorization

    Methods for the acquisition and analysis of volume electron microscopy data

    Get PDF

    Analysis of Clickstream Data

    Get PDF
    This thesis is concerned with providing further statistical development in the area of web usage analysis to explore web browsing behaviour patterns. We received two data sources: web log files and operational data files for the websites, which contained information on online purchases. There are many research question regarding web browsing behaviour. Specifically, we focused on the depth-of-visit metric and implemented an exploratory analysis of this feature using clickstream data. Due to the large volume of data available in this context, we chose to present effect size measures along with all statistical analysis of data. We introduced two new robust measures of effect size for two-sample comparison studies for Non-normal situations, specifically where the difference of two populations is due to the shape parameter. The proposed effect sizes perform adequately for non-normal data, as well as when two distributions differ from shape parameters. We will focus on conversion analysis, to investigate the causal relationship between the general clickstream information and online purchasing using a logistic regression approach. The aim is to find a classifier by assigning the probability of the event of online shopping in an e-commerce website. We also develop the application of a mixture of hidden Markov models (MixHMM) to model web browsing behaviour using sequences of web pages viewed by users of an e-commerce website. The mixture of hidden Markov model will be performed in the Bayesian context using Gibbs sampling. We address the slow mixing problem of using Gibbs sampling in high dimensional models, and use the over-relaxed Gibbs sampling, as well as forward-backward EM algorithm to obtain an adequate sample of the posterior distributions of the parameters. The MixHMM provides an advantage of clustering users based on their browsing behaviour, and also gives an automatic classification of web pages based on the probability of observing web page by visitors in the website

    Switching Principal Component Analysis for Modeling Means and Covariance Changes Over Time

    Get PDF
    Many psychological theories predict that cognitions, affect, action tendencies, and other variables change across time in mean level as well as in covariance structure. Often such changes are rather abrupt, because they are caused by sudden events. To capture such changes, one may repeatedly measure the variables under study for a single individual and examine whether the resulting multivariate time series contains a number of phases with different means and covariance structures. The latter task is challenging, however. First, in many cases, it is unknown how many phases there are and when new phases start. Second, often a rather large number of variables is involved, complicating the interpretation of the covariance pattern within each phase. To take up this challenge, we present switching principal component analysis (PCA). Switching PCA detects phases of consecutive observations or time points (in single subject data) with similar means and/or covariation structures, and performs a PCA per phase to yield insight into its covariance structure. An algorithm for fitting switching PCA solutions as well as a model selection procedure are presented and evaluated in a simulation study. Finally, we analyze empirical data on cardiorespiratory recordings
    corecore