3,943 research outputs found
The EM Algorithm
The Expectation-Maximization (EM) algorithm is a broadly applicable approach to the iterative computation of maximum likelihood (ML) estimates, useful in a variety of incomplete-data problems. Maximum likelihood estimation and likelihood-based inference are of central importance in statistical theory and data analysis. Maximum likelihood estimation is a general-purpose method with attractive properties. It is the most-often used estimation technique in the frequentist framework; it is also relevant in the Bayesian framework (Chapter III.11). Often Bayesian solutions are justified with the help of likelihoods and maximum likelihood estimates (MLE), and Bayesian solutions are similar to penalized likelihood estimates. Maximum likelihood estimation is an ubiquitous technique and is used extensively in every area where statistical techniques are used. --
Block Mixture Model for the Biclustering of Microarray Data
This publication is a representation of what appears in the IEEE Digital Libraries.International audienceAn attractive way to make biclustering of genes and conditions is to adopt a Block Mixture Model (BMM). Approaches based on a BMM operate thanks to a Block Expectation Maximization (BEM) algorithm and/or a Block Classification Expectation Maximization (BCEM) one. The drawback of these approaches is their difficulty to choose a good strategy of initialization of the BEM and BCEM algorithms. This paper introduces existing biclustering approaches adopting a BMM and suggests a new fuzzy biclustering one. Our approach enables to choose a good strategy of initialization of the BEM and BCEM algorithms
Clustering Patients with Tensor Decomposition
In this paper we present a method for the unsupervised clustering of
high-dimensional binary data, with a special focus on electronic healthcare
records. We present a robust and efficient heuristic to face this problem using
tensor decomposition. We present the reasons why this approach is preferable
for tasks such as clustering patient records, to more commonly used
distance-based methods. We run the algorithm on two datasets of healthcare
records, obtaining clinically meaningful results.Comment: Presented at 2017 Machine Learning for Healthcare Conference (MLHC
2017). Boston, M
Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data
Subsequence clustering of multivariate time series is a useful tool for
discovering repeated patterns in temporal data. Once these patterns have been
discovered, seemingly complicated datasets can be interpreted as a temporal
sequence of only a small number of states, or clusters. For example, raw sensor
data from a fitness-tracking application can be expressed as a timeline of a
select few actions (i.e., walking, sitting, running). However, discovering
these patterns is challenging because it requires simultaneous segmentation and
clustering of the time series. Furthermore, interpreting the resulting clusters
is difficult, especially when the data is high-dimensional. Here we propose a
new method of model-based clustering, which we call Toeplitz Inverse
Covariance-based Clustering (TICC). Each cluster in the TICC method is defined
by a correlation network, or Markov random field (MRF), characterizing the
interdependencies between different observations in a typical subsequence of
that cluster. Based on this graphical representation, TICC simultaneously
segments and clusters the time series data. We solve the TICC problem through
alternating minimization, using a variation of the expectation maximization
(EM) algorithm. We derive closed-form solutions to efficiently solve the two
resulting subproblems in a scalable way, through dynamic programming and the
alternating direction method of multipliers (ADMM), respectively. We validate
our approach by comparing TICC to several state-of-the-art baselines in a
series of synthetic experiments, and we then demonstrate on an automobile
sensor dataset how TICC can be used to learn interpretable clusters in
real-world scenarios.Comment: This revised version fixes two small typos in the published versio
The Lazy Bootstrap. A Fast Resampling Method for Evaluating Latent Class Model Fit
The latent class model is a powerful unsupervised clustering algorithm for
categorical data. Many statistics exist to test the fit of the latent class
model. However, traditional methods to evaluate those fit statistics are not
always useful. Asymptotic distributions are not always known, and empirical
reference distributions can be very time consuming to obtain. In this paper we
propose a fast resampling scheme with which any type of model fit can be
assessed. We illustrate it here on the latent class model, but the methodology
can be applied in any situation.
The principle behind the lazy bootstrap method is to specify a statistic
which captures the characteristics of the data that a model should capture
correctly. If those characteristics in the observed data and in model-generated
data are very different we can assume that the model could not have produced
the observed data. With this method we achieve the flexibility of tests from
the Bayesian framework, while only needing maximum likelihood estimates. We
provide a step-wise algorithm with which the fit of a model can be assessed
based on the characteristics we as researcher find important. In a Monte Carlo
study we show that the method has very low type I errors, for all illustrated
statistics. Power to reject a model depended largely on the type of statistic
that was used and on sample size. We applied the method to an empirical data
set on clinical subgroups with risk of Myocardial infarction and compared the
results directly to the parametric bootstrap. The results of our method were
highly similar to those obtained by the parametric bootstrap, while the
required computations differed three orders of magnitude in favour of our
method.Comment: This is an adaptation of chapter of a PhD dissertation available at
https://pure.uvt.nl/portal/files/19030880/Kollenburg_Computer_13_11_2017.pd
- …