194,648 research outputs found
A Topological Approach to Spectral Clustering
We propose two related unsupervised clustering algorithms which, for input,
take data assumed to be sampled from a uniform distribution supported on a
metric space , and output a clustering of the data based on the selection of
a topological model for the connected components of . Both algorithms work
by selecting a graph on the samples from a natural one-parameter family of
graphs, using a geometric criterion in the first case and an information
theoretic criterion in the second. The estimated connected components of
are identified with the kernel of the associated graph Laplacian, which allows
the algorithm to work without requiring the number of expected clusters or
other auxiliary data as input.Comment: 21 Page
Sequence Modelling For Analysing Student Interaction with Educational Systems
The analysis of log data generated by online educational systems is an
important task for improving the systems, and furthering our knowledge of how
students learn. This paper uses previously unseen log data from Edulab, the
largest provider of digital learning for mathematics in Denmark, to analyse the
sessions of its users, where 1.08 million student sessions are extracted from a
subset of their data. We propose to model students as a distribution of
different underlying student behaviours, where the sequence of actions from
each session belongs to an underlying student behaviour. We model student
behaviour as Markov chains, such that a student is modelled as a distribution
of Markov chains, which are estimated using a modified k-means clustering
algorithm. The resulting Markov chains are readily interpretable, and in a
qualitative analysis around 125,000 student sessions are identified as
exhibiting unproductive student behaviour. Based on our results this student
representation is promising, especially for educational systems offering many
different learning usages, and offers an alternative to common approaches like
modelling student behaviour as a single Markov chain often done in the
literature.Comment: The 10th International Conference on Educational Data Mining 201
Model-Based Co-clustering for Functional Data
International audienceIn order to provide a simplified representation of key performance indicators for an easier analysis by mobile network maintainers, a model-based co-clustering algorithm for functional data is proposed. Co-clustering aims to identify block patterns in a data set from a simultaneous clustering of rows and columns. The algorithm relies on the latent block model in which each curve is identified by its functional principal components that are modeled by a multivariate Gaussian distribution whose parameters are block-specific. These latter are estimated by a stochastic EM algorithm embedding a Gibbs sampling. In order to select the numbers of row-and column-clusters, an ICL-BIC criterion is introduced. In addition to be the first co-clustering algorithm for functional data, the advantage of the proposed model is its ability to extract the hidden double structure induced by the data and its ability to deal with missing values. The model has proven its efficiency on simulated data and on a real data application that helps to optimize the topology of 4G mobile networks
Fast high-dimensional Bayesian classification and clustering
We introduce a fast approach to classification and clustering applicable to high-dimensional continuous data, based on Bayesian mixture models for which explicit computations are available. This permits us to treat classification and clustering in a single framework, and allows calculation of unobserved class probability. The new classifier is robust to adding noise variables as a drawback of the built-in spike-and-slab structure of the proposed Bayesian model. The usefulness of classification using our method is shown on metabololomic example, and on the Iris data with and without noise variables. Agglomerative hierarchical clustering is used to construct a dendrogram based on the posterior probabilities of particular partitions, to provide a dendrogram with a probabilistic interpretation. An extension to variable selection is proposed which summarises the importance of variables for classification or clustering and has probabilistic interpretation. Having a simple model provides estimation of the model parameters using maximum likelihood and therefore yields a fully automatic algorithm. The new clustering method is applied to metabolomic, microarray, and image data and is studied using simulated data motivated by real datasets. The computational difficulties of the new approach are discussed, solutions for algorithm acceleration are proposed, and the written computer code is briefly analysed. Simulations shows that the quality of the estimated model parameters depends on the parametric distribution assumed for effects, but after fixing the model parameters to reasonable values, the distribution of the effects influences clustering very little. Simulations confirms that the clustering algorithm and the proposed variable selection method is reliable when the model assumptions are wrong. The new approach is compared with the popular Bayesian clustering alternative, MCLUST, fitted on the principal components using two loss functions in which our proposed approach is found to be more efficient in almost every situation
Model-Based Clustering of Multivariate Ordinal Data Relying on a Stochastic Binary Search Algorithm
International audienceWe design the first univariate probability distribution for ordinal data which strictly respects the ordinal nature of data. More precisely, it relies only on order comparisons between modalities. Contrariwise, most competitors either forget the order information or add a nonexistent distance information. The proposed distribution is obtained by modeling the data generating process which is assumed, from optimality arguments, to be a stochastic binary search algorithm in a sorted table. The resulting distribution is natively governed by two meaningful parameters (position and precision) and has very appealing properties: decrease around the mode, shape tuning from uniformity to a Dirac, identifiability. Moreover, it is easily estimated by an EM algorithm since the path in the stochastic binary search algorithm is missing. Using then the classical latent class assumption, the previous univariate ordinal model is straightforwardly extended to model-based clustering for multivariate ordinal data. Again, parameters of this mixture model are estimated by an EM algorithm. Both simulated and real data sets illustrate the great potential of this model by its ability to parsimoniously identify particularly relevant clusters which were unsuspected by some traditional competitors
- …