39 research outputs found
On class visualisation for high dimensional data: Exploring scientific datasets
Parametric Embedding (PE) has recently been proposed as a general-purpose
algorithm for class visualisation. It takes class posteriors produced by a
mixture-based clustering algorithm and projects them in 2D for visualisation.
However, although this fully modularised combination of objectives (clustering
and projection) is attractive for its conceptual simplicity, in the case of
high dimensional data, we show that a more optimal combination of these
objectives can be achieved by integrating them both into a consistent
probabilistic model. In this way, the projection step will fulfil a role of
regularisation, guarding against the curse of dimensionality. As a result, the
tradeoff between clustering and visualisation turns out to enhance the
predictive abilities of the overall model. We present results on both synthetic
data and two real-world high-dimensional data sets: observed spectra of
early-type galaxies and gene expression arrays.Comment: to appear in Lecture notes in Artificial Intelligence vol. 4265, the
(refereed) proceedings of the Ninth International conference on Discovery
Science (DS-2006), October 2006, Barcelona, Spain. 12 pages, 8 figure
Initialized and guided EM-clustering of sparse binary data with application to text based documents
We investigate an alternative way of combining classification and clustering techniques for sparse binary data in order to reduce the amount of training samples required. Initializing EM from the available labels also reduces the algorithms known dependency on the initialization, which is more evident in the case of sparse data. In addition, the two-valued Poisson class-model is proposed in this paper as a sparse variant of the usual Binomial assumption. Our method can be seen as a fusion between generalized logistic regression and parametric mixture modeling. Comparative simulation results on subsets of the 20 Newsgroups' binary coded text corpora and binary handwritten digits data demonstrate the potential usefulness of the suggested method. © 2000 IEEE
On an Equivalence between PLSI and LDA
Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework
Simplicial mixtures of markov chains: Distributed modelling of dynamic user profiles
To provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the definition of individual specific and global models. This paper proposes a linear-time distributed model for finite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user behavior may be 'explained' by a relatively small number of common structurally simple behavioral patterns which may interleave randomly in a user-specific proportion. The results of an empirical study on three different sources of user traces indicates that this modelling approach provides an efficient representation scheme, reflected by improved prediction performance as well as providing lowcomplexity and intuitively interpretable representations
Towards LargeScale ContinuousEDA: A RandomMatrix Theory Perspective
Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in highdimensionsisextremelychallengingandasaresultexistingEDAsmaylosetheir strengthsinlargescale problems. Large scale continuous global optimisation is key to many modern-day real-world problems. Scaling up EAs to large scale problems has become one of the biggest challengesof thefield. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective and efficient EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduceanensembleofrandomprojectionstolowdimensionsofthesetoffittestsearch points as a basis for developing a new and generic divide-and-conquer methodology. Ourideasarerootedinthetheoryofrandomprojectionsdevelopedintheoreticalcomputerscience,andindevelopingandanalysingourframeworkweexploitsomerecent resultsin non-asymptoticrandommatrixtheory
Sequential activity profiling: Latent dirichlet allocation of Markov chains
To provide a parsimonious generative representation of the sequential activity of a number of individuals within a population there is a necessary tradeoff between the definition of individual specific and global representations. A linear-time algorithm is proposed that defines a distributed predictive model for finite state symbolic sequences which represent the traces of the activity of a number of individuals within a group. The algorithm is based on a straightforward generalization of latent Dirichlet allocation to time-invariant Markov chains of arbitrary order. The modelling assumption made is that the possibly heterogeneous behavior of individuals may be represented by a relatively small number of simple and common behavioral traits which may interleave randomly according to an individual-specific distribution. The results of an empirical study on three different application domains indicate that this modelling approach provides an efficient low-complexity and intuitively interpretable representation scheme which is reflected by improved prediction performance over comparable models. © 2005 Springer Science + Business Media, Inc
A dynamic probabilistic model to visualise topic evolution in text streams
We propose a novel probabilistic method, based on latent variable models, for unsupervised topographic visualisation of dynamically evolving, coherent textual information. This can be seen as a complementary tool for topic detection and tracking applications. This is achieved by the exploitation of the a priori domain knowledge available, that there are relatively homogeneous temporal segments in the data stream. In a different manner from topographical techniques previously utilized for static text collections, the topography is an outcome of the coherence in time of the data stream in the proposed model. Simulation results on both toy-data settings and an actual application on Internet chat line discussion analysis is presented by way of demonstration