Search CORE

39 research outputs found

On class visualisation for high dimensional data: Exploring scientific datasets

Author: A. Kabán
B.P. Carlin
C.M. Bishop
J. Rice
L. Nolan
T. Soukup
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Parametric Embedding (PE) has recently been proposed as a general-purpose algorithm for class visualisation. It takes class posteriors produced by a mixture-based clustering algorithm and projects them in 2D for visualisation. However, although this fully modularised combination of objectives (clustering and projection) is attractive for its conceptual simplicity, in the case of high dimensional data, we show that a more optimal combination of these objectives can be achieved by integrating them both into a consistent probabilistic model. In this way, the projection step will fulfil a role of regularisation, guarding against the curse of dimensionality. As a result, the tradeoff between clustering and visualisation turns out to enhance the predictive abilities of the overall model. We present results on both synthetic data and two real-world high-dimensional data sets: observed spectra of early-type galaxies and gene expression arrays.Comment: to appear in Lecture notes in Artificial Intelligence vol. 4265, the (refereed) proceedings of the Ninth International conference on Discovery Science (DS-2006), October 2006, Barcelona, Spain. 12 pages, 8 figure

arXiv.org e-Print Archive

Crossref

CERN Document Server

Solving large-scale global optimization problems using enhanced adaptive differential evolution algorithm

Author: A Kabán
AK Qin
AP Engelbrecht
AW Mohamed
AW Mohamed
AW Mohamed
AW Mohamed
AW Mohamed
H Wang
HY Fan
J Brest
JQ Zhang
KV Price
M Weber
N Hachicha
N Noman
P Yang’
R Cheng
R Cheng
R Mallipeddi
R Ros
R Storn
S Das
S Das
S García
SA El-Quliti
SA El-Quliti
SA El-Qulity
SM Elsayed
V Feoktistov
W Dong
X Li
Y Wang
Z Yang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Sufficient ensemble size for random matrix theory-based handling of singular covariance matrices

Author: Adamczak R.
Ata Kabán
Gupta A. K.
Kabán A.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Crossref

Initialized and guided EM-clustering of sparse binary data with application to text based documents

Author: Girolami M
Kabán A
Publication venue
Publication date: 01/12/2000
Field of study

We investigate an alternative way of combining classification and clustering techniques for sparse binary data in order to reduce the amount of training samples required. Initializing EM from the available labels also reduces the algorithms known dependency on the initialization, which is more evident in the case of sparse data. In addition, the two-valued Poisson class-model is proposed in this paper as a sparse variant of the usual Binomial assumption. Our method can be seen as a fusion between generalized logistic regression and parametric mixture modeling. Comparative simulation results on subsets of the 20 Newsgroups' binary coded text corpora and binary handwritten digits data demonstrate the potential usefulness of the suggested method. © 2000 IEEE

CUED - Cambridge University Engineering Department

On an Equivalence between PLSI and LDA

Author: Girolami M
Kabán A
Publication venue
Publication date: 01/01/2003
Field of study

Latent Dirichlet Allocation (LDA) is a fully generative approach to language modelling which overcomes the inconsistent generative semantics of Probabilistic Latent Semantic Indexing (PLSI). This paper shows that PLSI is a maximum a posteriori estimated LDA model under a uniform Dirichlet prior, therefore the perceived shortcomings of PLSI can be resolved and elucidated within the LDA framework

CUED - Cambridge University Engineering Department

Simplicial mixtures of markov chains: Distributed modelling of dynamic user profiles

Author: Girolami M
Kabán A
Publication venue
Publication date: 01/01/2004
Field of study

To provide a compact generative representation of the sequential activity of a number of individuals within a group there is a tradeoff between the definition of individual specific and global models. This paper proposes a linear-time distributed model for finite state symbolic sequences representing traces of individual user activity by making the assumption that heterogeneous user behavior may be 'explained' by a relatively small number of common structurally simple behavioral patterns which may interleave randomly in a user-specific proportion. The results of an empirical study on three different sources of user traces indicates that this modelling approach provides an efficient representation scheme, reflected by improved prediction performance as well as providing lowcomplexity and intuitively interpretable representations

CUED - Cambridge University Engineering Department

Towards LargeScale ContinuousEDA: A RandomMatrix Theory Perspective

Author: A. Kabán
Publication venue
Publication date
Field of study

Estimation of distribution algorithms (EDA) are a major branch of evolutionary algorithms (EA) with some unique advantages in principle. They are able to take advantage of correlation structure to drive the search more efficiently, and they are able to provide insights about the structure of the search space. However, model building in highdimensionsisextremelychallengingandasaresultexistingEDAsmaylosetheir strengthsinlargescale problems. Large scale continuous global optimisation is key to many modern-day real-world problems. Scaling up EAs to large scale problems has become one of the biggest challengesof thefield. This paper pins down some fundamental roots of the problem and makes a start at developing a new and generic framework to yield effective and efficient EDA-type algorithms for large scale continuous global optimisation problems. Our concept is to introduceanensembleofrandomprojectionstolowdimensionsofthesetoffittestsearch points as a basis for developing a new and generic divide-and-conquer methodology. Ourideasarerootedinthetheoryofrandomprojectionsdevelopedintheoreticalcomputerscience,andindevelopingandanalysingourframeworkweexploitsomerecent resultsin non-asymptoticrandommatrixtheory

CiteSeerX

Sequential activity profiling: Latent dirichlet allocation of Markov chains

Author: Girolami M
Kabán A
Publication venue
Publication date: 01/05/2005
Field of study

To provide a parsimonious generative representation of the sequential activity of a number of individuals within a population there is a necessary tradeoff between the definition of individual specific and global representations. A linear-time algorithm is proposed that defines a distributed predictive model for finite state symbolic sequences which represent the traces of the activity of a number of individuals within a group. The algorithm is based on a straightforward generalization of latent Dirichlet allocation to time-invariant Markov chains of arbitrary order. The modelling assumption made is that the possibly heterogeneous behavior of individuals may be represented by a relatively small number of simple and common behavioral traits which may interleave randomly according to an individual-specific distribution. The results of an empirical study on three different application domains indicate that this modelling approach provides an efficient low-complexity and intuitively interpretable representation scheme which is reflected by improved prediction performance over comparable models. © 2005 Springer Science + Business Media, Inc

CUED - Cambridge University Engineering Department

A dynamic probabilistic model to visualise topic evolution in text streams

Author: Girolami MA
Kabán A
Publication venue
Publication date: 01/03/2002
Field of study

We propose a novel probabilistic method, based on latent variable models, for unsupervised topographic visualisation of dynamically evolving, coherent textual information. This can be seen as a complementary tool for topic detection and tracking applications. This is achieved by the exploitation of the a priori domain knowledge available, that there are relatively homogeneous temporal segments in the data stream. In a different manner from topographical techniques previously utilized for static text collections, the topography is an outcome of the coherence in time of the data stream in the proposed model. Simulation results on both toy-data settings and an actual application on Internet chat line discussion analysis is presented by way of demonstration

CUED - Cambridge University Engineering Department