5 research outputs found
Som-Based Class Discovery Exploring the ICA-Reduced Features of Microarray Expression Profiles
Gene expression datasets are large and complex, having many variables and unknown
internal structure. We apply independent component analysis (ICA) to derive a
less redundant representation of the expression data. The decomposition produces
components with minimal statistical dependence and reveals biologically relevant
information. Consequently, to the transformed data, we apply cluster analysis (an
important and popular analysis tool for obtaining an initial understanding of the
data, usually employed for class discovery). The proposed self-organizing map
(SOM)-based clustering algorithm automatically determines the number of ‘natural’
subgroups of the data, being aided at this task by the available prior knowledge of the
functional categories of genes. An entropy criterion allows each gene to be assigned
to multiple classes, which is closer to the biological representation. These features,
however, are not achieved at the cost of the simplicity of the algorithm, since the
map grows on a simple grid structure and the learning algorithm remains equal to
Kohonen’s one
Joint Entropy Maximization in Kernel-Based Topographic Maps
A new learning algorithm for kernel-based topographic map formation
is introduced. The kernel parameters are adjusted individually so as to
maximize the joint entropy of the kernel outputs. This is done by maximizing
the differential entropies of the individual kernel outputs, given
that the map’s output redundancy, due to the kernel overlap, needs to be
minimized. The latter is achieved by minimizing the mutual information
between the kernel outputs. As a kernel, the (radial) incomplete gamma
distribution is taken since, for a gaussian input density, the differential
entropy of the kernel output will be maximal. Since the theoretically optimal
joint entropy performance can be derived for the case of nonoverlapping
gaussian mixture densities, a new clustering algorithm is suggested
that uses this optimum as its “null” distribution. Finally, it is shown that
the learning algorithm is similar to one that performs stochastic gradient
descent on the Kullback-Leibler divergence for a heteroskedastic gaussian
mixture density model.status: publishe
Bayesian networks for classification, clustering, and high-dimensional data visualisation
This thesis presents new developments for a particular class of Bayesian networks which are limited in the number of parent nodes that each node in the network can have. This restriction yields structures which have low complexity (number of edges), thus enabling the formulation of optimal learning algorithms for Bayesian networks from data. The new developments are focused on three topics: classification, clustering, and high-dimensional data visualisation (topographic map formation). For classification purposes, a new learning algorithm for Bayesian networks is introduced which generates simple Bayesian network classifiers. This approach creates a completely new class of networks which previously was limited mostly to two well known models, the naive Bayesian (NB) classifier and the Tree Augmented Naive Bayes (TAN) classifier. The proposed learning algorithm enhances the NB model by adding a Bayesian monitoring system. Therefore, the complexity of the resulting network is determined according to the input data yielding structures which model the data distribution in a more realistic way which improves the classification performance. Research on Bayesian networks for clustering has not been as popular as for classification tasks. A new unsupervised learning algorithm for three types of Bayesian network classifiers, which enables them to carry out clustering tasks, is introduced. The resulting models can perform cluster assignments in a probabilistic way using the posterior probability of a data point belonging to one of the clusters. A key characteristic of the proposed clustering models, which traditional clustering techniques do not have, is the ability to show the probabilistic dependencies amongst the variables for each cluster. This feature enables a better understanding of each cluster. The final part of this thesis introduces one of the first developments for Bayesian networks to perform topographic mapping. A new unsupervised learning algorithm for the NB model is presented which enables the projection of high-dimensional data into a two-dimensional space for visualisation purposes. The Bayesian network formalism of the model allows the learning algorithm to generate a density model of the input data and the presence of a cost function to monitor the convergence during the training process. These important features are limitations which other mapping techniques have and which have been overcome in this research