5,071 research outputs found
Semi-supervised model-based clustering with controlled clusters leakage
In this paper, we focus on finding clusters in partially categorized data
sets. We propose a semi-supervised version of Gaussian mixture model, called
C3L, which retrieves natural subgroups of given categories. In contrast to
other semi-supervised models, C3L is parametrized by user-defined leakage
level, which controls maximal inconsistency between initial categorization and
resulting clustering. Our method can be implemented as a module in practical
expert systems to detect clusters, which combine expert knowledge with true
distribution of data. Moreover, it can be used for improving the results of
less flexible clustering techniques, such as projection pursuit clustering. The
paper presents extensive theoretical analysis of the model and fast algorithm
for its efficient optimization. Experimental results show that C3L finds high
quality clustering model, which can be applied in discovering meaningful groups
in partially classified data
Self-Organizing Time Map: An Abstraction of Temporal Multivariate Patterns
This paper adopts and adapts Kohonen's standard Self-Organizing Map (SOM) for
exploratory temporal structure analysis. The Self-Organizing Time Map (SOTM)
implements SOM-type learning to one-dimensional arrays for individual time
units, preserves the orientation with short-term memory and arranges the arrays
in an ascending order of time. The two-dimensional representation of the SOTM
attempts thus twofold topology preservation, where the horizontal direction
preserves time topology and the vertical direction data topology. This enables
discovering the occurrence and exploring the properties of temporal structural
changes in data. For representing qualities and properties of SOTMs, we adapt
measures and visualizations from the standard SOM paradigm, as well as
introduce a measure of temporal structural changes. The functioning of the
SOTM, and its visualizations and quality and property measures, are illustrated
on artificial toy data. The usefulness of the SOTM in a real-world setting is
shown on poverty, welfare and development indicators
Visualizing probabilistic models: Intensive Principal Component Analysis
Unsupervised learning makes manifest the underlying structure of data without
curated training and specific problem definitions. However, the inference of
relationships between data points is frustrated by the `curse of
dimensionality' in high-dimensions. Inspired by replica theory from statistical
mechanics, we consider replicas of the system to tune the dimensionality and
take the limit as the number of replicas goes to zero. The result is the
intensive embedding, which is not only isometric (preserving local distances)
but allows global structure to be more transparently visualized. We develop the
Intensive Principal Component Analysis (InPCA) and demonstrate clear
improvements in visualizations of the Ising model of magnetic spins, a neural
network, and the dark energy cold dark matter ({\Lambda}CDM) model as applied
to the Cosmic Microwave Background.Comment: 6 pages, 5 figure
Information Preserving Component Analysis: Data Projections for Flow Cytometry Analysis
Flow cytometry is often used to characterize the malignant cells in leukemia
and lymphoma patients, traced to the level of the individual cell. Typically,
flow cytometric data analysis is performed through a series of 2-dimensional
projections onto the axes of the data set. Through the years, clinicians have
determined combinations of different fluorescent markers which generate
relatively known expression patterns for specific subtypes of leukemia and
lymphoma -- cancers of the hematopoietic system. By only viewing a series of
2-dimensional projections, the high-dimensional nature of the data is rarely
exploited. In this paper we present a means of determining a low-dimensional
projection which maintains the high-dimensional relationships (i.e.
information) between differing oncological data sets. By using machine learning
techniques, we allow clinicians to visualize data in a low dimension defined by
a linear combination of all of the available markers, rather than just 2 at a
time. This provides an aid in diagnosing similar forms of cancer, as well as a
means for variable selection in exploratory flow cytometric research. We refer
to our method as Information Preserving Component Analysis (IPCA).Comment: 26 page
- …