5,071 research outputs found

    Semi-supervised model-based clustering with controlled clusters leakage

    Full text link
    In this paper, we focus on finding clusters in partially categorized data sets. We propose a semi-supervised version of Gaussian mixture model, called C3L, which retrieves natural subgroups of given categories. In contrast to other semi-supervised models, C3L is parametrized by user-defined leakage level, which controls maximal inconsistency between initial categorization and resulting clustering. Our method can be implemented as a module in practical expert systems to detect clusters, which combine expert knowledge with true distribution of data. Moreover, it can be used for improving the results of less flexible clustering techniques, such as projection pursuit clustering. The paper presents extensive theoretical analysis of the model and fast algorithm for its efficient optimization. Experimental results show that C3L finds high quality clustering model, which can be applied in discovering meaningful groups in partially classified data

    Self-Organizing Time Map: An Abstraction of Temporal Multivariate Patterns

    Full text link
    This paper adopts and adapts Kohonen's standard Self-Organizing Map (SOM) for exploratory temporal structure analysis. The Self-Organizing Time Map (SOTM) implements SOM-type learning to one-dimensional arrays for individual time units, preserves the orientation with short-term memory and arranges the arrays in an ascending order of time. The two-dimensional representation of the SOTM attempts thus twofold topology preservation, where the horizontal direction preserves time topology and the vertical direction data topology. This enables discovering the occurrence and exploring the properties of temporal structural changes in data. For representing qualities and properties of SOTMs, we adapt measures and visualizations from the standard SOM paradigm, as well as introduce a measure of temporal structural changes. The functioning of the SOTM, and its visualizations and quality and property measures, are illustrated on artificial toy data. The usefulness of the SOTM in a real-world setting is shown on poverty, welfare and development indicators

    Visualizing probabilistic models: Intensive Principal Component Analysis

    Full text link
    Unsupervised learning makes manifest the underlying structure of data without curated training and specific problem definitions. However, the inference of relationships between data points is frustrated by the `curse of dimensionality' in high-dimensions. Inspired by replica theory from statistical mechanics, we consider replicas of the system to tune the dimensionality and take the limit as the number of replicas goes to zero. The result is the intensive embedding, which is not only isometric (preserving local distances) but allows global structure to be more transparently visualized. We develop the Intensive Principal Component Analysis (InPCA) and demonstrate clear improvements in visualizations of the Ising model of magnetic spins, a neural network, and the dark energy cold dark matter ({\Lambda}CDM) model as applied to the Cosmic Microwave Background.Comment: 6 pages, 5 figure

    Information Preserving Component Analysis: Data Projections for Flow Cytometry Analysis

    Full text link
    Flow cytometry is often used to characterize the malignant cells in leukemia and lymphoma patients, traced to the level of the individual cell. Typically, flow cytometric data analysis is performed through a series of 2-dimensional projections onto the axes of the data set. Through the years, clinicians have determined combinations of different fluorescent markers which generate relatively known expression patterns for specific subtypes of leukemia and lymphoma -- cancers of the hematopoietic system. By only viewing a series of 2-dimensional projections, the high-dimensional nature of the data is rarely exploited. In this paper we present a means of determining a low-dimensional projection which maintains the high-dimensional relationships (i.e. information) between differing oncological data sets. By using machine learning techniques, we allow clinicians to visualize data in a low dimension defined by a linear combination of all of the available markers, rather than just 2 at a time. This provides an aid in diagnosing similar forms of cancer, as well as a means for variable selection in exploratory flow cytometric research. We refer to our method as Information Preserving Component Analysis (IPCA).Comment: 26 page
    corecore