19,752 research outputs found

    Trustworthiness and metrics in visualizing similarity of gene expression

    Get PDF
    BACKGROUND: Conventionally, the first step in analyzing the large and high-dimensional data sets measured by microarrays is visual exploration. Dendrograms of hierarchical clustering, self-organizing maps (SOMs), and multidimensional scaling have been used to visualize similarity relationships of data samples. We address two central properties of the methods: (i) Are the visualizations trustworthy, i.e., if two samples are visualized to be similar, are they really similar? (ii) The metric. The measure of similarity determines the result; we propose using a new learning metrics principle to derive a metric from interrelationships among data sets. RESULTS: The trustworthiness of hierarchical clustering, multidimensional scaling, and the self-organizing map were compared in visualizing similarity relationships among gene expression profiles. The self-organizing map was the best except that hierarchical clustering was the most trustworthy for the most similar profiles. Trustworthiness can be further increased by treating separately those genes for which the visualization is least trustworthy. We then proceed to improve the metric. The distance measure between the expression profiles is adjusted to measure differences relevant to functional classes of the genes. The genes for which the new metric is the most different from the usual correlation metric are listed and visualized with one of the visualization methods, the self-organizing map, computed in the new metric. CONCLUSIONS: The conjecture from the methodological results is that the self-organizing map can be recommended to complement the usual hierarchical clustering for visualizing and exploring gene expression data. Discarding the least trustworthy samples and improving the metric still improves it

    Methods of Hierarchical Clustering

    Get PDF
    We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference

    What governs star formation in galaxies? A modern statistical approach

    Get PDF
    Understanding the process of star formation is one of the key steps in understanding the formation and evolution of galaxies. In this thesis, I address the empirical star formation laws, and study the properties of galaxies that can affect the star formation rate. The Andromeda galaxy (M31) is the nearest large spiral galaxy, and Therefore, high resolution images of this galaxy are available. These images provide data from various regions with different physical properties. Star formation rate and gas mass surface densities of M31have been measured using three different methods, and have been used to compare different star formation laws over the whole galaxy and in spatially-resolved regions. Using hierarchical Bayesian regression analysis, I conclude that there is a correlation between surface density of star formation and the stellar mass surface density. A weak correlation between star formation rate, stellar mass and metallicity is also found. To study the effect of other properties a galaxy on the star formation rate, I utilize an unsupervised data mining method (specifically the self-organizing map) on measurements of both nearby and high-redshift galaxies. Both observed data and derived quantities (e.g. star formation rate, stellar mass) of star-forming regions in M31 and the nearby spiral galaxy M101 are used as inputs to the self-organizing map. Clustering the M31 regions in the feature space reveals some (anti)-correlations between the properties the galaxy, which are not apparent when considering data from all regions in the galaxy. The self-organizing map can be used to predict star formation rates for spatially-resolved regions in galaxies using other properties of those regions. I also apply the self-organizing map method to spectral energy distributions of high-redshift galaxies. Template spectra made from galaxies with known morphological type are used to train self-organizing maps. The trained maps are used to classify a sample of galaxy spectral energy distributions derived from fitting models to photometry data of 142 high-redshift galaxies. The grouped properties of the classified galaxies are found to be more tightly correlated in mean values of age, specific star formation rate, stellar mass, and far-UV extinction than in previous studies

    Self-Organizing Time Map: An Abstraction of Temporal Multivariate Patterns

    Full text link
    This paper adopts and adapts Kohonen's standard Self-Organizing Map (SOM) for exploratory temporal structure analysis. The Self-Organizing Time Map (SOTM) implements SOM-type learning to one-dimensional arrays for individual time units, preserves the orientation with short-term memory and arranges the arrays in an ascending order of time. The two-dimensional representation of the SOTM attempts thus twofold topology preservation, where the horizontal direction preserves time topology and the vertical direction data topology. This enables discovering the occurrence and exploring the properties of temporal structural changes in data. For representing qualities and properties of SOTMs, we adapt measures and visualizations from the standard SOM paradigm, as well as introduce a measure of temporal structural changes. The functioning of the SOTM, and its visualizations and quality and property measures, are illustrated on artificial toy data. The usefulness of the SOTM in a real-world setting is shown on poverty, welfare and development indicators

    Clustering Methods for Electricity Consumers: An Empirical Study in Hvaler-Norway

    Get PDF
    The development of Smart Grid in Norway in specific and Europe/US in general will shortly lead to the availability of massive amount of fine-grained spatio-temporal consumption data from domestic households. This enables the application of data mining techniques for traditional problems in power system. Clustering customers into appropriate groups is extremely useful for operators or retailers to address each group differently through dedicated tariffs or customer-tailored services. Currently, the task is done based on demographic data collected through questionnaire, which is error-prone. In this paper, we used three different clustering techniques (together with their variants) to automatically segment electricity consumers based on their consumption patterns. We also proposed a good way to extract consumption patterns for each consumer. The grouping results were assessed using four common internal validity indexes. We found that the combination of Self Organizing Map (SOM) and k-means algorithms produce the most insightful and useful grouping. We also discovered that grouping quality cannot be measured effectively by automatic indicators, which goes against common suggestions in literature.Comment: 12 pages, 3 figure

    Batch kernel SOM and related Laplacian methods for social network analysis

    Get PDF
    Large graphs are natural mathematical models for describing the structure of the data in a wide variety of fields, such as web mining, social networks, information retrieval, biological networks, etc. For all these applications, automatic tools are required to get a synthetic view of the graph and to reach a good understanding of the underlying problem. In particular, discovering groups of tightly connected vertices and understanding the relations between those groups is very important in practice. This paper shows how a kernel version of the batch Self Organizing Map can be used to achieve these goals via kernels derived from the Laplacian matrix of the graph, especially when it is used in conjunction with more classical methods based on the spectral analysis of the graph. The proposed method is used to explore the structure of a medieval social network modeled through a weighted graph that has been directly built from a large corpus of agrarian contracts

    Applied Sensor Fault Detection, Identification and Data Reconstruction

    Get PDF
    Sensor fault detection and identification (SFD/I) has attracted considerable attention in military applications, especially when safety- or mission-critical issues are of paramount importance. Here, two readily implementable approaches for SFD/I are proposed through hierarchical clustering and self-organizing map neural networks. The proposed methodologies are capable of detecting sensor faults from a large group of sensors measuring different physical quantities and achieve SFD/I in a single stage. Furthermore, it is possible to reconstruct the measurements expected from the faulted sensor and thereby facilitate improved unit availability. The efficacy of the proposed approaches is demonstrated through the use of measurements from experimental trials on a gas turbine. Ultimately, the underlying principles are readily transferable to other complex industrial and military systems
    • …
    corecore