22 research outputs found

    Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition

    Get PDF
    An approach to infer the unknown microbial population structure within a metagenome is to cluster nucleotide sequences based on common patterns in base composition, otherwise referred to as binning. When functional roles are assigned to the identified populations, a deeper understanding of microbial communities can be attained, more so than gene-centric approaches that explore overall functionality. In this study, we propose an unsupervised, model-based binning method with two clustering tiers, which uses a novel transformation of the oligonucleotide frequency-derived error gradient and GC content to generate coarse groups at the first tier of clustering; and tetranucleotide frequency to refine these groups at the secondary clustering tier. The proposed method has a demonstrated improvement over PhyloPythia, S-GSOM, TACOA and TaxSOM on all three benchmarks that were used for evaluation in this study. The proposed method is then applied to a pyrosequenced metagenomic library of mud volcano sediment sampled in southwestern Taiwan, with the inferred population structure validated against complementary sequencing of 16S ribosomal RNA marker genes. Finally, the proposed method was further validated against four publicly available metagenomes, including a highly complex Antarctic whale-fall bone sample, which was previously assumed to be too complex for binning prior to functional analysis

    Enhanced clustering analysis pipeline for performance analysis of parallel applications

    Get PDF
    Clustering analysis is widely used to stratify data in the same cluster when they are similar according to the specific metrics. We can use the cluster analysis to group the CPU burst of a parallel application, and the regions on each process in-between communication calls or calls to the parallel runtime. The resulting clusters obtained are the different computational trends or phases that appear in the application. These clusters are useful to understand the behavior of the computation part of the application and focus the analyses on those that present performance issues. Although density-based clustering algorithms are a powerful and efficient tool to summarize this type of information, their traditional user-guided clustering methodology has many shortcomings and deficiencies in dealing with the complexity of data, the diversity of data structures, high-dimensionality of data, and the dramatic increase in the amount of data. Consequently, the majority of DBSCAN-like algorithms have weaknesses to handle high-dimensionality and/or Multi-density data, and they are sensitive to their hyper-parameter configuration. Furthermore, extracting insight from the obtained clusters is an intuitive and manual task. To mitigate these weaknesses, we have proposed a new unified approach to replace the user-guided clustering with an automated clustering analysis pipeline, called Enhanced Cluster Identification and Interpretation (ECII) pipeline. To build the pipeline, we propose novel techniques including Robust Independent Feature Selection, Feature Space Curvature Map, Organization Component Analysis, and hyper-parameters tuning to feature selection, density homogenization, cluster interpretation, and model selection which are the main components of our machine learning pipeline. This thesis contributes four new techniques to the Machine Learning field with a particular use case in Performance Analytics field. The first contribution is a novel unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose a feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis. Simultaneously, the noise is separated as an independent component. The second contribution of the thesis is a parametric multilinear transformation method to homogenize cluster densities while preserving the topological structure of the dataset, called Feature Space Curvature Map (FSCM). We present a new Gravitational Self-organizing Map to model the feature space curvature by plugging the concepts of gravity and fabric of space into the Self-organizing Map algorithm to mathematically describe the density structure of the data. To homogenize the cluster density, we introduce a novel mapping mechanism to project the data from the non-Euclidean curved space to a new Euclidean flat space. The third contribution is a novel topological-based method to study potentially complex high-dimensional categorized data by quantifying their shapes and extracting fine-grain insights from them to interpret the clustering result. We introduce our Organization Component Analysis (OCA) method for the automatic arbitrary cluster-shape study without an assumption about the data distribution. Finally, to tune the DBSCAN hyper-parameters, we propose a new tuning mechanism by combining techniques from machine learning and optimization domains, and we embed it in the ECII pipeline. Using this cluster analysis pipeline with the CPU burst data of a parallel application, we provide the developer/analyst with a high-quality SPMD computation structure detection with the added value that reflects the fine grain of the computation regions.El análisis de conglomerados se usa ampliamente para estratificar datos en el mismo conglomerado cuando son similares según las métricas específicas. Nosotros puede usar el análisis de clúster para agrupar la ráfaga de CPU de una aplicación paralela y las regiones en cada proceso intermedio llamadas de comunicación o llamadas al tiempo de ejecución paralelo. Los clusters resultantes obtenidos son las diferentes tendencias computacionales o fases que aparecen en la solicitud. Estos clusters son útiles para entender el comportamiento de la parte de computación del aplicación y centrar los análisis en aquellos que presenten problemas de rendimiento. Aunque los algoritmos de agrupamiento basados en la densidad son una herramienta poderosa y eficiente para resumir este tipo de información, su La metodología tradicional de agrupación en clústeres guiada por el usuario tiene muchas deficiencias y deficiencias al tratar con la complejidad de los datos, la diversidad de estructuras de datos, la alta dimensionalidad de los datos y el aumento dramático en la cantidad de datos. En consecuencia, el La mayoría de los algoritmos similares a DBSCAN tienen debilidades para manejar datos de alta dimensionalidad y/o densidad múltiple, y son sensibles a su configuración de hiperparámetros. Además, extraer información de los clústeres obtenidos es una forma intuitiva y tarea manual Para mitigar estas debilidades, hemos propuesto un nuevo enfoque unificado para reemplazar el agrupamiento guiado por el usuario con un canalización de análisis de agrupamiento automatizado, llamada canalización de identificación e interpretación de clúster mejorada (ECII). para construir el tubería, proponemos técnicas novedosas que incluyen la selección robusta de características independientes, el mapa de curvatura del espacio de características, Análisis de componentes de la organización y ajuste de hiperparámetros para la selección de características, homogeneización de densidad, agrupación interpretación y selección de modelos, que son los componentes principales de nuestra canalización de aprendizaje automático. Esta tesis aporta cuatro nuevas técnicas al campo de Machine Learning con un caso de uso particular en el campo de Performance Analytics. La primera contribución es un enfoque novedoso no supervisado para la selección de características en datos ruidosos, llamado Robust Independent Feature. Selección (RIFS).Específicamente, elegimos un subconjunto de funciones que contiene la mayor parte de la información subyacente, utilizando el mismo criterios como el análisis de componentes independientes. Simultáneamente, el ruido se separa como un componente independiente. La segunda contribución de la tesis es un método de transformación multilineal paramétrica para homogeneizar densidades de clústeres mientras preservando la estructura topológica del conjunto de datos, llamado Mapa de Curvatura del Espacio de Características (FSCM). Presentamos un nuevo Gravitacional Mapa autoorganizado para modelar la curvatura del espacio característico conectando los conceptos de gravedad y estructura del espacio en el Algoritmo de mapa autoorganizado para describir matemáticamente la estructura de densidad de los datos. Para homogeneizar la densidad del racimo, introducimos un mecanismo de mapeo novedoso para proyectar los datos del espacio curvo no euclidiano a un nuevo plano euclidiano espacio. La tercera contribución es un nuevo método basado en topología para estudiar datos categorizados de alta dimensión potencialmente complejos mediante cuantificando sus formas y extrayendo información detallada de ellas para interpretar el resultado de la agrupación. presentamos nuestro Método de análisis de componentes de organización (OCA) para el estudio automático de forma arbitraria de conglomerados sin una suposición sobre el distribución de datos.Postprint (published version

    Use of Self Organized Maps for Feature Extraction of Hyperspectral Data

    Get PDF
    In this paper, the problem of analyzing hyperspectral data is presented. The complexity of multi-dimensional data leads to the need for computer assisted data compression and labeling of important features. A brief overview of Self-Organizing Maps and their variants is given and then two possible methods of data analysis are examined. These methods are incorporated into a program derived from som_toolbox2. In this program, ASD data (data collected by an Analytical Spectral Device sensor) is read into a variable, relevant bands for discrimination between classes are extracted, and several different methods of analyzing the results are employed. A GUI was developed for easy implementation of these three stages

    k-Means

    Get PDF

    Analysis of large-scale molecular biological data using self-organizing maps

    Get PDF
    Modern high-throughput technologies such as microarrays, next generation sequencing and mass spectrometry provide huge amounts of data per measurement and challenge traditional analyses. New strategies of data processing, visualization and functional analysis are inevitable. This thesis presents an approach which applies a machine learning technique known as self organizing maps (SOMs). SOMs enable the parallel sample- and feature-centered view of molecular phenotypes combined with strong visualization and second-level analysis capabilities. We developed a comprehensive analysis and visualization pipeline based on SOMs. The unsupervised SOM mapping projects the initially high number of features, such as gene expression profiles, to meta-feature clusters of similar and hence potentially co-regulated single features. This reduction of dimension is attained by the re-weighting of primary information and does not entail a loss of primary information in contrast to simple filtering approaches. The meta-data provided by the SOM algorithm is visualized in terms of intuitive mosaic portraits. Sample-specific and common properties shared between samples emerge as a handful of localized spots in the portraits collecting groups of co-regulated and co-expressed meta-features. This characteristic color patterns reflect the data landscape of each sample and promote immediate identification of (meta-)features of interest. It will be demonstrated that SOM portraits transform large and heterogeneous sets of molecular biological data into an atlas of sample-specific texture maps which can be directly compared in terms of similarities and dissimilarities. Spot-clusters of correlated meta-features can be extracted from the SOM portraits in a subsequent step of aggregation. This spot-clustering effectively enables reduction of the dimensionality of the data in two subsequent steps towards a handful of signature modules in an unsupervised fashion. Furthermore we demonstrate that analysis techniques provide enhanced resolution if applied to the meta-features. The improved discrimination power of meta-features in downstream analyses such as hierarchical clustering, independent component analysis or pairwise correlation analysis is ascribed to essentially two facts: Firstly, the set of meta-features better represents the diversity of patterns and modes inherent in the data and secondly, it also possesses the better signal-to-noise characteristics as a comparable collection of single features. Additionally to the pattern-driven feature selection in the SOM portraits, we apply statistical measures to detect significantly differential features between sample classes. Implementation of scoring measurements supplements the basal SOM algorithm. Further, two variants of functional enrichment analyses are introduced which link sample specific patterns of the meta-feature landscape with biological knowledge and support functional interpretation of the data based on the ‘guilt by association’ principle. Finally, case studies selected from different ‘OMIC’ realms are presented in this thesis. In particular, molecular phenotype data derived from expression microarrays (mRNA, miRNA), sequencing (DNA methylation, histone modification patterns) or mass spectrometry (proteome), and also genotype data (SNP-microarrays) is analyzed. It is shown that the SOM analysis pipeline implies strong application capabilities and covers a broad range of potential purposes ranging from time series and treatment-vs.-control experiments to discrimination of samples according to genotypic, phenotypic or taxonomic classifications

    Computer aided identification of biological specimens using self-organizing maps

    Get PDF
    For scientific or socio-economic reasons it is often necessary or desirable that biological material be identified. Given that there are an estimated 10 million living organisms on Earth, the identification of biological material can be problematic. Consequently the services of taxonomist specialists are often required. However, if such expertise is not readily available it is necessary to attempt an identification using an alternative method. Some of these alternative methods are unsatisfactory or can lead to a wrong identification. One of the most common problems encountered when identifying specimens is that important diagnostic features are often not easily observed, or may even be completely absent. A number of techniques can be used to try to overcome this problem, one of which, the Self Organizing Map (or SOM), is a particularly appealing technique because of its ability to handle missing data. This thesis explores the use of SOMs as a technique for the identification of indigenous trees of the Acacia species in KwaZulu-Natal, South Africa. The ability of the SOM technique to perform exploratory data analysis through data clustering is utilized and assessed, as is its usefulness for visualizing the results of the analysis of numerical, multivariate botanical data sets. The SOM’s ability to investigate, discover and interpret relationships within these data sets is examined, and the technique’s ability to identify tree species successfully is tested. These data sets are also tested using the C5 and CN2 classification techniques. Results from both these techniques are compared with the results obtained by using a SOM commercial package. These results indicate that the application of the SOM to the problem of biological identification could provide the start of the long-awaited breakthrough in computerized identification that biologists have eagerly been seeking.Dissertation (MSc)--University of Pretoria, 2011.Computer Scienceunrestricte
    corecore