3,190 research outputs found

    Recovering complete and draft population genomes from metagenome datasets.

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    Analysis of large-scale molecular biological data using self-organizing maps

    Get PDF
    Modern high-throughput technologies such as microarrays, next generation sequencing and mass spectrometry provide huge amounts of data per measurement and challenge traditional analyses. New strategies of data processing, visualization and functional analysis are inevitable. This thesis presents an approach which applies a machine learning technique known as self organizing maps (SOMs). SOMs enable the parallel sample- and feature-centered view of molecular phenotypes combined with strong visualization and second-level analysis capabilities. We developed a comprehensive analysis and visualization pipeline based on SOMs. The unsupervised SOM mapping projects the initially high number of features, such as gene expression profiles, to meta-feature clusters of similar and hence potentially co-regulated single features. This reduction of dimension is attained by the re-weighting of primary information and does not entail a loss of primary information in contrast to simple filtering approaches. The meta-data provided by the SOM algorithm is visualized in terms of intuitive mosaic portraits. Sample-specific and common properties shared between samples emerge as a handful of localized spots in the portraits collecting groups of co-regulated and co-expressed meta-features. This characteristic color patterns reflect the data landscape of each sample and promote immediate identification of (meta-)features of interest. It will be demonstrated that SOM portraits transform large and heterogeneous sets of molecular biological data into an atlas of sample-specific texture maps which can be directly compared in terms of similarities and dissimilarities. Spot-clusters of correlated meta-features can be extracted from the SOM portraits in a subsequent step of aggregation. This spot-clustering effectively enables reduction of the dimensionality of the data in two subsequent steps towards a handful of signature modules in an unsupervised fashion. Furthermore we demonstrate that analysis techniques provide enhanced resolution if applied to the meta-features. The improved discrimination power of meta-features in downstream analyses such as hierarchical clustering, independent component analysis or pairwise correlation analysis is ascribed to essentially two facts: Firstly, the set of meta-features better represents the diversity of patterns and modes inherent in the data and secondly, it also possesses the better signal-to-noise characteristics as a comparable collection of single features. Additionally to the pattern-driven feature selection in the SOM portraits, we apply statistical measures to detect significantly differential features between sample classes. Implementation of scoring measurements supplements the basal SOM algorithm. Further, two variants of functional enrichment analyses are introduced which link sample specific patterns of the meta-feature landscape with biological knowledge and support functional interpretation of the data based on the ‘guilt by association’ principle. Finally, case studies selected from different ‘OMIC’ realms are presented in this thesis. In particular, molecular phenotype data derived from expression microarrays (mRNA, miRNA), sequencing (DNA methylation, histone modification patterns) or mass spectrometry (proteome), and also genotype data (SNP-microarrays) is analyzed. It is shown that the SOM analysis pipeline implies strong application capabilities and covers a broad range of potential purposes ranging from time series and treatment-vs.-control experiments to discrimination of samples according to genotypic, phenotypic or taxonomic classifications

    Trustworthiness and metrics in visualizing similarity of gene expression

    Get PDF
    BACKGROUND: Conventionally, the first step in analyzing the large and high-dimensional data sets measured by microarrays is visual exploration. Dendrograms of hierarchical clustering, self-organizing maps (SOMs), and multidimensional scaling have been used to visualize similarity relationships of data samples. We address two central properties of the methods: (i) Are the visualizations trustworthy, i.e., if two samples are visualized to be similar, are they really similar? (ii) The metric. The measure of similarity determines the result; we propose using a new learning metrics principle to derive a metric from interrelationships among data sets. RESULTS: The trustworthiness of hierarchical clustering, multidimensional scaling, and the self-organizing map were compared in visualizing similarity relationships among gene expression profiles. The self-organizing map was the best except that hierarchical clustering was the most trustworthy for the most similar profiles. Trustworthiness can be further increased by treating separately those genes for which the visualization is least trustworthy. We then proceed to improve the metric. The distance measure between the expression profiles is adjusted to measure differences relevant to functional classes of the genes. The genes for which the new metric is the most different from the usual correlation metric are listed and visualized with one of the visualization methods, the self-organizing map, computed in the new metric. CONCLUSIONS: The conjecture from the methodological results is that the self-organizing map can be recommended to complement the usual hierarchical clustering for visualizing and exploring gene expression data. Discarding the least trustworthy samples and improving the metric still improves it

    Revealing Microbial Responses to Environmental Dynamics: Developing Methods for Analysis and Visualization of Complex Sequence Datasets.

    Get PDF
    abstract: The greatest barrier to understanding how life interacts with its environment is the complexity in which biology operates. In this work, I present experimental designs, analysis methods, and visualization techniques to overcome the challenges of deciphering complex biological datasets. First, I examine an iron limitation transcriptome of Synechocystis sp. PCC 6803 using a new methodology. Until now, iron limitation in experiments of Synechocystis sp. PCC 6803 gene expression has been achieved through media chelation. Notably, chelation also reduces the bioavailability of other metals, whereas naturally occurring low iron settings likely result from a lack of iron influx and not as a result of chelation. The overall metabolic trends of previous studies are well-characterized but within those trends is significant variability in single gene expression responses. I compare previous transcriptomics analyses with our protocol that limits the addition of bioavailable iron to growth media to identify consistent gene expression signals resulting from iron limitation. Second, I describe a novel method of improving the reliability of centroid-linkage clustering results. The size and complexity of modern sequencing datasets often prohibit constructing distance matrices, which prevents the use of many common clustering algorithms. Centroid-linkage circumvents the need for a distance matrix, but has the adverse effect of producing input-order dependent results. In this chapter, I describe a method of cluster edge counting across iterated centroid-linkage results and reconstructing aggregate clusters from a ranked edge list without a distance matrix and input-order dependence. Finally, I introduce dendritic heat maps, a new figure type that visualizes heat map responses through expanding and contracting sequence clustering specificities. Heat maps are useful for comparing data across a range of possible states. However, data binning is sensitive to clustering cutoffs which are often arbitrarily introduced by researchers and can substantially change the heat map response of any single data point. With an understanding of how the architectural elements of dendrograms and heat maps affect data visualization, I have integrated their salient features to create a figure type aimed at viewing multiple levels of clustering cutoffs, allowing researchers to better understand the effects of environment on metabolism or phylogenetic lineages.Dissertation/ThesisChapter 2 Excel file of transcriptome responsesChapter 2 Perl scriptsChapter 3 Cluster Aggregation Perl scriptChapter 4 Example of the top-down clustering method used to construct dendritic heat mapsChapter 4Perl scripts and dendritic heat map imagesChapter 4 Perl scripts and dendritic heat map imagesDoctoral Dissertation Geological Sciences 201

    The non-coding RNA landscape of human hematopoiesis and leukemia

    Get PDF
    © The Author(s) 2017. Non-coding RNAs have emerged as crucial regulators of gene expression and cell fate decisions. However, their expression patterns and regulatory functions during normal and malignant human hematopoiesis are incompletely understood. Here we present a comprehensive resource defining the non-coding RNA landscape of the human hematopoietic system. Based on highly specific non-coding RNA expression portraits per blood cell population, we identify unique fingerprint non-coding RNAs-such as LINC00173 in granulocytes-and assign these to critical regulatory circuits involved in blood homeostasis. Following the incorporation of acute myeloid leukemia samples into the landscape, we further uncover prognostically relevant non-coding RNA stem cell signatures shared between acute myeloid leukemia blasts and healthy hematopoietic stem cells. Our findings highlight the importance of the non-coding transcriptome in the formation and maintenance of the human blood hierarchy

    Analysis of gene expression data using Expressionist 3.1 and GeneSpring 4.2

    Get PDF
    The purpose of this study was to determine the differences in the gene expression analysis methods of two data mining tools, ExpressionisticTM 3.1 and GeneSpringTM 4.2 with focus on basic statistical analysis and clustering algorithms. The data for this analysis was derived from the hybridization of Rattus norvegicus RNA to the Affymetrix RG34A GeneChip. This analysis was derived from experiments designed to identify changes in gene expression patterns that were induced in vivo by an experimental treatment. The tools were found to be comparable with respect to the list of statistically significant genes that were up-regulated by more than two fold. Approximately 78% of this gene list was present in both tools. ExpressionistTm 3.1 was capable of representing the different linkage methods of hierarchical clustering as average, complete and single, whereas in GeneSpringTM 4.2, the user could manipulate the separation ratio and minimum distance of the hierarchical tree
    • …
    corecore