385 research outputs found

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Novel microbial guilds implicated in N2O reduction

    Get PDF
    N2O is a long-recognized greenhouse gas (GHG) with potential in global warming and ozone depletion. Terrestrial ecosystems are a major source of N2O due to imbalanced N2O production and consumption. Soil pH is a chief modulating factor controlling net N2O emissions, and N2O consumption has been considered negligible under acidic conditions (pH \u3c6). In this dissertation, we obtained solids-free cultures reducing N2O at pH 4.5. Furthermore, a co-culture (designated culture EV) comprising two interacting bacterial population was acquired via consecutive transfer in mineral salt medium. Integrated phenotypic, metagenomic and metabolomic analysis dictated that the Serratia population excreted certain amino acid to support Desulfosporosinus population growth. Characterization of co-culture EV demonstrated that organisms synthesizing functional NosZ exist in acidic soils. To further close the knowledge gap of low pH N2O reduction, we conducted enrichment experiments on two contrasting soils representing natural and agricultural soils, respectively. With varying combination of carbon sources and H2, N2O reduction activity was observed in a total of six solids-free cultures at pH 4.5. Comparative growth experiments documented that N2O was essential for N2O-reducing organisms’ growth. Reconstruction of the communities suggested the cultures were highly enriched following consecutive transfer efforts. Surprisingly, N2O-reducing organisms recovered from the cultures were all identified as clade II lineage or a novel clade lineage. These results together suggested that clade II and the putative novel clade N2O-reducing organisms are closely associated with low pH N2O reduction. The observation of low pH N2O reduction in current research was inconsistent to field observation, where the abiotic-biotic interaction occurs frequently. To simulate potential abiotic-biotic interaction of low pH N2O reduction, we tested the impact of geochemistry factors on N2O reduction by culture EV. The impeded N2O reduction of co-culture EV in the presence of trace amount (as low as 0.01 mM for nitrate and 0.001 mM for nitrite) of nitrogen oxyanions was unexpected, which may explain the weak N2O reduction activity under field conditions. These observations indicated N2O reduction in acidic soils may be limited by interacting abiotic factors, instead of pH itself

    Modular architecture in biological networks

    Get PDF
    Thesis (Ph. D.)--Harvard-MIT Division of Health Sciences and Technology, 2007.Includes bibliographical references (p. 201-207).In the past decade, biology has been revolutionized by an explosion in the availability of data. Translating this new wealth of information into meaningful biological insights and clinical breakthroughs will require a complete overhaul both in the questions being asked, and the methodologies used to answer them. One of the largest challenges in organizing and understanding the data coming from genome sequencing, microarray experiments, and other high-throughput measurements, will be the ability to find large-scale structure in biological systems. Ideally, this would lead to a simplified representation, wherein the thousands of genes in an organism can be viewed as a much smaller number of dynamic modules working in concert to accomplish cellular functions. Toward demonstrating the importance of higher-level, modular structure in biological systems, we have performed the following analyses: 1. Using computational techniques and pre-existing protein-protein interaction (PPI) data, we have developed general tools to find and validate modular structure. We have applied these approaches to the PPI networks of yeast, fly, worm, and human.(cont.) 2. Utilizing a modular scaffold, we have generated predictions that attempt to explain existing system-wide experiments as well as predict the function of otherwise uncharacterized proteins. 3. Following the example of comparative genomics, we have aligned biological networks at the modular level to elucidate principles of how modules evolve. We show that conserved modular structure can further aid in functional annotation across the proteome. In addition to the detection and use of modular structure for computational analyses, experimental techniques must be adapted to support top-down strategies, and the targeting of entire modules with combinations of small-molecules. With this in mind, we have designed experimental strategies to find sets of small-molecules capable of perturbing fimctional modules through a variety of distinct, but related, mechanisms. As a first test, we have looked for classes of small-molecules targeting growth signaling through the phosphatidyl-inositol-3-kinase (PI3K) pathway. This provides a platform for developing new screening techniques in the setting of biology relevant to diabetes and cancer. In combination, these investigations provide an extensible computational approach to finding and utilizing modular structure in biological networks, and experimental approaches to bring them toward clinical endpoints.by Gopal Ramachandran.Ph.D

    Graph-based modeling and evolutionary analysis of microbial metabolism

    Get PDF
    Microbial organisms are responsible for most of the metabolic innovations on Earth. Understanding microbial metabolism helps shed the light on questions that are central to biology, biomedicine, energy and the environment. Graph-based modeling is a powerful tool that has been used extensively for elucidating the organising principles of microbial metabolism and the underlying evolutionary forces that act upon it. Nevertheless, various graph-theoretic representations and techniques have been applied to metabolic networks, rendering the modeling aspect ad hoc and highlighting the conflicting conclusions based on the different representations. The contribution of this dissertation is two-fold. In the first half, I revisit the modeling aspect of metabolic networks, and present novel techniques for their representation and analysis. In particular, I explore the limitations of standard graphs representations, and the utility of the more appropriate model---hypergraphs---for capturing metabolic network properties. Further, I address the task of metabolic pathway inference and the necessity to account for chemical symmetries and alternative tracings in this crucial task. In the second part of the dissertation, I focus on two evolutionary questions. First, I investigate the evolutionary underpinnings of the formation of communities in metabolic networks---a phenomenon that has been reported in the literature and implicated in an organism's adaptation to its environment. I find that the metabolome size better explains the observed community structures. Second, I correlate evolution at the genome level with emergent properties at the metabolic network level. In particular, I quantify the various evolutionary events (e.g., gene duplication, loss, transfer, fusion, and fission) in a group of proteobacteria, and analyze their role in shaping the metabolic networks and determining the organismal fitness. As metabolism gains an increasingly prominent role in biomedical, energy, and environmental research, understanding how to model this process and how it came about during evolution become more crucial. My dissertation provides important insights in both directions

    Vertikal integration, globale und Modularanalyse von molekularen Wechselwirkungen Netzwerke von Escherichia coli

    Get PDF
    Phenotypical characteristics of cells often arise from interactions between genes, proteins and metabolites. For a complete understanding of cellular processes and their regulations it is necessary to vertically integrate the molecular networks into an interactome and understand its global structure. In this thesis,, an integrated molecular network (IMN) of Escherichia coli was reconstructed which comprises metabolic reactions, metabolite-protein interactions (MPI) and transcriptional regulation data. Three fundamental aspects of cellular processes were studied: (i) feedback regulation of gene expression, (ii) network motifs and (iii) global organization. Intriguingly, this work found that feedback regulation of gene expression in E. coli is mediated by MPIs and 69 such feedback loops (FBLs) were identified. Motif studies identified the FBL as a significant pattern and detected 12 other three-node motifs comprising five composite motifs. Connectivity analysis discovered the existence of bow-tie architecture and motif analysis in the bow-tie components revealed that 77% of them interconnect to form the giant strong component which is the backbone of the bow-tie. Further in this work, cluster and modular analyses were performed on the integrated molecular network of E. coli constructed from diverse collection of datasets involving metabolic reactions, metabolite protein interactions and transcriptional regulation. Modularity was used as the parameter of an appropriate, fast and robust method for clustering such a heterogeneous molecular circuitry of interactions. This work revealed that clustering this complex network significantly grouped together genes of known similar function in well-defined physiologically related modules. Identification of network motifs and correlating them with the modules of highly connected nodes may define their potential functional role. To this end, twelve highly significant three-node network motifs among which four are composite network motifs comprising multiple types of interactions were detected and analyzed. Distribution analysis of these motifs within and between the various functional modules supported the fact that these motifs represent basic patterns of regulation and organization of genes into modules. This thesis illustrates the potential of data integration of molecular networks to detect the feedback interactions in regulatory networks and its global analysis for better understanding cellular processes and their regulation. Moreover this work also presents a basic framework for detecting functional modules and their interaction with various motifs in an integrated E.coli system.Phenotypische Eigenschaften von Zellen entstehen häufig aus Wechselwirkungen zwischen Genen, Proteinen und Metaboliten. Für ein ganzheitliches Verstehen von Zellprozessen und ihrer Regulation ist es notwendig, die molekularen Netzwerke vertikal in ein Interactom zu integrieren und seine globale Struktur zu verstehen. In dieser Arbeit wurde ein integriertes molekulares Netzwerk (IMN) von Escherichia coli modelliert, dass aus den metabolischen Reaktionen, Metabolit-Protein-Wechselwirkungen (MPI) und den transkriptional-regulatorischen Elementen bestand. Drei grundsätzliche Aspekte von Zellprozessen wurden untersucht: (i) Feedback-Regulierung der Genexpression, (ii) Netzwerkmotive und (iii) globale Organisation. Diese Arbeit lieferte faszinierende Ergebnisse: Es konnte aufgezeigt werden, dass die Feedback-Regulierung der Genexpression in E. coli durch MPIs vermittelt wird und 69 solcher Feedback-Schleifen (FBLs) identifiziert werden konnten. Motiv-Untersuchungen identifizierten die FBLs als ein bedeutendes Muster und entdeckten 12 andere Drei-Knoten-Motive, die fünf zerlegbare Motive umfassen. Konnektivitätsanalysen zeigten die Existenz der Bow-tie-Struktur auf und Motivanalyse der Bow-tie-Komponenten offenbarte, dass 77 % davon das GSC (giant strong component) bilden, welches das Rückgrat des Bow-tie darstellt. Weiterhin wurden Cluster- und Modularanalysen im integrierten-molekularen Netwerk von E. coli durchgeführt, die auf diversen Sammlungen von Daten beruhten, die metabolische Reaktionen, Metabolit-Protein-Wechselwirkungen und transkriptionelle Regulierung beinhalteten. Modularität wurde als Parameter einer geeigneten, schnellen und robusten Methode zur Clusterung solcher heterogenen molekularen Schaltung von Wechselwirkungen genutzt. Diese Arbeit zeigte, dass die Clusterung dieses komplexen Netzwerkes Gene bekannter ähnlicher Funktion in wohl-definierten physiologisch verwandten Modulen signifikant gruppierte. Die Identifizierung von Netzwerk-Motiven und die Korrelation dieser mit Modulen hochverzweigter Knoten mag ihre potentielle funktionelle Rolle definieren. Zu diesem Zweck wurden zwölf hochsignifikante 3-Knoten-Motive, von denen vier zusammengesetzte Netzwerkmotive multiple Typen von Interaktionen darstellen, entdeckt und analysiert. Verteilungsanalyse dieser Motive innerhalb und zwischen verschiedenen funktionellen Modulen unterstützte die Tatsache, dass diese Motive Grundmuster der Regulation und Organisation von Genen in Modulen darstellen. Diese These illustriert das Potential der Datenintegrierung molekularer Netzwerke zur Entdeckung von Feedback-Interaktionen in regulatorischen Netzwerken und seiner globalen Analyse zur besseren Erkenntnis zellulärer Prozesse und ihrer Regulierung. Darüberhinaus zeigt diese Arbeit einen Grundrahmen für die Entdeckung funktioneller Module und ihrer Wechselwirkungen mit verschiedenen Motiven in einem integrierten System von E. coli auf

    Data integration for the analysis of uncharacterized proteins in Mycobacterium tuberculosis

    Get PDF
    Includes abstract.Includes bibliographical references (leaves 126-150).Mycobacterium tuberculosis is a bacterial pathogen that causes tuberculosis, a leading cause of human death worldwide from infectious diseases, especially in Africa. Despite enormous advances achieved in recent years in controlling the disease, tuberculosis remains a public health challenge. The contribution of existing drugs is of immense value, but the deadly synergy of the disease with Human Immunodeficiency Virus (HIV) or Acquired Immunodeficiency Syndrome (AIDS) and the emergence of drug resistant strains are threatening to compromise gains in tuberculosis control. In fact, the development of active tuberculosis is the outcome of the delicate balance between bacterial virulence and host resistance, which constitute two distinct and independent components. Significant progress has been made in understanding the evolution of the bacterial pathogen and its interaction with the host. The end point of these efforts is the identification of virulence factors and drug targets within the bacterium in order to develop new drugs and vaccines for the eradication of the disease

    New insights on black rot of crucifers : disclosing novel virulence genes by in vivo host/pathogen transcriptomics and functional genetics

    Get PDF
    Tese de doutoramento, Biologia (Microbiologia), Universidade de Lisboa, Faculdade de Ciências, 2018Xanthomonas campestris belongs to the gamma subdivision of Proteobacteria and is the type species of a genus comprising 28 species of plant pathogenic bacteria, affecting 124 species of monocotyledonous and 268 species of dicotyledonous plants. Identified for the first time in 1895 in the United States of America, this species has undergone several taxonomic reclassifications, and currently comprises pathovars X. campestris pv. campestris (Xcc), X. campestris pv. raphani (Xcr) and X. campestris pv. incanae (Xci), causing distinct diseases in vegetables, ornamentals and spontaneous plants belonging to Brassicaceae family, as well as some vegetable crops belonging to the Solanaceae family. Black rot disease, caused by Xcc, is the most important bacterial disease of Brassicaceae, affecting crops and weeds worldwide. After penetration through hydathodes on leaf margins, bacterial multiplication causes typical V-shaped lesions on leaf margins, followed by darkening of the veins that evolve to necrosis of the affected tissue, ultimately leading to plant death. Short-distance dispersal through plant-to-plant contact, human manipulation, wind, insects, aerosols and irrigation water or rain, associated with long distance dispersal through infected seeds and plantlets following commercial routes across the globe are responsible for the worldwide distribution of this disease. Studies on the interaction of Xcc with B. oleracea have identified several virulence genes, however no resistance genes have been successfully cloned in Brassicaceae crops and there is still a lack of effective black rot disease control measures, which continues to cause severe economic losses worldwide. Despite the growing body of work on this subject, molecular mechanisms of host-pathogen interaction have been mostly inferred using in vitro approaches, resulting in a knowledge gap concerning the in vivo behavior of this pathogen during its interaction with plant hosts. In Portugal, an important centre of Brassicaceae domestication, Xcc has long been identified and Portuguese Xcc strains have already been described as a unique sub-population of this pathogen. In this context, with the goal of bringing new insights on the molecular mechanisms of host/pathogen interaction, a set of 33 X. campestris strains collected in the country was characterized, in terms of pathogenicity, virulence, population structure and phylogenetic diversity. Furthermore, the in planta transcriptomes of the Xcc strains representing the extremes of the virulence spectrum during the infection process on two selected hosts were profiled, in a total of four pathosystems. A high level of phenotypic diversity, supported by phylogenetic data, was found among X. campestris isolates, allowing the identification of Xcc closely related pathovars Xcr and Xci for the first time in Portugal. Moreover, among Xcc isolates, presence of races 4, 6 and 7 was recorded, and two novel races of this pathovar, race 10 and race 11, were also described. Contrastingly, the partial virulence profiles determined by the presence of known virulence genes were highly conserved among the set of X. campestris strains. The integration of Portuguese strains in a global dataset comprising a total of 75 X. campestris strains provided a snapshot of the worldwide X. campestris phylogenetic diversity and population structure, correlating the existing pathovars with three distinct genetic lineages. The identification of an intermediate link between Xcc and Xcr, suggests that these pathovars are more closely related to each other than to Xci. This gradient of genetic relatedness seems to be associated to the host range of each pathovar. While Xci appears only to be pathogenic on ornamental Brassicaceae, Xcc and Xcr have a partially overlapping host range: Xcc affects mostly Brassicaceous hosts, whereas Xcr, while affecting the same hosts, is also pathogenic on Solanaceous hosts. These findings suggest that instead of causing a host shift, the genetic divergence between Xcc and Xcr conferred the latter strains the ability to explore additional hosts, resulting in a broader host range. Although population displayed a major clonal structure, the presence of recombinational events that may have driven the ecological specialization of X. campestris and distinct host ranges was highlighted. Portuguese X. campestris strains provided a significant input of genetic diversity, confirming this region as an important diversification reservoir, most likely taking place through host-pathogen co-evolution. Virulence assessment of Portuguese Xcc strains, based on the percentage on infected leaf area after inoculation, highlighted that virulence was not homogeneous within races and that higher pathogenicity was not necessarily correlated higher virulence. It was then possible to select the extremes of the virulence spectrum – CPBF213, the lowest virulence strain (L-vir), and CPBF278, the highest virulence strain (H-vir). The in vivo transcriptome profiling of those contrastingly virulent strains infecting two cultivars of B. oleracea, using RNA-Seq, allowed establishing that Xcc undergoes transcriptional reprogramming in a host-independent manner, suggesting that virulence is an intrinsic feature of the pathogen. A total of 154 differentially expressed genes (DEGs) were identified between the two strains. The most represented functional category of DEGs was, as expected, ‘pathogenicity and adaptation’, representing 16% of the DEGs, although the complex adaptation of pathogen cells to the host environment was highlighted by the presence of DEGs from various functional categories. Among DEGs, Type III effector coding genes xopE2 and xopD were induced in L-vir strain, while xopAC, xopX and xopR were induced in H-vir strain. In addition to Type III effectors, genes encoding proteins involved in signal transduction systems, transport, detoxification mechanisms and other virulenceassociated processes were found differentially expressed between both strains, highlighting their role in virulence regulation. Overall, low virulence appears to be the combined result of impaired sensory mechanisms, reduced detoxification of reactive oxygen species, decreased motility and higher production of pathogen-associated molecular patterns (PAMPs), accompanied by an overexpression of avirulence proteins and a repression of virulence proteins targeting the hosts’ PAMP-triggered immune responses. Contrastingly, the highly virulent strain showed to be better equipped to escape initial plant defenses, whether avoiding detection, or by the ability to counteract those responses. On the other hand, upon detection, highly virulent pathogens will show a decreased expression of avirulence proteins, making them less recognizable by the hosts Effector-Triggered Immunity mechanisms, and thus able to continue multiplying and cause more severe disease symptoms. Through the study of differential infections in planta, the highly innovative strategy used in this work contributed to disclosure of novel virulence related genes in X. campestris pv. campestris - B. oleracea pathosystem, that will be crucial to further detail the virulence regulation network, and develop new tools for the control of black rot disease

    Frequent Pattern Finding in Integrated Biological Networks

    Get PDF
    Biomedical research is undergoing a revolution with the advance of high-throughput technologies. A major challenge in the post-genomic era is to understand how genes, proteins and small molecules are organized into signaling pathways and regulatory networks. To simplify the analysis of large complex molecular networks, strategies are sought to break them down into small yet relatively independent network modules, e.g. pathways and protein complexes. In fulfillment of the motivation to find evolutionary origins of network modules, a novel strategy has been developed to uncover duplicated pathways and protein complexes. This search was first formulated into a computational problem which finds frequent patterns in integrated graphs. The whole framework was then successfully implemented as the software package BLUNT, which includes a parallelized version. To evaluate the biological significance of the work, several large datasets were chosen, with each dataset targeting a different biological question. An application of BLUNT was performed on the yeast protein-protein interaction network, which is described. A large number of frequent patterns were discovered and predicted to be duplicated pathways. To explore how these pathways may have diverged since duplication, the differential regulation of duplicated pathways was studied at the transcriptional level, both in terms of time and location. As demonstrated, this algorithm can be used as new data mining tool for large scale biological data in general. It also provides a novel strategy to study the evolution of pathways and protein complexes in a systematic way. Understanding how pathways and protein complexes evolve will greatly benefit the fundamentals of biomedical research
    corecore