178 research outputs found

    Molecular Assemblies, Genes and Genomics Integrated Efficiently (MAGGIE)

    Full text link
    Final report on MAGGIE. We set ambitious goals to model the functions of individual organisms and their community from molecular to systems scale. These scientific goals are driving the development of sophisticated algorithms to analyze large amounts of experimental measurements made using high throughput technologies to explain and predict how the environment influences biological function at multiple scales and how the microbial systems in turn modify the environment. By experimentally evaluating predictions made using these models we will test the degree to which our quantitative multiscale understanding wilt help to rationally steer individual microbes and their communities towards specific tasks. Towards this end we have made substantial progress towards understanding evolution of gene families, transcriptional structures, detailed structures of keystone molecular assemblies (proteins and complexes), protein interactions, biological networks, microbial interactions, and community structure. Using comparative analysis we have tracked the evolutionary history of gene functions to understand how novel functions evolve. One level up, we have used proteomics data, high-resolution genome tiling microarrays, and 5' RNA sequencing to revise genome annotations, discover new genes including ncRNAs, and map dynamically changing operon structures of five model organisms: For Desulfovibrio vulgaris Hildenborough, Pyrococcus furiosis, Sulfolobus solfataricus, Methanococcus maripaludis and Haiobacterium salinarum NROL We have developed machine learning algorithms to accurately identify protein interactions at a near-zero false positive rate from noisy data generated using tagfess complex purification, TAP purification, and analysis of membrane complexes. Combining other genome-scale datasets produced by ENIGMA (in particular, microarray data) and available from literature we have been able to achieve a true positive rate as high as 65% at almost zero false positives when applied to the manually curated training set. Applying this method to the data representing around a quarter of the fraction space for water soluble proteins in D. vulgaris, we obtained 854 reliable pair wise interactions. Further, we have developed algorithms to analyze and assign significance to protein interaction data from bait pull-down experiments and integrate these data with other systems biology data through associative biclustering in a parallel computing environment. We will 'fill-in' missing information in these interaction data using a 'Transitive Closure' algorithm and subsequently use 'Between Commonality Decomposition' algorithm to discover complexes within these large graphs of protein interactions. To characterize the metabolic activities of proteins and their complexes we are developing algorithms to deconvolute pure mass spectra, estimate chemical formula for m/z values, and fit isotopic fine structure to metabolomics data. We have discovered that in comparison to isotopic pattern fitting methods restricting the chemical formula by these two dimensions actually facilitates unique solutions for chemical formula generators. To understand how microbial functions are regulated we have developed complementary algorithms for reconstructing gene regulatory networks (GRNs). Whereas the network inference algorithms cMonkey and Inferelator developed enable de novo reconstruction of predictive models for GRNs from diverse systems biology data, the RegPrecise and RegPredict framework developed uses evolutionary comparisons of genomes from closely related organisms to reconstruct conserved regulons. We have integrated the two complementary algorithms to rapidly generate comprehensive models for gene regulation of understudied organisms. Our preliminary analyses of these reconstructed GRNs have revealed novel regulatory mechanisms and cis-regulatory motifs, as well asothers that are conserved across species. Finally, we are supporting scientific efforts in ENIGMA with data management solutions and by integrating all of the algorithms, software and data into a Knowledgebase. For instance, we have developed the RegPrecise database (http://regprecise.lbl.gov) which represents manually curated sets of regulons laying the basis for automatic annotation of regulatory interactions in closely related species. We are also in the midst of scaling up MicrobesOnline to handle the growing volume of sequence and functional genomics data. Over the last year our efforts have been focused on providing support for additional genomic and functional genomic data types. Similarly, we have developed several visualization tools to help with the exploration of complex systems biology datasets. A case in point is the Gaggle Genome Browser (GGB), which was enhanced with visualizations for plotting peptide detections and protein-DNA binding alongside transcriptome structure, plus the ability to interactively filter by signal intensity or p-value

    Integrated biclustering of heterogeneous genome-wide datasets for the inference of global regulatory networks

    Get PDF
    BACKGROUND: The learning of global genetic regulatory networks from expression data is a severely under-constrained problem that is aided by reducing the dimensionality of the search space by means of clustering genes into putatively co-regulated groups, as opposed to those that are simply co-expressed. Be cause genes may be co-regulated only across a subset of all observed experimental conditions, biclustering (clustering of genes and conditions) is more appropriate than standard clustering. Co-regulated genes are also often functionally (physically, spatially, genetically, and/or evolutionarily) associated, and such a priori known or pre-computed associations can provide support for appropriately grouping genes. One important association is the presence of one or more common cis-regulatory motifs. In organisms where these motifs are not known, their de novo detection, integrated into the clustering algorithm, can help to guide the process towards more biologically parsimonious solutions. RESULTS: We have developed an algorithm, cMonkey, that detects putative co-regulated gene groupings by integrating the biclustering of gene expression data and various functional associations with the de novo detection of sequence motifs. CONCLUSION: We have applied this procedure to the archaeon Halobacterium NRC-1, as part of our efforts to decipher its regulatory network. In addition, we used cMonkey on public data for three organisms in the other two domains of life: Helicobacter pylori, Saccharomyces cerevisiae, and Escherichia coli. The biclusters detected by cMonkey both recapitulated known biology and enabled novel predictions (some for Halobacterium were subsequently confirmed in the laboratory). For example, it identified the bacteriorhodopsin regulon, assigned additional genes to this regulon with apparently unrelated function, and detected its known promoter motif. We have performed a thorough comparison of cMonkey results against other clustering methods, and find that cMonkey biclusters are more parsimonious with all available evidence for co-regulation

    The Gaggle: An open-source software system for integrating bioinformatics software and data sources

    Get PDF
    BACKGROUND: Systems biologists work with many kinds of data, from many different sources, using a variety of software tools. Each of these tools typically excels at one type of analysis, such as of microarrays, of metabolic networks and of predicted protein structure. A crucial challenge is to combine the capabilities of these (and other forthcoming) data resources and tools to create a data exploration and analysis environment that does justice to the variety and complexity of systems biology data sets. A solution to this problem should recognize that data types, formats and software in this high throughput age of biology are constantly changing. RESULTS: In this paper we describe the Gaggle -a simple, open-source Java software environment that helps to solve the problem of software and database integration. Guided by the classic software engineering strategy of separation of concerns and a policy of semantic flexibility, it integrates existing popular programs and web resources into a user-friendly, easily-extended environment. We demonstrate that four simple data types (names, matrices, networks, and associative arrays) are sufficient to bring together diverse databases and software. We highlight some capabilities of the Gaggle with an exploration of Helicobacter pylori pathogenesis genes, in which we identify a putative ricin-like protein -a discovery made possible by simultaneous data exploration using a wide range of publicly available data and a variety of popular bioinformatics software tools. CONCLUSION: We have integrated diverse databases (for example, KEGG, BioCyc, String) and software (Cytoscape, DataMatrixViewer, R statistical environment, and TIGR Microarray Expression Viewer). Through this loose coupling of diverse software and databases the Gaggle enables simultaneous exploration of experimental data (mRNA and protein abundance, protein-protein and protein-DNA interactions), functional associations (operon, chromosomal proximity, phylogenetic pattern), metabolic pathways (KEGG) and Pubmed abstracts (STRING web resource), creating an exploratory environment useful to 'web browser and spreadsheet biologists', to statistically savvy computational biologists, and those in between. The Gaggle uses Java RMI and Java Web Start technologies and can be found at

    Integrating Oceanographic Research Into High School Curricula: Achieving Broader Impacts Through Systems Education Experiences Modules

    Get PDF
    We describe a framework for incorporating cross-disciplinary oceanographic research into high school curriculum modules and discuss how this framework could be adopted broadly by ocean scientists to build cohesive broader impacts programs nested within individual oceanographic research programs

    Bringing critical systems thinking to high school students through ocean acidification research

    Get PDF
    Ocean acidification is a complex phenomenon with complex consequences. Understanding this complexity and the impact of ocean acidification requires systems thinking and collaboration, both in research and in education. Scientific advancement will help us better understand the problem and devise more effective solutions, but executing these solutions will require widespread public participation to mitigate this local and global problem. We have translated current systems-level ocean acidification research into a 5 week high school curriculum module. We will present this curriculum which is easily implemented in schools and has resulted in a high level of engagement and learning. Thus far 13 different schools and over 1200 students have field tested this work – we have seen dramatic increases in students’ abilities to use inquiry and to challenge their mental models. The lessons are hands-on, interdisciplinary, standards-based, and specifically focus on systems thinking, which has been shown to enable behavioral change. In this curriculum, students take on the roles of scientists and delegates as they investigate the consequences of the changing carbon cycle on the chemistry and biology of the oceans. Students critically assess different pieces of information (news articles and real-time data from the Salish Sea and beyond). They combine their findings into a network diagram that interconnects key players of this system. Students align themselves with stakeholders and design collaborative, cohesive experiments to test hypotheses and network properties. They explore the consequences of increased CO2 levels on the pH of water, the integrity of seashells, and the lifecycle of diatoms. The module also connects to other pertinent lessons being developed locally and globally. In the culminating activity, students act as delegates and reconvene to discuss the systems consequences of ocean acidification. They make recommendations for further research, policy-making, and lifestyle changes on both a local and global scale

    The Firegoose: two-way integration of diverse data from different bioinformatics web resources with desktop applications

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information resources on the World Wide Web play an indispensable role in modern biology. But integrating data from multiple sources is often encumbered by the need to reformat data files, convert between naming systems, or perform ongoing maintenance of local copies of public databases. Opportunities for new ways of combining and re-using data are arising as a result of the increasing use of web protocols to transmit structured data.</p> <p>Results</p> <p>The Firegoose, an extension to the Mozilla Firefox web browser, enables data transfer between web sites and desktop tools. As a component of the Gaggle integration framework, Firegoose can also exchange data with Cytoscape, the R statistical package, Multiexperiment Viewer (MeV), and several other popular desktop software tools. Firegoose adds the capability to easily use local data to query KEGG, EMBL STRING, DAVID, and other widely-used bioinformatics web sites. Query results from these web sites can be transferred to desktop tools for further analysis with a few clicks.</p> <p>Firegoose acquires data from the web by screen scraping, microformats, embedded XML, or web services. We define a microformat, which allows structured information compatible with the Gaggle to be embedded in HTML documents.</p> <p>We demonstrate the capabilities of this software by performing an analysis of the genes activated in the microbe <it>Halobacterium salinarum NRC-1 </it>in response to anaerobic environments. Starting with microarray data, we explore functions of differentially expressed genes by combining data from several public web resources and construct an integrated view of the cellular processes involved.</p> <p>Conclusion</p> <p>The Firegoose incorporates Mozilla Firefox into the Gaggle environment and enables interactive sharing of data between diverse web resources and desktop software tools without maintaining local copies. Additional web sites can be incorporated easily into the framework using the scripting platform of the Firefox browser. Performing data integration in the browser allows the excellent search and navigation capabilities of the browser to be used in combination with powerful desktop tools.</p

    Comprehensive de novo structure prediction in a systems-biology context for the archaea Halobacterium sp. NRC-1

    Get PDF
    BACKGROUND: Large fractions of all fully sequenced genomes code for proteins of unknown function. Annotating these proteins of unknown function remains a critical bottleneck for systems biology and is crucial to understanding the biological relevance of genome-wide changes in mRNA and protein expression, protein-protein and protein-DNA interactions. The work reported here demonstrates that de novo structure prediction is now a viable option for providing general function information for many proteins of unknown function. RESULTS: We have used Rosetta de novo structure prediction to predict three-dimensional structures for 1,185 proteins and protein domains (<150 residues in length) found in Halobacterium NRC-1, a widely studied halophilic archaeon. Predicted structures were searched against the Protein Data Bank to identify fold similarities and extrapolate putative functions. They were analyzed in the context of a predicted association network composed of several sources of functional associations such as: predicted protein interactions, predicted operons, phylogenetic profile similarity and domain fusion. To illustrate this approach, we highlight three cases where our combined procedure has provided novel insights into our understanding of chemotaxis, possible prophage remnants in Halobacterium NRC-1 and archaeal transcriptional regulators. CONCLUSIONS: Simultaneous analysis of the association network, coordinated mRNA level changes in microarray experiments and genome-wide structure prediction has allowed us to glean significant biological insights into the roles of several Halobacterium NRC-1 proteins of previously unknown function, and significantly reduce the number of proteins encoded in the genome of this haloarchaeon for which no annotation is available

    Integration and visualization of systems biology data in context of the genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>High-density tiling arrays and new sequencing technologies are generating rapidly increasing volumes of transcriptome and protein-DNA interaction data. Visualization and exploration of this data is critical to understanding the regulatory logic encoded in the genome by which the cell dynamically affects its physiology and interacts with its environment.</p> <p>Results</p> <p>The Gaggle Genome Browser is a cross-platform desktop program for interactively visualizing high-throughput data in the context of the genome. Important features include dynamic panning and zooming, keyword search and open interoperability through the Gaggle framework. Users may bookmark locations on the genome with descriptive annotations and share these bookmarks with other users. The program handles large sets of user-generated data using an in-process database and leverages the facilities of SQL and the R environment for importing and manipulating data.</p> <p>A key aspect of the Gaggle Genome Browser is interoperability. By connecting to the Gaggle framework, the genome browser joins a suite of interconnected bioinformatics tools for analysis and visualization with connectivity to major public repositories of sequences, interactions and pathways. To this flexible environment for exploring and combining data, the Gaggle Genome Browser adds the ability to visualize diverse types of data in relation to its coordinates on the genome.</p> <p>Conclusions</p> <p>Genomic coordinates function as a common key by which disparate biological data types can be related to one another. In the Gaggle Genome Browser, heterogeneous data are joined by their location on the genome to create information-rich visualizations yielding insight into genome organization, transcription and its regulation and, ultimately, a better understanding of the mechanisms that enable the cell to dynamically respond to its environment.</p

    Promoter analysis by saturation mutagenesis

    Get PDF
    Gene expression and regulation are mediated by DNA sequences, in most instances, directly upstream to the coding sequences by recruiting transcription factors, regulators, and a RNA polymerase in a spatially defined fashion. Few nucleotides within a promoter make contact with the bound proteins. The minimal set of nucleotides that can recruit a protein factor is called a cis-acting element. This article addresses a powerful mutagenesis strategy that can be employed to define cis-acting elements at a molecular level. Technical details including primer design, saturation mutagenesis, construction of promoter libraries, phenotypic analysis, data analysis, and interpretation are discussed
    • …
    corecore