10 research outputs found

    Classification of Chemical Compounds to Support Complex Queries in a Pathway Database

    Get PDF
    Data quality in biological databases has become a topic of great discussion. To provide high quality data and to deal with the vast amount of biochemical data, annotators and curators need to be supported by software that carries out part of their work in an (semi-) automatic manner. The detection of errors and inconsistencies is a part that requires the knowledge of domain experts, thus in most cases it is done manually, making it very expensive and time-consuming. This paper presents two tools to partially support the curation of data on biochemical pathways. The tool enables the automatic classification of chemical compounds based on their respective SMILES strings. Such classification allows the querying and visualization of biochemical reactions at different levels of abstraction, according to the level of detail at which the reaction participants are described. Chemical compounds can be classified in a flexible manner based on different criteria. The support of the process of data curation is provided by facilitating the detection of compounds that are identified as different but that are actually the same. This is also used to identify similar reactions and, in turn, pathways

    Quantitative utilization of prior biological knowledge in the Bayesian network modeling of gene expression data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Bayesian Network (BN) is a powerful approach to reconstructing genetic regulatory networks from gene expression data. However, expression data by itself suffers from high noise and lack of power. Incorporating prior biological knowledge can improve the performance. As each type of prior knowledge on its own may be incomplete or limited by quality issues, integrating multiple sources of prior knowledge to utilize their consensus is desirable.</p> <p>Results</p> <p>We introduce a new method to incorporate the quantitative information from multiple sources of prior knowledge. It first uses the Naïve Bayesian classifier to assess the likelihood of functional linkage between gene pairs based on prior knowledge. In this study we included cocitation in PubMed and schematic similarity in Gene Ontology annotation. A candidate network edge reservoir is then created in which the copy number of each edge is proportional to the estimated likelihood of linkage between the two corresponding genes. In network simulation the Markov Chain Monte Carlo sampling algorithm is adopted, and samples from this reservoir at each iteration to generate new candidate networks. We evaluated the new algorithm using both simulated and real gene expression data including that from a yeast cell cycle and a mouse pancreas development/growth study. Incorporating prior knowledge led to a ~2 fold increase in the number of known transcription regulations recovered, without significant change in false positive rate. In contrast, without the prior knowledge BN modeling is not always better than a random selection, demonstrating the necessity in network modeling to supplement the gene expression data with additional information.</p> <p>Conclusion</p> <p>our new development provides a statistical means to utilize the quantitative information in prior biological knowledge in the BN modeling of gene expression data, which significantly improves the performance.</p

    Computational methods for integrated analysis of omics and pathway data

    Get PDF
    One of the key tenets of bioinformatics is to find ways to enable the interoperability of heterogeneous data sources and improve the integration of various biological data. High-throughput experimental methods continue to improve and become more easily accessible. This allows researchers to measure not just their specific gene or protein of interest, but the entirety of the biological machinery inside the cell. These measurements are referred to as omics , such as genomics, transcriptomics, proteomics, metabolomics, and fluxomics. Omics data is highly interrelated at the systems-level, as each type of molecule (DNA, RNA, protein, etc.) can interact with and have an impact on the other types. These interactions may be direct, such as the central dogma of biology that information flows from DNA to RNA to protein. They may also be indirect, such as the regulation of gene expression or metabolic feedback loops. Regardless, it is becoming apparent that multiple levels of omics data must be analyzed and understood simultaneously if we are to advance our understanding of systems-level biology. Much of our current biological knowledge is stored in public databases, most of which specialize in a particular type of omics or a specific organism. Despite efforts to improve consistency between databases, there are many challenges which can impede efforts to meaningfully compare or combine these resources. At a basic level, differences in naming and internal database ID assignments prevent simple mapping between objects in these databases. More fundamentally, though, is the lack of a standardized way to define equivalency between two functionally identical biological entities. One benefit of improving database interoperability is that targeted high quality data from one database can be used to improve another database. Comparison between MaizeCyc and CornCyc identified many manually curated GO annotations present in MaizeCyc but not in CornCyc. CycTools facilitates the transfer of high-quality annotation data from one database to another by automatically mapping equivalent objects in both databases. This java-based tool has a graphical user interface which guides users through the transfer process. A case study which uses two independent Zea Mays pathway databases, CornCyc and MaizeCyc, illustrates the challenges of comparing the content of even closely related resources. This example highlights the downstream implications that the choice of initial computational enzymatic function assignment pipelines and subsequent manual curation had on the overall scope and quality of the content of each database. We compare the prediction accuracy of the protein EC assignments for 177 maize enzymes between these resources and find that while MaizeCyc covers a broader scope of enzyme predictions, CornCyc predictions are more accurate. The advantage of high quality, integrated data resources must be realized through analysis methods which can account for multiple data types simultaneously. Due to the difficulty in obtaining systems-wide metabolic flux measurements, researchers have made several efforts to integrate transcriptional regulatory data with metabolic models in order to improve the accuracy of metabolic flux predictions. Transcriptional regulation involves the binding of transcription factors (i.e. proteins) to binding sites on the DNA in order to positively or negatively influence expression of the targeted gene. This has an indirect, downstream impact on the organism\u27s metabolism, as metabolic reactions depend on gene-derived enzymes in order to catalyze the reaction. A novel method is proposed which seeks to integrate transcriptional regulation and metabolic reactions data into a single model in order to investigate the interactions between metabolism and regulation. In contrast to existing methods which seek to use transcriptional regulation networks to limit the solution space of the constraint-based metabolic model, we seek to define a transcriptional regulatory space which can be associated with the metabolic distribution of interest. This allows us to make inferences about how changes in the regulatory network could lead to improved metabolic flux

    Critical assessment of human metabolic pathway databases: a stepping stone for future integration

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiple pathway databases are available that describe the human metabolic network and have proven their usefulness in many applications, ranging from the analysis and interpretation of high-throughput data to their use as a reference repository. However, so far the various human metabolic networks described by these databases have not been systematically compared and contrasted, nor has the extent to which they differ been quantified. For a researcher using these databases for particular analyses of human metabolism, it is crucial to know the extent of the differences in content and their underlying causes. Moreover, the outcomes of such a comparison are important for ongoing integration efforts.</p> <p>Results</p> <p>We compared the genes, EC numbers and reactions of five frequently used human metabolic pathway databases. The overlap is surprisingly low, especially on reaction level, where the databases agree on 3% of the 6968 reactions they have combined. Even for the well-established tricarboxylic acid cycle the databases agree on only 5 out of the 30 reactions in total. We identified the main causes for the lack of overlap. Importantly, the databases are partly complementary. Other explanations include the number of steps a conversion is described in and the number of possible alternative substrates listed. Missing metabolite identifiers and ambiguous names for metabolites also affect the comparison.</p> <p>Conclusions</p> <p>Our results show that each of the five networks compared provides us with a valuable piece of the puzzle of the complete reconstruction of the human metabolic network. To enable integration of the networks, next to a need for standardizing the metabolite names and identifiers, the conceptual differences between the databases should be resolved. Considerable manual intervention is required to reach the ultimate goal of a unified and biologically accurate model for studying the systems biology of human metabolism. Our comparison provides a stepping stone for such an endeavor.</p

    Genome-scale metabolic network reconstruction of Polaromonas sp. strain JS666: analysis of cDCE degradation rates and design of experiments for bioremediation improvement

    Get PDF
    Release of chloroethene compounds into the environment often results in groundwater contamination, which puts people at risk of exposure by drinking contaminated water. cDCE (cis-1,2-dichloroethene) accumulation on subsurface environments is a common environmental problem due to stagnation and partial degradation of other precursor chloroethene species. Polaromonas sp. strain JS666 apparently requires no exotic growth factors to be used as a bioaugmentation agent for aerobic cDCE degradation. Although being the only suitable microorganism found capable of such, further studies are needed for improving the intrinsic bioremediation rates and fully comprehend the metabolic processes involved. In order to do so, a metabolic model, iJS666, was reconstructed from genome annotation and available bibliographic data. FVA (Flux Variability Analysis) and FBA (Flux Balance Analysis) techniques were used to satisfactory validate the predictive capabilities of the iJS666 model. The iJS666 model was able to predict biomass growth for different previously tested conditions, allowed to design key experiments which should be done for further model improvement and, also, produced viable predictions for the use of biostimulant metabolites in the cDCE biodegradation

    Criação da base de dados Via/Genoma da Chromobacterium violaceum - CvioCyc e análise das informações geradas pelo Software Pathway Tools

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Engenharia Química.Microrganismos, cujo genoma já foram completamente seqüenciados, como é o caso da Chromobacterium violaceum, possuem dados de anotação genômica em geral produzidos por uma equipe que analisa a fisiologia do organismo e os correlaciona com os dados produzidos pelo projeto de seqüenciamento. Esses dados são geralmente armazenados em bases de dados públicas, e precisam muitas vezes ser verificados, ou seja, validados por via experimental. Todavia, os recentes avanços na área de bioinformática e biologia computacional permitem que certas condições fisiológicas sejam verificadas computacionalmente como, por exemplo, a existência ou não de uma dada via metabólica. Vários grupos têm desenvolvido técnicas para predição de vias metabólicas de organismos a partir da anotação do genoma, produzindo bases de dados integradas via/genoma (pathway/genome database). Este trabalho teve como objetivo construir uma base de dados para a C. violaceum (CvioCyc), uma bactéria Gram-negativa, seqüenciada pelo Brazilian National Genome Project Consortium, de grande potencial biotecnológico e biomédico. A C. violaceum produz um pigmento violeta conhecido como violaceína, ao qual são atribuídas propriedades anti-tumorais, anti-chagazíticas, entre outras; além disso, essa bactéria é capaz de produzir biopolímeros de grande interesse comercial. A base de dados via/genoma - CvioCyc (cviocy.intelab.ufsc.br), criada a partir do conjunto de softwares Pathway Tools, mostrou-se uma importante ferramenta de análise do genoma da C. violaceum. Através da análise de 61 vias metabólicas, de um total de 233 geradas automaticamente pelo software, 17 vias foram removidas (27,86%) e 44 mantidas (72,13%). Além da inclusão da via de biossíntese da violaceína a partir do aminoácido triptofano, essas 61 vias foram diretamente curadas a partir dos resultados do Pathway Tools, da análise de dados da literatura, e pelo uso de várias outras ferramentas de bioinformática disponíveis na web, tais como ferramentas BLAST e bases de dados de enzimas (KEGG, ENZYME e BRENDA). Trinta e nove ORFs (quadros abertos de leitura), relacionadas às vias analisadas, foram alteradas na base CvioCyc. Isto representa aproximadamente 24,3% de erro numa amostra de 160 ORFs analisadas. Este resultado está dentro da faixa de erros comumente encontrada na literatura, que varia de 8% a 25%. Vários erros se encaixam nos erros de anotação mais comumente encontrados na literatura, como ORFs falso-positivas, falso-negativas, erros de digitação, e falta de padronização nos nomes de enzimas e de genes. A análise dos genes envolvidos na biossíntese da violaceína nos permite sugerir que a ORF CV3270 pode fazer parte do operon vio, de acordo a predição do Pathway Tools. É, portanto, provável a existência de mais um gene no operon vioABCD. Entre essa ORF e o gene vioD, há uma distância de apenas 12 pares de bases, e observa-se uma estrutura em grampo, à jusante desta ORF, o que indica o término da transcrição após a ORF CV3270. Microorganisms that have already had their genomes completely sequenced, as in the case of Chromobacterium violaceum, have their genome annotation generally produced by a team that analyze the organism's physiology and its correlation to data produced by the sequencing project. These data are generally stored in public domain databases and most of the time they need to be verified, i.e., experimentally validated. However, recent advances in bioinformatics and computational biology allow us to computationally verify some physiological conditions, such as, for instance, the presence or lack of a particular metabolic pathway. Several research groups have developed techniques to predict metabolic pathways from genomic annotation, thus producing integrated pathway/genome database. The main objective of this work was to build a database for C. violaceum (CvioCyc), a Gram-negative bacterium sequenced by the Brazilian National Genome Project Consortium. C. violaceum is a microorganism of great biotechnological and biomedical potential. C. violaceum produces a violet pigment known as violacein, to which it is attributed anti-tumoral and anti-Trypanosoma cruzi properties, among others; moreover, this bacterium is able to synthesize biopolymer of great commercial interest. The pathway/genome database - CvioCyc (cviocyc.intelab.ufsc.br), was built from a suite of programs in the Pathway Tools package, and was shown to be an important tool to analyze the C. violaceum genome. It allowed us re-annotating 61 out of 233 automatically generated metabolic pathways, 17 were removed (27.86%) and 44 were remained (72.13%). Besides the violacein biosynthesis pathway included in the database, from its precursors (tryptophan), those 61 metabolic pathways were directly curated from the results generated by Pathway Tools, and from literature research, and the use of bioinformatics tools available on the web, such as BLAST and enzyme databases (KEGG, ENZYME and BRENDA). Thirty-nine ORFs (open reading frames) were modified in CvioCyc. That represents approximately 24.3% error in the 160 examined ORFs. This result is in agreement with what we have found for other genome annotations (8 to 25%). Most of those errors are among the frequently found in the literature, such as false-positive and false-negative ORFs, typo errors, and lack of standardization in enzyme and gene names. The analysis done on violacein biosynthesis suggests that ORF CV3270 may be part of the vio operon, according to Pathway Tools predictions. Thus, it is likely that it might exist another gene in the vioABCD operon. We have found that there is only a 12 bp distance between the ORF3270 and the vioD gene; besides that, there is a stem-loop (hairpin) structure downstream that ORF, what suggests a transcription termination moiety after CV3270

    Identification of bacterial pathogenic gene classes subject to diversifying selection

    Get PDF
    Philosophiae Doctor - PhD (Biotechnology)Availability of genome sequences for numerous bacterial species comprising of different bacterial strains allows elucidation of species and strain specific adaptations that facilitate their survival in widely fluctuating micro-environments and enhance their pathogenic potential. Different bacterial species use different strategies in their pathogenesis and the pathogenic potential of a bacterial species is dependent on its genomic complement of virulence factors. A bacterial virulence factor, within the context of this study, is defined as any endogenous protein product encoded by a gene that aids in the adhesion, invasion, colonization, persistence and pathogenesis of a bacterium within a host. Anecdotal evidence suggests that bacterial virulence genes are undergoing diversifying evolution to counteract the rapid adaptability of its host&rsquo;s immune defences. Genome sequences of pathogenic bacterial species and strains provide unique opportunities to study the action of diversifying selection operating on different classes of bacterial genes.South Afric
    corecore