3,906 research outputs found

    metaSHARK: software for automated metabolic network prediction from DNA sequence and its application to the genomes of Plasmodium falciparum and Eimeria tenella

    Get PDF
    The metabolic SearcH And Reconstruction Kit (metaSHARK) is a new fully automated software package for the detection of enzyme-encoding genes within unannotated genome data and their visualization in the context of the surrounding metabolic network. The gene detection package (SHARKhunt) runs on a Linux systemand requires only a set of raw DNA sequences (genomic, expressed sequence tag and/ or genome survey sequence) as input. Its output may be uploaded to our web-based visualization tool (SHARKview) for exploring and comparing data from different organisms. We first demonstrate the utility of the software by comparing its results for the raw Plasmodium falciparum genome with the manual annotations available at the PlasmoDB and PlasmoCyc websites. We then apply SHARKhunt to the unannotated genome sequences of the coccidian parasite Eimeria tenella and observe that, at an E-value cut-off of 10(-20), our software makes 142 additional assertions of enzymatic function compared with a recent annotation package working with translated open reading frame sequences. The ability of the software to cope with low levels of sequence coverage is investigated by analyzing assemblies of the E.tenella genome at estimated coverages from 0.5x to 7.5x. Lastly, as an example of how metaSHARK can be used to evaluate the genomic evidence for specific metabolic pathways, we present a study of coenzyme A biosynthesis in P.falciparum and E.tenella

    Representing and analysing molecular and cellular function in the computer

    Get PDF
    Determining the biological function of a myriad of genes, and understanding how they interact to yield a living cell, is the major challenge of the post genome-sequencing era. The complexity of biological systems is such that this cannot be envisaged without the help of powerful computer systems capable of representing and analysing the intricate networks of physical and functional interactions between the different cellular components. In this review we try to provide the reader with an appreciation of where we stand in this regard. We discuss some of the inherent problems in describing the different facets of biological function, give an overview of how information on function is currently represented in the major biological databases, and describe different systems for organising and categorising the functions of gene products. In a second part, we present a new general data model, currently under development, which describes information on molecular function and cellular processes in a rigorous manner. The model is capable of representing a large variety of biochemical processes, including metabolic pathways, regulation of gene expression and signal transduction. It also incorporates taxonomies for categorising molecular entities, interactions and processes, and it offers means of viewing the information at different levels of resolution, and dealing with incomplete knowledge. The data model has been implemented in the database on protein function and cellular processes 'aMAZE' (http://www.ebi.ac.uk/research/pfbp/), which presently covers metabolic pathways and their regulation. Several tools for querying, displaying, and performing analyses on such pathways are briefly described in order to illustrate the practical applications enabled by the model

    AGMIAL: implementing an annotation strategy for prokaryote genomes as a distributed system

    Get PDF
    We have implemented a genome annotation system for prokaryotes called AGMIAL. Our approach embodies a number of key principles. First, expert manual annotators are seen as a critical component of the overall system; user interfaces were cyclically refined to satisfy their needs. Second, the overall process should be orchestrated in terms of a global annotation strategy; this facilitates coordination between a team of annotators and automatic data analysis. Third, the annotation strategy should allow progressive and incremental annotation from a time when only a few draft contigs are available, to when a final finished assembly is produced. The overall architecture employed is modular and extensible, being based on the W3 standard Web services framework. Specialized modules interact with two independent core modules that are used to annotate, respectively, genomic and protein sequences. AGMIAL is currently being used by several INRA laboratories to analyze genomes of bacteria relevant to the food-processing industry, and is distributed under an open source license

    GENDB : a second generation genome annotation system

    Get PDF
    Meyer F. GENDB : a second generation genome annotation system. Bielefeld (Germany): Bielefeld University; 2001.The advent of new high throughput technologies opens the road towards a new era of genome analysis. Data from high throughput sequencers, chip based RNA expression analysis and proteome analysis systems create the need for software systems to support new kinds of analysis and data. At the same time the focus of molecular research shifted from the analysis of single genes to the analysis of whole genomes, multiple high throughput sources of data are routinely used. Yet there is a shortage of software systems that help store, integrate and analyse the wealth of information now available. We describe the development of a new genome annotation system (GENDB) based on a relational database system and object oriented technology that helps with the analysis of this data. GENDB significantly reduces the storage and compute overhead of existing systems, while offering more flexibility. The ability to integrate new kinds of data and new methods of analysis is one of the primary design targets for GENDB. The GENDB system has been succesfully used in a number of genome projects

    XenDB: Full length cDNA prediction and cross species mapping in Xenopus laevis

    Get PDF
    BACKGROUND: Research using the model system Xenopus laevis has provided critical insights into the mechanisms of early vertebrate development and cell biology. Large scale sequencing efforts have provided an increasingly important resource for researchers. To provide full advantage of the available sequence, we have analyzed 350,468 Xenopus laevis Expressed Sequence Tags (ESTs) both to identify full length protein encoding sequences and to develop a unique database system to support comparative approaches between X. laevis and other model systems. DESCRIPTION: Using a suffix array based clustering approach, we have identified 25,971 clusters and 40,877 singleton sequences. Generation of a consensus sequence for each cluster resulted in 31,353 tentative contig and 4,801 singleton sequences. Using both BLASTX and FASTY comparison to five model organisms and the NR protein database, more than 15,000 sequences are predicted to encode full length proteins and these have been matched to publicly available IMAGE clones when available. Each sequence has been compared to the KOG database and ~67% of the sequences have been assigned a putative functional category. Based on sequence homology to mouse and human, putative GO annotations have been determined. CONCLUSION: The results of the analysis have been stored in a publicly available database XenDB . A unique capability of the database is the ability to batch upload cross species queries to identify potential Xenopus homologues and their associated full length clones. Examples are provided including mapping of microarray results and application of 'in silico' analysis. The ability to quickly translate the results of various species into 'Xenopus-centric' information should greatly enhance comparative embryological approaches. Supplementary material can be found at

    EasyCluster: a fast and efficient gene-oriented clustering tool for large-scale transcriptome data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>ESTs and full-length cDNAs represent an invaluable source of evidence for inferring reliable gene structures and discovering potential alternative splicing events. In newly sequenced genomes, these tasks may not be practicable owing to the lack of appropriate training sets. However, when expression data are available, they can be used to build EST clusters related to specific genomic transcribed <it>loci</it>. Common strategies recently employed to this end are based on sequence similarity between transcripts and can lead, in specific conditions, to inconsistent and erroneous clustering. In order to improve the cluster building and facilitate all downstream annotation analyses, we developed a simple genome-based methodology to generate gene-oriented clusters of ESTs when a genomic sequence and a pool of related expressed sequences are provided. Our procedure has been implemented in the software EasyCluster and takes into account the spliced nature of ESTs after an <it>ad hoc </it>genomic mapping.</p> <p>Methods</p> <p>EasyCluster uses the well-known GMAP program in order to perform a very quick EST-to-genome mapping in addition to the detection of reliable splice sites. Given a genomic sequence and a pool of ESTs/FL-cDNAs, EasyCluster starts building genomic and EST local databases and runs GMAP. Subsequently, it parses results creating an initial collection of pseudo-clusters by grouping ESTs according to the overlap of their genomic coordinates on the same strand. In the final step, EasyCluster refines the clustering by again running GMAP on each pseudo-cluster and groups together ESTs sharing at least one splice site.</p> <p>Results</p> <p>The higher accuracy of EasyCluster with respect to other clustering tools has been verified by means of a manually cured benchmark of human EST clusters. Additional datasets including the Unigene cluster Hs.122986 and ESTs related to the human <it>HOXA </it>gene family have also been used to demonstrate the better clustering capability of EasyCluster over current genome-based web service tools such as ASmodeler and BIPASS. EasyCluster has also been used to provide a first compilation of gene-oriented clusters in the <it>Ricinus communis </it>oilseed plant for which no Unigene clusters are yet available, as well as an evaluation of the alternative splicing in this plant species.</p

    K2/Kleisli and GUS: Experiments in Integrated Access to Genomic Data Sources

    Get PDF
    The integration of heterogeneous data sources and software systems is a major issue in the biomed ical community and several approaches have been explored: linking databases, on-the- fly integration through views, and integration through warehousing. In this paper we report on our experiences with two systems that were developed at the University of Pennsylvania: an integration system called K2, which has primarily been used to provide views over multiple external data sources and software systems; and a data warehouse called GUS which downloads, cleans, integrates and annotates data from multiple external data sources. Although the view and warehouse approaches each have their advantages, there is no clear winner . Therefore, users must consider how the data is to be used, what the performance guarantees must be, and how much programmer time and expertise is available to choose the best strategy for a particular application
    • …
    corecore