27 research outputs found

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    A standard variation file format for human genome sequences

    Get PDF
    Here we describe the Genome Variation Format (GVF) and the 10Gen dataset. GVF, an extension of Generic Feature Format version 3 (GFF3), is a simple tab-delimited format for DNA variant files, which uses Sequence Ontology to describe genome variation data. The 10Gen dataset, ten human genomes in GVF format, is freely available for community analysis from the Sequence Ontology website and from an Amazon elastic block storage (EBS) snapshot for use in Amazon's EC2 cloud computing environment

    mGene.web: a web service for accurate computational gene finding

    Get PDF
    We describe mGene.web, a web service for the genome-wide prediction of protein coding genes from eukaryotic DNA sequences. It offers pre-trained models for the recognition of gene structures including untranslated regions in an increasing number of organisms. With mGene.web, users have the additional possibility to train the system with their own data for other organisms on the push of a button, a functionality that will greatly accelerate the annotation of newly sequenced genomes. The system is built in a highly modular way, such that individual components of the framework, like the promoter prediction tool or the splice site predictor, can be used autonomously. The underlying gene finding system mGene is based on discriminative machine learning techniques and its high accuracy has been demonstrated in an international competition on nematode genomes. mGene.web is available at http://www.mgene.org/web, it is free of charge and can be used for eukaryotic genomes of small to moderate size (several hundred Mbp)

    WormBase 2007

    Get PDF
    WormBase (www.wormbase.org) is the major publicly available database of information about Caenorhabditis elegans, an important system for basic biological and biomedical research. Derived from the initial ACeDB database of C. elegans genetic and sequence information, WormBase now includes the genomic, anatomical and functional information about C. elegans, other Caenorhabditis species and other nematodes. As such, it is a crucial resource not only for C. elegans biologists but the larger biomedical and bioinformatics communities. Coverage of core areas of C. elegans biology will allow the biomedical community to make full use of the results of intensive molecular genetic analysis and functional genomic studies of this organism. Improved search and display tools, wider cross-species comparisons and extended ontologies are some of the features that will help scientists extend their research and take advantage of other nematode species genome sequences

    Improving the annotation of the Heterorhabditis bacteriophora genome

    Get PDF
    Background: Genome assembly and annotation remain exacting tasks. As the tools available for these tasks improve, it is useful to return to data produced with earlier techniques to assess their credibility and correctness. The entomopathogenic nematode Heterorhabditis bacteriophora is widely used to control insect pests in horticulture. The genome sequence for this species was reported to encode an unusually high proportion of unique proteins and a paucity of secreted proteins compared to other related nematodes. Findings: We revisited the H. bacteriophora genome assembly and gene predictions to determine whether these unusual characteristics were biological or methodological in origin. We mapped an independent resequencing dataset to the genome and used the blobtools pipeline to identify potential contaminants. While present (0.2% of the genome span, 0.4% of predicted proteins), assembly contamination was not significant. Conclusions: Re-prediction of the gene set using BRAKER1 and published transcriptome data generated a predicted proteome that was very different from the published one. The new gene set had a much reduced complement of unique proteins, better completeness values that were in line with other related species’ genomes, and an increased number of proteins predicted to be secreted. It is thus likely that methodological issues drove the apparent uniqueness of the initial H. bacteriophora genome annotation and that similar contamination and misannotation issues affect other published genome assemblies

    Methods and strategies for gene structure curation in WormBase

    Get PDF
    The Caenorhabditis elegans genome sequence was published over a decade ago; this was the first published genome of a multi-cellular organism and now the WormBase project has had a decade of experience in curating this genome's sequence and gene structures. In one of its roles as a central repository for nematode biology, WormBase continues to refine the gene structure annotations using sequence similarity and other computational methods, as well as information from the literature- and community-submitted annotations. We describe the various methods of gene structure curation that have been tried by WormBase and the problems associated with each of them. We also describe the current strategy for gene structure curation, and introduce the WormBase ‘curation tool’, which integrates different data sources in order to identify new and correct gene structures

    Doctor of Philosophy

    Get PDF
    dissertationWhole genome sequencing projects have expanded our understanding of evolution, organism development, and human disease. Now advances in secondgeneration technologies are making whole genome sequencing routine even for small laboratories. However, advances in annotation technology have not kept pace with genome sequencing, and annotation has become the major bottleneck for many genome projects (especially those with limited bioinformatics expertise). At the same time, challenges associated with genomics research extend beyond merely annotating genomes, as annotations must be subjected to diverse downstream analyses, the complexities of which can confound smaller research groups. Additionally, with improvements in genome assembly and the wide availability of next generation transcriptome data (mRNA-seq), researchers have the opportunity to re-annotate previously published genomes, which creates new difficulties for data integration and management that are not well addressed by existing tools. In response to the challenges facing second-generation genome projects, I have developed the annotation pipeline MAKER2 together with accessory software for downstream analysis and data management. The MAKER2 annotation pipeline finds repeats within a genome, aligns ESTs and cDNAs, identifies sites of protein homology, and produces database-ready gene annotations in association with supporting evidence. However MAKER2 can go beyond structural annotation to identify and integrate functional annotations. MAKER2 also provides researchers iv with the capability to re-annotate legacy genome datasets and to incorporate mRNAseq. Additionally, MAKER2 supports distributed parallelization on computer clusters, thus providing a scalable solution for datasets of any size. Annotations produced by MAKER2 can be directly loaded into many popular downstream annotation analysis and management tools from the Generic Model Organism Database Project. By using MAKER2 with these tools, research groups can quickly build genome annotations, perform analyses, and distribute their data to the wider scientific community. Here I describe the internal architecture of MAKER2, and document its computational capabilities. I also describe my work to annotate and analyze eight emerging model organism genomes in collaboration with their associated genome projects. Thus, in the course of my thesis work, I have addressed a specific need within the scientific community for easy-to-use annotation and analysis tools while also expanding our understanding of evolution and biology

    WebGMAP: a web service for mapping and aligning cDNA sequences to genomes

    Get PDF
    Miami University; Ohio Plant Biotechnology Consortium; National Natural Science Fund of China [60774033]; Natural Science Foundation of Fujian Province in China [B0710031]; Specialized Research Fund for the Doctoral Program of Higher Education [2007038400The genomes of thousands of organisms are being sequenced, often with accompanying sequences of cDNAs or ESTs. One of the great challenges in bioinformatics is to make these genomic sequences and genome annotations accessible in a user-friendly manner to general biologists to address interesting biological questions. We have created an open-access web service called WebGMAP (http://www.bioinfolab.org/software/webgmap) that seamlessly integrates cDNA-genome alignment tools, such as GMAP, with easy-to-use data visualization and mining tools. This web service is intended to facilitate community efforts in improving genome annotation, determining accurate gene structures and their variations, and exploring important biological processes such as alternative splicing and alternative polyadenylation. For routine sequence analysis, WebGMAP provides a web-based sequence viewer with many useful functions, including nucleotide positioning, six-frame translations, sequence reverse complementation, and imperfect motif detection and alignment. WebGMAP also provides users with the ability to sort, filter and search for individual cDNA sequences and cDNA-genome alignments. Our EST-Genome-Browser can display annotated gene structures and cDNA-genome alignments at scales from 100 to 50 000 nt. With its ability to highlight base differences between query cDNAs and the genome, our EST-Genome-Browser allows biologists to discover potential point or insertion-deletion variations from cDNA-genome alignments

    OryzaPG-DB: Rice Proteome Database based on Shotgun Proteogenomics

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Proteogenomics aims to utilize experimental proteome information for refinement of genome annotation. Since mass spectrometry-based shotgun proteomics approaches provide large-scale peptide sequencing data with high throughput, a data repository for shotgun proteogenomics would represent a valuable source of gene expression evidence at the translational level for genome re-annotation.</p> <p>Description</p> <p>Here, we present OryzaPG-DB, a rice proteome database based on shotgun proteogenomics, which incorporates the genomic features of experimental shotgun proteomics data. This version of the database was created from the results of 27 nanoLC-MS/MS runs on a hybrid ion trap-orbitrap mass spectrometer, which offers high accuracy for analyzing tryptic digests from undifferentiated cultured rice cells. Peptides were identified by searching the product ion spectra against the protein, cDNA, transcript and genome databases from Michigan State University, and were mapped to the rice genome. Approximately 3200 genes were covered by these peptides and 40 of them contained novel genomic features. Users can search, download or navigate the database per chromosome, gene, protein, cDNA or transcript and download the updated annotations in standard GFF3 format, with visualization in PNG format. In addition, the database scheme of OryzaPG was designed to be generic and can be reused to host similar proteogenomic information for other species. OryzaPG is the first proteogenomics-based database of the rice proteome, providing peptide-based expression profiles, together with the corresponding genomic origin, including the annotation of novelty for each peptide.</p> <p>Conclusions</p> <p>The OryzaPG database was constructed and is freely available at <url>http://oryzapg.iab.keio.ac.jp/</url>.</p

    Reconstructing the phylogenetic relationships of nematodes using draft genomes and transcriptomes

    Get PDF
    Nematoda is a very diverse animal phylum. Within Nematoda, species display a multitude of life styles, different reproductive strategies and parasitism has arisen independently several times. Furthermore, morphological conservation and a high rate of homoplasy have impeded the resolution of nematode systematics. To address these issues, single gene (usually the nuclear ribosomal small subunit gene) and mitochondrial gene phylogenies have been used, but the information contained within the sequence of these genes is not enough to resolve the topological relationships between clades that emerged during rapid cladogenesis. Next generation sequencing data have been shown to produce high quality genomic and transcriptomic assemblies at low cost, as a result more and more nematode species are being sequenced. Sequences were gathered or generated for 53 nematode species from ESTs, gene predictions from full genome assemblies and transcripts from RNA-Seq experiments. These sequences were screened for orthologous gene clusters, which were concatenated into a supermatrix with thousands of aminoacid sites. The analysis of the supermatrix with maximum likelihood and Bayesian inference methods sheds light into the early splitting clades of the phylogenetic tree of nematodes and the derived clades III, IV and V. Furthermore, the phylogenetic relationships within the parastitic family Onchocercidae were resolved, unveiling the evolutionary history of these important taxa. Finally, data produced in this work will be useful for subsequent evolutionary studies of the phylum Nematoda
    corecore