11 research outputs found

    Association algorithm to mine the rules that govern enzyme definition and to classify protein sequences

    Get PDF
    BACKGROUND: The number of sequences compiled in many genome projects is growing exponentially, but most of them have not been characterized experimentally. An automatic annotation scheme must be in an urgent need to reduce the gap between the amount of new sequences produced and reliable functional annotation. This work proposes rules for automatically classifying the fungus genes. The approach involves elucidating the enzyme classifying rule that is hidden in UniProt protein knowledgebase and then applying it for classification. The association algorithm, Apriori, is utilized to mine the relationship between the enzyme class and significant InterPro entries. The candidate rules are evaluated for their classificatory capacity. RESULTS: There were five datasets collected from the Swiss-Prot for establishing the annotation rules. These were treated as the training sets. The TrEMBL entries were treated as the testing set. A correct enzyme classification rate of 70% was obtained for the prokaryote datasets and a similar rate of about 80% was obtained for the eukaryote datasets. The fungus training dataset which lacks an enzyme class description was also used to evaluate the fungus candidate rules. A total of 88 out of 5085 test entries were matched with the fungus rule set. These were otherwise poorly annotated using their functional descriptions. CONCLUSION: The feasibility of using the method presented here to classify enzyme classes based on the enzyme domain rules is evident. The rules may be also employed by the protein annotators in manual annotation or implemented in an automatic annotation flowchart

    Gene Ontology annotation quality analysis in model eukaryotes

    Get PDF
    Functional analysis using the Gene Ontology (GO) is crucial for array analysis, but it is often difficult for researchers to assess the amount and quality of GO annotations associated with different sets of gene products. In many cases the source of the GO annotations and the date the GO annotations were last updated is not apparent, further complicating a researchers’ ability to assess the quality of the GO data provided. Moreover, GO biocurators need to ensure that the GO quality is maintained and optimal for the functional processes that are most relevant for their research community. We report the GO Annotation Quality (GAQ) score, a quantitative measure of GO quality that includes breadth of GO annotation, the level of detail of annotation and the type of evidence used to make the annotation. As a case study, we apply the GAQ scoring method to a set of diverse eukaryotes and demonstrate how the GAQ score can be used to track changes in GO annotations over time and to assess the quality of GO annotations available for specific biological processes. The GAQ score also allows researchers to quantitatively assess the functional data available for their experimental systems (arrays or databases)

    Conserved developmental transcriptomes in evolutionarily divergent species

    Get PDF
    Transcriptional profiling of Dictyostelium development reveals significant conservation of transcriptional profiles between evolutionarily divergent species

    Automatic, context-specific generation of Gene Ontology slims

    Get PDF
    Background: The use of ontologies to control vocabulary and structure annotation has added value to genome-scale data, and contributed to the capture and re-use of knowledge across research domains. Gene Ontology (GO) is widely used to capture detailed expert knowledge in genomic-scale datasets and as a consequence has grown to contain many terms, making it unwieldy for many applications. To increase its ease of manipulation and efficiency of use, subsets called GO slims are often created by collapsing terms upward into more general, high-level terms relevant to a particular context. Creation of a GO slim currently requires manipulation and editing of GO by an expert (or community) familiar with both the ontology and the biological context. Decisions about which terms to include are necessarily subjective, and the creation process itself and subsequent curation are time-consuming and largely manual

    Sequencing, Mapping, and Analysis of 27,455 Maize Full-Length cDNAs

    Get PDF
    Full-length cDNA (FLcDNA) sequencing establishes the precise primary structure of individual gene transcripts. From two libraries representing 27 B73 tissues and abiotic stress treatments, 27,455 high-quality FLcDNAs were sequenced. The average transcript length was 1.44 kb including 218 bases and 321 bases of 5′ and 3′ UTR, respectively, with 8.6% of the FLcDNAs encoding predicted proteins of fewer than 100 amino acids. Approximately 94% of the FLcDNAs were stringently mapped to the maize genome. Although nearly two-thirds of this genome is composed of transposable elements (TEs), only 5.6% of the FLcDNAs contained TE sequences in coding or UTR regions. Approximately 7.2% of the FLcDNAs are putative transcription factors, suggesting that rare transcripts are well-enriched in our FLcDNA set. Protein similarity searching identified 1,737 maize transcripts not present in rice, sorghum, Arabidopsis, or poplar annotated genes. A strict FLcDNA assembly generated 24,467 non-redundant sequences, of which 88% have non-maize protein matches. The FLcDNAs were also assembled with 41,759 FLcDNAs in GenBank from other projects, where semi-strict parameters were used to identify 13,368 potentially unique non-redundant sequences from this project. The libraries, ESTs, and FLcDNA sequences produced from this project are publicly available. The annotated EST and FLcDNA assemblies are available through the maize FLcDNA web resource (www.maizecdna.org)

    Genome-wide identification of new Wnt/β-catenin target genes in the human genome using CART method

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The importance of <it>in silico </it>predictions for understanding cellular processes is now widely accepted, and a variety of algorithms useful for studying different biological features have been designed. In particular, the prediction of <it>cis </it>regulatory modules in non-coding human genome regions represents a major challenge for understanding gene regulation in several diseases. Recently, studies of the Wnt signaling pathway revealed a connection with neurodegenerative diseases such as Alzheimer's. In this article, we construct a classification tool that uses the transcription factor binding site motifs composition of some gene promoters to identify new Wnt/β-catenin pathway target genes potentially involved in brain diseases.</p> <p>Results</p> <p>In this study, we propose 89 new Wnt/β-catenin pathway target genes predicted <it>in silico </it>by using a method based on multiple Classification and Regression Tree (CART) analysis. We used as decision variables the presence of transcription factor binding site motifs in the upstream region of each gene. This prediction was validated by RT-qPCR in a sample of 9 genes. As expected, LEF1, a member of the T-cell factor/lymphoid enhancer-binding factor family (TCF/LEF1), was relevant for the classification algorithm and, remarkably, other factors related directly or indirectly to the inflammatory response and amyloidogenic processes also appeared to be relevant for the classification. Among the 89 new Wnt/β-catenin pathway targets, we found a group expressed in brain tissue that could be involved in diverse responses to neurodegenerative diseases, like Alzheimer's disease (AD). These genes represent new candidates to protect cells against amyloid β toxicity, in agreement with the proposed neuroprotective role of the Wnt signaling pathway.</p> <p>Conclusions</p> <p>Our multiple CART strategy proved to be an effective tool to identify new Wnt/β-catenin pathway targets based on the study of their regulatory regions in the human genome. In particular, several of these genes represent a new group of transcriptional dependent targets of the canonical Wnt pathway. The functions of these genes indicate that they are involved in pathophysiology related to Alzheimer's disease or other brain disorders.</p

    Representing Ontogeny Through Ontology: A Developmental Biologist’s Guide to The Gene Ontology

    Get PDF
    Developmental biology, like many other areas of biology, has undergone a dramatic shift in the perspective from which developmental processes are viewed. Instead of focusing on the actions of a handful of genes or functional RNAs, we now consider the interactions of large functional gene networks and study how these complex systems orchestrate the unfolding of an organism, from gametes to adult. Developmental biologists are beginning to realize that understanding ontogeny on this scale requires the utilization of computational methods to capture, store and represent the knowledge we have about the underlying processes. Here we review the use of the Gene Ontology (GO) to study developmental biology. We describe the organization and structure of the GO and illustrate some of the ways we use it to capture the current understanding of many common developmental processes. We also discuss ways in which gene product annotations using the GO have been used to ask and answer developmental questions in a variety of model developmental systems. We provide suggestions as to how the GO might be used in more powerful ways to address questions about development. Our goal is to provide developmental biologists with enough background about the GO that they can begin to think about how they might use the ontology efficiently and in the most powerful ways possible

    BIOINFORMATICS TOOL DEVELOPMENT AND SEQUENCE ANALYSIS OF ROSACEAE FAMILY EXPRESSED SEQUENCE TAGS

    Get PDF
    BACKGROUND: An international community of researchers has generated a significant number of Expressed Sequence Tags (ESTs) for the Rosaceae, an economically important plant family that includes most temperate fruits such as apple, cherry, peach, and strawberry as well as other commercially valuable members. ESTs are fragments of expressed genes that can be used for gene discovery, developing markers for mapping and cultivar improvement via marker assisted selection. Efficient dissemination and integration of this data is best facilitated through a centralized and curated database with associated sequence analysis tools. DESCRIPTION: The Genome Database for Rosaceae (GDR) was initiated to provide a curated and integrated web-based relational database for this family. I developed a key component of GDR to assemble and annotate the publicly available ESTs from the four main genera of the family (Prunus, Malus, Fragaria, Rosa). I created both genera and family level unigenes using the software CAP3 after extensive filtering, trimming and assembly. Further analysis includes marker mining for single nucleotide polymorphisms (SNPs) and simple sequence repeast (SSRs) with putative primer identification, and oligo identification for potential microarray development. Functional genomics efforts are supported with sequence similarity searching against major protein and nucleotide databases, gene product ontology assignment, and protein motif identification. I deployed the entire project on the GDR with all data available for browsing, searching, and downloading. CONCLUSIONS: The GDR and its associated EST unigene project are meeting a major need for timely annotation and curation of sequence data for the Rosaceae community. The results of my analysis highlight major genes and pathways of interest including ripening, disease resistance, and transcription factors. The easily accessible pool of annotated coding sequences should further both functional and structural genomics characterization in Rosaceae. The unigene elucidates the levels of sequence similarity shared across different plant species and the implications for resource sharing across the family. GDR can be accessed at http://www.rosaceae.org/

    Clonal genome evolution of the marbled crayfish, Procambarus virginalis

    Get PDF
    Marbled crayfish (Procambarus virginalis) are the only freshwater crayfish known to reproduce by cloning (apomictic parthenogenesis). Notably, among genetically identical offspring raised in the same environment, distinct phenotypic differences can be observed. These unique characteristics establish the marbled crayfish as a particularly interesting laboratory model. Additionally, parthenogenetic reproduction enables the marbled crayfish to rapidly spread and form stable populations, which poses a serious threat in many freshwater habitats. A further understanding of this organism requires the accessibility of its 3.5 Gbp large genome sequence. This doctoral thesis provides the first de novo genome assembly of the marbled crayfish. Multiple shotgun and long jumping distance libraries were generated from one individual female, with a single base coverage of over 100x. Sequencing data was used for a first genome assembly with a length weighted median scaffold size (N50) of over 40 kbp. The estimated genome wide heterozygosity rate of 0.53% is substantially higher compared to other arthropod genomes. Transcriptome data enabled the refinement of genetic structures. Eventually, a total of 87.8% complete and 7.4% fragmented single-copy arthropod orthologs were identified using the benchmarking software BUSCO. Single nucleotide variations were analyzed to verify clonality in geographically isolated populations. Results indicate an evolution from a single origin. Moreover, detailed insights into genotype distributions support the theory of asexual speciation by autopolyploidization. Comparison of three Procambarus species indicates detectable genetic separation between marbled crayfish and the closest relative Procambarus fallax. Automatic annotation of 21,000 genes using the annotation pipeline MAKER provides a detailed overview of genetic features. For example, a cellulase gene was identified which potentially plays a key role in omnivorousness. Genomic data and several online services are provided by a central web resource. This thesis provides detailed genetic insights into the unknown but very versatile order of decapod crustaceans. Considered economically and ecologically relevant keystone species, a representative genome sequence provides an important resource for future research
    corecore