11 research outputs found

    CGKB: an annotation knowledge base for cowpea (Vigna unguiculata L.) methylation filtered genomic genespace sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cowpea [<it>Vigna unguiculata </it>(L.) Walp.] is one of the most important food and forage legumes in the semi-arid tropics because of its ability to tolerate drought and grow on poor soils. It is cultivated mostly by poor farmers in developing countries, with 80% of production taking place in the dry savannah of tropical West and Central Africa. Cowpea is largely an underexploited crop with relatively little genomic information available for use in applied plant breeding. The goal of the Cowpea Genomics Initiative (CGI), funded by the Kirkhouse Trust, a UK-based charitable organization, is to leverage modern molecular genetic tools for gene discovery and cowpea improvement. One aspect of the initiative is the sequencing of the gene-rich region of the cowpea genome (termed the genespace) recovered using methylation filtration technology and providing annotation and analysis of the sequence data.</p> <p>Description</p> <p>CGKB, Cowpea Genespace/Genomics Knowledge Base, is an annotation knowledge base developed under the CGI. The database is based on information derived from 298,848 cowpea genespace sequences (GSS) isolated by methylation filtering of genomic DNA. The CGKB consists of three knowledge bases: GSS annotation and comparative genomics knowledge base, GSS enzyme and metabolic pathway knowledge base, and GSS simple sequence repeats (SSRs) knowledge base for molecular marker discovery. A homology-based approach was applied for annotations of the GSS, mainly using BLASTX against four public FASTA formatted protein databases (NCBI GenBank Proteins, UniProtKB-Swiss-Prot, UniprotKB-PIR (Protein Information Resource), and UniProtKB-TrEMBL). Comparative genome analysis was done by BLASTX searches of the cowpea GSS against four plant proteomes from <it>Arabidopsis thaliana, Oryza sativa, Medicago truncatula</it>, and <it>Populus trichocarpa</it>. The possible exons and introns on each cowpea GSS were predicted using the HMM-based Genscan gene predication program and the potential domains on annotated GSS were analyzed using the HMMER package against the Pfam database. The annotated GSS were also assigned with Gene Ontology annotation terms and integrated with 228 curated plant metabolic pathways from the <it>Arabidopsis </it>Information Resource (TAIR) knowledge base. The UniProtKB-Swiss-Prot ENZYME database was used to assign putative enzymatic function to each GSS. Each GSS was also analyzed with the Tandem Repeat Finder (TRF) program in order to identify potential SSRs for molecular marker discovery. The raw sequence data, processed annotation, and SSR results were stored in relational tables designed in key-value pair fashion using a PostgreSQL relational database management system. The biological knowledge derived from the sequence data and processed results are represented as views or materialized views in the relational database management system. All materialized views are indexed for quick data access and retrieval. Data processing and analysis pipelines were implemented using the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The CPU intensive data processing and analysis pipelines were run on a computer cluster of more than 30 dual-processor Apple XServes. A job management system called Vela was created as a robust way to submit large numbers of jobs to the Portable Batch System (PBS).</p> <p>Conclusion</p> <p>CGKB is an integrated and annotated resource for cowpea GSS with features of homology-based and HMM-based annotations, enzyme and pathway annotations, GO term annotation, toolkits, and a large number of other facilities to perform complex queries. The cowpea GSS, chloroplast sequences, mitochondrial sequences, retroelements, and SSR sequences are available as FASTA formatted files and downloadable at CGKB. This database and web interface are publicly accessible at <url>http://cowpeagenomics.med.virginia.edu/CGKB/</url>.</p

    TOBFAC: the database of tobacco transcription factors

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Regulation of gene expression at the level of transcription is a major control point in many biological processes. Transcription factors (TFs) can activate and/or repress the transcriptional rate of target genes and vascular plant genomes devote approximately 7% of their coding capacity to TFs. Global analysis of TFs has only been performed for three complete higher plant genomes – Arabidopsis (<it>Arabidopsis thaliana</it>), poplar (<it>Populus trichocarpa</it>) and rice (<it>Oryza sativa</it>). Presently, no large-scale analysis of TFs has been made from a member of the <it>Solanaceae</it>, one of the most important families of vascular plants. To fill this void, we have analysed tobacco (<it>Nicotiana tabacum</it>) TFs using a dataset of 1,159,022 gene-space sequence reads (GSRs) obtained by methylation filtering of the tobacco genome. An analytical pipeline was developed to isolate TF sequences from the GSR data set. This involved multiple (typically 10–15) independent searches with different versions of the TF family-defining domain(s) (normally the DNA-binding domain) followed by assembly into contigs and verification. Our analysis revealed that tobacco contains a minimum of 2,513 TFs representing all of the 64 well-characterised plant TF families. The number of TFs in tobacco is higher than previously reported for Arabidopsis and rice.</p> <p>Results</p> <p>TOBFAC: the database of tobacco transcription factors, is an integrative database that provides a portal to sequence and phylogeny data for the identified TFs, together with a large quantity of other data concerning TFs in tobacco. The database contains an individual page dedicated to each of the 64 TF families. These contain background information, domain architecture via Pfam links, a list of all sequences and an assessment of the minimum number of TFs in this family in tobacco. Downloadable phylogenetic trees of the major families are provided along with detailed information on the bioinformatic pipeline that was used to find all family members. TOBFAC also contains EST data, a list of published tobacco TFs and a list of papers concerning tobacco TFs. The sequences and annotation data are stored in relational tables using a PostgrelSQL relational database management system. The data processing and analysis pipelines used the Perl programming language. The web interface was implemented in JavaScript and Perl CGI running on an Apache web server. The computationally intensive data processing and analysis pipelines were run on an Apple XServe cluster with more than 20 nodes.</p> <p>Conclusion</p> <p>TOBFAC is an expandable knowledgebase of tobacco TFs with data currently available for over 2,513 TFs from 64 gene families. TOBFAC integrates available sequence information, phylogenetic analysis, and EST data with published reports on tobacco TF function. The database provides a major resource for the study of gene expression in tobacco and the <it>Solanaceae </it>and helps to fill a current gap in studies of TF families across the plant kingdom. TOBFAC is publicly accessible at <url>http://compsysbio.achs.virginia.edu/tobfac/</url>.</p

    Sequencing and analysis of the gene-rich space of cowpea

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Cowpea, <it>Vigna unguiculata </it>(L.) Walp., is one of the most important food and forage legumes in the semi-arid tropics because of its drought tolerance and ability to grow on poor quality soils. Approximately 80% of cowpea production takes place in the dry savannahs of tropical West and Central Africa, mostly by poor subsistence farmers. Despite its economic and social importance in the developing world, cowpea remains to a large extent an underexploited crop. Among the major goals of cowpea breeding and improvement programs is the stacking of desirable agronomic traits, such as disease and pest resistance and response to abiotic stresses. Implementation of marker-assisted selection and breeding programs is severely limited by a paucity of trait-linked markers and a general lack of information on gene structure and organization. With a nuclear genome size estimated at ~620 Mb, the cowpea genome is an ideal target for reduced representation sequencing.</p> <p>Results</p> <p>We report here the sequencing and analysis of the gene-rich, hypomethylated portion of the cowpea genome selectively cloned by methylation filtration (MF) technology. Over 250,000 gene-space sequence reads (GSRs) with an average length of 610 bp were generated, yielding ~160 Mb of sequence information. The GSRs were assembled, annotated by BLAST homology searches of four public protein annotation databases and four plant proteomes (<it>A. thaliana</it>, <it>M. truncatula, O. sativa</it>, and <it>P. trichocarpa</it>), and analyzed using various domain and gene modeling tools. A total of 41,260 GSR assemblies and singletons were annotated, of which 19,786 have unique GenBank accession numbers. Within the GSR dataset, 29% of the sequences were annotated using the Arabidopsis Gene Ontology (GO) with the largest categories of assigned function being catalytic activity and metabolic processes, groups that include the majority of cellular enzymes and components of amino acid, carbohydrate and lipid metabolism. A total of 5,888 GSRs had homology to genes encoding transcription factors (TFs) and transcription associated factors (TAFs) representing about 5% of the total annotated sequences in the dataset. Sixty-two (62) of the 64 well-characterized plant transcription factor (TF) gene families are represented in the cowpea GSRs, and these families are of similar size and phylogenetic organization to those characterized in other plants. The cowpea GSRs also provides a rich source of genes involved in photoperiodic control, symbiosis, and defense-related responses. Comparisons to available databases revealed that about 74% of cowpea ESTs and 70% of all legume ESTs were represented in the GSR dataset. As approximately 12% of all GSRs contain an identifiable simple-sequence repeat, the dataset is a powerful resource for the design of microsatellite markers.</p> <p>Conclusion</p> <p>The availability of extensive publicly available genomic data for cowpea, a non-model legume with significant importance in the developing world, represents a significant step forward in legume research. Not only does the gene space sequence enable the detailed analysis of gene structure, gene family organization and phylogenetic relationships within cowpea, but it also facilitates the characterization of syntenic relationships with other cultivated and model legumes, and will contribute to determining patterns of chromosomal evolution in the Leguminosae. The micro and macrosyntenic relationships detected between cowpea and other cultivated and model legumes should simplify the identification of informative markers for marker-assisted trait selection and map-based gene isolation necessary for cowpea improvement.</p

    Tobacco Transcription Factors: Novel Insights into Transcriptional Regulation in the Solanaceae1[C][W][OA]

    No full text
    Tobacco (Nicotiana tabacum) is a member of the Solanaceae, one of the agronomically most important groups of flowering plants. We have performed an in silico analysis of 1.15 million gene-space sequence reads from the tobacco nuclear genome and report the detailed analysis of more than 2,500 tobacco transcription factors (TFs). The tobacco genome contains at least one member of each of the 64 well-characterized TF families identified in sequenced vascular plant genomes, indicating that evolution of the Solanaceae was not associated with the gain or loss of TF families. However, we found notable differences between tobacco and non-Solanaceae species in TF family size and evidence for both tobacco- and Solanaceae-specific subfamily expansions. Compared with TF families from sequenced plant genomes, tobacco has a higher proportion of ERF/AP2, C2H2 zinc finger, homeodomain, GRF, TCP, zinc finger homeodomain, BES, and STERILE APETALA (SAP) genes and novel subfamilies of BES, C2H2 zinc finger, SAP, and NAC genes. The novel NAC subfamily, termed TNACS, appears restricted to the Solanaceae, as they are absent from currently sequenced plant genomes but present in tomato (Solanum lycopersicum), pepper (Capsicum annuum), and potato (Solanum tuberosum). They constitute approximately 25% of NAC genes in tobacco. Based on our phylogenetic studies, we predict that many of the more than 50 tobacco group IX ERF genes are involved in jasmonate responses. Consistent with this, over two-thirds of group IX ERF genes tested showed increased mRNA levels following jasmonate treatment. Our data are a major resource for the Solanaceae and fill a void in studies of TF families across the plant kingdom

    Pipeline for the identification of tobacco TF sequences from the 1,159,022 GSRs

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "TOBFAC: the database of tobacco transcription factors"</p><p>http://www.biomedcentral.com/1471-2105/9/53</p><p>BMC Bioinformatics 2008;9():53-53.</p><p>Published online 25 Jan 2008</p><p>PMCID:PMC2246155.</p><p></p

    Snapshot of the page showing the results of BLAST searches with the tobacco TF contigs against 46,546 tobacco EST sequences from the ESTobacco project 18

    No full text
    <p><b>Copyright information:</b></p><p>Taken from "TOBFAC: the database of tobacco transcription factors"</p><p>http://www.biomedcentral.com/1471-2105/9/53</p><p>BMC Bioinformatics 2008;9():53-53.</p><p>Published online 25 Jan 2008</p><p>PMCID:PMC2246155.</p><p></p> A total of 557 of the transcription factor contigs are supported by EST sequences
    corecore