18 research outputs found

    Bioinformatics of the sugarcane EST project

    Get PDF
    The Sugarcane EST project (SUCEST) produced 291,904 expressed sequence tags (ESTs) in a consortium that involved 74 sequencing and data mining laboratories. We created a web site for this project that served as a ?meeting point? for receiving, processing, analyzing, and providing services to help explore the sequence data. In this paper we describe the information pathway that we implemented to support this project and a brief explanation of the clustering procedure, which resulted in 43,141 clusters.O projeto SUCEST (Sugarcane EST Project) produziu 291.904 ESTs de cana-de-açúcar. Nesse projeto, o Laboratório de Bioinformática criou o web site que foi o ponto de encontro dos 74 laboratórios de sequenciamento e data mining que fizeram parte do consórcio para o projeto. O Laboratório de Bioinformática (LBI) recebeu, processou, analisou e disponibilizou ferramentas para a exploração dos dados. Neste artigo os dados, serviços e programas implementados pelo LBI para o projeto são descritos, incluindo o procedimento de clustering que gerou 43.141 clusters.915Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq

    The libraries that made SUCEST

    Get PDF
    A large-scale sequencing of sugarcane expressed sequence tags (ESTs) was carried out as a first step in depicting the genome of this important tropical crop. Twenty-six unidirectional cDNA libraries were constructed from a variety of tissues sampled from thirteen different sugarcane cultivars. A total of 291,689 cDNA clones were sequenced in their 5? and 3?end regions. After trimming low-quality sequences and removing vector and ribosomal RNA sequences, 237,954 ESTs potentially derived from protein-encoding messenger RNA (mRNA) remained. The average insert size in all libraries was estimated to be 1,250bp with the insert length varying from 500 to 5,000 bp. Clustering the 237,954 sugarcane ESTs resulted in 43,141clusters, from which 38% had no matches with existing sequences in the public databases. Around 53% of the clusters were formed by ESTs expressed in at least two libraries while 47% of the clusters are formed by ESTs expressed in only one library. A global analysis of the ESTs indicated that around 33% contain cDNA clones with full-length insert.1

    Genomic synteny between sorghum and sugarcane inferred from a BAC pool sequencing

    Get PDF
    Orientador: Paulo ArrudaTese (doutorado) - Universidade Estadual de Campinas, Instituto de BiologiaResumo: O sequenciamento genômico de plantas tem se acelerado nos últimos anos principalmente devido ao avanço das tecnologias de sequenciamento de nova geração, capazes de gerar um grande volume de dados com custo cada vez menor. No entanto, o sequenciamento e a montagem de genomas de plantas ainda continua sendo um grande desafio em função da alta complexidade desses genomas que na sua grande maioria possuem alto grau de ploidia e grande proporção de sequências repetitivas. O sequenciamento de bibliotecas produzidas com DNA genômico de plantas clonados em vetores BACs (bacterial artificial chromosomes) pode ser uma estratégia efetiva para sequenciamento de genomas complexos, por dividir a tarefa de montagem em problemas menores. No geral, bibliotecas de BACs contém fragmentos de DNA de 100 a 200 kilobases, cujo conjunto cobre o genoma clonado várias vezes. Entretanto, mesmo com as novas tecnologias de sequenciamento, o custo de sequenciar bibliotecas de BACs ainda é alto, pois na maioria das vezes o sequenciamento é realizado a partir do DNA isolado de cada BAC individualmente. Uma alternativa seria sequenciar pools contendo centenas de BACs amostrados randomicamente, que dessa forma diminuiria o custo proporcionalmente ao número de BACs do pool. Neste trabalho, desenvolvemos um modelo para sequenciamento e montagem de pools de BACs de uma biblioteca preparada a partir de uma variedade comercial de cana-de-açúcar. Como resultado, um pool com 178 BACs de cana-de-açúcar da variedade SP80-3280 foi sequenciado utilizando-se as tecnologias HighSeq2000 da Illumina e PacBio, e montados utilizando diferentes conjuntos de softwares. Por ser uma amostra de BACs selecionados randomicamente foi possível montar 2.451 scaffolds correspondentes a 88,2% do tamanho estimado total do conjunto de BACs do pool. A completeza da montagem foi verificada de várias maneiras incluindo a análise do número de BACs montados com tamanho esperado, a comparação com BACs depositados no NCBI e pela colinearidade e ordem de genes presentes entre scaffolds de cana e os cromossomos de sorgo. Os scaffolds com tamanho superior a 2 kb foram alinhados contra o genoma de sorgo, e no geral os alinhamentos mostraram uma distribuição uniforme ao longo dos 10 cromossomos do sorgo indicando a aleatoriedade da amostragem. Pela análise sintênica entre os scaffolds de cana e os cromossomos de sorgo, observamos que o genoma monoploide da cana parece ser mais contraído em relação ao genoma do sorgo. No geral o trabalho mostrou que é possível sequenciar pool de BACs de genomas de plantas de alta complexidade como o genoma de cana-de-açúcar com altos níveis de ploidiaAbstract: The genomic sequencing of plants has accelerated in recent years mainly due to the advances of next generation sequencing technologies capable of generating a high volume of data with ever lower cost. However, the sequencing and assembly of plant genomes remains a major challenge due to the high complexity of these genomes that mostly have a high degree of ploidy and large proportion of repetitive sequences. The sequencing of libraries produced with genomic DNA of plants cloned into BAC (bacterial artificial chromosome) vectors can be an effective strategy for sequencing complex genomes, by breaking down the assembly task into smaller problems. Typical BAC libraries contain DNA fragments of 100 to 200 kilobases which together cover the genome cloned several times. However, even with the new sequencing technologies, the cost of sequencing BACs libraries is still high because most of the times the sequencing is individually performed from the isolated DNA of each BAC. An alternative would be the sequencing of pools containing hundreds of randomly sampled BACs, which thereby would decrease the cost in proportion to the number of BACs pooled. In this work we developed a model for sequencing and assembly BAC pools of a library prepared from a commercial sugarcane variety. As a result, a pool of 178 BACs from sugarcane variety SP80-3280 was sequenced using the technologies of the Illumina HighSeq2000 and PacBio and was assembled using different sets of softwares. Being a sample of randomly selected BACs was possible to assemble 2,451 scaffolds corresponding to 88.2% of the estimated total pool size set of BACs. The completeness of the assembly was verified in many ways including the analysis of the number of BACs assembled with expected size, comparison to sugarcane BACs deposited in NCBI and by the collinearity and gene order presented between sugarcane scaffolds and sorghum chromosomes. Scaffolds larger than 2 kb were aligned to the sorghum genome, and in general, alignments showed a uniform distribution over the 10 sorghum chromosomes indicating the randomness of sampling. By syntenic analysis between sugarcane scaffolds and sorghum chromosomes, we found that the monoploid sugarcane genome seems to be more contracted compared to the genome of sorghum. Overall the study showed that it is possible to sequence BAC pools from plant genomes with high complexity like the sugarcane genome with high level of ploidyDoutoradoBioinformaticaDoutor em Genetica e Biologia Molecula

    Generation, annotation, analysis and database integration of 16,500 white spruce EST clusters

    Get PDF
    BACKGROUND: The sequencing and analysis of ESTs is for now the only practical approach for large-scale gene discovery and annotation in conifers because their very large genomes are unlikely to be sequenced in the near future. Our objective was to produce extensive collections of ESTs and cDNA clones to support manufacture of cDNA microarrays and gene discovery in white spruce (Picea glauca [Moench] Voss). RESULTS: We produced 16 cDNA libraries from different tissues and a variety of treatments, and partially sequenced 50,000 cDNA clones. High quality 3' and 5' reads were assembled into 16,578 consensus sequences, 45% of which represented full length inserts. Consensus sequences derived from 5' and 3' reads of the same cDNA clone were linked to define 14,471 transcripts. A large proportion (84%) of the spruce sequences matched a pine sequence, but only 68% of the spruce transcripts had homologs in Arabidopsis or rice. Nearly all the sequences that matched the Populus trichocarpa genome (the only sequenced tree genome) also matched rice or Arabidopsis genomes. We used several sequence similarity search approaches for assignment of putative functions, including blast searches against general and specialized databases (transcription factors, cell wall related proteins), Gene Ontology term assignation and Hidden Markov Model searches against PFAM protein families and domains. In total, 70% of the spruce transcripts displayed matches to proteins of known or unknown function in the Uniref100 database (blastx e-value < 1e-10). We identified multigenic families that appeared larger in spruce than in the Arabidopsis or rice genomes. Detailed analysis of translationally controlled tumour proteins and S-adenosylmethionine synthetase families confirmed a twofold size difference. Sequences and annotations were organized in a dedicated database, SpruceDB. Several search tools were developed to mine the data either based on their occurrence in the cDNA libraries or on functional annotations. CONCLUSION: This report illustrates specific approaches for large-scale gene discovery and annotation in an organism that is very distantly related to any of the fully sequenced genomes. The ArboreaSet sequences and cDNA clones represent a valuable resource for investigations ranging from plant comparative genomics to applied conifer genetics

    An EST-based analysis identifies new genes and reveals distinctive gene expression features of Coffea arabica and Coffea canephora

    Get PDF
    Background: Coffee is one of the world’s most important crops; it is consumed worldwide and plays a significant role in the economy of producing countries. Coffea arabica and C. canephora are responsible for 70 and 30% of commercial production, respectively. C. arabica is an allotetraploid from a recent hybridization of the diploid species, C. canephora and C. eugenioides. C. arabica has lower genetic diversity and results in a higher quality beverage than C. canephora. Research initiatives have been launched to produce genomic and transcriptomic data about Coffea spp. as a strategy to improve breeding efficiency. Results: Assembling the expressed sequence tags (ESTs) of C. arabica and C. canephora produced by the Brazilian Coffee Genome Project and the Nestlé-Cornell Consortium revealed 32,007 clusters of C. arabica and 16,665 clusters of C. canephora. We detected different GC3 profiles between these species that are related to their genome structure and mating system. BLAST analysis revealed similarities between coffee and grape (Vitis vinifera) genes. Using KA/KS analysis, we identified coffee genes under purifying and positive selection. Protein domain and gene ontology analyses suggested differences between Coffea spp. data, mainly in relation to complex sugar synthases and nucleotide binding proteins. OrthoMCL was used to identify specific and prevalent coffee protein families when compared to five other plant species. Among the interesting families annotated are new cystatins, glycine-rich proteins and RALF-like peptides. Hierarchical clustering was used to independently group C. arabica and C. canephora expression clusters according to expression data extracted from EST libraries, resulting in the identification of differentially expressed genes. Based on these results, we emphasize gene annotation and discuss plant defenses, abiotic stress and cup quality-related functional categories. Conclusion: We present the first comprehensive genome-wide transcript profile study of C. arabica and C. canephora, which can be freely assessed by the scientific community at http://www.lge.ibi.unicamp.br/ coffea. Our data reveal the presence of species-specific/prevalent genes in coffee that may help to explain particular characteristics of these two crops. The identification of differentially expressed transcripts offers a starting point for the correlation between gene expression profiles and Coffea spp. developmental traits, providing valuable insights for coffee breeding and biotechnology, especially concerning sugar metabolism and stress tolerance
    corecore