109 research outputs found

    PSR: Polymorphic SSR retrieval

    Get PDF
    BACKGROUND: With the advent of high-throughput sequencing technologies large-scale identification of microsatellites became affordable and was especially directed to non-model species. By contrast, few efforts have been published toward the automatic identification of polymorphic microsatellites by exploiting sequence redundancy. Few tools for genotyping microsatellite repeats have been implemented so far that are able to manage huge amount of sequence data and handle the SAM/BAM file format. Most of them have been developed for and tested on human or model organisms with high quality reference genomes. RESULTS: In this note we describe polymorphic SSR retrieval (PSR), a read counter and simple sequence repeat (SSR) length polymorphism detection tool. It is written in Perl and was developed to identify length polymorphisms in perfect microsatellites exploiting next generation sequencing (NGS) data. PSR has been developed bearing in mind plant non-model species for which de novo transcriptome assembly is generally the first sequence resource available to be used for SSR-mining. PSR is divided into two modules: the read-counting module (PSR_read_retrieval) identifies all the reads that cover the full-length of perfect microsatellites; the comparative module (PSR_poly_finder) detects both heterozygous and homozygous alleles at each microsatellite locus across all genotypes under investigation. Two threshold values to call a length polymorphism and reduce the number of false positives can be defined by the user: the minimum number of reads overlapping the repetitive stretch and the minimum read depth. The first parameter determines if the microsatellite-containing sequence must be processed or not, while the second one is decisive for the identification of minor alleles. PSR was tested on two different case studies. The first study aims at the identification of polymorphic SSRs in a set of de novo assembled transcripts defined by RNA-sequencing of two different plant genotypes. The second research activity aims to investigate sequence variations within a collection of newly sequenced chloroplast genomes. In both the cases PSR results are in agreement with those obtained by capillary gel separation. CONCLUSION: PSR has been specifically developed from the need to automate the gene-based and genome-wide identification of polymorphic microsatellites from NGS data. It overcomes the limits related to the existing and time-consuming efforts based on tools developed in the pre-NGS era

    NGS-based genotyping, high-throughput phenotyping and genome-wide association studies laid the foundations for next-generation breeding in horticultural crops

    Get PDF
    Demographic trends and changes to climate require a more efficient use of plant genetic resources in breeding programs. Indeed, the release of high-yielding varieties has resulted in crop genetic erosion and loss of diversity. This has produced an increased susceptibility to severe stresses and a reduction of several food quality parameters. Next generation sequencing (NGS) technologies are being increasingly used to explore “gene space” and to provide high-resolution profiling of nucleotide variation within germplasm collections. On the other hand, advances in high-throughput phenotyping are bridging the genotype-to-phenotype gap in crop selection. The combination of allelic and phenotypic data points via genome-wide association studies is facilitating the discovery of genetic loci that are associated with key agronomic traits. In this review, we provide a brief overview on the latest NGS-based and phenotyping technologies and on their role to unlocking the genetic potential of vegetable crops; then, we discuss the paradigm shift that is underway in horticultural crop breeding

    Gene models from ESTs (GeneModelEST): an application on the Solanum lycopersicum genome

    Get PDF
    Background: The structure annotation of a genome is based either on ab initio methodologies or on similaritiy searches versus molecules that have been already annotated. Ab initio gene predictions in a genome are based on a priori knowledge of species-specific features of genes. The training of ab initio gene finders is based on the definition of a data-set of gene models. To accomplish this task the common approach is to align species-specific full length cDNA and EST sequences along the genomic sequences in order to define exon/intron structure of mRNA coding genes. Results: GeneModelEST is the software here proposed for defining a data-set of candidate gene models using exclusively evidence derived from cDNA/EST sequences. GeneModelEST requires the genome coordinates of the spliced-alignments of ESTs and of contigs (tentative consensus sequences) generated by an EST clustering/assembling procedure to be formatted in a General Feature Format (GFF) standard file. Moreover, the alignments of the contigs versus a protein database are required as an NCBI BLAST formatted report file. The GeneModelEST analysis aims to i) evaluate each exon as defined from contig spliced alignments onto the genome sequence; ii) classify the contigs according to quality levels in order to select candidate gene models; iii) assign to the candidate gene models preliminary functional annotations. We discuss the application of the proposed methodology to build a data-set of gene models of Solanum lycopersicum, whose genome sequencing is an ongoing effort by the International Tomato Genome Sequencing Consortium. Conclusion: The contig classification procedure used by GeneModelEST supports the detection of candidate gene models, the identification of potential alternative transcripts and it is useful to filter out ambiguous information. An automated procedure, such as the one proposed here, is fundamental to support large scale analysis in order to provide species-specific gene models, that could be useful as a training data-set for ab initio gene finders and/or as a reference gene list for a human curated annotation

    ParPEST: a pipeline for EST data analysis based on parallel computing

    Get PDF
    BACKGROUND: Expressed Sequence Tags (ESTs) are short and error-prone DNA sequences generated from the 5' and 3' ends of randomly selected cDNA clones. They provide an important resource for comparative and functional genomic studies and, moreover, represent a reliable information for the annotation of genomic sequences. Because of the advances in biotechnologies, ESTs are daily determined in the form of large datasets. Therefore, suitable and efficient bioinformatic approaches are necessary to organize data related information content for further investigations. RESULTS: We implemented ParPEST (Parallel Processing of ESTs), a pipeline based on parallel computing for EST analysis. The results are organized in a suitable data warehouse to provide a starting point to mine expressed sequence datasets. The collected information is useful for investigations on data quality and on data information content, enriched also by a preliminary functional annotation. CONCLUSION: The pipeline presented here has been developed to perform an exhaustive and reliable analysis on EST data and to provide a curated set of information based on a relational database. Moreover, it is designed to reduce execution time of the specific steps required for a complete analysis using distributed processes and parallelized software. It is conceived to run on low requiring hardware components, to fulfill increasing demand, typical of the data used, and scalability at affordable costs

    TomatEST database: in silico exploitation of EST data to explore expression patterns in tomato species

    Get PDF
    TomatEST is a secondary database integrating expressed sequence tag (EST)/cDNA sequence information from different libraries of multiple tomato species. Redundant EST collections from each species are organized into clusters (gene indices). A cluster consists of one or multiple contigs. Multiple contigs in a cluster represent alternatively transcribed forms of a gene. The set of stand-alone EST sequences (singletons) and contigs, representing all the computationally defined ‘Transcript Indices’, are annotated according to similarity versus protein and RNA family databases. Sequence function description is integrated with the Gene Ontologies and the Enzyme Commission identifiers for a standard classification of gene products and for the mapping of the expressed sequences onto metabolic pathways. Information on the origin of the ESTs, on their structural features, on clusters and contigs, as well as on functional annotations are accessible via a user-friendly web interface. Specific facilities in the database allow Transcript Indices from a query be automatically classified in Enzyme classes and in metabolic pathways. The ‘on the fly’ mapping onto the metabolic maps is integrated in the analytical tools. The TomatEST database website is freely available at

    SolEST database: a "one-stop shop" approach to the study of Solanaceae transcriptomes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Since no genome sequences of solanaceous plants have yet been completed, expressed sequence tag (EST) collections represent a reliable tool for broad sampling of <it>Solanaceae </it>transcriptomes, an attractive route for understanding <it>Solanaceae </it>genome functionality and a powerful reference for the structural annotation of emerging <it>Solanaceae </it>genome sequences.</p> <p>Description</p> <p>We describe the SolEST database <url>http://biosrv.cab.unina.it/solestdb</url> which integrates different EST datasets from both cultivated and wild <it>Solanaceae </it>species and from two species of the genus <it>Coffea</it>. Background as well as processed data contained in the database, extensively linked to external related resources, represent an invaluable source of information for these plant families. Two novel features differentiate SolEST from other resources: i) the option of accessing and then visualizing <it>Solanaceae </it>EST/TC alignments along the emerging tomato and potato genome sequences; ii) the opportunity to compare different <it>Solanaceae </it>assemblies generated by diverse research groups in the attempt to address a common complaint in the SOL community.</p> <p>Conclusion</p> <p>Different databases have been established worldwide for collecting <it>Solanaceae </it>ESTs and are related in concept, content and utility to the one presented herein. However, the SolEST database has several distinguishing features that make it appealing for the research community and facilitates a "one-stop shop" for the study of <it>Solanaceae </it>transcriptomes.</p

    An EST database from saffron stigmas

    Get PDF
    BACKGROUND: Saffron (Crocus sativus L., Iridaceae) flowers have been used as a spice and medicinal plant ever since the Greek-Minoan civilization. The edible part - the stigmas - are commonly considered the most expensive spice in the world and are the site of a peculiar secondary metabolism, responsible for the characteristic color and flavor of saffron. RESULTS: We produced 6,603 high quality Expressed Sequence Tags (ESTs) from a saffron stigma cDNA library. This collection is accessible and searchable through the Saffron Genes database http://www.saffrongenes.org. The ESTs have been grouped into 1,893 Clusters, each corresponding to a different expressed gene, and annotated. The complete set of raw EST sequences, as well as of their electopherograms, are maintained in the database, allowing users to investigate sequence qualities and EST structural features (vector contamination, repeat regions). The saffron stigma transcriptome contains a series of interesting sequences (putative sex determination genes, lipid and carotenoid metabolism enzymes, transcription factors). CONCLUSION: The Saffron Genes database represents the first reference collection for the genomics of Iridaceae, for the molecular biology of stigma biogenesis, as well as for the metabolic pathways underlying saffron secondary metabolism
    • 

    corecore