3,951 research outputs found

    Ion Torrent and lllumina, two complementary RNA-seq platforms for constructing the holm oak (Quercus ilex) transcriptome

    Get PDF
    Transcriptome analysis is widely used in plant biology research to explore gene expression across a large variety of biological contexts such as those related to environmental stress and plant-pathogen interaction. Currently, next generation sequencing platforms are used to obtain a high amount of raw data to build the transcriptome of any plant. Here, we compare Illumina and Ion Torrent sequencing platforms for the construction and analysis of the holm oak (Quercus ilex) transcriptome. Genomic analysis of this forest tree species is a major challenge considering its recalcitrant character and the absence of previous molecular studies. In this study, Quercus ilex raw sequencing reads were obtained from Illumina and Ion Torrent and assembled by three different algorithms, MIRA, RAY and TRINITY. A hybrid transcriptome combining both sequencing technologies was also obtained in this study. The RAY-hybrid assembly generated the most complete transcriptome (1,116 complete sequences of which 1,085 were single copy) with a E90N50 of 1,122 bp. The MIRAIllumina and TRINITY-Ion Torrent assemblies annotated the highest number of total transcripts (62,628 and 74,058 respectively). MIRA-Ion Torrent showed the highest number of shared sequences (84.8%) with the oak transcriptome. All the assembled transcripts from the hybrid transcriptome were annotated with gene ontology grouping them in terms of biological processes, molecular functions and cellular components. In addition, an in silico proteomic analysis was carried out using the translated assemblies as databases. Those from Ion Torrent showed more proteins compared to the Illumina and hybrid assemblies. This new generated transcriptome represents a valuable tool to conduct differential gene expression studies in response to biotic and abiotic stresses and to assist and validate the ongoing Q. ilex whole genome sequencing

    Integración de la Bioinformática en la investigación molecular en especies forestales: el caso de la encina (Quercus ilex)

    Get PDF
    The term Bioinformatics, first coined by Paulien Hogeweg and Ben Hesper, back in 1970 to describe ’the study of informatic processes in biotic systems’, can be defined as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, represent, describe, store, analyze, or visualize such data” or “the development and application of data-analytical and theoretical methods, mathematical modelling, and computational simulation techniques to the study of biological, behavioural, and social systems”. The first definition deals with the biological information management, and the second one with computational biology. The general objective and methodology employed in the current Thesis, “Integration of Bioinformatics to molecular research in forest species: the case of Holm oak (Quercus ilex)”, is focused on the first definition. The use of bioinformatic tools (algorithms, programs, databases and repositories) has been used to construct the transcriptome, proteome and metabolome of Holm oak and their integration to define the metabolism and responses to drought in this species. Since the end of the last century, biological research has moved from a reductionist to holistic paradigm, which have been possible thanks to the great technological advances, especially in the molecular biology discipline. Thus, the appearance of platforms based on the Next Generation Sequencing (NGS), and transcriptomics, and Mass Spectrometry(MS),for proteomics and metabolomics has made possible to obtain from hundreds to thousands of data in a single experiment, being impossible the management and analysis of them without the employment of informatics tools. The employment of high throughput techniques and their combination with classic approaches is what defines“SystemsBiology”. It do not only analyse thousands and thousands of molecular entities of an individual, but also the integration and creation of predictive models. This is quite feasible with model organisms (e.g. Arabidopsis), but it is a real challenge for those orphan and recalcitrant experimental systems such as Q. ilex. The study of this species is justified because of the environmental and economic importance in Spain and, because it faces a problem of increasing tree mortality associated to the decline syndrome, a situation that can be worsen in a climate change scenario. Biotechnology can contribute to solve this problem through breeding programs based on markers-assisted selection of elite genotypes that are more tolerant and resistant to biotic and abiotic stresses and more resilient to climate change. As a continuation of the work carried out since 2004 by the research group “Agroforestry and Plant Biochemistry, Proteomics, and Systems Biology”, mostly based on classic biochemistry, physiology and proteomics, and considering that neither the genome of Holm oak has been sequenced yet nor DNA or proteins sequences are available in public databases, as first objective of the Thesis was proposed the construction of the first reference transcriptome for this species. The work is presented in chapter 3, and has been published in Frontiers in Molecular Bioscience. For that purpose, the mRNA extracted from homogenized tissue from acorn embryo, leaves, and roots, was sequenced using an Illumina Hiseq 2500 platform. Three different assemblers were employed, TRINITY, RAY, and MIRA. The assemblies obtained were aligned against the most accurate and nearest phylogenetically transcriptome currently available, that of Quercus robur and Quercus petraea. MIRA generated more and longer contigs than RAY and TRINITY (MIRA>RAY>TRINITY). So, MIRA assembly was used to continue with the corresponding annotation of Q. ilex transcriptome, resulting in 31973 annotated sequences were obtained by Blast2GO using Swiss-Prot as reference database. As a continuation of the previous work, and as a second objective, a new sequencing platform, Ion Torrent, was evaluated in the construction and analysis of the Q. ilex transcriptome. The obtained results are presented in chapter 4 and have been already published in PLoS ONE. Raw sequence reads, obtained from Illumina and Ion Torrent, were assembled by three different software, MIRA, RAY and TRINITY. A hybrid transcriptome combining reads from both sequencing technologies was also assembled using RAY. The hybrid assembly generated the most complete transcriptome. The assembly of Ion Torrent reads of MIRA showed the highest number of shared sequences (84.8%) with the oak transcriptome. In addition, an in silico proteomic analysis was carried out using the translated assemblies as databases. Those from Ion Torrent showed more proteins compared to the Illumina and hybrid assemblies. All the assembled transcripts from the hybrid transcriptome were annotated and grouped according to the corresponding biological processes, molecular functions and cellular components (Gene Ontology). This new generated transcriptome represents a valuable tool to conduct differential gene expression studies in response to biotic and abiotic stresses and to assist and validate the ongoing Q. ilex whole genome sequencing. By using the above mentioned plant sample, the transcriptomic (NGS-Illumina), proteomic (shotgun LC-MS/MS, Orbitrap), and metabolomic (GCMS) profiles were analysed. Results are presented in chapter 5, and have been already published in Frontiers in Plant Science. The annotated Q. ilex transcriptome was compared against the complete in silico proteomes of Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. Japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112) in order to elucidate the unique and shared sequences. Also, the EC numbers of each proteome were contrasted to achieve a complete picture of the metabolic pathways coverage differences among proteomes studied in previously mentioned species. The descriptive analysis and the visualization of data on a gene-by-gene basis on schematic diagrams (maps) of the biological processes described in Mapman, resulted in the identification of around 62629 transcripts, 2380 protein species, and 62 metabolites. Data were compared with those reported for model plant species, whose genome has been sequenced and well annotated, including Arabidopsis, japonica rice, poplar, and eucalyptus. The integration of the large amount of data reported using bioinformatics tools allowed the Holm oak metabolic network to be partially reconstructed. From the 127 metabolic pathways reported in KEGG pathway database, 123 metabolic pathways can be visualized when using the described methodology. They included: carbohydrate and energy metabolism, amino acid metabolism, lipid metabolism, nucleotide metabolism, and biosynthesis of secondary metabolites. The TCA cycle was the pathway most represented with 5 out of 10 metabolites, 6 out of 8 protein enzymes, and 8 out of 8 enzyme transcripts. On the other hand, gaps, missed pathways, included metabolism of terpenoids and polyketides and lipid metabolism. The multi-omics resource generated in this work will set the basis for ongoing and future studies, bringing the Holm oak closer to model species. As a final objective of the current Thesis, an integrated transcriptomics and proteomics analysis of the response to drought in Q. ilex seedlings has been carried out. Seedlings were subjected to drought conditions by water withholding, and leaf tissue sampled at two times of the experiment, 20 and 25 days. RNA and proteins were extracted and analysed by using RNA-seq (Illumina), and proteomics, LC-MS/MS Orbitrap. Data are presented in chapter 6; it also corresponds to a manuscript to be submitted for publication. Gene products were identified and quantified at transcript and protein levels, establishing correlations between transcript and the corresponding protein abundance. Gene ontology (GO) analysis was performed to classify identified transcripts and proteins in terms of biological process, molecular function and cellular component. A multivariate analysis of the total and variable datasets at transcript and protein levels was performed with mixOmics. To acquire an integrated visualization of Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway maps, total transcript and protein datasets, specifying those variable transcripts and proteins, were analysed by Paintomics 3 (v0.4.5), considering Arabidopsis thaliana as a model reference. Pathways with p-value RAY>Trinity). Por lo tanto, las secuencias ensambladas con MIRA fueron las que se usaron para continuar con la anotación correspondiente del transcriptoma Q. ilex, lo que resultó en 31973 secuencias anotadas obtenidas por Blast2GO utilizando Swiss-Prot como base de datos de referencia. Como continuación del trabajo descrito en el capítulo 4, y como segundo objetivo, se evaluó una nueva plataforma de secuenciación, Ion Torrent, para la construcción y análisis del transcriptoma de Q. ilex. Los resultados obtenidos han sido publicados en PLoS ONE. Como en el capítulo anterior, las lecturas obtenidas a partir de Illumina y Ion Torrent se ensamblaron utilizando tres programas diferentes, MIRA, RAY y TRINITY. En el ensamblado de MIRA con Illumina y el de TRINITY con Ion Torrent generaron el mayor número de transcritos anotados (62628 y 74058 respectivamente). El ensamblado de MIRA con Ion Torrent generó el mayor número de secuencias compartidas con el transcriptoma del roble (84.8%). RAY generó los mejores resultados atendiendo al número de contigs y longitud de los mismos, con valores de E90N50 de 1122bp. Todos los transcritos del nuevo transcriptoma de referencia fueron anotados y agrupados en términos de Gene Ontology ("Biological Process", "Celullar Component" y "Molecular Function"). Dicho transcriptoma se tradujo in silico, obteniéndose una base de datos de proteínas que será utilizada en experimentos de proteómica para la identificación de productos génicos. El uso de dicha base de datos incrementó notablemente el número de especies proteicas identificadas y los parámetros de confianza de la identificación. A partir de las bases de datos generadas y los datos multiómicos obtenidos cuando se utilizó una muestra de encina consistente en un pool de extractos de diferentes tejidos (embrión, hoja y raíz) se reconstruyeron diferentes rutas metabólicas tal y como ocurren en Q. ilex. Los resultados se presentan en el capítulo 5 y han sido publicados en Frontiers in Plant Science. Se llevó a cabo la extracción independiente a partir de la misma muestra del RNA, proteínas y metabolitos, estableciéndose el perfil ómico mediante NGS-Illumina (RNA), shotgun LC-MS/MS, Orbitrap (proteínas) y GC-MS (metabolitos). Se identificaron 62629 transcritos, 2380 especies proteicas y 62 metabolitos. Se llevó a cabo la identificación de productos génicos correspondientes a enzimas mediante la comparación con genomas de referencia incluyendo Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112). Delas127rutasmetabólicasdescritasenKEGG, y mediante el empleo de Mapman, se visualizaron 123, entre ellas, las del metabolismo energético, de carbohidratos, de aminoacidos, lípicos, nucleótidos y secundario. El ciclo de los ácidos tricarboxílicos (TCA) fue la ruta mejor representadas con 5 de 10 metabolitos, 6 de 8 proteínas enzimáticas y 8 de 8 transcritos. Por otro lado, hay rutas que no se observaron o estaban muy poco representadas, como por ejemplo las del metabolismo de lípidos, terpenoides y policétidos. Como objetivo final de la presente tesis, se llevó a cabo un análisis transcriptómico y proteómico integrado de la respuesta a sequía en plantones de Q. ilex. Los resultados se presentan en el capítulo 6, correspondiente a un manuscrito que será enviado para su publicación. Las plántulas de Q. ilex crecieron en macetas con perlita, siendo sometidas a condiciones de sequía por falta de riego durante 30 días. Se tomaron muestras de hojas a dos tiempos, cuando la fluorescencia de las hojas disminuyó en un 30% y un 50% (20 y 25 días). Tras la extracción de RNA y proteínas se llevó a cabo su análisis mediante RNA-Seq (Illumina) y proteómica “shotgun” (LS-MS/MS, Orbitrap). El análisis de RNA-seq generó 47868 transcritos correspondientes a 21000 unigenes, con 3588 diferencias cualitativas o cuantitativas entre plántulas irrigadas y no irrigadas (1149 sobreexpresados y 2439 reprimidos). A partir de la proteómica “shotgun” se identificaron 4008 proteoformas, productos de 2767 genes diferentes; de ellos, 640 presentaron diferencias cualitativas o cuantitativas en abundancia entre tratamientos (353 más y 287 menos abundantes en condiciones de sequía). Los productos genéticos variables se clasificaron en términos de Gene Ontology (proceso biológico, función molecular y componente celular) y en rutas metabólicas de KEGG en el caso de las enzimas. El conjunto de datos variables se sometió a análisis estadístico multivariante, PCA y sPLS. Finalmente, se usó GeneMANIA para la construcción de redes de interacción. Hubo cambios importantes en el patrón de expresión génica siendo los grupos de respuesta a estrés y cloroplastos lo más afectados. Respecto a rutas metabólicas, se detectaron cambios en la síntesis de proteínas, fotosíntesis, carbohidratos, aminoácidos y fenólicos. Hubo cambios transitorios (observado a un solo tiempo) o permanentes (comunes a los dos tiempos) detectados a nivel de transcrito y/o proteína. El número de productos génicos variables detectados por ambas plataformas fue mínimo, entre ellos RPS2, 4CL2, PSB28 y RIN4. A partir del conjunto de datos de transcritos y proteínas variables, se construyeron dos redes de interacción: la primera incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, SMT1 y UGP3, y los genes reprimidos ABA2, RPS1, ADK y RPL4, y la segunda red incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, AP1, INVE, AT4G2740, CAD4, FEN1 y HIPP27 y el gen reprimido ABA2. Se proponen como genes marcadores de respuesta y tolerancia a sequía en encina a aquellos sobreexpresados a los dos tiempos y detectados a nivel de transcrito y proteína. Solo un número de genes cumplen dichas características entre los que se incluyen posibles proteínas de respuest

    Informatics for RNA sequencing: A web resource for analysis on the cloud

    Get PDF
    Massively parallel RNA sequencing (RNA-seq) has rapidly become the assay of choice for interrogating RNA transcript abundance and diversity. This article provides a detailed introduction to fundamental RNA-seq molecular biology and informatics concepts. We make available open-access RNA-seq tutorials that cover cloud computing, tool installation, relevant file formats, reference genomes, transcriptome annotations, quality-control strategies, expression, differential expression, and alternative splicing analysis methods. These tutorials and additional training resources are accompanied by complete analysis pipelines and test datasets made available without encumbrance at www.rnaseq.wiki

    Are we there yet? : reliably estimating the completeness of plant genome sequences

    Get PDF
    Genome sequencing is becoming cheaper and faster thanks to the introduction of next-generation sequencing techniques. Dozens of new plant genome sequences have been released in recent years, ranging from small to gigantic repeat-rich or polyploid genomes. Most genome projects have a dual purpose: delivering a contiguous, complete genome assembly and creating a full catalog of correctly predicted genes. Frequently, the completeness of a species' gene catalog is measured using a set of marker genes that are expected to be present. This expectation can be defined along an evolutionary gradient, ranging from highly conserved genes to species-specific genes. Large-scale population resequencing studies have revealed that gene space is fairly variable even between closely related individuals, which limits the definition of the expected gene space, and, consequently, the accuracy of estimates used to assess genome and gene space completeness. We argue that, based on the desired applications of a genome sequencing project, different completeness scores for the genome assembly and/or gene space should be determined. Using examples from several dicot and monocot genomes, we outline some pitfalls and recommendations regarding methods to estimate completeness during different steps of genome assembly and annotation

    RNA‐seq: Applications and Best Practices

    Get PDF
    RNA‐sequencing (RNA‐seq) is the state‐of‐the‐art technique for transcriptome analysis that takes advantage of high‐throughput next‐generation sequencing. Although being a powerful approach, RNA‐seq imposes major challenges throughout its steps with numerous caveats. There are currently many experimental options available, and a complete comprehension of each step is critical to make right decisions and avoid getting into inconclusive results. A complete workflow consists of: (1) experimental design; (2) sample and library preparation; (3) sequencing; and (4) data analysis. RNA‐seq enables a wide range of applications such as the discovery of novel genes, gene/transcript quantification, and differential expression and functional analysis. This chapter will encompass the main aspects from sample preparation to downstream data analysis. It will be discussed how to obtain high‐quality samples, replicates amount, library preparation, sequencing platforms and coverage, focusing on best recommended practices based on specialized literature. Basic techniques and well‐known algorithms are presented and discussed, guiding both beginners and experienced users in the implementation of reliable experiments

    A comprehensive comparison of RNA-Seq-based transcriptome analysis from reads to differential gene expression and cross-comparison with microarrays: a case study in Saccharomyces cerevisiae

    Get PDF
    RNA-seq, has recently become an attractive method of choice in the studies of transcriptomes, promising several advantages compared with microarrays. In this study, we sought to assess the contribution of the different analytical steps involved in the analysis of RNA-seq data generated with the Illumina platform, and to perform a cross-platform comparison based on the results obtained through Affymetrix microarray. As a case study for our work we, used the Saccharomyces cerevisiae strain CEN.PK 113-7D, grown under two different conditions (batch and chemostat). Here, we asses the influence of genetic variation on the estimation of gene expression level using three different aligners for read-mapping (Gsnap, Stampy and TopHat) on S288c genome, the capabilities of five different statistical methods to detect differential gene expression (baySeq, Cuffdiff, DESeq, edgeR and NOISeq) and we explored the consistency between RNA-seq analysis using reference genome and de novo assembly approach. High reproducibility among biological replicates (correlation >= 0.99) and high consistency between the two platforms for analysis of gene expression levels (correlation >= 0.91) are reported. The results from differential gene expression identification derived from the different statistical methods, as well as their integrated analysis results based on gene ontology annotation are in good agreement. Overall, our study provides a useful and comprehensive comparison between the two platforms (RNA-seq and microrrays) for gene expression analysis and addresses the contribution of the different steps involved in the analysis of RNA-seq data

    A survey of best practices for RNA-seq data analysis.

    Get PDF
    RNA-sequencing (RNA-seq) has a wide variety of applications, but no single analysis pipeline can be used in all cases. We review all of the major steps in RNA-seq data analysis, including experimental design, quality control, read alignment, quantification of gene and transcript levels, visualization, differential gene expression, alternative splicing, functional analysis, gene fusion detection and eQTL mapping. We highlight the challenges associated with each step. We discuss the analysis of small RNAs and the integration of RNA-seq with other functional genomics techniques. Finally, we discuss the outlook for novel technologies that are changing the state of the art in transcriptomics.This is the final published version. It first appeared at http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

    Micro and Macro-Evolutionary Studies in Non-Model Species: a Transcriptomic Perspective in Teleosts

    Get PDF
    246 p.Los recursos genómicos y las herramientas bioinformáticas son muy escasas en especies no modelo. ElRNA-Seq puede ser una herramienta efectiva para generar tales recursos genómicos con el objetivo derevelar la variación genética y funciones necesarias en estudios micro- y macro-evolutivos.La primera sección de esta Tesis aborda la presencia o ausencia de los filogrupos Este y Oeste de dospoblaciones cultivadas de Tinca tinca en Europa Central. Nuestro estudio respalda la hipótesis de que losindividuos analizados resultan ser un mosaico genómico de ambos filogrupos, y que las diferencias entrelas dos razas se deben a la composición inicial de los filogrupos en el momento de su fundación.La segunda sección de esta Tesis describe EXFI, un método que utiliza algoritmos de última generaciónpara dividir tránscritos en exones.La tercera sección muestra un método de muestreo optimizado para la caracterización de tránscritos.Apliqué la estrategia multi-tejido para obtener muestras, secuenciar y ensamblar el transcriptoma másexhaustivo de la sardina europea.En la última sección, estudié las relaciones evolutivas dentro de los Clupeiformes. Recopilé un granconjunto de transcriptomas de doce especies, construí su árbol filogenético, y descubrí grupos de genesbajo selección positiva. La principal conclusión es que la evolución ha moldeado la maquinaria molecularde los Clupeiformes hacia un almacenamiento y transporte de lípidos mejorado
    corecore