Integración de la Bioinformática en la investigación molecular en especies forestales: el caso de la encina (Quercus ilex)

Abstract

The term Bioinformatics, first coined by Paulien Hogeweg and Ben Hesper, back in 1970 to describe ’the study of informatic processes in biotic systems’, can be defined as “research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, represent, describe, store, analyze, or visualize such data” or “the development and application of data-analytical and theoretical methods, mathematical modelling, and computational simulation techniques to the study of biological, behavioural, and social systems”. The first definition deals with the biological information management, and the second one with computational biology. The general objective and methodology employed in the current Thesis, “Integration of Bioinformatics to molecular research in forest species: the case of Holm oak (Quercus ilex)”, is focused on the first definition. The use of bioinformatic tools (algorithms, programs, databases and repositories) has been used to construct the transcriptome, proteome and metabolome of Holm oak and their integration to define the metabolism and responses to drought in this species. Since the end of the last century, biological research has moved from a reductionist to holistic paradigm, which have been possible thanks to the great technological advances, especially in the molecular biology discipline. Thus, the appearance of platforms based on the Next Generation Sequencing (NGS), and transcriptomics, and Mass Spectrometry(MS),for proteomics and metabolomics has made possible to obtain from hundreds to thousands of data in a single experiment, being impossible the management and analysis of them without the employment of informatics tools. The employment of high throughput techniques and their combination with classic approaches is what defines“SystemsBiology”. It do not only analyse thousands and thousands of molecular entities of an individual, but also the integration and creation of predictive models. This is quite feasible with model organisms (e.g. Arabidopsis), but it is a real challenge for those orphan and recalcitrant experimental systems such as Q. ilex. The study of this species is justified because of the environmental and economic importance in Spain and, because it faces a problem of increasing tree mortality associated to the decline syndrome, a situation that can be worsen in a climate change scenario. Biotechnology can contribute to solve this problem through breeding programs based on markers-assisted selection of elite genotypes that are more tolerant and resistant to biotic and abiotic stresses and more resilient to climate change. As a continuation of the work carried out since 2004 by the research group “Agroforestry and Plant Biochemistry, Proteomics, and Systems Biology”, mostly based on classic biochemistry, physiology and proteomics, and considering that neither the genome of Holm oak has been sequenced yet nor DNA or proteins sequences are available in public databases, as first objective of the Thesis was proposed the construction of the first reference transcriptome for this species. The work is presented in chapter 3, and has been published in Frontiers in Molecular Bioscience. For that purpose, the mRNA extracted from homogenized tissue from acorn embryo, leaves, and roots, was sequenced using an Illumina Hiseq 2500 platform. Three different assemblers were employed, TRINITY, RAY, and MIRA. The assemblies obtained were aligned against the most accurate and nearest phylogenetically transcriptome currently available, that of Quercus robur and Quercus petraea. MIRA generated more and longer contigs than RAY and TRINITY (MIRA>RAY>TRINITY). So, MIRA assembly was used to continue with the corresponding annotation of Q. ilex transcriptome, resulting in 31973 annotated sequences were obtained by Blast2GO using Swiss-Prot as reference database. As a continuation of the previous work, and as a second objective, a new sequencing platform, Ion Torrent, was evaluated in the construction and analysis of the Q. ilex transcriptome. The obtained results are presented in chapter 4 and have been already published in PLoS ONE. Raw sequence reads, obtained from Illumina and Ion Torrent, were assembled by three different software, MIRA, RAY and TRINITY. A hybrid transcriptome combining reads from both sequencing technologies was also assembled using RAY. The hybrid assembly generated the most complete transcriptome. The assembly of Ion Torrent reads of MIRA showed the highest number of shared sequences (84.8%) with the oak transcriptome. In addition, an in silico proteomic analysis was carried out using the translated assemblies as databases. Those from Ion Torrent showed more proteins compared to the Illumina and hybrid assemblies. All the assembled transcripts from the hybrid transcriptome were annotated and grouped according to the corresponding biological processes, molecular functions and cellular components (Gene Ontology). This new generated transcriptome represents a valuable tool to conduct differential gene expression studies in response to biotic and abiotic stresses and to assist and validate the ongoing Q. ilex whole genome sequencing. By using the above mentioned plant sample, the transcriptomic (NGS-Illumina), proteomic (shotgun LC-MS/MS, Orbitrap), and metabolomic (GCMS) profiles were analysed. Results are presented in chapter 5, and have been already published in Frontiers in Plant Science. The annotated Q. ilex transcriptome was compared against the complete in silico proteomes of Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. Japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112) in order to elucidate the unique and shared sequences. Also, the EC numbers of each proteome were contrasted to achieve a complete picture of the metabolic pathways coverage differences among proteomes studied in previously mentioned species. The descriptive analysis and the visualization of data on a gene-by-gene basis on schematic diagrams (maps) of the biological processes described in Mapman, resulted in the identification of around 62629 transcripts, 2380 protein species, and 62 metabolites. Data were compared with those reported for model plant species, whose genome has been sequenced and well annotated, including Arabidopsis, japonica rice, poplar, and eucalyptus. The integration of the large amount of data reported using bioinformatics tools allowed the Holm oak metabolic network to be partially reconstructed. From the 127 metabolic pathways reported in KEGG pathway database, 123 metabolic pathways can be visualized when using the described methodology. They included: carbohydrate and energy metabolism, amino acid metabolism, lipid metabolism, nucleotide metabolism, and biosynthesis of secondary metabolites. The TCA cycle was the pathway most represented with 5 out of 10 metabolites, 6 out of 8 protein enzymes, and 8 out of 8 enzyme transcripts. On the other hand, gaps, missed pathways, included metabolism of terpenoids and polyketides and lipid metabolism. The multi-omics resource generated in this work will set the basis for ongoing and future studies, bringing the Holm oak closer to model species. As a final objective of the current Thesis, an integrated transcriptomics and proteomics analysis of the response to drought in Q. ilex seedlings has been carried out. Seedlings were subjected to drought conditions by water withholding, and leaf tissue sampled at two times of the experiment, 20 and 25 days. RNA and proteins were extracted and analysed by using RNA-seq (Illumina), and proteomics, LC-MS/MS Orbitrap. Data are presented in chapter 6; it also corresponds to a manuscript to be submitted for publication. Gene products were identified and quantified at transcript and protein levels, establishing correlations between transcript and the corresponding protein abundance. Gene ontology (GO) analysis was performed to classify identified transcripts and proteins in terms of biological process, molecular function and cellular component. A multivariate analysis of the total and variable datasets at transcript and protein levels was performed with mixOmics. To acquire an integrated visualization of Kyoto Encyclopaedia of Genes and Genomes (KEGG) pathway maps, total transcript and protein datasets, specifying those variable transcripts and proteins, were analysed by Paintomics 3 (v0.4.5), considering Arabidopsis thaliana as a model reference. Pathways with p-value RAY>Trinity). Por lo tanto, las secuencias ensambladas con MIRA fueron las que se usaron para continuar con la anotación correspondiente del transcriptoma Q. ilex, lo que resultó en 31973 secuencias anotadas obtenidas por Blast2GO utilizando Swiss-Prot como base de datos de referencia. Como continuación del trabajo descrito en el capítulo 4, y como segundo objetivo, se evaluó una nueva plataforma de secuenciación, Ion Torrent, para la construcción y análisis del transcriptoma de Q. ilex. Los resultados obtenidos han sido publicados en PLoS ONE. Como en el capítulo anterior, las lecturas obtenidas a partir de Illumina y Ion Torrent se ensamblaron utilizando tres programas diferentes, MIRA, RAY y TRINITY. En el ensamblado de MIRA con Illumina y el de TRINITY con Ion Torrent generaron el mayor número de transcritos anotados (62628 y 74058 respectivamente). El ensamblado de MIRA con Ion Torrent generó el mayor número de secuencias compartidas con el transcriptoma del roble (84.8%). RAY generó los mejores resultados atendiendo al número de contigs y longitud de los mismos, con valores de E90N50 de 1122bp. Todos los transcritos del nuevo transcriptoma de referencia fueron anotados y agrupados en términos de Gene Ontology ("Biological Process", "Celullar Component" y "Molecular Function"). Dicho transcriptoma se tradujo in silico, obteniéndose una base de datos de proteínas que será utilizada en experimentos de proteómica para la identificación de productos génicos. El uso de dicha base de datos incrementó notablemente el número de especies proteicas identificadas y los parámetros de confianza de la identificación. A partir de las bases de datos generadas y los datos multiómicos obtenidos cuando se utilizó una muestra de encina consistente en un pool de extractos de diferentes tejidos (embrión, hoja y raíz) se reconstruyeron diferentes rutas metabólicas tal y como ocurren en Q. ilex. Los resultados se presentan en el capítulo 5 y han sido publicados en Frontiers in Plant Science. Se llevó a cabo la extracción independiente a partir de la misma muestra del RNA, proteínas y metabolitos, estableciéndose el perfil ómico mediante NGS-Illumina (RNA), shotgun LC-MS/MS, Orbitrap (proteínas) y GC-MS (metabolitos). Se identificaron 62629 transcritos, 2380 especies proteicas y 62 metabolitos. Se llevó a cabo la identificación de productos génicos correspondientes a enzimas mediante la comparación con genomas de referencia incluyendo Arabidopsis thaliana (UP0000065489, Oryza sativa subsp. japonica (UP00005968010), Populus trichocarpa (UP00000672911), and Eucaliptus grandis (UP00003071112). Delas127rutasmetabólicasdescritasenKEGG, y mediante el empleo de Mapman, se visualizaron 123, entre ellas, las del metabolismo energético, de carbohidratos, de aminoacidos, lípicos, nucleótidos y secundario. El ciclo de los ácidos tricarboxílicos (TCA) fue la ruta mejor representadas con 5 de 10 metabolitos, 6 de 8 proteínas enzimáticas y 8 de 8 transcritos. Por otro lado, hay rutas que no se observaron o estaban muy poco representadas, como por ejemplo las del metabolismo de lípidos, terpenoides y policétidos. Como objetivo final de la presente tesis, se llevó a cabo un análisis transcriptómico y proteómico integrado de la respuesta a sequía en plantones de Q. ilex. Los resultados se presentan en el capítulo 6, correspondiente a un manuscrito que será enviado para su publicación. Las plántulas de Q. ilex crecieron en macetas con perlita, siendo sometidas a condiciones de sequía por falta de riego durante 30 días. Se tomaron muestras de hojas a dos tiempos, cuando la fluorescencia de las hojas disminuyó en un 30% y un 50% (20 y 25 días). Tras la extracción de RNA y proteínas se llevó a cabo su análisis mediante RNA-Seq (Illumina) y proteómica “shotgun” (LS-MS/MS, Orbitrap). El análisis de RNA-seq generó 47868 transcritos correspondientes a 21000 unigenes, con 3588 diferencias cualitativas o cuantitativas entre plántulas irrigadas y no irrigadas (1149 sobreexpresados y 2439 reprimidos). A partir de la proteómica “shotgun” se identificaron 4008 proteoformas, productos de 2767 genes diferentes; de ellos, 640 presentaron diferencias cualitativas o cuantitativas en abundancia entre tratamientos (353 más y 287 menos abundantes en condiciones de sequía). Los productos genéticos variables se clasificaron en términos de Gene Ontology (proceso biológico, función molecular y componente celular) y en rutas metabólicas de KEGG en el caso de las enzimas. El conjunto de datos variables se sometió a análisis estadístico multivariante, PCA y sPLS. Finalmente, se usó GeneMANIA para la construcción de redes de interacción. Hubo cambios importantes en el patrón de expresión génica siendo los grupos de respuesta a estrés y cloroplastos lo más afectados. Respecto a rutas metabólicas, se detectaron cambios en la síntesis de proteínas, fotosíntesis, carbohidratos, aminoácidos y fenólicos. Hubo cambios transitorios (observado a un solo tiempo) o permanentes (comunes a los dos tiempos) detectados a nivel de transcrito y/o proteína. El número de productos génicos variables detectados por ambas plataformas fue mínimo, entre ellos RPS2, 4CL2, PSB28 y RIN4. A partir del conjunto de datos de transcritos y proteínas variables, se construyeron dos redes de interacción: la primera incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, SMT1 y UGP3, y los genes reprimidos ABA2, RPS1, ADK y RPL4, y la segunda red incluía los genes sobreexpresados CLPB2, CLPB3, HSP70, HSP17.4, FtsH6, AT1G23740, AP1, INVE, AT4G2740, CAD4, FEN1 y HIPP27 y el gen reprimido ABA2. Se proponen como genes marcadores de respuesta y tolerancia a sequía en encina a aquellos sobreexpresados a los dos tiempos y detectados a nivel de transcrito y proteína. Solo un número de genes cumplen dichas características entre los que se incluyen posibles proteínas de respuest

    Similar works