unknown

Bioinformatics assisted breeding, from QTL to candidate genes

Abstract

Over the last decade, the amount of data generated by a single run of a NGS sequencer outperforms days of work done with Sanger sequencing. Metabolomics, proteomics and transcriptomics technologies have also involved producing more and more information at an ever faster rate. In addition, the number of databases available to biologists and breeders is increasing every year. The challenge for them becomes two-fold, namely: to cope with the increased amount of data produced by these new technologies and to cope with the distribution of the information across the Web. An example of a study with a lot of ~omics data is described in Chapter 2, where more than 600 peaks have been measured using liquid chromatography mass-spectrometry (LCMS) in peel and flesh of a segregating F1apple population. In total, 669 mQTL were identified in this study. The amount of mQTL identified is vast and almost overwhelming. Extracting meaningful information from such an experiment requires appropriate data filtering and data visualization techniques. The visualization of the distribution of the mQTL on the genetic map led to the discovery of QTL hotspots on linkage group: 1, 8, 13 and 16. The mQTL hotspot on linkage group 16 was further investigated and mainly contained compounds involved in the phenylpropanoid pathway. The apple genome sequence and its annotation were used to gain insight in genes potentially regulating this QTL hotspot. This led to the identification of the structural gene leucoanthocyanidin reductase (LAR1) as well as seven genes encoding transcription factors as putative candidates regulating the phenylpropanoid pathway, and thus candidates for the biosynthesis of health beneficial compounds. However, this study also indicated bottlenecks in the availability of biologist-friendly tools to visualize large-scale QTL mapping results and smart ways to mine genes underlying QTL intervals. In this thesis, we provide bioinformatics solutions to allow exploration of regions of interest on the genome more efficiently. In Chapter 3, we describe MQ2, a tool to visualize results of large-scale QTL mapping experiments. It allows biologists and breeders to use their favorite QTL mapping tool such as MapQTL or R/qtl and visualize the distribution of these QTL among the genetic map used in the analysis with MQ2. MQ2provides the distribution of the QTL over the markers of the genetic map for a few hundreds traits. MQ2is accessible online via its web interface but can also be used locally via its command line interface. In Chapter 4, we describe Marker2sequence (M2S), a tool to filter out genes of interest from all the genes underlying a QTL. M2S returns the list of genes for a specific genome interval and provides a search function to filter out genes related to the provided keyword(s) by their annotation. Genome annotations often contain cross-references to resources such as the Gene Ontology (GO), or proteins of the UniProt database. Via these annotations, additional information can be gathered about each gene. By integrating information from different resources and offering a way to mine the list of genes present in a QTL interval, M2S provides a way to reduce a list of hundreds of genes to possibly tens or less of genes potentially related to the trait of interest. Using semantic web technologies M2S integrates multiple resources and has the flexibility to extend this integration to more resources as they become available to these technologies. Besides the importance of efficient bioinformatics tools to analyze and visualize data, the work in Chapter 2also revealed the importance of regulatory elements controlling key genes of pathways. The limitation of M2S is that it only considers genes within the interval. In genome annotations, transcription factors are not linked to the trait (keyword) and to the gene it controls, and these relationships will therefore not be considered. By integrating information about the gene regulatory network of the organism into Marker2sequence, it should be able to integrate in its list of genes, genes outside of the QTL interval but regulated by elements present within the QTL interval. In tomato, the genome annotation already lists a number of transcription factors, however, it does not provide any information about their target. In Chapter 5, we describe how we combined transcriptomics information with six genotypes from an Introgression Line (IL) population to find genes differentially expressed while being in a similar genomic background (i.e.: outside of any introgression segments) as the reference genotype (with no introgression). These genes may be differentially expressed as a result of a regulatory element present in an introgression. The promoter regions of these genes have been analyzed for DNA motifs, and putative transcription factor binding sites have been found. The approaches taken in M2S (Chaper 4) are focused on a specific region of the genome, namely the QTL interval. In Chapter 6, we generalized this approach to develop Annotex. Annotex provides a simple way to browse the cross-references existing between biological databases (ChEBI, Rhea, UniProt, GO) and genome annotations. The main concept of Annotex being, that from any type of data present in the databases, one can navigate the cross-references to retrieve the desired type of information. This thesis has resulted in the production of three tools that biologists and breeders can use to speed up their research and build new hypothesis on. This thesis also revealed the state of bioinformatics with regards to data integration. It also reveals the need for integration into annotations (for example, genome annotations, protein annotations, and pathway annotations) of more ontologies than just the Gene Ontology (GO) currently used. Multiple platforms are arising to build these new ontologies but the process of integrating them into existing resources remains to be done. It also confirms the state of the data in plants where multiples resources may contain overlapping. Finally, this thesis also shows what can be achieved when the data is made inter-operable which should be an incentive to the community to work together and build inter-operable, non-overlapping resources, creating a bioinformatics Web for plant research.</p

    Similar works