32 research outputs found
Comparative analysis of plant genomes through data integration
When we started our research in 2008, several online resources for genomics existed, each with a different focus. TAIR (The Arabidopsis Information Resource) has a focus on the plant model species Arabidopsis thaliana, with (at that time) little or no support for evolutionary or comparative genomics. Ensemble provided some basic tools and functions as a data warehouse, but it would only start incorporating plant genomes in 2010. There was no online resource at that time however, that provided the necessary data content and tools for plant comparative and evolutionary genomics that we required. As such, the plant community was missing an essential component to get their research at the same level as the biomedicine oriented research communities. We started to work on PLAZA in order to provide such a data resource that could be accessed by the plant community, and which also contained the necessary data content to help our research group’s focus on evolutionary genomics.
The platform for comparative and evolutionary genomics, which we named PLAZA, was developed from scratch (i.e. not based on an existing database scheme, such as Ensemble). Gathering the data for all species, parsing this data into a common format and then uploading it into the database was the next step. We developed a processing pipeline, based on sequence similarity measurements, to group genes into gene families and sub families. Functional annotation was gathered through both the original data providers and through InterPro scans, combined with Interpro2GO. This primary data information was then ready to be used in every subsequent analysis. Building such a database was good enough for research within our bioinformatics group, but the target goal was to provide a comprehensive resource for all plant biologists with an interest in comparative and evolutionary genomics. Designing and creating a user-friendly, visually appealing web interface, connected to our database, was the next step. While the most detailed information is commonly presented in data tables, aesthetically pleasing graphics, images and charts are often used to visualize trends, general statistics and also used in specific tools. Design and development of these tools and visualizations is thus one of the core elements within my PhD. The PLAZA platform was designed as a gene-centric data resource, which is easily navigated when a biologist wants to study a relative small number of genes. However, using the default PLAZA website to retrieve information for dozens of genes quickly becomes very tedious. Therefore a ’gene set’-centric extra layer was developed where user-defined gene sets could be quickly analyzed. This extra layer, called the PLAZA workbench, functions on top of the normal PLAZA website, implicating that only gene sets from species present within the PLAZA database can be directly analyzed.
The PLAZA resource for comparative and evolutionary genomics was a major success, but it still had several issues. We tried to solve at least two of these problems at the same time by creating a new platform. The first issue was the building procedure of PLAZA: adding a single species, or updating the structural annotation of an existing one, requires the total re-computation of the database content. The second issue was the restrictiveness of the PLAZA workbench: through a mapping procedure gene sets could be entered for species not present in the PLAZA database, but for species without a phylogenetic close relative this approach did not always yield satisfying results. Furthermore, the research in question might just focus on the difference between a species present in PLAZA and a close relative not present in PLAZA (e.g. to study adaptation to a different ecological niche). In such a case, the mapping procedure is in itself useless. With the advent of NGS transcriptome data sets for a growing number of species, it was clear that a next challenge had presented itself. We designed and developed a new platform, named TRAPID, which could automatically process entire transcriptome data sets, using a reference database. The target goal was to have the processing done quickly with the results containing both gene family oriented data (such as multiple sequence alignments and phylogenetic trees) and functional characterization of the transcripts. Major efforts went into designing the processing pipeline so it could be reliable, fast and accurate
Unsupervised and semi-supervised training methods for eukaryotic gene prediction
This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing.
Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns.
The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments.
Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof
Suppression and triggering of Arabidopsis immunity by Albugo species
Albugo species are obligate biotrophic phytopathogens. Like other biotrophs, they are
anticipated to secrete effectors that can suppress or trigger plant defenses; the nature of
Albugo effectors is currently unknown.
Sequencing of A. laibachii isolate Nc14 (AlNc14) genome reveals 13032 genes encoded in
a ~37 Mb genome. We analyze the effector complement of AlNc14 and find known
effector classes but also classes unique to A. laibachii. Experiments reveal that CHXCs are
a novel class of effectors that suppress host defense.
We functionally characterize two predicted AlNc14 effectors in detail; CHXC1 a potential
core effector conserved in other oomycete species, and SSP6, a fast-evolving effector
specific to A. laibachii.
CHXC1 encodes a nuclear localized HECT E3 ligase homolog, which suppresses host
defenses dependent on cys651.
We find 7 variants of SSP6 that are under diversifying selection. Two highly expressed
variants SSP6-2c and SSP6-A are plasma membrane localized when expressed in planta.
Interestingly, SSP6-2c but not SSP6-A, is able to enhance growth of P. infestans race blue
13 and suppress flg22-dependent ROS production. In Arabidopsis cells we find SSP6-2c
localizes around AlNc14 haustoria. We propose that AlNc14 secretes the effectors SSP6-2c
and CHXC1 into the plant cell to suppress defense and promote infection.
Current methods to screen for virulence of effector candidates predominantly rely on
measuring growth of bacterial pathogens. Quantitative assessment of resistance and
susceptibility to eukaryotic pathogens is more difficult. We develop a semi-automated
high-throughput system for assaying Hpa growth.
We investigate the genetic basis of resistance to Albugo in Arabidopsis. We find that
resistance to AlNc14 is linked to RAC1 and RAC3 in Ksk-1. In contrast, resistance to A.
candida Nc2 (AcNc2) is linked to WRR4 in Col-0, Col-5 and Ksk-1. A second dominant
locus, WRR5a/b in Col-5 also confers resistance to AlNc2. Thus, different R-genes and
presumably different effectors govern resistance to AlNc14 and AcNc2.
Recommended from our members
Network-Scale Engineering: Systems Approaches to Synthetic Biology
The field of Synthetic Biology seeks to develop engineering principles for biological systems. Modular biological parts are repurposed and recombined to develop new synthetic biological devices with novel functions. The proper functioning of these devices is dependent on the cellular context provided by the host organism, and the interaction of these devices with host systems. The field of Systems Biology seeks to measure and model the properties of biological phenomena at the network scale. We present the application of systems biology approaches to synthetic biology, with particular emphasis on understanding and remodeling metabolic networks. Chapter 2 demonstrates the use of a Flux Balance Analysis model of the Saccharomyces cerevisiae metabolic network to identify and construct strains of S. cerevisiae that produced increased amounts of formic acid. Chapter 3 describes the development of synthetic metabolic pathways in Escherichia coli for the production of hydrogen, and a directed evolution strategy for hydrogenase enzyme improvement. Chapter 4 introduces the use of metabolomic profiling to investigate the role of circadian regulation in the metabolic network of the photoautotrophic cyanobacterium Synechococcus elongatus PCC 7942. Together, this work demonstrates the utility of network-scale approaches to understanding biological systems, and presents novel strategies for engineering metabolism
Differential evolution of non-coding DNA across eukaryotes and its close relationship with complex multicellularity on Earth
Here, I elaborate on the hypothesis that complex multicellularity (CM, sensu Knoll) is a major evolutionary transition (sensu Szathmary), which has convergently evolved a few times in Eukarya only: within red and brown algae, plants, animals, and fungi. Paradoxically, CM seems to correlate with the expansion of non-coding DNA (ncDNA) in the genome rather than with genome size or the total number of genes. Thus, I investigated the correlation between genome and organismal complexities across 461 eukaryotes under a phylogenetically controlled framework. To that end, I introduce the first formal definitions and criteria to distinguish ‘unicellularity’, ‘simple’ (SM) and ‘complex’ multicellularity. Rather than using the limited available estimations of unique cell types, the 461 species were classified according to our criteria by reviewing their life cycle and body plan development from literature. Then, I investigated the evolutionary association between genome size and 35 genome-wide features (introns and exons from protein-coding genes, repeats and intergenic regions) describing the coding and ncDNA complexities of the 461 genomes. To that end, I developed ‘GenomeContent’, a program that systematically retrieves massive multidimensional datasets from gene annotations and calculates over 100 genome-wide statistics. R-scripts coupled to parallel computing were created to calculate >260,000 phylogenetic controlled pairwise correlations. As previously reported, both repetitive and non-repetitive DNA are found to be scaling strongly and positively with genome size across most eukaryotic lineages. Contrasting previous studies, I demonstrate that changes in the length and repeat composition of introns are only weakly or moderately associated with changes in genome size at the global phylogenetic scale, while changes in intron abundance (within and across genes) are either not or only very weakly associated with changes in genome size. Our evolutionary correlations are robust to: different phylogenetic regression methods, uncertainties in the tree of eukaryotes, variations in genome size estimates, and randomly reduced datasets. Then, I investigated the correlation between the 35 genome-wide features and the cellular complexity of the 461 eukaryotes with phylogenetic Principal Component Analyses. Our results endorse a genetic distinction between SM and CM in Archaeplastida and Metazoa, but not so clearly in Fungi. Remarkably, complex multicellular organisms and their closest ancestral relatives are characterized by high intron-richness, regardless of genome size. Finally, I argue why and how a vast expansion of non-coding RNA (ncRNA) regulators rather than of novel protein regulators can promote the emergence of CM in Eukarya. As a proof of concept, I co-developed a novel ‘ceRNA-motif pipeline’ for the prediction of “competing endogenous” ncRNAs (ceRNAs) that regulate microRNAs in plants. We identified three candidate ceRNAs motifs: MIM166, MIM171 and MIM159/319, which were found to be conserved across land plants and be potentially involved in diverse developmental processes and stress responses. Collectively, the findings of this dissertation support our hypothesis that CM on Earth is a major evolutionary transition promoted by the expansion of two major ncDNA classes, introns and regulatory ncRNAs, which might have boosted the irreversible commitment of cell types in certain lineages by canalizing the timing and kinetics of the eukaryotic transcriptome.:Cover page
Abstract
Acknowledgements
Index
1. The structure of this thesis
1.1. Structure of this PhD dissertation
1.2. Publications of this PhD dissertation
1.3. Computational infrastructure and resources
1.4. Disclosure of financial support and information use
1.5. Acknowledgements
1.6. Author contributions and use of impersonal and personal pronouns
2. Biological background
2.1. The complexity of the eukaryotic genome
2.2. The problem of counting and defining “genes” in eukaryotes
2.3. The “function” concept for genes and “dark matter”
2.4. Increases of organismal complexity on Earth through multicellularity
2.5. Multicellularity is a “fitness transition” in individuality
2.6. The complexity of cell differentiation in multicellularity
3. Technical background
3.1. The Phylogenetic Comparative Method (PCM)
3.2. RNA secondary structure prediction
3.3. Some standards for genome and gene annotation
4. What is in a eukaryotic genome? GenomeContent provides a good answer
4.1. Background
4.2. Motivation: an interoperable tool for data retrieval of gene annotations
4.3. Methods
4.4. Results
4.5. Discussion
5. The evolutionary correlation between genome size and ncDNA
5.1. Background
5.2. Motivation: estimating the relationship between genome size and ncDNA
5.3. Methods
5.4. Results
5.5. Discussion
6. The relationship between non-coding DNA and Complex Multicellularity
6.1. Background
6.2. Motivation: How to define and measure complex multicellularity across eukaryotes?
6.3. Methods
6.4. Results
6.5. Discussion
7. The ceRNA motif pipeline: regulation of microRNAs by target mimics
7.1. Background
7.2. A revisited protocol for the computational analysis of Target Mimics
7.3. Motivation: a novel pipeline for ceRNA motif discovery
7.4. Methods
7.5. Results
7.6. Discussion
8. Conclusions and outlook
8.1. Contributions and lessons for the bioinformatics of large-scale comparative analyses
8.2. Intron features are evolutionarily decoupled among themselves and from genome size throughout Eukarya
8.3. “Complex multicellularity” is a major evolutionary transition
8.4. Role of RNA throughout the evolution of life and complex multicellularity on Earth
9. Supplementary Data
Bibliography
Curriculum Scientiae
Selbständigkeitserklärung (declaration of authorship
Recommended from our members
Development of a Software Package for the Quantitative Analysis of Proteomic Mass Spectrometry Datasets Labelled with Nitrogen-15
Elemental metabolic labelling using 15N stable isotopes is a technique used in peptide-centric proteomics that allows samples to be mixed before preparation and analysis (minimising technical variance) without introducing sample ambiguity to the results. Labelling with 15N induces a mass shift in labelled peptides that, when analysed by mass spectrometry (MS), allows the signal associated with differently labelled samples to be differentiated.
When compared to similar labelling techniques such as Stable Isotope Labelling by Amino acids in Cell culture (SILAC), 15N poses unique challenges for analysis because the level of label incorporation affects not only the relative intensity of signals in MS analysis, but also how that signal is distributed. A computational signal extraction algorithm is not easily generalised to all peptides, especially if there are differences in the level of incorporation. Analysis of 15N data has been neglected by the general pace of software development in proteomic MS. Furthermore, the current 15N analysis options have relatively complex installation procedures and are limited to a command-line interface.
I describe the development of a cross-platform 15N quantification software package (HeavyMetL) which runs inside a web browser, requiring no installation procedure and providing a graphical interface for both the analysis of data and visual interrogation of results (in addition to a more typical text-format table output). The optimisation (using experimental data) of a core part of the algorithm to determine the level of 15N incorporation is described in detail. Finally, the performance of HeavyMetL is benchmarked against published 15N labelled data from Arabidopsis seedlings quantified by a previously published algorithm, showing that HeavyMetL produces quantification of equivalent or better quality