143 research outputs found

    Screening synteny blocks in pairwise genome comparisons through integer programming

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is difficult to accurately interpret chromosomal correspondences such as true orthology and paralogy due to significant divergence of genomes from a common ancestor. Analyses are particularly problematic among lineages that have repeatedly experienced whole genome duplication (WGD) events. To compare multiple "subgenomes" derived from genome duplications, we need to relax the traditional requirements of "one-to-one" syntenic matchings of genomic regions in order to reflect "one-to-many" or more generally "many-to-many" matchings. However this relaxation may result in the identification of synteny blocks that are derived from ancient shared WGDs that are not of interest. For many downstream analyses, we need to eliminate weak, low scoring alignments from pairwise genome comparisons. Our goal is to objectively select subset of synteny blocks whose total scores are maximized while respecting the duplication history of the genomes in comparison. We call this "quota-based" screening of synteny blocks in order to appropriately fill a quota of syntenic relationships within one genome or between two genomes having WGD events.</p> <p>Results</p> <p>We have formulated the synteny block screening as an optimization problem known as "Binary Integer Programming" (BIP), which is solved using existing linear programming solvers. The computer program QUOTA-ALIGN performs this task by creating a clear objective function that maximizes the compatible set of synteny blocks under given constraints on overlaps and depths (corresponding to the duplication history in respective genomes). Such a procedure is useful for any pairwise synteny alignments, but is most useful in lineages affected by multiple WGDs, like plants or fish lineages. For example, there should be a 1:2 ploidy relationship between genome A and B if genome B had an independent WGD subsequent to the divergence of the two genomes. We show through simulations and real examples using plant genomes in the rosid superorder that the quota-based screening can eliminate ambiguous synteny blocks and focus on specific genomic evolutionary events, like the divergence of lineages (in cross-species comparisons) and the most recent WGD (in self comparisons).</p> <p>Conclusions</p> <p>The QUOTA-ALIGN algorithm screens a set of synteny blocks to retain only those compatible with a user specified ploidy relationship between two genomes. These blocks, in turn, may be used for additional downstream analyses such as identifying true orthologous regions in interspecific comparisons. There are two major contributions of QUOTA-ALIGN: 1) reducing the block screening task to a BIP problem, which is novel; 2) providing an efficient software pipeline starting from all-against-all BLAST to the screened synteny blocks with dot plot visualizations. Python codes and full documentations are publicly available <url>http://github.com/tanghaibao/quota-alignment</url>. QUOTA-ALIGN program is also integrated as a major component in SynMap <url>http://genomevolution.com/CoGe/SynMap.pl</url>, offering easier access to thousands of genomes for non-programmers.</p

    Positional orthology: putting genomic evolutionary relationships into context

    Get PDF
    Orthology is a powerful refinement of homology that allows us to describe more precisely the evolution of genomes and understand the function of the genes they contain. However, because orthology is not concerned with genomic position, it is limited in its ability to describe genes that are likely to have equivalent roles in different genomes. Because of this limitation, the concept of ‘positional orthology’ has emerged, which describes the relation between orthologous genes that retain their ancestral genomic positions. In this review, we formally define this concept, for which we introduce the shorter term ‘toporthology’, with respect to the evolutionary events experienced by a gene’s ancestors. Through a discussion of recent studies on the role of genomic context in gene evolution, we show that the distinction between orthology and toporthology is biologically significant. We then review a number of orthology prediction methods that take genomic context into account and thus that may be used to infer the important relation of toporthology

    Distinct expression and methylation patterns for genes with different fates following a single whole-genome duplication in flowering plants

    Get PDF
    For most sequenced flowering plants, multiple whole-genome duplications (WGDs) are found. Duplicated genes following WGD often have different fates that can quickly disappear again, be retained for long(er) periods, or subsequently undergo small-scale duplications. However, how different expression, epigenetic regulation, and functional constraints are associated with these different gene fates following a WGD still requires further investigation due to successive WGDs in angiosperms complicating the gene trajectories. In this study, we investigate lotus (Nelumbo nucifera), an angiosperm with a single WGD during the K–pg boundary. Based on improved intraspecific-synteny identification by a chromosome-level assembly, transcriptome, and bisulfite sequencing, we explore not only the fundamental distinctions in genomic features, expression, and methylation patterns of genes with different fates after a WGD but also the factors that shape post-WGD expression divergence and expression bias between duplicates. We found that after a WGD genes that returned to single copies show the highest levels and breadth of expression, gene body methylation, and intron numbers, whereas the long-retained duplicates exhibit the highest degrees of protein–protein interactions and protein lengths and the lowest methylation in gene flanking regions. For those long-retained duplicate pairs, the degree of expression divergence correlates with their sequence divergence, degree in protein–protein interactions, and expression level, whereas their biases in expression level reflecting subgenome dominance are associated with the bias of subgenome fractionation. Overall, our study on the paleopolyploid nature of lotus highlights the impact of different functional constraints on gene fate and duplicate divergence following a single WGD in plant

    Synteny-based phylogenomic networks for comparative genomics

    Get PDF
    For comparative genomics, relative gene orders or synteny holds key information to assess genomic innovations such as gene duplications, gene loss, or transpositions. While the number of reference genomes is growing exponentially, a major challenge is how to detect, represent, and visualize synteny relations of any genes of interest effectively across a large number of genomes. In this thesis, I present six chapters centering on a network approach for large-scale phylogenomic synteny analysis, and discuss how such a network approach can enhance our understanding of the evolutionary history of genes and genomes across broad phylogenetic groups and divergence times. In Chapter 1, I stress that synteny information is becoming more important at this genomics age with rapidly developing DNA sequencing technologies. It provides us another layer of data besides merely sequences, and could potentially be better used to improve phylogeny. I also summarized current available tools and gave an example of popular websites for synteny detection. In Chapter 2, I propose an outline performing synteny network analysis, which is based on three primary steps: pairwise whole genome comparisons, syntenic block detection and data fusion, and network visualization. Then with comparison to a previous synteny comparison result which use traditional parallel coordinate plots, I show that the network approach could present us a much clear, strong, and systematic graph, with integrated synteny information from 101 broadly distributed species. In Chapter 3, we analyzed synteny networks of the entire MADS-box transcription factor gene family from fifty-one completed plant genomes. We applied a k-cliques percolation method to cluster the synteny network. We found lineage-specific clusters that derive from transposition events for the regulators of floral development (APETALA3 and PI) and flowering-time (FLC) in the Brassicales and for the regulators of root-development (AGL17) in Poales. We also visualized big difference of synteny properties between Type I MADS-box genes and Type II MADS-box genes. We identified two large gene clusters that jointly encompass many key phenotypic regulatory Type II MADS-box gene clades (SEP1, SQUA, TM8, SEP3, FLC, AGL6 and TM3). This allows for a better understanding of how evolution has acted on a key regulatory gene family in the plant kingdom. In Chapter 4, we performed synteny network analysis of LEA gene families, which includes eight different subfamilies (LEA_1 to LEA_6, SMP, and DHN) and has a relatively chaotic classification. Synteny clusters provide us better pictures of genomic innovations and function diversification. For example recurrent tandem duplications contributed to LEA_2 family expansion, whereas synteny and protein sequence were highly conserved during the evolution of LEA_5. In Chapter 5, instead of the analysis of a particular gene family, I scale up the analysis to all the genes from all available genomes across kingdoms over significant evolutionary timescales. We used available genomes of 87 mammals and 107 flowering plants. We first compare synteny percentage with popular genome metrics such as BUSCO and N50, which reveal genomic architecture conservation and variation across kingdoms. We characterized and compare the properties of the whole network, using degree distribution and clustering results. Through phylogenomic profiling of size, degree and compositions of all clusters, we identified many phylogenomic genomic innovations (i.e. duplications, gene transpositions, gene loss), at the individual gene level, from tested mammal and angiosperm genomes. In Chapter 6, I summarize the merits of taking a network-based approach for synteny comparisons, and discuss current clustering methods for synteny data. I also mentioned several weakness, which could be further complemented in the future.</p

    Deeply conserved synteny resolves early events in vertebrate evolution

    Get PDF
    Although it is widely believed that early vertebrate evolution was shaped by ancient whole-genome duplications, the number, timing and mechanism of these events remain elusive. Here, we infer the history of vertebrates through genomic comparisons with a new chromosome-scale sequence of the invertebrate chordate amphioxus. We show how the karyotypes of amphioxus and diverse vertebrates are derived from 17 ancestral chordate linkage groups (and 19 ancestral bilaterian groups) by fusion, rearrangement and duplication. We resolve two distinct ancient duplications based on patterns of chromosomal conserved synteny. All extant vertebrates share the first duplication, which occurred in the mid/late Cambrian by autotetraploidization (that is, direct genome doubling). In contrast, the second duplication is found only in jawed vertebrates and occurred in the mid-late Ordovician by allotetraploidization (that is, genome duplication following interspecific hybridization) from two now-extinct progenitors. This complex genomic history parallels the diversification of vertebrate lineages in the fossil record

    The Marvelous World of tRNAs: From Accurate Mapping to Chemical Modifications

    Get PDF
    Since the discovery of transfer RNAs (tRNAs) as decoders of the genetic code, life science has transformed. Particularly, as soon as the importance of tRNAs in protein synthesis has been established, researchers recognized that the functionality of tRNAs in cellular regulation exceeds beyond this paradigm. A strong impetus for these discoveries came from advances in large-scale RNA sequencing (RNA-seq) and increasingly sophisticated algorithms. Sequencing tRNAs is challenging both experimentally and in terms of the subsequent computational analysis. In RNA-seq data analysis, mapping tRNA reads to a reference genome is an error-prone task. This is in particular true, as chemical modifications introduce systematic reverse transcription errors while at the same time the genomic loci are only approximately identical due to the post-transcriptional maturation of tRNAs. Additionally, their multi-copy nature complicates the precise read assignment to its true genomic origin. In the course of the thesis a computational workflow was established to enable accurate mapping of tRNA reads. The developed method removes most of the mapping artifacts introduced by simpler mapping schemes, as demonstrated by using both simulated and human RNA-seq data. Subsequently, the resulting mapping profiles can be used for reliable identification of specific chemical tRNA modifications with a false discovery rate of only 2%. For that purpose, computational analysis methods were developed that facilitates the sensitive detection and even classification of most tRNA modifications based on their mapping profiles. This comprised both untreated RNA-seq data of various species, as well as treated data of Bacillus subtilis that has been designed to display modifications in a specific read-out in the mapping profile. The discussion focuses on sources of artifacts that complicate the profiling of tRNA modifications and strategies to overcome them. Exemplary studies on the modification pattern of different human tissues and the developmental stages of Dictyostelium discoideum were carried out. These suggested regulatory functions of tRNA modifications in development and during cell differentiation. The main experimental difficulties of tRNA sequencing are caused by extensive, stable secondary structures and the presence of chemical modifications. Current RNA-seq methods do not sample the entire tRNA pool, lose short tRNA fragments, or they lack specificity for tRNAs. Within this thesis, the benchmark and improvement of LOTTE-seq, a method for specific selection of tRNAs for high-throughput sequencing, exhibited that the method solves the experimental challenges and avoids the disadvantages of previous tRNA-seq protocols. Applying the accurate tRNA mapping strategy to LOTTE-seq and other tRNA-specific RNA- seq methods demonstrated that the content of mature tRNAs is highest in LOTTE-seq data, ranging from 90% in Spinacia oleracea to 100% in D. discoideum. Additionally, the thesis addressed the fact that tRNAs are multi-copy genes that undergo concerted evolution which keeps sequences of paralogous genes effectively identical. Therefore, it is impossible to distinguish orthologs from paralogs by sequence similarity alone. Synteny, the maintenance of relative genomic positions, is helpful to disambiguate evolutionary relationships in this situation. During this thesis a workflow was computed for synteny-based orthology identification of tRNA genes. The workflow is based on the use of pre-computed genome-wide multiple sequence alignment blocks as anchors to establish syntenic conservation of sequence intervals. Syntenic clusters of concertedly evolving genes of different tRNA families are then subdivided and processed by cograph editing to recover their duplication histories. A useful outcome of this study is that it highlights the technical problems and difficulties associated with an accurate analysis of the evolution of multi-copy genes. To showcase the method, evolution of tRNAs in primates and fruit flies were reconstructed. In the last decade, a number of reports have described novel aspects of tRNAs in terms of the diversity of their genes. For example, nuclear-encoded mitochondrial-derived tRNAs (nm-tRNAs) have been reported whose presence provokes intriguing questions about their functionality. Within this thesis an annotation strategy was developed that led to the identification of 335 and 43 novel nm-tRNAs in human and mouse, respectively. Interestingly, downstream analyses showed that the localization of several nm-tRNAs in introns and the over-representation of conserved RNA-binding sites of proteins involved in splicing suggest a potential regulatory function of intronic nm-tRNAs in splicing

    A novel approach to infer orthologs and produce gene annotations at scale

    Get PDF
    Aufgrund von Fortschritten im Bereich der DNA-Sequenzierung hat die Anzahl verfĂŒgbarer Genome in den letzten Jahrzehnten rapide zugenommen. Tausende bereits heute zur VerfĂŒgung stehende Genome ermöglichen detaillierte vergleichende Analysen, welche fĂŒr die Beantwortung relevanter Fragestellungen essentiell sind. Dies betrifft die Assoziation von Genotyp und PhĂ€notyp, die Erforschung der Besonderheiten komplexer Proteine und die Weiterentwicklung medizinischer Anwendungen. Um all diese Fragen zu beantworten ist es notwendig, proteinkodierende Gene in neu sequenzierten Genomen zu annotieren und ihre HomologieverhĂ€ltnisse zu bestimmen. Die bestehenden Methoden der Genomanalyse sind jedoch nicht fĂŒr Menge heutzutage anfallender Datenmengen ausgelegt. Daher ist die zentrale Herausforderung in der vergleichenden Genomik nicht die Anzahl der verfĂŒgbaren Genome, sondern die Entwicklung neuer Methoden zur Datenanalyse im Hochdurchsatz. Um diese Probleme zu adressieren, schlage ich ein neues Paradigma der Annotation von Genomen und der Inferenz von HomologieverhĂ€ltnissen vor, welches auf dem Alignment gesamter Genome basiert. WĂ€hrend die derzeit angewendeten Methoden zur Gen-Annotation und Bestimmung der Homologie ausschließlich auf codierenden Sequenzen beruhen, könnten durch die Einbeziehung des umgebenden neutral evolvierenden genomischen Kontextes bessere und vollstĂ€ndigere Annotationen vorgenommen werden. Die Verwendung von Genom-Alignments ermöglicht eine beliebige Skalierung der vorgeschlagenen Methodik auf Tausende Genome. In dieser Arbeit stelle ich TOGA (Tool to infer Orthologs from Genome Alignments) vor, eine bioinformatische Methode, welche dieses Konzept implementiert und Homologie- Klassifizierung und Gen-Annotation in einer einzelnen Pipeline kombiniert. TOGA verwendet Machine-Learning, um Orthologe von Paralogen basierend auf dem Alignment von intronischer und intergener Regionen zu unterscheiden. Die Ergebnisse des Benchmarkings zeigen, dass TOGA die herkömmlichen AnsĂ€tze innerhalb der Placentalia ĂŒbertrifft. TOGA klassifiziert HomologieverhĂ€ltnisse mit hoher PrĂ€zision und identifiziert zuverlĂ€ssig inaktivierte Gene als solchet. FrĂŒhere Versionen von TOGA fanden in mehreren Studien Anwendung und wurden in zwei Publikationen verwendet. Außerdem wurde TOGA erfolgreich zur Annotation von 500 SĂ€ugetiergeenomen verwendet, dies ist der bisher umfangreichste solche Datensatz. Diese Ergebnisse zeigen, dass TOGA das Potenzial hat, sich zu einer etablierten Methode zur Gen-Annotation zu entwickeln und die derzeit angewandten Techniken zu ergĂ€nzen

    Mathematical models for evolution of genome structure

    Get PDF
    The structure of a genome can be characterized by its gene content. Evolution of genome structure in closely related species can be studied by examining their synteny or conserved gene order and content. A variety of evolutionary rearrangements like polyploidy, inversions, transpositions, translocations, gene duplication and gene loss degrade synteny over time. In this dissertation, I approach the problem of understanding synteny in genomes and how far back its evolutionary history can be traced in multiple ways. First, I present a probabilistic model of the rearrangements gene loss and transposition (gain) and apply it to the problem of estimating the relative contribution of these rearrangements within a set of syntenic genome segments. This model can be used to predict gene content in syntenic regions of unsequenced genomes. Next, I use optimization methods to recover syntenic segments between genomes based on reconstructions of their parent ancestry. I examine how these reconstructions can be used as input to programs that identify syntenic regions in genomes to reveal more synteny than was previously detected. I use simulations that incorporate each of the evolutionary rearrangements described above to evaluate the models presented in this dissertation. Finally, I apply these models to genomic data from yeast and flowering plants, two eukaryotic systems that are known to have experienced polyploidy. This application is of particular relevance in flowering plants, in which a lot of economically and scientifically important polyploid species have incompletely sequenced genomes
    • 

    corecore