842 research outputs found

    Simultaneous Reconstruction of Duplication Episodes and Gene-Species Mappings

    Get PDF
    We present a novel problem, called MetaEC, which aims to infer gene-species assignments in a collection of gene trees with missing labels by minimizing the size of duplication episode clustering (EC). This problem is particularly relevant in metagenomics, where incomplete data often poses a challenge in the accurate reconstruction of gene histories. To solve MetaEC, we propose a polynomial time dynamic programming (DP) formulation that verifies the existence of a set of duplication episodes from a predefined set of episode candidates. We then demonstrate how to use DP to design an algorithm that solves MetaEC. Although the algorithm is exponential in the worst case, we introduce a heuristic modification of the algorithm that provides a solution with the knowledge that it is exact. To evaluate our method, we perform two computational experiments on simulated and empirical data containing whole genome duplication events, showing that our algorithm is able to accurately infer the corresponding events

    The multiple gene duplication problem revisited

    Get PDF
    Motivation: Deciphering the location of gene duplications and multiple gene duplication episodes on the Tree of Life is fundamental to understanding the way gene families and genomes evolve. The multiple gene duplication problem provides a framework for placing gene duplication events onto nodes of a given species tree, and detecting episodes of multiple gene duplication. One version of the multiple gene duplication problem was defined by Guigó et al. in 1996. Several heuristic solutions have since been proposed for this problem, but no exact algorithms were known

    Assumption 0 analysis: comparative phylogenetic studies in the age of complexity

    Get PDF
    Darwin's panoramic view of biology encompassed two metaphors: the phylogenetic tree, pointing to relatively linear (and divergent) complexity, and the tangled bank, pointing to reticulated (and convergent) complexity. The emergence of phylogenetic systematics half a century ago made it possible to investigate linear complexity in biology. Assumption 0, first proposed in 1986, is not needed for cases of simple evolutionary patterns, but must be invoked when there are complex evolutionary patterns whose hallmark is reticulated relationships. A corollary of Assumption 0, the duplication convention, was proposed in 1990, permitting standard phylogenetic systematic ontology to be used in discovering reticulated evolutionary histories. In 2004, a new algorithm, phylogenetic analysis for comparing trees (PACT), was developed specifically for use in analyses invoking Assumption 0. PACT can help discern complex evolutionary explanations for historical biogeographical, coevolutionary, phylogenetic, and tokogenetic processe

    Inferring angiosperm phylogeny from EST data with widespread gene duplication

    Get PDF
    BACKGROUND: Most studies inferring species phylogenies use sequences from single copy genes or sets of orthologs culled from gene families. For taxa such as plants, with very high levels of gene duplication in their nuclear genomes, this has limited the exploitation of nuclear sequences for phylogenetic studies, such as those available in large EST libraries. One rarely used method of inference, gene tree parsimony, can infer species trees from gene families undergoing duplication and loss, but its performance has not been evaluated at a phylogenomic scale for EST data in plants. RESULTS: A gene tree parsimony analysis based on EST data was undertaken for six angiosperm model species and Pinus, an outgroup. Although a large fraction of the tentative consensus sequences obtained from the TIGR database of ESTs was assembled into homologous clusters too small to be phylogenetically informative, some 557 clusters contained promising levels of information. Based on maximum likelihood estimates of the gene trees obtained from these clusters, gene tree parsimony correctly inferred the accepted species tree with strong statistical support. A slight variant of this species tree was obtained when maximum parsimony was used to infer the individual gene trees instead. CONCLUSION: Despite the complexity of the EST data and the relatively small fraction eventually used in inferring a species tree, the gene tree parsimony method performed well in the face of very high apparent rates of duplication

    An ILP solution for the gene duplication problem

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee.</p> <p>Results</p> <p>We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics.</p> <p>Conclusions</p> <p>Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.</p

    Inventing an arsenal: adaptive evolution and neofunctionalization of snake venom phospholipase A(2 )genes

    Get PDF
    BACKGROUND: Gene duplication followed by functional divergence has long been hypothesized to be the main source of molecular novelty. Convincing examples of neofunctionalization, however, remain rare. Snake venom phospholipase A(2 )genes are members of large multigene families with many diverse functions, thus they are excellent models to study the emergence of novel functions after gene duplications. RESULTS: Here, I show that positive Darwinian selection and neofunctionalization is common in snake venom phospholipase A(2 )genes. The pattern of gene duplication and positive selection indicates that adaptive molecular evolution occurs immediately after duplication events as novel functions emerge and continues as gene families diversify and are refined. Surprisingly, adaptive evolution of group-I phospholipases in elapids is also associated with speciation events, suggesting adaptation of the phospholipase arsenal to novel prey species after niche shifts. Mapping the location of sites under positive selection onto the crystal structure of phospholipase A(2 )identified regions evolving under diversifying selection are located on the molecular surface and are likely protein-protein interactions sites essential for toxin functions. CONCLUSION: These data show that increases in genomic complexity (through gene duplications) can lead to phenotypic complexity (venom composition) and that positive Darwinian selection is a common evolutionary force in snake venoms. Finally, regions identified under selection on the surface of phospholipase A(2 )enzymes are potential candidate sites for structure based antivenin design

    Algorithms: simultaneous error-correction and rooting for gene tree reconciliation and the gene duplication problem

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Evolutionary methods are increasingly challenged by the wealth of fast growing resources of genomic sequence information. Evolutionary events, like gene duplication, loss, and deep coalescence, account more then ever for incongruence between gene trees and the actual species tree. Gene tree reconciliation is addressing this fundamental problem by invoking the minimum number of gene duplication and losses that reconcile a rooted gene tree with a rooted species tree. However, the reconciliation process is highly sensitive to topological error or wrong rooting of the gene tree, a condition that is not met by most gene trees in practice. Thus, despite the promises of gene tree reconciliation, its applicability in practice is severely limited.</p> <p>Results</p> <p>We introduce the problem of reconciling unrooted and erroneous gene trees by simultaneously rooting and error-correcting them, and describe an efficient algorithm for this problem. Moreover, we introduce an error-corrected version of the gene duplication problem, a standard application of gene tree reconciliation. We introduce an effective heuristic for our error-corrected version of the gene duplication problem, given that the original version of this problem is NP-hard. Our experimental results suggest that our error-correcting approaches for unrooted input trees can significantly improve on the accuracy of gene tree reconciliation, and the species tree inference under the gene duplication problem. Furthermore, the efficiency of our algorithm for error-correcting reconciliation is capable of handling truly large-scale phylogenetic studies.</p> <p>Conclusions</p> <p>Our presented error-correction approach is a crucial step towards making gene tree reconciliation more robust, and thus to improve on the accuracy of applications that fundamentally rely on gene tree reconciliation, like the inference of gene-duplication supertrees.</p

    Vertebrate phylogenomics and gene family evolution

    Get PDF
    This thesis is about 2 topics; the evolution of gene families by the birth-death process of gene duplication and gene loss, and phylogenetic inference. It is a central theme that these two processes are intimately associated - the phylogenies of gene families (of any gene) are shaped by the processes of gene duplication and gene loss, as much as by the processes of speciation and extinction occurring among the species the gene is evolving in. This has two results. Firstly, that we need to know, or assume, something about the processes of gene duplication and loss to correctly understand the pattern of speciation, or cladogenesis, in a group of organisms. Secondly, that we need to know, or assume, something about this pattern if we are to fully appreciate the effect of gene duplication and loss on a gene family phylogeny.The main part of this thesis investigates the use of reconciled tree methods in unravelling species phylogeny and the evolution of gene families. Part of this investigation involves placing reconciled tree methods (and the use of these methods to infer species phylogeny, known as gene tree parsimony), in the context of some related methods: supertree methods and "simultaneous analysis" of combined data. Two empirical studies complete this part of the thesis - one attempting to infer the higher-level phylogeny of vertebrates using gene tree parsimony, and another focusing on a lower taxonomic level, on primate phylogeny. This chapter attempts an integrated study of gene duplication and species phylogeny, which uses information about gene duplication to help date evolutionary events.Despite the close relationship between gene duplication and speciation on phylogenies, it is possible to study gene duplication independently. If we restrict ourselves to genes sampled from a single genome, gene family trees represent gene duplications and gene losses occurring during the history of a single species, so the complication of speciation and extinction is eliminated. By realising that the processes of gene duplication and loss in these trees are analogous to the processes of speciation and extinction in species phylogenies, we can harness a toolkit of methods developed for more traditional phylogenies to study these molecular processes. Two such methods are models of cladistic tree shape and birth-death models, which allow the first estimates of the rate of gene loss

    Introgression and repeated co-option facilitated the recurrent emergence of C4 photosynthesis among close relatives.

    Get PDF
    The origins of novel traits are often studied using species trees and modeling phenotypes as different states of the same character, an approach that cannot always distinguish multiple origins from fewer origins followed by reversals. We address this issue by studying the origin of C4 photosynthesis, an adaptation to warm and dry conditions, in the grass Alloteropsis. We dissect the C4 trait into its components, and show two independent origins of the C4 phenotype via different anatomical modifications, and the use of distinct sets of genes. Further, inference of enzyme adaptation suggests that one of the two groups encompasses two transitions to a full C4 state from a common ancestor with an intermediate phenotype that had some C4 anatomical and biochemical components. Molecular dating of C4 genes confirms the introgression of two key C4 components between species, while the inheritance of all others matches the species tree. The number of origins consequently varies among C4 components, a scenario that could not have been inferred from analyses of the species tree alone. Our results highlight the power of studying individual components of complex traits to reconstruct trajectories toward novel adaptations
    corecore