1,686 research outputs found

    R programming in phylogenetics and evolution

    Get PDF
    This dissertation addresses the application of the statistical computing language R in the study of evolution and diversification of plants. The topics included range from the worldwide historical biogeography of the cucurbit family and the phylogenetic composition of the Mediterranean Oxalis flora in central Chile to the interplay between population genetics and climatic niche evolution in four Horde- um clades in the Americas. In these studies, I drew on existing methods in R and on java and C programs that could be easily integrated with R. Whenever necessary, I created additional software available in four new R packages. R's features, e.g., intersystem-interfaces, extensibility, reproducibility and advanced graphical capability, proved well suited for evolutionary and phylogenetic research. My coauthors and I addressed the history of Cucurbitaceae, one of the most economically important families of plants, using a multi-gene phylogeny for 114 of the 115 genera and 25 per cent of the 960 species. Worldwide sampling was achieved by using specimens from 30 herbaria. Results reveal an Asian origin of Cucurbitaceae in the Late Cretaceous, followed by the repeated spread of lineages into the African, American and Australian continents via transoceanic long-distance dispersal (LDD). North American cucurbits stem from at least seven range expansions of Central and South American lineages; Madagascar was colonized 13 times, always from Africa; Australia was reached 12 times, apparently always from Southeast Asia. Overall, Cucurbitaceae underwent at least 43 successful LDD events over the past 60 Myr, which would translate into an average of seven LDDs every 10 Myr. These and similar findings from other angiosperms stress the need for an increased tapping of museum collections to achieve extensive geographical sampling in plant phylogenetics. The second study focused on the interplay of population demography with the evolution of ecological niches during or after speciation in Hordeum. While large populations maintain a high level of standing genetic diversity, gene ow and recombination buffers against fast alterations in ecological adaptation. Small populations harbor lower allele diversity but can more easily shift to new niches if they initially survive under changed conditions. Thus, large populations should be more conservative regarding niche changes in comparison to small populations. My coauthors and I used environmental niche modeling together with phylogenetic, phylogeographic and population genetic analyses to infer the correlation of population demography with changes in ecological niche dimensions in 12 diploid Hordeum species from the New World, forming four monophyletic groups. Our analyses found both shifts and conservatism in certain niche dimensions within and among clades. Speciation due to vicariance resulted in three species with no pronounced climate niche differences, while species originating due to long-distance dispersals or otherwise encountering genetic bottlenecks mostly revealed climate niche shifts. Niche convergence among clades indicates a niche-filling pattern during the last 2 Myr in South American Hordeum. We provide evidence that species that did not encounter population reductions mainly show ecoclimatic niche conservatism, while major niche shifts have occurred in species that have undergone population bottlenecks. Our analyses allow the conclusion that population demography influences adaptation and niche shifts or conservatism in South American Hordeum species. Finally, I studied the phylogenetic composition of Oxalis flora of Mediterranean zone of Chile by asking whether in such a species-rich clade xerophytic adaptations arose in parallel, at different times, or simultaneously. Answering this type of question has been a major concern of evolutionary biology over the past few years, with a growing consensus that lineages tend to be conservative in their vegetative traits and niche requirements. Combined nuclear and chloroplast DNA sequences for 112 species of Oxalidales (4900 aligned nucleotides) were used for a fossil-calibrated phylogeny that includes 43 of the 54 species of Chilean Oxalis, and species distribution models (SDMs) incorporating precipitation, temperature, and fog, and the phylogeny were used to reconstruct ancestral habitat preferences, relying on likelihood and Bayesian techniques. Since uneven collecting can reduce the power of SDMs, we compared 3 strategies to correct for collecting effort. Unexpectedly, the Oxalis flora of Chile consists of 7 distant lineages that originated at different times prior to the last Andean uplift pulse; some had features preadapting them to seasonally arid or xeric conditions. Models that incorporated fog and a `collecting activity surface' performed best and identified the Mediterranean zone as a hotspot of Oxalis species as well as lineage diversity because it harbors a mix of ancient and young groups, including insuficiently arid-adapted species. There is no evidence of rapid adaptive radiation

    On the Limits and Practice of Automatically Designing Self-Stabilization

    Get PDF
    A protocol is said to be self-stabilizing when the distributed system executing it is guaranteed to recover from any fault that does not cause permanent damage. Designing such protocols is hard since they must recover from all possible states, therefore we investigate how feasible it is to synthesize them automatically. We show that synthesizing stabilization on a fixed topology is NP-complete in the number of system states. When a solution is found, we further show that verifying its correctness on a general topology (with any number of processes) is undecidable, even for very simple unidirectional rings. Despite these negative results, we develop an algorithm to synthesize a self-stabilizing protocol given its desired topology, legitimate states, and behavior. By analogy to shadow puppetry, where a puppeteer may design a complex puppet to cast a desired shadow, a protocol may need to be designed in a complex way that does not even resemble its specification. Our shadow/puppet synthesis algorithm addresses this concern and, using a complete backtracking search, has automatically designed 4 new self-stabilizing protocols with minimal process space requirements: 2-state maximal matching on bidirectional rings, 5-state token passing on unidirectional rings, 3-state token passing on bidirectional chains, and 4-state orientation on daisy chains

    Exploring Data Augmentation Algorithm to Improve Genomic Prediction of Top-Ranking Cultivars

    Get PDF
    Genomic selection (GS) is a groundbreaking statistical machine learning method for advancing plant and animal breeding. Nonetheless, its practical implementation remains challenging due to numerous factors affecting its predictive performance. This research explores the potential of data augmentation to enhance prediction accuracy across entire datasets and specifically within the top 20% of the testing set. Our findings indicate that, overall, the data augmentation method (method A), when compared to the conventional model (method C) and assessed using Mean Arctangent Absolute Prediction Error (MAAPE) and normalized root mean square error (NRMSE), did not improve the prediction accuracy for the unobserved cultivars. However, significant improvements in prediction accuracy (evidenced by reduced prediction error) were observed when data augmentation was applied exclusively to the top 20% of the testing set. Specifically, reductions in MAAPE_20 and NRMSE_20 by 52.86% and 41.05%, respectively, were noted across various datasets. Further investigation is needed to refine data augmentation techniques for effective use in genomic prediction

    Use of Protein Crosslinking and Tandem Mass Spectrometry to Study the PsbO, PsbP and PsbQ Extrinsic Proteins of Higher Plant Photosystem II

    Get PDF
    Photosystem II (PSII) is a light-driven, water plastoquinone oxidoreductase present in all oxygenic photosynthetic organisms. The oxygen evolution process is catalyzed by the Mn4CaO5 cluster and an ensemble of intrinsic and extrinsic proteins which are associated with the photosystem. This metal cluster is stabilized and protected from exogenous reductants by the extrinsic proteins, PsbO, PsbP and PsbQ in higher plants, which are present on the lumenal face of PSII. No crystal structure for the higher plant PSII is currently available; consequently, the binding locations of these extrinsic proteins in PSII remain elusive. We have used chemical-crosslinkers Bis (sulfosuccinimidyl) suberate (BS3) and 1-ethyl-3-(3-dimethylaminopropyl) carbodiimide (EDC) to crosslink the extrinsic proteins in their bound state to PSII followed by identification of the crosslinked products by tandem mass spectrometry. BS3 crosslinking identified the interacting domain of PsbP with PsbQ involving the PsbP residues 93Y, 96K and 97T (located in the 17-residue loop 3A, 89G-105S) which are in close proximity (\u3c11.4Å) to the N-terminal 1E residue of PsbQ. We also found that this PsbP assumes a compact structure from the nine independent crosslinked residues between the N- and C-terminus of PsbP. This suggests that the N-terminus of PsbP, 1A-11K (which is not resolved in the current crystal structures), is closely associated with the C-terminal domain 170K-186A. Additionally, interacting domains of two PsbQ copies from different PSII monomers were identified. The residue pairs 98K-133Y and 101K-133Y of PsbQ were crosslinked. These residues are \u3e30 Å apart when mapped onto the PsbQ crystal structure. Since BS3 can only crosslink residues which are within 11.4 Å, these residues are hypothesized as inter-molecular crosslinks of PsbQ. Furthermore, EDC crosslinking provided structural information pertaining to the organization of the N-terminus, absent in the cyanobacterial-PsbO. In this study, twenty-four crosslinked residues located in the N-terminal, loop and the β-barrel region of PsbO were identified. The models incorporating crosslinking data suggests several differences in cyanobacterial- and higher plant-PsbO. The results on extrinsic proteins provide significant new information concerning the association of the extrinsic proteins with PSII and are valuable while proposing overall models of higher plant PSII

    Ancestral sequence reconstruction as an accessible tool for the engineering of biocatalyst stability

    Get PDF
    Synthetic biology is the engineering of life to imbue non-natural functionality. As such, synthetic biology has considerable commercial potential, where synthetic metabolic pathways are utilised to convert low value substrates into high value products. High temperature biocatalysis offers several system-level benefits to synthetic biology, including increased dilution of substrate, increased reaction rates and decreased contamination risk. However, the current gamut of tools available for the engineering of thermostable proteins are either expensive, unreliable, or poorly understood, meaning their adoption into synthetic biology workflows is treacherous. This thesis focuses on the development of an accessible tool for the engineering of protein thermostability, based on the evolutionary biology tool ancestral sequence reconstruction (ASR). ASR allows researchers to walk back in time along the branches of a phylogeny and predict the most likely representation of a protein family’s ancestral state. It also has simple input requirements, and its output proteins are often observed to be thermostable, making ASR tractable to protein engineering. Chapter 2 explores the applicability of multiple ASR methods to the engineering of a carboxylic acid reductase (CAR) biocatalyst. Despite the family emerging only 500 million years ago, ancestors presented considerable improvements in thermostability over their modern counterparts. We proceed to thoroughly characterise the ancestral enzymes for their inclusion into the CAR biocatalytic toolbox. Chapter 3 explores why ASR derived proteins may be thermostable despite a mesophilic history. An in silico toolbox for tracking models of protein stability over simulated evolutionary time at the sequence, protein and population level is built. We provide considerable evidence that the sequence alignments of simulated protein families that evolved at marginal stability are saturated with stabilising residues. ASR therefore derives sequences from a dataset biased toward stabilisation. Importantly, while ASR is accessible, it still requires a steep learning curve based on its requirements of phylogenetic expertise. In chapter 4, we utilise the evolutionary model produced in chapter 3 to develop a highly simplified and accessible ASR protocol. This protocol was then applied to engineer CAR enzymes that displayed dramatic increases in thermostability compared to both modern CARs and the thermostable AncCARs presented in chapter 2

    R programming in phylogenetics and evolution

    Get PDF
    This dissertation addresses the application of the statistical computing language R in the study of evolution and diversification of plants. The topics included range from the worldwide historical biogeography of the cucurbit family and the phylogenetic composition of the Mediterranean Oxalis flora in central Chile to the interplay between population genetics and climatic niche evolution in four Horde- um clades in the Americas. In these studies, I drew on existing methods in R and on java and C programs that could be easily integrated with R. Whenever necessary, I created additional software available in four new R packages. R's features, e.g., intersystem-interfaces, extensibility, reproducibility and advanced graphical capability, proved well suited for evolutionary and phylogenetic research. My coauthors and I addressed the history of Cucurbitaceae, one of the most economically important families of plants, using a multi-gene phylogeny for 114 of the 115 genera and 25 per cent of the 960 species. Worldwide sampling was achieved by using specimens from 30 herbaria. Results reveal an Asian origin of Cucurbitaceae in the Late Cretaceous, followed by the repeated spread of lineages into the African, American and Australian continents via transoceanic long-distance dispersal (LDD). North American cucurbits stem from at least seven range expansions of Central and South American lineages; Madagascar was colonized 13 times, always from Africa; Australia was reached 12 times, apparently always from Southeast Asia. Overall, Cucurbitaceae underwent at least 43 successful LDD events over the past 60 Myr, which would translate into an average of seven LDDs every 10 Myr. These and similar findings from other angiosperms stress the need for an increased tapping of museum collections to achieve extensive geographical sampling in plant phylogenetics. The second study focused on the interplay of population demography with the evolution of ecological niches during or after speciation in Hordeum. While large populations maintain a high level of standing genetic diversity, gene ow and recombination buffers against fast alterations in ecological adaptation. Small populations harbor lower allele diversity but can more easily shift to new niches if they initially survive under changed conditions. Thus, large populations should be more conservative regarding niche changes in comparison to small populations. My coauthors and I used environmental niche modeling together with phylogenetic, phylogeographic and population genetic analyses to infer the correlation of population demography with changes in ecological niche dimensions in 12 diploid Hordeum species from the New World, forming four monophyletic groups. Our analyses found both shifts and conservatism in certain niche dimensions within and among clades. Speciation due to vicariance resulted in three species with no pronounced climate niche differences, while species originating due to long-distance dispersals or otherwise encountering genetic bottlenecks mostly revealed climate niche shifts. Niche convergence among clades indicates a niche-filling pattern during the last 2 Myr in South American Hordeum. We provide evidence that species that did not encounter population reductions mainly show ecoclimatic niche conservatism, while major niche shifts have occurred in species that have undergone population bottlenecks. Our analyses allow the conclusion that population demography influences adaptation and niche shifts or conservatism in South American Hordeum species. Finally, I studied the phylogenetic composition of Oxalis flora of Mediterranean zone of Chile by asking whether in such a species-rich clade xerophytic adaptations arose in parallel, at different times, or simultaneously. Answering this type of question has been a major concern of evolutionary biology over the past few years, with a growing consensus that lineages tend to be conservative in their vegetative traits and niche requirements. Combined nuclear and chloroplast DNA sequences for 112 species of Oxalidales (4900 aligned nucleotides) were used for a fossil-calibrated phylogeny that includes 43 of the 54 species of Chilean Oxalis, and species distribution models (SDMs) incorporating precipitation, temperature, and fog, and the phylogeny were used to reconstruct ancestral habitat preferences, relying on likelihood and Bayesian techniques. Since uneven collecting can reduce the power of SDMs, we compared 3 strategies to correct for collecting effort. Unexpectedly, the Oxalis flora of Chile consists of 7 distant lineages that originated at different times prior to the last Andean uplift pulse; some had features preadapting them to seasonally arid or xeric conditions. Models that incorporated fog and a `collecting activity surface' performed best and identified the Mediterranean zone as a hotspot of Oxalis species as well as lineage diversity because it harbors a mix of ancient and young groups, including insuficiently arid-adapted species. There is no evidence of rapid adaptive radiation

    The use of vicinal-risk minimization for training decision trees

    Get PDF
    We propose the use of Vapnik's vicinal risk minimization (VRM) for training decision trees to approximately maximize decision margins. We implement VRM by propagating uncertainties in the input attributes into the labeling decisions. In this way, we perform a global regularization over the decision tree structure. During a training phase, a decision tree is constructed to minimize the total probability of misclassifying the labeled training examples, a process which approximately maximizes the margins of the resulting classifier. We perform the necessary minimization using an appropriate meta-heuristic (genetic programming) and present results over a range of synthetic and benchmark real datasets. We demonstrate the statistical superiority of VRM training over conventional empirical risk minimization (ERM) and the well-known C4.5 algorithm, for a range of synthetic and real datasets. We also conclude that there is no statistical difference between trees trained by ERM and using C4.5. Training with VRM is shown to be more stable and repeatable than by ERM

    Distributed Knowledge Discovery in Large Scale Peer-to-Peer Networks

    Get PDF
    Explosive growth in the availability of various kinds of data in distributed locations has resulted in unprecedented opportunity to develop distributed knowledge discovery (DKD) techniques. DKD embraces the growing trend of merging computation with communication by performing distributed data analysis and modeling with minimal communication of data. Most of the current state-of-the-art DKD systems suffer from the lack of scalability, robustness and adaptability due to their dependence on a centralized model for building the knowledge discovery model. Peer-to-Peer networks offer a better scalable and fault-tolerant computing platform for building distributed knowledge discovery models than client-server based platforms. Algorithms and communication protocols have been developed for file search and discovery services in peer-to-peer networks. The file search algorithms are concerned with identification of a peer and discovery of a file on that specified peer, so most of the current peer-to-peer networks for file search act as directory services. The problem of distributed knowledge discovery is different from file search services, however new issues and challenges have to be addressed. The algorithms and communication protocols for knowledge discovery deal with implementing algorithms by which every peer in the network discovers the correct knowledge discovery model, as if it were given the combined database. Therefore, algorithms and communication protocols for DKD mainly deal with distributed computing. The distributed computations are entirely asynchronous, impose very little communication overhead, transparently tolerate network topology changes and peer failures and quickly adjust to changes in the data as they occur. Another important aspect of the distributed computations in a peer-to-peer network is that most of the communication between peer nodes is local i.e. the knowledge discovery model is learned at each peer using information gathered from a very small neighborhood, whose size is independent of the size of the peer-to-peer network. The peer-to-peer constraints on data and/or computing are the hard ones, so the challenge is to show that it is still possible to extract useful information from the distributed data effectively and dependably. The implementation of a distributed algorithm in an asynchronous and decentralized environment is the hardest challenge. DKD in a peer-to-peer network raises issues related to impracticality of global communications and global synchronization, on-the-fly data updates, lack of control, accuracy of computation, the need to share resources with other applications, and frequent failure and recovery of resources. We propose a methodology based on novel distributed algorithms and communication protocols to perform DKD in a peer-to-peer network. We investigate the performance of our algorithms and communication protocols by means of analysis and simulations

    Meta-learning computational intelligence architectures

    Get PDF
    In computational intelligence, the term \u27memetic algorithm\u27 has come to be associated with the algorithmic pairing of a global search method with a local search method. In a sociological context, a \u27meme\u27 has been loosely defined as a unit of cultural information, the social analog of genes for individuals. Both of these definitions are inadequate, as \u27memetic algorithm\u27 is too specific, and ultimately a misnomer, as much as a \u27meme\u27 is defined too generally to be of scientific use. In this dissertation the notion of memes and meta-learning is extended from a computational viewpoint and the purpose, definitions, design guidelines and architecture for effective meta-learning are explored. The background and structure of meta-learning architectures is discussed, incorporating viewpoints from psychology, sociology, computational intelligence, and engineering. The benefits and limitations of meme-based learning are demonstrated through two experimental case studies -- Meta-Learning Genetic Programming and Meta- Learning Traveling Salesman Problem Optimization. Additionally, the development and properties of several new algorithms are detailed, inspired by the previous case-studies. With applications ranging from cognitive science to machine learning, meta-learning has the potential to provide much-needed stimulation to the field of computational intelligence by providing a framework for higher order learning --Abstract, page iii
    • …
    corecore