66 research outputs found

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Cheminformatics Tools to Explore the Chemical Space of Peptides and Natural Products

    Get PDF
    Cheminformatics facilitates the analysis, storage, and collection of large quantities of chemical data, such as molecular structures and molecules' properties and biological activity, and it has revolutionized medicinal chemistry for small molecules. However, its application to larger molecules is still underrepresented. This thesis work attempts to fill this gap and extend the cheminformatics approach towards large molecules and peptides. This thesis is divided into two parts. The first part presents the implementation and application of two new molecular descriptors: macromolecule extended atom pair fingerprint (MXFP) and MinHashed atom pair fingerprint of radius 2 (MAP4). MXFP is an atom pair fingerprint suitable for large molecules, and here, it is used to explore the chemical space of non-Lipinski molecules within the widely used PubChem and ChEMBL databases. MAP4 is a MinHashed hybrid of substructure and atom pair fingerprints suitable for encoding small and large molecules. MAP4 is first benchmarked against commonly used atom pairs and substructure fingerprints, and then it is used to investigate the chemical space of microbial and plants natural products with the aid of machine learning and chemical space mapping. The second part of the thesis focuses on peptides, and it is introduced by a review chapter on approaches to discover novel peptide structures and describing the known peptide chemical space. Then, a genetic algorithm that uses MXFP in its fitness function is described and challenged to generate peptide analogs of peptidic or non-peptidic queries. Finally, supervised and unsupervised machine learning is used to generate novel antimicrobial and non-hemolytic peptide sequences

    Development of Computer-aided Concepts for the Optimization of Single-Molecules and their Integration for High-Throughput Screenings

    Get PDF
    In the field of synthetic biology, highly interdisciplinary approaches for the design and modelling of functional molecules using computer-assisted methods have become established in recent decades. These computer-assisted methods are mainly used when experimental approaches reach their limits, as computer models are able to e.g., elucidate the temporal behaviour of nucleic acid polymers or proteins by single-molecule simulations, as well as to illustrate the functional relationship of amino acid residues or nucleotides to each other. The knowledge raised by computer modelling can be used continuously to influence the further experimental process (screening), and also shape or function (rational design) of the considered molecule. Such an optimization of the biomolecules carried out by humans is often necessary, since the observed substrates for the biocatalysts and enzymes are usually synthetic (``man-made materials'', such as PET) and the evolution had no time to provide efficient biocatalysts. With regard to the computer-aided design of single-molecules, two fundamental paradigms share the supremacy in the field of synthetic biology. On the one hand, probabilistic experimental methods (e.g., evolutionary design processes such as directed evolution) are used in combination with High-Throughput Screening (HTS), on the other hand, rational, computer-aided single-molecule design methods are applied. For both topics, computer models/concepts were developed, evaluated and published. The first contribution in this thesis describes a computer-aided design approach of the Fusarium Solanie Cutinase (FsC). The activity loss of the enzyme during a longer incubation period was investigated in detail (molecular) with PET. For this purpose, Molecular Dynamics (MD) simulations of the spatial structure of FsC and a water-soluble degradation product of the synthetic substrate PET (ethylene glycol) were computed. The existing model was extended by combining it with Reduced Models. This simulation study has identified certain areas of FsC which interact very strongly with PET (ethylene glycol) and thus have a significant influence on the flexibility and structure of the enzyme. The subsequent original publication establishes a new method for the selection of High-Throughput assays for the use in protein chemistry. The selection is made via a meta-optimization of the assays to be analyzed. For this purpose, control reactions are carried out for the respective assay. The distance of the control distributions is evaluated using classical static methods such as the Kolmogorov-Smirnov test. A performance is then assigned to each assay. The described control experiments are performed before the actual experiment (screening), and the assay with the highest performance is used for further screening. By applying this generic method, high success rates can be achieved. We were able to demonstrate this experimentally using lipases and esterases as an example. In the area of green chemistry, the above-mentioned processes can be useful for finding enzymes for the degradation of synthetic materials more quickly or modifying enzymes that occur naturally in such a way that these enzymes can efficiently convert synthetic substrates after successful optimization. For this purpose, the experimental effort (consumption of materials) is kept to a minimum during the practical implementation. Especially for large-scale screenings, a prior consideration or restriction of the possible sequence-space can contribute significantly to maximizing the success rate of screenings and minimizing the total time they require. In addition to classical methods such as MD simulations in combination with reduced models, new graph-based methods for the presentation and analysis of MD simulations have been developed. For this purpose, simulations were converted into distance-dependent dynamic graphs. Based on this reduced representation, efficient algorithms for analysis were developed and tested. In particular, network motifs were investigated to determine whether this type of semantics is more suitable for describing molecular structures and interactions within MD simulations than spatial coordinates. This concept was evaluated for various MD simulations of molecules, such as water, synthetic pores, proteins, peptides and RNA structures. It has been shown that this novel form of semantics is an excellent way to describe (bio)molecular structures and their dynamics. Furthermore, an algorithm (StreAM-Tg) has been developed for the creation of motif-based Markov models, especially for the analysis of single molecule simulations of nucleic acids. This algorithm is used for the design of RNAs. The insights obtained from the analysis with StreAM-Tg (Markov models) can provide useful design recommendations for the (re)design of functional RNA. In this context, a new method was developed to quantify the environment (i.e. water; solvent context) and its influence on biomolecules in MD simulations. For this purpose, three vertex motifs were used to describe the structure of the individual water molecules. This new method offers many advantages. With this method, the structure and dynamics of water can be accurately described. For example, we were able to reproduce the thermodynamic entropy of water in the liquid and vapor phase along the vapor-liquid equilibrium curve from the triple point to the critical point. Another major field covered in this thesis is the development of new computer-aided approaches for HTS for the design of functional RNA. For the production of functional RNA (e.g., aptamers and riboswitches), an experimental, round-based HTS (like SELEX) is typically used. By using Next Generation Sequencing (NGS) in combination with the SELEX process, this design process can be studied at the nucleotide and secondary structure levels for the first time. The special feature of small RNA molecules compared to proteins is that the secondary structure (topology), with a minimum free energy, can be determined directly from the nucleotide sequence, with a high degree of certainty. Using the combination of M. Zuker's algorithm, NGS and the SELEX method, it was possible to quantify the structural diversity of individual RNA molecules under consideration of the genetic context. This combination of methods allowed the prediction of rounds in which the first ciprofloxacin-riboswitch emerged. In this example, only a simple structural comparison was made for the quantification (Levenshtein distance) of the diversity of each round. To improve this, a new representation of the RNA structure as a directed graph was modeled, which was then compared with a probabilistic subgraph isomorphism. Finally, the NGS dataset (ciprofloxacin-riboswitch) was modeled as a dynamic graph and analyzed after the occurrence of defined seven-vertex motifs. For this purpose, motif-based semantics were integrated into HTS for RNA molecules for the first time. The identified motifs could be assigned to secondary structural elements that were identified experimentally in the ciprofloxacin aptamer R10k6. Finally, all the algorithms presented were integrated into an R library, published and made available to scientists from all over the world

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Sequence Comparison in Historical Linguistics

    Get PDF
    B

    Building Blocks for Mapping Services

    Get PDF
    Mapping services are ubiquitous on the Internet. These services enjoy a considerable user base. But it is often overlooked that providing a service on a global scale with virtually millions of users has been the playground of an oligopoly of a select few service providers are able to do so. Unfortunately, the literature on these solutions is more than scarce. This thesis adds a number of building blocks to the literature that explain how to design and implement a number of features

    Sequence Comparison in Historical Linguistics

    Get PDF

    Sublinear Computation Paradigm

    Get PDF
    This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms
    corecore