66 research outputs found
Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution
Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend.
An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches.
Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still.
Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms.
While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing.
Yet, this does not hinder studies of language evolution using automated
tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks.
Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems
Cheminformatics Tools to Explore the Chemical Space of Peptides and Natural Products
Cheminformatics facilitates the analysis, storage, and collection of large quantities of chemical data, such as molecular structures and molecules' properties and biological activity, and it has revolutionized medicinal chemistry for small molecules. However, its application to larger molecules is still underrepresented. This thesis work attempts to fill this gap and extend the cheminformatics approach towards large molecules and peptides.
This thesis is divided into two parts. The first part presents the implementation and application of two new molecular descriptors: macromolecule extended atom pair fingerprint (MXFP) and MinHashed atom pair fingerprint of radius 2 (MAP4). MXFP is an atom pair fingerprint suitable for large molecules, and here, it is used to explore the chemical space of non-Lipinski molecules within the widely used PubChem and ChEMBL databases. MAP4 is a MinHashed hybrid of substructure and atom pair fingerprints suitable for encoding small and large molecules. MAP4 is first benchmarked against commonly used atom pairs and substructure fingerprints, and then it is used to investigate the chemical space of microbial and plants natural products with the aid of machine learning and chemical space mapping.
The second part of the thesis focuses on peptides, and it is introduced by a review chapter on approaches to discover novel peptide structures and describing the known peptide chemical space. Then, a genetic algorithm that uses MXFP in its fitness function is described and challenged to generate peptide analogs of peptidic or non-peptidic queries. Finally, supervised and unsupervised machine learning is used to generate novel antimicrobial and non-hemolytic peptide sequences
Development of Computer-aided Concepts for the Optimization of Single-Molecules and their Integration for High-Throughput Screenings
In the field of synthetic biology, highly interdisciplinary approaches for the
design and modelling of functional molecules using computer-assisted methods
have become established in recent decades. These computer-assisted methods are
mainly used when experimental approaches reach their limits, as computer models
are able to e.g., elucidate the temporal behaviour of nucleic acid polymers or
proteins by single-molecule simulations, as well as to illustrate the functional
relationship of amino acid residues or nucleotides to each other. The knowledge
raised by computer modelling can be used continuously to influence the further
experimental process (screening), and also shape or function
(rational design) of the considered molecule. Such an optimization of the
biomolecules carried out by humans is often necessary, since the observed
substrates for the biocatalysts and enzymes are usually synthetic (``man-made
materials'', such as PET) and the evolution had no time to provide efficient
biocatalysts.
With regard to the computer-aided design of single-molecules, two fundamental paradigms
share the supremacy in the field of synthetic biology. On the one hand,
probabilistic experimental methods (e.g., evolutionary design processes such as
directed evolution) are used in combination with High-Throughput
Screening (HTS), on the other hand, rational, computer-aided single-molecule design
methods are applied.
For both topics, computer models/concepts were developed, evaluated and
published.
The first contribution in this thesis describes a computer-aided design approach
of the Fusarium Solanie Cutinase (FsC). The activity loss of the enzyme during a
longer incubation period was investigated in detail (molecular) with PET. For
this purpose, Molecular Dynamics (MD) simulations of the spatial structure of
FsC and a water-soluble degradation product of the
synthetic substrate PET (ethylene glycol) were computed. The existing model was
extended by combining it with Reduced Models. This simulation study has
identified certain areas of FsC which interact very
strongly with PET (ethylene glycol) and thus have a significant influence on the
flexibility and structure of the enzyme.
The subsequent original publication establishes a new method for the selection
of High-Throughput assays for the use in protein chemistry. The selection is
made via a meta-optimization of the assays to be analyzed. For this purpose,
control reactions are carried out for the respective assay. The distance of the
control distributions is evaluated using classical static methods such as the
Kolmogorov-Smirnov test. A performance is then assigned to each assay. The
described control experiments are performed before the actual experiment
(screening), and the assay with the highest performance is used for further
screening. By applying this generic method, high success rates can be achieved.
We were able to demonstrate this experimentally using
lipases and esterases as an example.
In the area of green chemistry, the above-mentioned processes can be useful for finding
enzymes for the degradation of synthetic materials more quickly or modifying
enzymes that occur naturally in such a way that these enzymes can
efficiently convert synthetic substrates after successful optimization. For this
purpose, the experimental effort (consumption of materials) is kept to a minimum
during the practical implementation. Especially for large-scale screenings, a
prior consideration or restriction of the possible sequence-space can contribute significantly to
maximizing the success rate of screenings and minimizing the total
time they require.
In addition to classical methods such as MD simulations in combination with
reduced models, new graph-based methods for the presentation and analysis of MD
simulations have been developed. For this purpose, simulations were converted
into distance-dependent dynamic graphs. Based on this reduced representation,
efficient algorithms for analysis were developed and tested. In particular,
network motifs were investigated to determine whether this type of
semantics is more suitable for describing molecular structures and interactions
within MD simulations than spatial coordinates. This concept was evaluated for
various MD simulations of molecules, such as water, synthetic pores, proteins,
peptides and RNA structures. It has been shown that this novel form of semantics
is an excellent way to describe (bio)molecular structures and their dynamics.
Furthermore, an algorithm (StreAM-Tg) has been developed for the creation of
motif-based Markov models, especially for the analysis of single molecule
simulations of nucleic acids. This algorithm is used for the design of RNAs. The
insights obtained from the analysis with StreAM-Tg (Markov models) can
provide useful design recommendations for the (re)design of functional RNA.
In this context, a new method was developed to quantify the environment (i.e.
water; solvent context) and its influence on biomolecules in MD simulations. For
this purpose, three vertex motifs were used to describe the structure of the
individual water molecules. This new method offers many advantages. With this
method, the structure and dynamics of water can be accurately described. For
example, we were able to reproduce the thermodynamic entropy of water in the
liquid and vapor phase along the vapor-liquid equilibrium curve from the
triple point to the critical point.
Another major field covered in this thesis is the development of new
computer-aided approaches for HTS for the design of
functional RNA. For the production of functional RNA (e.g., aptamers and riboswitches), an experimental,
round-based HTS (like SELEX) is typically used. By using
Next Generation Sequencing (NGS) in combination with the SELEX process,
this design process can be studied at the nucleotide and secondary structure
levels for the first time. The special feature of small RNA molecules compared
to proteins is that the secondary structure (topology), with a minimum free
energy, can be determined directly from the nucleotide sequence, with a high
degree of certainty.
Using the combination of M. Zuker's algorithm, NGS and the SELEX method, it was
possible to quantify the structural diversity of individual RNA molecules under
consideration of the genetic context. This combination of methods allowed the
prediction of rounds in which the first ciprofloxacin-riboswitch emerged.
In this example, only a simple structural comparison was made for the
quantification (Levenshtein distance) of the diversity of each round.
To improve this, a new representation of the RNA structure as a directed graph
was modeled, which was then compared with a probabilistic subgraph isomorphism.
Finally, the NGS dataset (ciprofloxacin-riboswitch) was modeled as a dynamic
graph and analyzed after the occurrence of defined seven-vertex motifs. For this
purpose, motif-based semantics were integrated into HTS
for RNA molecules for the first time. The identified motifs could be assigned to
secondary structural elements that were identified experimentally in the
ciprofloxacin aptamer R10k6.
Finally, all the algorithms presented were integrated into an R library,
published and made available to scientists from all over the world
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Building Blocks for Mapping Services
Mapping services are ubiquitous on the Internet. These services enjoy a considerable user base. But it is often overlooked that providing a service on a global scale with virtually millions of users has been the playground of an oligopoly of a select few service providers are able to do so. Unfortunately, the literature on these solutions is more than scarce. This thesis adds a number of building blocks to the literature that explain how to design and implement a number of features
Sublinear Computation Paradigm
This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms
- …