2,416 research outputs found

    Computational Molecular Biology

    No full text
    Computational Biology is a fairly new subject that arose in response to the computational problems posed by the analysis and the processing of biomolecular sequence and structure data. The field was initiated in the late 60's and early 70's largely by pioneers working in the life sciences. Physicists and mathematicians entered the field in the 70's and 80's, while Computer Science became involved with the new biological problems in the late 1980's. Computational problems have gained further importance in molecular biology through the various genome projects which produce enormous amounts of data. For this bibliography we focus on those areas of computational molecular biology that involve discrete algorithms or discrete optimization. We thus neglect several other areas of computational molecular biology, like most of the literature on the protein folding problem, as well as databases for molecular and genetic data, and genetic mapping algorithms. Due to the availability of review papers and a bibliography this bibliography

    Complexity, parallel computation and statistical physics

    Full text link
    The intuition that a long history is required for the emergence of complexity in natural systems is formalized using the notion of depth. The depth of a system is defined in terms of the number of parallel computational steps needed to simulate it. Depth provides an objective, irreducible measure of history applicable to systems of the kind studied in statistical physics. It is argued that physical complexity cannot occur in the absence of substantial depth and that depth is a useful proxy for physical complexity. The ideas are illustrated for a variety of systems in statistical physics.Comment: 21 pages, 7 figure

    Inferring phylogenetic trees under the general Markov model via a minimum spanning tree backbone

    Get PDF
    Phylogenetic trees are models of the evolutionary relationships among species, with species typically placed at the leaves of trees. We address the following problems regarding the calculation of phylogenetic trees. (1) Leaf-labeled phylogenetic trees may not be appropriate models of evolutionary relationships among rapidly evolving pathogens which may contain ancestor-descendant pairs. (2) The models of gene evolution that are widely used unrealistically assume that the base composition of DNA sequences does not evolve. Regarding problem (1) we present a method for inferring generally labeled phylogenetic trees that allow sampled species to be placed at non-leaf nodes of the tree. Regarding problem (2), we present a structural expectation maximization method (SEM-GM) for inferring leaf-labeled phylogenetic trees under the general Markov model (GM) which is the most complex model of DNA substitution that allows the evolution of base composition. In order to improve the scalability of SEM-GM we present a minimum spanning tree (MST) framework called MST-backbone. MST-backbone scales linearly with the number of leaves. However, the unrealistic location of the root as inferred on empirical data suggests that the GM model may be overtrained. MST-backbone was inspired by the topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. (2011). We discovered that the topological relationship does not necessarily hold if there is no unique MST. We propose so-called vertex-order based MSTs (VMSTs) that guarantee a topological relationship with phylogenetic trees.Phylogenetische Bäume modellieren evolutionäre Beziehungen zwischen Spezies, wobei die Spezies typischerweise an den Blättern der Bäume sitzen. Wir befassen uns mit den folgenden Problemen bei der Berechnung von phylogenetischen Bäumen. (1) Blattmarkierte phylogenetische Bäume sind möglicherweise keine geeigneten Modelle der evolutionären Beziehungen zwischen sich schnell entwickelnden Krankheitserregern, die Vorfahren-Nachfahren-Paare enthalten können. (2) Die weit verbreiteten Modelle der Genevolution gehen unrealistischerweise davon aus, dass sich die Basenzusammensetzung von DNA-Sequenzen nicht ändert. Bezüglich Problem (1) stellen wir eine Methode zur Ableitung von allgemein markierten phylogenetischen Bäumen vor, die es erlaubt, Spezies, für die Proben vorliegen, an inneren des Baumes zu platzieren. Bezüglich Problem (2) stellen wir eine strukturelle Expectation-Maximization-Methode (SEM-GM) zur Ableitung von blattmarkierten phylogenetischen Bäumen unter dem allgemeinen Markov-Modell (GM) vor, das das komplexeste Modell von DNA-Substitution ist und das die Evolution von Basenzusammensetzung erlaubt. Um die Skalierbarkeit von SEM-GM zu verbessern, stellen wir ein Minimale Spannbaum (MST)-Methode vor, die als MST-Backbone bezeichnet wird. MST-Backbone skaliert linear mit der Anzahl der Blätter. Die Tatsache, dass die Lage der Wurzel aus empirischen Daten nicht immer realistisch abgeleitet warden kann, legt jedoch nahe, dass das GM-Modell möglicherweise übertrainiert ist. MST-backbone wurde von einer topologischen Beziehung zwischen minimalen Spannbäumen und phylogenetischen Bäumen inspiriert, die von Choi et al. 2011 eingeführt wurde. Wir entdeckten, dass die topologische Beziehung nicht unbedingt Bestand hat, wenn es keinen eindeutigen minimalen Spannbaum gibt. Wir schlagen so genannte vertex-order-based MSTs (VMSTs) vor, die eine topologische Beziehung zu phylogenetischen Bäumen garantieren

    Geometric Algorithms and Data Structures for Simulating Diffusion Limited Reactions

    Get PDF
    Radiation therapy is one of the most effective means for treating cancers. An important calculation in radiation therapy is the estimation of dose distribution in the treated patient, which is key to determining the treatment outcome and potential side effects of the therapy. Biological dose — the level of biological damage (e.g., cell killing ratio, DNA damage, etc.) inflicted by the radiation is the best measure of treatment quality, but it is very difficult to calculate. Therefore, most clinics today use physical dose - the energy deposited by incident radiation per unit body mass - for planning radiation therapy, which can be calculated accurately using kinetic Monte Carlo simulations. Studies have found that physical dose correlates with biological dose, but exhibits a very complex relationship that is not yet well understood. Generally speaking, the calculation of biological dose involves four steps: (1) the calculation of physical dose distribution, (2) the generation of radiochemicals based on the physical dose distribution, (3) the simulation of interactions between radiochemicals and bio-matter in the body, and (4) the estimation of biological damage based on the distribution of radiochemicals. This dissertation focuses on the development of a more efficient and effective simulation algorithm to speed up step (3). The main contribution of this research is the development of an efficient and effective kinetic Monte Carlo (KMC) algorithm for simulating diffusion-limited chemical reactions in the context of radiation therapy. The central problem studied is - given n particles distributed among a small number of particle species, all allowed to diffuse and chemically react according to a small number of chemical reaction equations - predict the radiochemical yield over time. The algorithm presented makes use of a sparse grid structure, with one grid per species per radiochemical reactant used to group particles in a way that makes the nearest neighbor search efficient, where particles are stored only once, yet are represented in grids of all appropriate reaction radii. A kinetic data structure is used as the time stepping mechanism, which provides spatially local updates to the simulation at a frequency which captures all events - retaining accuracy. A serial and three parallel versions of the algorithm have been developed. The parallel versions implement the kinetic data structure using both a standard priority queue and a treap data structure in order to investigate the algorithms scalability. The treap provides a way for each thread of execution to do more work in a particular region of space. A comparison with a spatial discretization variant of the algorithm is also provided

    An exact mathematical programming approach to multiple RNA sequence-structure alignment

    Get PDF
    One of the main tasks in computational biology is the computation of alignments of genomic sequences to reveal their commonalities. In case of DNA or protein sequences, sequence information alone is usually sufficient to compute reliable alignments. RNA molecules, however, build spatial conformations—the secondary structure—that are more conserved than the actual sequence. Hence, computing reliable alignments of RNA molecules has to take into account the secondary structure. We present a novel framework for the computation of exact multiple sequence-structure alignments: We give a graph- theoretic representation of the sequence-structure alignment problem and phrase it as an integer linear program. We identify a class of constraints that make the problem easier to solve and relax the original integer linear program in a Lagrangian manner. Experiments on a recently published benchmark show that our algorithms has a comparable performance than more costly dynamic programming algorithms, and outperforms all other approaches in terms of solution quality with an increasing number of input sequences

    From RNA folding to inverse folding: a computational study: Folding and design of RNA molecules

    Get PDF
    Since the discovery of the structure of DNA in the early 1953s and its double-chained complement of information hinting at its means of replication, biologists have recognized the strong connection between molecular structure and function. In the past two decades, there has been a surge of research on an ever-growing class of RNA molecules that are non-coding but whose various folded structures allow a diverse array of vital functions. From the well-known splicing and modification of ribosomal RNA, non-coding RNAs (ncRNAs) are now known to be intimately involved in possibly every stage of DNA translation and protein transcription, as well as RNA signalling and gene regulation processes. Despite the rapid development and declining cost of modern molecular methods, they typically can only describe ncRNA's structural conformations in vitro, which differ from their in vivo counterparts. Moreover, it is estimated that only a tiny fraction of known ncRNAs has been documented experimentally, often at a high cost. There is thus a growing realization that computational methods must play a central role in the analysis of ncRNAs. Not only do computational approaches hold the promise of rapidly characterizing many ncRNAs yet to be described, but there is also the hope that by understanding the rules that determine their structure, we will gain better insight into their function and design. Many studies revealed that the ncRNA functions are performed by high-level structures that often depend on their low-level structures, such as the secondary structure. This thesis studies the computational folding mechanism and inverse folding of ncRNAs at the secondary level. In this thesis, we describe the development of two bioinformatic tools that have the potential to improve our understanding of RNA secondary structure. These tools are as follows: (1) RAFFT for efficient prediction of pseudoknot-free RNA folding pathways using the fast Fourier transform (FFT)}; (2) aRNAque, an evolutionary algorithm inspired by Lévy flights for RNA inverse folding with or without pseudoknot (A secondary structure that often poses difficulties for bio-computational detection). The first tool, RAFFT, implements a novel heuristic to predict RNA secondary structure formation pathways that has two components: (i) a folding algorithm and (ii) a kinetic ansatz. When considering the best prediction in the ensemble of 50 secondary structures predicted by RAFFT, its performance matches the recent deep-learning-based structure prediction methods. RAFFT also acts as a folding kinetic ansatz, which we tested on two RNAs: the CFSE and a classic bi-stable sequence. In both test cases, fewer structures were required to reproduce the full kinetics, whereas known methods (such as Treekin) required a sample of 20,000 structures and more. The second tool, aRNAque, implements an evolutionary algorithm (EA) inspired by the Lévy flight, allowing both local global search and which supports pseudoknotted target structures. The number of point mutations at every step of aRNAque's EA is drawn from a Zipf distribution. Therefore, our proposed method increases the diversity of designed RNA sequences and reduces the average number of evaluations of the evolutionary algorithm. The overall performance showed improved empirical results compared to existing tools through intensive benchmarks on both pseudoknotted and pseudoknot-free datasets. In conclusion, we highlight some promising extensions of the versatile RAFFT method to RNA-RNA interaction studies. We also provide an outlook on both tools' implications in studying evolutionary dynamics

    RNA polyhedrojen algoritminen suunnittelu

    Get PDF
    The field of bottom-up nanotechnology has been the subject of much research in the recent years. Most of that research has focused on creating nano-scale shapes and structures using multiple strands. DNA origamis and various tile-based schemes are perhaps the most famous examples. No such robust design schemes exist for the design of single stranded RNA structures, however, despite their potential to offer a cheap and sound approach to nanomanufacturing. In this thesis, we study the problem of designing single-stranded RNA polyhedral wireframes, i.e., such RNA strands that fold into the wireframe of a given polyhedron. We introduce a kissing-loop based design scheme, which routes an RNA strand around a spanning tree of a polyhedron, and we show how to do the routing on arbitrary polyhedra while avoiding knots. We also introduce a design tool, Sterna, which is based on these principles. It allows the user to convert a 3D model of a polyhedron into an RNA secondary and tertiary structures, which can be further developed into a primary structure with the additional scripts we have provided. Finally, we design three RNA polyhedra, which are synthesized and imaged in a project related to this master's thesis. The resulting images lend credence to the soundness of Sterna and the underlying design process.Yksi koostavan (engl. bottom-up) nanoteknologian keskeisiä tutkimusalueita viime vuosina on ollut DNA-nanoteknologia, so. nanokokoisten kappaleiden ja rakennelmien tuottaminen biopolymeereistä. Niinsanotut DNA-origamit ja -laatoitukset ovat tämän lähestymistavan tunnetuimpia esimerkkejä. Vastaavaa yleistä menetelmää ei toistaiseksi ole ollut nanorakenteiden tuottamiseen yksisäikeisistä RNA-polymeereistä, vaikka nämä periaatteessa tarjoaisivat edullisen ja skaalautuvan lähtökohdan nanovalmistukselle. Tässä diplomityössä tarkastelemme 3D-monitahokkaiden rautalankamallien laskostamista yksisäikeisistä RNA-polymeereistä. Kehitämme automatisoidun suunnitteluprosessin, joka tuottaa syötteenä annettua monitahokasta vastaavaan muotoon laskostuvan RNA-emästen jonon. Käyttämämme menetelmä perustuu RNA-säikeen reitittämiseen monitahokkaan virittävän puun ympäri ja rakenteen sulkemiseen ns. silmukkapareilla (engl. kissing loop motif). Esitämme myös, miten mielivaltaisen monitahokkaan virittävä puu on mahdollista reitittää tuottamatta topologisia solmuja, jotka estäisivät vastaavan RNA-polymeerin laskostumisen. Toteuttamamme Sterna-suunnitteluohjelman avulla käyttäjä voi tuottaa mistä tahansa 3D-monitahokasmallista sen muotoon laskostuvan RNA-jonon sekundääri- ja tertiäärirakennekuvaukset. Tarjoamme myös ohjelman, jonka avulla nämä voidaan edelleen täydentää emästiedoilla biosynteesiä varten tarvittavaksi RNA-primäärirakenteeksi. Käyttöesimerkkeinä suunnittelemme kolme RNA-monitahokasta, jotka on syntetisoitu ja kuvannettu tämän diplomityön kumppanihankkeissa. Saadut tulokset todentavat suunnittelumenetelmämme ja siihen pohjautuvan Sterna-työkalun oikeellisuutta

    Representing and extracting knowledge from single cell data

    Full text link
    Single-cell analysis is currently one of the most high-resolution techniques to study biology. The large complex datasets that have been generated have spurred numerous developments in computational biology, in particular the use of advanced statistics and machine learning. This review attempts to explain the deeper theoretical concepts that underpin current state-of-the-art analysis methods. Single-cell analysis is covered from cell, through instruments, to current and upcoming models. A minimum of mathematics and statistics has been used, but the reader is assumed to either have basic knowledge of single-cell analysis workflows, or have a solid knowledge of statistics. The aim of this review is to spread concepts which are not yet in common use, especially from topology and generative processes, and how new statistical models can be developed to capture more of biology. This opens epistemological questions regarding our ontology and models, and some pointers will be given to how natural language processing (NLP) may help overcome our cognitive limitations for understanding single-cell data

    Operations research: from computational biology to sensor network

    Get PDF
    In this dissertation we discuss the deployment of combinatorial optimization methods for modeling and solve real life problemS, with a particular emphasis to two biological problems arising from a common scenario: the reconstruction of the three-dimensional shape of a biological molecule from Nuclear Magnetic Resonance (NMR) data. The fi rst topic is the 3D assignment pathway problem (APP) for a RNA molecule. We prove that APP is NP-hard, and show a formulation of it based on edge-colored graphs. Taking into account that interactions between consecutive nuclei in the NMR spectrum are diff erent according to the type of residue along the RNA chain, each color in the graph represents a type of interaction. Thus, we can represent the sequence of interactions as the problem of fi nding a longest (hamiltonian) path whose edges follow a given order of colors (i.e., the orderly colored longest path). We introduce three alternative IP formulations of APP obtained with a max flow problem on a directed graph with packing constraints over the partitions, which have been compared among themselves. Since the last two models work on cyclic graphs, for them we proposed an algorithm based on the solution of their relaxation combined with the separation of cycle inequalities in a Branch & Cut scheme. The second topic is the discretizable distance geometry problem (DDGP), which is a formulation on discrete search space of the well-known distance geometry problem (DGP). The DGP consists in seeking the embedding in the space of a undirected graph, given a set of Euclidean distances between certain pairs of vertices. DGP has two important applications: (i) fi nding the three dimensional conformation of a molecule from a subset of interatomic distances, called Molecular Distance Geometry Problem, and (ii) the Sensor Network Localization Problem. We describe a Branch & Prune (BP) algorithm tailored for this problem, and two versions of it solving the DDGP both in protein modeling and in sensor networks localization frameworks. BP is an exact and exhaustive combinatorial algorithm that examines all the valid embeddings of a given weighted graph G=(V,E,d), under the hypothesis of existence of a given order on V. By comparing the two version of BP to well-known algorithms we are able to prove the e fficiency of BP in both contexts, provided that the order imposed on V is maintained
    corecore