2,028 research outputs found

    Binary Coding, mRNA Information and Protein Structure

    Get PDF
    We describe new binary algorithm for the prediction of α and ÎČ protein folding types from RNA, DNA and amino acid sequences. The method enables quick, simple and accurate prediction of α and ÎČ protein folds on a personal computer by means of a few binary patterns of coded amino acid and nucleotide physicochemical properties. The algorithm was tested with machine learning SMO (sequential minimal optimization) classifier for the support vector machines and classification trees, on a dataset of 140 dissimilar protein folds. Depending on the method of testing, the overall classification accuracy was 91.43% – 100% and the tenfold cross-validation result of the procedure was 83.57% – >90%. Genetic code randomization analysis based on 100,000 different codes tested for the protein fold prediction quality indicated that: a) there is a very low chance of p = 2.7 x 10^(-4) that a better code than the natural one specified by the binary coding algorithm is randomly produced, b)dipeptides represent basic protein units with respect to the natural genetic code defining of the secondary protein structure

    Functional nucleic acids as substrate for information processing

    No full text
    Information processing applications driven by self-assembly and conformation dynamics of nucleic acids are possible. These underlying paradigms (self-assembly and conformation dynamics) are essential for natural information processors as illustrated by proteins. A key advantage in utilising nucleic acids as information processors is the availability of computational tools to support the design process. This provides us with a platform to develop an integrated environment in which an orchestration of molecular building blocks can be realised. Strict arbitrary control over the design of these computational nucleic acids is not feasible. The microphysical behaviour of these molecular materials must be taken into consideration during the design phase. This thesis investigated, to what extent the construction of molecular building blocks for a particular purpose is possible with the support of a software environment. In this work we developed a computational protocol that functions on a multi-molecular level, which enable us to directly incorporate the dynamic characteristics of nucleic acids molecules. To allow the implementation of this computational protocol, we developed a designer that able to solve the nucleic acids inverse prediction problem, not only in the multi-stable states level, but also include the interactions among molecules that occur in each meta-stable state. The realisation of our computational protocol are evaluated by generating computational nucleic acids units that resembles synthetic RNA devices that have been successfully implemented in the laboratory. Furthermore, we demonstrated the feasibility of the protocol to design various types of computational units. The accuracy and diversity of the generated candidates are significantly better than the best candidates produced by conventional designers. With the computational protocol, the design of nucleic acid information processor using a network of interconnecting nucleic acids is now feasible

    On the Genetic Origin of Complementary Protein Coding

    Get PDF
    The relations of protein coding and hydropathy are investigated considering the principles of the molecular recognition theory and Grafstein\u27s hypothesis of the stereochemical origin of the genetic code. It is shown that the coding of RNA and DNA requires 14 distinct groups of codon-anticodon pairs, which define all possible complementary amino acids. The molecular recognition theory is redefined considering the codon-anticodon relations of mRNAs, DNAs, tRNAs and Siemion\u27s mutation ring of the genetic code. A model of DNA, RNA and protein coding (and decoding) based on two fundamental properties of DNA/RNA, denoted as complementary and stationary principles, is presented. Stationary DNA/RNA coding defines the nucleotide relationship of the same (self) DNA/RNA strand and complementary coding defines nucleotide distribution related to other (non-self) strand. Combinations of 2 digits, denoting primary and secondary characteristics of each nucleotide, specify codon positions according to the group subdivision (discrimination) principle. The process of coding is related to the hypercube node codon representations and dynamics of their binary tree locations. The relations between binary tree locations and Cantor set representations of different codon points are discussed in the context of quadratic mappings, Feigenbaum dynamics and signal analysis. Combinations of hypercube nodes and different binary tree positions define the words, sentences and syntax of DNA, RNA and protein language. Possible applications of this method may be related to network analysis and the design, gene, protein and drug modelling

    Unitary and Symmetric Structure in Deep Neural Networks

    Get PDF
    Recurrent neural networks (RNNs) have been successfully used on a wide range of sequential data problems. A well-known difficulty in using RNNs is the vanishing or exploding gradient problem. Recently, there have been several different RNN architectures that try to mitigate this issue by maintaining an orthogonal or unitary recurrent weight matrix. One such architecture is the scaled Cayley orthogonal recurrent neural network (scoRNN), which parameterizes the orthogonal recurrent weight matrix through a scaled Cayley transform. This parametrization contains a diagonal scaling matrix consisting of positive or negative one entries that can not be optimized by gradient descent. Thus the scaling matrix is fixed before training, and a hyperparameter is introduced to tune the matrix for each particular task. In the first part of this thesis, we develop a unitary RNN architecture based on a complex scaled Cayley transform. Unlike the real orthogonal case, the transformation uses a diagonal scaling matrix consisting of entries on the complex unit circle, which can be optimized using gradient descent and no longer requires the tuning of a hyperparameter. We compare the performance of The scaled Cayley unitary recurrent neural network (scuRNN) with scoRNN and other unitary RNN architectures. Convolutional Neural Networks (CNNs) is a class of deep neural networks, most commonly applied to analyzing visual imagery. Nowadays, deep neural networks also play an important role in understanding biological problems such as modeling RNA sequences and protein sequences. The second part of the thesis explores deep learning approaches involving recurrent and convolutional networks to directly infer RNA secondary structure or Protein contact map, which has a symmetric feature matrix as output. We develop a CNN architecture with a suitable symmetric parameterization of the convolutional Kernel that naturally produces symmetric feature matrices. We apply this architecture to the inference tasks for the RNA secondary structure or protein contact maps. We compare our symmetrized CNN architecture with the usual convolution network architecture and show that these approaches can improve prediction results while using equal or fewer numbers of machine parameters

    Kernel methods in genomics and computational biology

    Full text link
    Support vector machines and kernel methods are increasingly popular in genomics and computational biology, due to their good performance in real-world applications and strong modularity that makes them suitable to a wide range of problems, from the classification of tumors to the automatic annotation of proteins. Their ability to work in high dimension, to process non-vectorial data, and the natural framework they provide to integrate heterogeneous data are particularly relevant to various problems arising in computational biology. In this chapter we survey some of the most prominent applications published so far, highlighting the particular developments in kernel methods triggered by problems in biology, and mention a few promising research directions likely to expand in the future

    Computational Methods for the Analysis of Genomic Data and Biological Processes

    Get PDF
    In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality

    On the Genetic Origin of Complementary Protein Coding

    Get PDF
    The relations of protein coding and hydropathy are investigated considering the principles of the molecular recognition theory and Grafstein\u27s hypothesis of the stereochemical origin of the genetic code. It is shown that the coding of RNA and DNA requires 14 distinct groups of codon-anticodon pairs, which define all possible complementary amino acids. The molecular recognition theory is redefined considering the codon-anticodon relations of mRNAs, DNAs, tRNAs and Siemion\u27s mutation ring of the genetic code. A model of DNA, RNA and protein coding (and decoding) based on two fundamental properties of DNA/RNA, denoted as complementary and stationary principles, is presented. Stationary DNA/RNA coding defines the nucleotide relationship of the same (self) DNA/RNA strand and complementary coding defines nucleotide distribution related to other (non-self) strand. Combinations of 2 digits, denoting primary and secondary characteristics of each nucleotide, specify codon positions according to the group subdivision (discrimination) principle. The process of coding is related to the hypercube node codon representations and dynamics of their binary tree locations. The relations between binary tree locations and Cantor set representations of different codon points are discussed in the context of quadratic mappings, Feigenbaum dynamics and signal analysis. Combinations of hypercube nodes and different binary tree positions define the words, sentences and syntax of DNA, RNA and protein language. Possible applications of this method may be related to network analysis and the design, gene, protein and drug modelling

    Towards Visualization of Discrete Optimization Problems and Search Algorithms

    Get PDF
    Diskrete Optimierung beschĂ€ftigt sich mit dem Identifizieren einer Kombination oder Permutation von Elementen, die im Hinblick auf ein gegebenes quantitatives Kriterium optimal ist. Anwendungen dafĂŒr entstehen aus Problemen in der Wirtschaft, der industriellen Fertigung, den Ingenieursdisziplinen, der Mathematik und Informatik. Dazu gehören unter anderem maschinelles Lernen, die Planung der Reihenfolge und Terminierung von Fertigungsprozessen oder das Layout von integrierten Schaltkreisen. HĂ€ufig sind diskrete Optimierungsprobleme NP-hart. Dadurch kommt der Erforschung effizienter, heuristischer Suchalgorithmen eine große Bedeutung zu, um fĂŒr mittlere und große Probleminstanzen ĂŒberhaupt gute Lösungen finden zu können. Dabei wird die Entwicklung von Algorithmen dadurch erschwert, dass Eigenschaften der Probleminstanzen aufgrund von deren GrĂ¶ĂŸe und KomplexitĂ€t hĂ€ufig schwer zu identifizieren sind. Ebenso herausfordernd ist die Analyse und Evaluierung von gegebenen Algorithmen, da das Suchverhalten hĂ€ufig schwer zu charakterisieren ist. Das trifft besonders im Fall von emergentem Verhalten zu, wie es in der Forschung der Schwarmintelligenz vorkommt. Visualisierung zielt auf das Nutzen des menschlichen Sehens zur Datenverarbeitung ab. Das Gehirn hat enorme FĂ€higkeiten optische Reize von den Sehnerven zu analysieren, Formen und Muster darin zu erkennen, ihnen Bedeutung zu verleihen und dadurch ein intuitives Verstehen des Gesehenen zu ermöglichen. Diese FĂ€higkeit kann im Speziellen genutzt werden, um Hypothesen ĂŒber komplexe Daten zu generieren, indem man sie in einem Bild reprĂ€sentiert und so dem visuellen System des Betrachters zugĂ€nglich macht. Bisher wurde Visualisierung kaum genutzt um speziell die Forschung in diskreter Optimierung zu unterstĂŒtzen. Mit dieser Dissertation soll ein Ausgangspunkt geschaffen werden, um den vermehrten Einsatz von Visualisierung bei der Entwicklung von Suchheuristiken zu ermöglichen. Dazu werden zunĂ€chst die zentralen Fragen in der Algorithmenentwicklung diskutiert und daraus folgende Anforderungen an Visualisierungssysteme abgeleitet. Mögliche Forschungsrichtungen in der Visualisierung, die konkreten Nutzen fĂŒr die Forschung in der Optimierung ergeben, werden vorgestellt. Darauf aufbauend werden drei Visualisierungssysteme und eine Analysemethode fĂŒr die Erforschung diskreter Suche vorgestellt. Drei wichtige Aufgaben von Algorithmendesignern werden dabei adressiert. ZunĂ€chst wird ein System fĂŒr den detaillierten Vergleich von Algorithmen vorgestellt. Auf der Basis von Zwischenergebnissen der Algorithmen auf einer Probleminstanz wird der Suchverlauf der Algorithmen dargestellt. Der Fokus liegt dabei dem Verlauf der QualitĂ€t der Lösungen ĂŒber die Zeit, wobei die Darstellung durch den Experten mit zusĂ€tzlichem Wissen oder Klassifizierungen angereichert werden kann. Als zweites wird ein System fĂŒr die Analyse von Suchlandschaften vorgestellt. Auf Basis von Pfaden und AbstĂ€nden in der Landschaft wird eine Karte der Probleminstanz gezeichnet, die strukturelle Merkmale intuitiv erfassbar macht. Der zweite Teil der Dissertation beschĂ€ftigt sich mit der topologischen Analyse von Suchlandschaften, aufbauend auf einer Schwellwertanalyse. Ein Visualisierungssystem wird vorgestellt, dass ein topologisch equivalentes Höhenprofil der Suchlandschaft darstellt, um die topologische Struktur begreifbar zu machen. Dieses System ermöglicht zudem, den Suchverlauf eines Algorithmus direkt in der Suchlandschaft zu beobachten, was insbesondere bei der Untersuchung von Schwarmintelligenzalgorithmen interessant ist. Die Berechnung der topologischen Struktur setzt eine vollstĂ€ndige AufzĂ€hlung aller Lösungen voraus, was aufgrund der GrĂ¶ĂŸe der Suchlandschaften im allgemeinen nicht möglich ist. Um eine Anwendbarkeit der Analyse auf grĂ¶ĂŸere Probleminstanzen zu ermöglichen, wird eine Methode zur AbschĂ€tzung der Topologie vorgestellt. Die Methode erlaubt eine schrittweise Verfeinerung der topologischen Struktur und lĂ€sst sich heuristisch steuern. Dadurch können Wissen und Hypothesen des Experten einfließen um eine möglichst hohe QualitĂ€t der AnnĂ€herung zu erreichen bei gleichzeitig ĂŒberschaubarem Berechnungsaufwand.Discrete optimization deals with the identification of combinations or permutations of elements that are optimal with regard to a specific, quantitative criterion. Applications arise from problems in economy, manufacturing, engineering, mathematics and computer sciences. Among them are machine learning, scheduling of production processes, and the layout of integrated electrical circuits. Typically, discrete optimization problems are NP hard. Thus, the investigation of efficient, heuristic search algorithms is of high relevance in order to find good solutions for medium- and large-sized problem instances, at all. The development of such algorithms is complicated, because the properties of problem instances are often hard to identify due to the size and complexity of the instances. Likewise, the analysis and evaluation of given algorithms is challenging, because the search behavior of an algorithm is hard to characterize, especially in case of emergent behavior as investigated in swarm intelligence research. Visualization targets taking advantage of human vision in order to do data processing. The visual brain possesses tremendous capabilities to analyse optical stimulation through the visual nerves, perceive shapes and patterns, assign meaning to them and thus facilitate an intuitive understanding of the seen. In particular, this can be used to generate hypotheses about complex data by representing them in a well-designed depiction and making it accessible to the visual system of the viewer. So far, there is only little use of visualization to support the discrete optimization research. This thesis is meant as a starting point to allow for an increased application of visualization throughout the process of developing discrete search heuristics. For this, we discuss the central questions that arise from the development of heuristics as well as the resulting requirements on visualization systems. Possible directions of research for visualization are described that yield a specific benefit for optimization research. Based on this, three visualization systems and one analysis method are presented. These address three important tasks of algorithm designers. First, a system for the fine-grained comparison of algorithms is introduced. Based on the intermediate results of algorithm runs on a given problem instance the search process is visualized. The focus is on the progress of the solution quality over time while allowing the algorithm expert to augment the depiction with additional domain knowledge and classification of individual solutions. Second, a system for the analysis of search landscapes is presented. Based on paths and distances in the landscape, a map of the problem instance is drawn that facilitates an intuitive cognition of structural properties. The second part of this thesis focuses on the topological analysis of search landscapes, based on barriers. A visualization system is presented that shows a topological equivalent height profile of the search landscape. Further, the system facilitates to observe the search process of an algorithm directly within the search landscape. This is of particular interest when researching swarm intelligence algorithms. The computation of topological structure requires a complete enumeration of all solutions which is not possible in the general case due to the size of the search landscapes. In order to enable an application to larger problem instances, we introduce a method to approximate the topological structure. The method allows for an incremental refinement of the topological approximation that can be controlled using a heuristic. Thus, the domain expert can introduce her knowledge and also hypotheses about the problem instance into the analysis so that an approximation of good quality is achieved with reasonable computational effort

    Development and Application of Comparative Gene Co-expression Network Methods in Brachypodium distachyon

    Get PDF
    Gene discovery and characterization is a long and labor-intensive process. Gene co-expression network analysis is a long-standing powerful approach that can strongly enrich signals within gene expression datasets to predict genes critical for many cellular functions. Leveraging this approach with a large number of transcriptome datasets does not yield a concomitant increase in network granularity. Independently generated datasets that describe gene expression in various tissues, developmental stages, times of day, and environments can carry conflicting co-expression signals. The gene expression responses of the model C3 grass Brachypodium distachyon to abiotic stress is characterized by a co-expression-based analysis, identifying 22 modules of genes, annotated with putative DNA regulatory elements and functional terms. A great deal of co-expression elasticity is found among the genes characterized therein. An algorithm, dGCNA, designed to determine statistically significant changes in gene-gene co-expression relationships is presented. The algorithm is demonstrated on the very well-characterized circadian system of Arabidopsis thaliana, and identifies potential strong signals of molecular interactions between a specific transcription factor and putative target gene loci. Lastly, this network comparison approach based on edge-wise similarities is demonstrated on many pairwise comparisons of independent microarray datasets, to demonstrate the utility of fine-grained network comparison, rather than amassing as large a dataset as possible. This approach identifies a set of 182 gene loci which are differentially expressed under drought stress, change their co-expression strongly under loss of thermocycles or high-salinity stress, and are associated with cell-cycle and DNA replication functions. This set of genes provides excellent candidates for the generation of rhythmic growth under thermocycles in Brachypodium distachyon
    • 

    corecore