121 research outputs found

    Processing hidden Markov models using recurrent neural networks for biological applications

    Get PDF
    Philosophiae Doctor - PhDIn this thesis, we present a novel hybrid architecture by combining the most popular sequence recognition models such as Recurrent Neural Networks (RNNs) and Hidden Markov Models (HMMs). Though sequence recognition problems could be potentially modelled through well trained HMMs, they could not provide a reasonable solution to the complicated recognition problems. In contrast, the ability of RNNs to recognize the complex sequence recognition problems is known to be exceptionally good. It should be noted that in the past, methods for applying HMMs into RNNs have been developed by other researchers. However, to the best of our knowledge, no algorithm for processing HMMs through learning has been given. Taking advantage of the structural similarities of the architectural dynamics of the RNNs and HMMs, in this work we analyze the combination of these two systems into the hybrid architecture. To this end, the main objective of this study is to improve the sequence recognition/classi_cation performance by applying a hybrid neural/symbolic approach. In particular, trained HMMs are used as the initial symbolic domain theory and directly encoded into appropriate RNN architecture, meaning that the prior knowledge is processed through the training of RNNs. Proposed algorithm is then implemented on sample test beds and other real time biological applications

    Development of computational methods for predicting structural characteristics of helical membrane proteins

    Get PDF
    Helical membrane proteins (HMPs) play a crucial role in diverse cellular processes. Given the difficulty in determining their structures by experimental techniques, it is desired to develop computational methods for predicting their structural characteristics. In addition, computational analysis can provide interesting insights into their structure and function that experimental work can not provide. This thesis summarizes years of such computational endeavours, comprising 4 published papers (Paper I ~ IV). In Paper I, it was attempted to model low-resolution tertiary structures of HMPs with a modest number of transmembrane (TM) helices from packing constraints and sequence conservation patterns. In Paper II, a fundamental investigation was undertaken to analyze the degree of correlation between exposure patterns of TM helices to the membrane and their properties such as their hydrophobicities and conservation patterns. In Paper III, on the basis of the work presented in Paper II, an optimal way of deriving the propensity scales of the 20 amino acids to preferentially interact with the membrane as reflected in known HMP structures was presented, which revealed a surprising fact that the architectural principle of HMPs is best captured by the partial specific volumes of the amino acids. In Paper IV, the development of TMX (TransMembrane eXposure), a novel computational method for predicting the lipid accessibility of TM residues of HMPs, was described, which significantly outperforms other existing methods. A web interface for TMX is available at http://service.bioinformatik.uni-saarland.de/tmx.Helikale Membranproteine (HMPs) spielen in diversen zellulären Prozessen eine bedeutende Rolle. In Anbetracht der Schwierigkeit, die Struktur dieser Proteine mittels experimenteller Techniken aufzuklären, ist es erstrebenswert, computerunterstützte Methoden für ihre Strukturaufklärung zu entwickeln. Zusätzlich könnten computerunterstützte Analysen interessante Aspekte ihrer Struktur und Funktion aufzeigen, die experimentelle Studien nicht aufzeigen können. Meine Doktorarbeit fasst jahrelange Anstrengungen zusammen, die sich auf 4 publizierte Artikel (Artikel I ~ IV) aufteilen. In Artikel I wurde der Versuch unternommen, mittelmäßig aufgelöste HMP-Strukturen mit einer geringen Zahl von Transmembranprotein (TM) Helices mit Hilfe von Packungsregeln und konservierten Sequenzmotiven zu modellieren. In Artikel II wurde eine grundlegende Untersuchung durchgeführt, um den Grad der Korrelation zwischen exponierten Motiven in den TM Helices zur Doppelmembranschicht und ihren Eigenschaften wie Hydrophobizität und konservierten Motiven zu analysieren. Darauf aufbauend wurde in Artikel III ein optimaler Weg zur Generierung von Skalen vorgestellt, die Paarungspräferenzen der 20 Aminosäuren mit der Doppelmembranschicht auf Basis von bekannten HMP Strukturen zu bewerten. Die Ergebnisse zeigen überraschenderweise, dass das architektonische Prinzip von HMPs am besten durch die partiellen spezifischen Volumina der Aminosäuren beschrieben werden kann. Artikel IV präsentiert TMX (TransMembrane eXposure), eine neue computerunterstützte Methode zur Vorhersage der Lipidzugänglichkeit der TM-Aminosäuren in HMPs, die bisherige Methoden deutlich an Genauigkeit übertrifft. Unter http://service.bioinformatik.uni-saarland.de/tmx ist eine Web-Schnittstelle für TMX aufrufbar

    Understanding the Structural and Functional Importance of Early Folding Residues in Protein Structures

    Get PDF
    Proteins adopt three-dimensional structures which serve as a starting point to understand protein function and their evolutionary ancestry. It is unclear how proteins fold in vivo and how this process can be recreated in silico in order to predict protein structure from sequence. Contact maps are a possibility to describe whether two residues are in spatial proximity and structures can be derived from this simplified representation. Coevolution or supervised machine learning techniques can compute contact maps from sequence: however, these approaches only predict sparse subsets of the actual contact map. It is shown that the composition of these subsets substantially influences the achievable reconstruction quality because most information in a contact map is redundant. No strategy was proposed which identifies unique contacts for which no redundant backup exists. The StructureDistiller algorithm quantifies the structural relevance of individual contacts and identifies crucial contacts in protein structures. It is demonstrated that using this information the reconstruction performance on a sparse subset of a contact map is increased by 0.4 A, which constitutes a substantial performance gain. The set of the most relevant contacts in a map is also more resilient to false positively predicted contacts: up to 6% of false positives are compensated before reconstruction quality matches a naive selection of contacts without any false positive contacts. This information is invaluable for the training to new structure prediction methods and provides insights into how robustness and information content of contact maps can be improved. In literature, the relevance of two types of residues for in vivo folding has been described. Early folding residues initiate the folding process, whereas highly stable residues prevent spontaneous unfolding events. The structural relevance score proposed by this thesis is employed to characterize both types of residues. Early folding residues form pivotal secondary structure elements, but their structural relevance is average. In contrast, highly stable residues exhibit significantly increased structural relevance. This implies that residues crucial for the folding process are not relevant for structural integrity and vice versa. The position of early folding residues is preserved over the course of evolution as demonstrated for two ancient regions shared by all aminoacyl-tRNA synthetases. One arrangement of folding initiation sites resembles an ancient and widely distributed structural packing motif and captures how reverberations of the earliest periods of life can still be observed in contemporary protein structures

    Front Matter - Soft Computing for Data Mining Applications

    Get PDF
    Efficient tools and algorithms for knowledge discovery in large data sets have been devised during the recent years. These methods exploit the capability of computers to search huge amounts of data in a fast and effective manner. However, the data to be analyzed is imprecise and afflicted with uncertainty. In the case of heterogeneous data sources such as text, audio and video, the data might moreover be ambiguous and partly conflicting. Besides, patterns and relationships of interest are usually vague and approximate. Thus, in order to make the information mining process more robust or say, human-like methods for searching and learning it requires tolerance towards imprecision, uncertainty and exceptions. Thus, they have approximate reasoning capabilities and are capable of handling partial truth. Properties of the aforementioned kind are typical soft computing. Soft computing techniques like Genetic

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here

    Intrinsically Disordered Proteins and Chronic Diseases

    Get PDF
    This book is an embodiment of a series of articles that were published as part of a Special Issue of Biomolecules. It is dedicated to exploring the role of intrinsically disordered proteins (IDPs) in various chronic diseases. The main goal of the articles is to describe recent progress in elucidating the mechanisms by which IDPs cause various human diseases, such as cancer, cardiovascular disease, amyloidosis, neurodegenerative diseases, diabetes, and genetic diseases, to name a few. Contributed by leading investigators in the field, this compendium serves as a valuable resource for researchers, clinicians as well as postdoctoral fellows and graduate student

    Quantitative structure fate relationships for multimedia environmental analysis

    Get PDF
    Key physicochemical properties for a wide spectrum of chemical pollutants are unknown. This thesis analyses the prospect of assessing the environmental distribution of chemicals directly from supervised learning algorithms using molecular descriptors, rather than from multimedia environmental models (MEMs) using several physicochemical properties estimated from QSARs. Dimensionless compartmental mass ratios of 468 validation chemicals were compared, in logarithmic units, between: a) SimpleBox 3, a Level III MEM, propagating random property values within statistical distributions of widely recommended QSARs; and, b) Support Vector Regressions (SVRs), acting as Quantitative Structure-Fate Relationships (QSFRs), linking mass ratios to molecular weight and constituent counts (atoms, bonds, functional groups and rings) for training chemicals. Best predictions were obtained for test and validation chemicals optimally found to be within the domain of applicability of the QSFRs, evidenced by low MAE and high q2 values (in air, MAE≤0.54 and q2≥0.92; in water, MAE≤0.27 and q2≥0.92).Las propiedades fisicoquímicas de un gran espectro de contaminantes químicos son desconocidas. Esta tesis analiza la posibilidad de evaluar la distribución ambiental de compuestos utilizando algoritmos de aprendizaje supervisados alimentados con descriptores moleculares, en vez de modelos ambientales multimedia alimentados con propiedades estimadas por QSARs. Se han comparado fracciones másicas adimensionales, en unidades logarítmicas, de 468 compuestos entre: a) SimpleBox 3, un modelo de nivel III, propagando valores aleatorios de propiedades dentro de distribuciones estadísticas de QSARs recomendados; y, b) regresiones de vectores soporte (SVRs) actuando como relaciones cuantitativas de estructura y destino (QSFRs), relacionando fracciones másicas con pesos moleculares y cuentas de constituyentes (átomos, enlaces, grupos funcionales y anillos) para compuestos de entrenamiento. Las mejores predicciones resultaron para compuestos de test y validación correctamente localizados dentro del dominio de aplicabilidad de los QSFRs, evidenciado por valores bajos de MAE y valores altos de q2 (en aire, MAE≤0.54 y q2≥0.92; en agua, MAE≤0.27 y q2≥0.92)

    EU US Roadmap Nanoinformatics 2030

    Get PDF
    The Nanoinformatics Roadmap 2030 is a compilation of state-of-the-art commentaries from multiple interconnecting scientific fields, combined with issues involving nanomaterial (NM) risk assessment and governance. In bringing these issues together into a coherent set of milestones, the authors address three recognised challenges facing nanoinformatics: (1) limited data sets; (2) limited data access; and (3) regulatory requirements for validating and accepting computational models. It is also recognised that data generation will progress unequally and unstructured if not captured within a nanoinformatics framework based on harmonised, interconnected databases and standards. The implicit coordination efforts within such a framework ensure early use of the data for regulatory purposes, e.g., for the read-across method of filling data gaps

    Design and data analysis of kinome microarrays

    Get PDF
    Catalyzed by protein kinases, phosphorylation is the most important post-translational modification in eukaryotes and is involved in the regulation of almost all cellular processes. Investigating phosphorylation events and how they change in response to different biological conditions is integral to understanding cellular signaling processes in general, as well as to defining the role of phosphorylation in health and disease. A recently-developed technology for studying phosphorylation events is the kinome microarray, which consists of several hundred "spots" arranged in a grid-like pattern on a glass slide. Each spot contains many peptides of a particular amino acid sequence chemically fixed to the slide, with different spots containing peptides with different sequences. Each peptide is a subsequence of a full protein, containing an amino acid residue that is known or suspected to undergo phosphorylation in vivo, as well as several surrounding residues. When a kinome microarray is exposed to cell lysate, the protein kinases in the lysate catalyze the phosphorylation of the peptides on the array. By measuring the degree to which the peptides comprising each spot are phosphorylated, insight can be gained into the upregulation or downregulation of signaling pathways in response to different biological treatments or conditions. There are two main computational challenges associated with kinome microarrays. The first is array design, which involves selecting the peptides to be included on a given array. The level of difficulty of this task depends largely on the number of phosphorylation sites that have been experimentally identified in the proteome of the organism being studied. For instance, thousands of phosphorylation sites are known for human and mouse, allowing considerable freedom to select peptides that are relevant to the problem being examined. In contrast, few sites are known for, say, honeybee and soybean. For such organisms, it is useful to expand the set of possible peptides by using computational techniques to predict probable phosphorylation sites. In this thesis, existing techniques for the computational prediction of phosphorylation sites are reviewed. In addition, two novel methods are described for predicting phosphorylation events in organisms with few known sites, with each method using a fundamentally different approach. The first technique, called PHOSFER, uses a random forest-based machine-learning strategy, while the second, called DAPPLE, takes advantage of sequence homology between known sites and the proteome of interest. Both methods are shown to allow quicker or more accurate predictions in organisms with few known sites than comparable previous techniques. Therefore, the use of kinome microarrays is no longer limited to the study of organisms having many known phosphorylation sites; rather, this technology can potentially be applied to any organism having a sequenced genome. It is shown that PHOSFER and DAPPLE are suitable for identifying phosphorylation sites in a wide variety of organisms, including cow, honeybee, and soybean. The second computational challenge is data analysis, which involves the normalization, clustering, statistical analysis, and visualization of data resulting from the arrays. While software designed for the analysis of DNA microarrays has also been used for kinome arrays, differences between the two technologies prompted the development of PIIKA, a software package specifically designed for the analysis of kinome microarray data. By comparing with methods used for DNA microarrays, it is shown that PIIKA improves the ability to identify biological pathways that are differentially regulated in a treatment condition compared to a control condition. Also described is an updated version, PIIKA 2, which contains improvements and new features in the areas of clustering, statistical analysis, and data visualization. Given the previous absence of dedicated tools for analyzing kinome microarray data, as well as their wealth of features, PIIKA and PIIKA 2 represent an important step in maximizing the scientific value of this technology. In addition to the above techniques, this thesis presents three studies involving biological applications of kinome microarray analysis. The first study demonstrates the existence of "kinotypes" - species- or individual-specific kinome profiles - which has implications for personalized medicine and for the use of model organisms in the study of human disease. The second study uses kinome analysis to characterize how the calf immune system responds to infection by the bacterium Mycobacterium avium subsp. paratuberculosis. Finally, the third study uses kinome arrays to study parasitism of honeybees by the mite Varroa destructor, which is thought to be a major cause of colony collapse disorder. In order to make the methods described above readily available, a website called the SAskatchewan PHosphorylation Internet REsource (SAPHIRE) has been developed. Located at the URL http://saphire.usask.ca, SAPHIRE allows researchers to easily make use of PHOSFER, DAPPLE, and PIIKA 2. These resources facilitate both the design and data analysis of kinome microarrays, making them an even more effective technique for studying cellular signaling

    Sequence Determinants of the Individual and Collective Behaviour of Intrinsically Disordered Proteins

    Get PDF
    Intrinsically disordered proteins and protein regions (IDPs) represent around thirty percent of the eukaryotic proteome. IDPs do not fold into a set three dimensional structure, but instead exist in an ensemble of inter-converting states. Despite being disordered, IDPs are decidedly not random; well-defined - albeit transient - local and long-range interactions give rise to an ensemble with distinct statistical biases over many length-scales. Among a variety of cellular roles, IDPs drive and modulate the formation of phase separated intracellular condensates, non-stoichiometric assemblies of protein and nucleic acid that serve many functions. In this work, we have explored how the amino acid sequence of IDPs determines their conformational behaviour, and how sequence and single chain behaviour influence their collective behaviour in the context of phase separation. In part I, in a series of studies, we used simulation, theory, and statistical analysis coupled with a wide range of experimental approaches to uncover novel rules that further explore how primary sequence and local structure influence the global and local behaviour of disordered proteins, with direct implications for protein function and evolution. We found that amino acid sidechains counteract the intrinsic collapse of the peptide backbone, priming the backbone for interaction and providing a fully reconciliatory explanation for the mechanism of action associated with the denaturants urea and GdmCl. We discovered that proline can engender a conformational buffering effect in IDPs to counteract standard electrostatic effects, and that the patterning those proline residues can be a crucial determinant of the conformational ensemble. We developed a series of tools for analysing primary sequences on a proteome wide scale and used them to discover that different organisms can have substantially different average sequence properties. Finally, we determined that for the normally folded protein NTL9, the unfolded state under folding conditions is relatively expanded but has well defined native and non-native structural preferences. In part II, we identified a novel mode of phase separation in biology, and explored how this could be tuned through sequence design. We discovered that phase separated liquids can be many orders of magnitude more dilute than simple mean-field theories would predict, and developed an analytic framework to explain and understand this phenomenon. Finally, we designed, developed and implemented a novel lattice-based simulation engine (PIMMS) to provide sequence-specific insight into the determinants of conformational behaviour and phase separation. PIMMS allows us to accurately and rapidly generate sequence-specific conformational ensembles and run simulations of hundreds of polymers with the goal of allowing us to systematically elucidate the link between primary sequence of phase separation
    corecore