34 research outputs found

    Predicting Flavonoid UGT Regioselectivity with Graphical Residue Models and Machine Learning.

    Get PDF
    Machine learning is applied to a challenging and biologically significant protein classification problem: the prediction of flavonoid UGT acceptor regioselectivity from primary protein sequence. Novel indices characterizing graphical models of protein residues are introduced. The indices are compared with existing amino acid indices and found to cluster residues appropriately. A variety of models employing the indices are then investigated by examining their performance when analyzed using nearest neighbor, support vector machine, and Bayesian neural network classifiers. Improvements over nearest neighbor classifications relying on standard alignment similarity scores are reported

    Transcriptional Regulation of Cell-type Specific Expression in the Arabidopsis Root

    Get PDF
    Characterizing transcription factor interactions with their corresponding binding sites is crucial for understanding how gene expression is regulated by DNA sequence. A more comprehensive understanding of this process could have benefits in synthetic promoter design and creation of genetically modified organisms. Herein, the promoters of genes exhibiting cell-type specific expression within a single layer of the Arabidopsis root are analyzed to identify cis-regulatory motifs implicated in cell-type specific expression. De novo motif prediction identifies multiple motif candidates overly represented in the promoter sequences of co-expressed genes specific for epidermal, cortex, and endodermal expression. Several endodermal specific putative motifs are further analyzed for positional biases and tested in planta. A priori mapping of known cis-regulatory motifs catalogued in publicly available databases is also performed. Results show that cell-types contain different statistically significant enrichment patterns of both predicted and known cis-regulatory motifs. These results will help future research in designing cell-type specific synthetic promoters

    Discriminative Learning for Probabilistic Sequence Analysis

    No full text

    Gene and genon concept: coding versus regulation: A conceptual and information-theoretic analysis of genetic storage and expression in the light of modern molecular biology

    Get PDF
    We analyse here the definition of the gene in order to distinguish, on the basis of modern insight in molecular biology, what the gene is coding for, namely a specific polypeptide, and how its expression is realized and controlled. Before the coding role of the DNA was discovered, a gene was identified with a specific phenotypic trait, from Mendel through Morgan up to Benzer. Subsequently, however, molecular biologists ventured to define a gene at the level of the DNA sequence in terms of coding. As is becoming ever more evident, the relations between information stored at DNA level and functional products are very intricate, and the regulatory aspects are as important and essential as the information coding for products. This approach led, thus, to a conceptual hybrid that confused coding, regulation and functional aspects. In this essay, we develop a definition of the gene that once again starts from the functional aspect. A cellular function can be represented by a polypeptide or an RNA. In the case of the polypeptide, its biochemical identity is determined by the mRNA prior to translation, and that is where we locate the gene. The steps from specific, but possibly separated sequence fragments at DNA level to that final mRNA then can be analysed in terms of regulation. For that purpose, we coin the new term “genon”. In that manner, we can clearly separate product and regulative information while keeping the fundamental relation between coding and function without the need to introduce a conceptual hybrid. In mRNA, the program regulating the expression of a gene is superimposed onto and added to the coding sequence in cis - we call it the genon. The complementary external control of a given mRNA by trans-acting factors is incorporated in its transgenon. A consequence of this definition is that, in eukaryotes, the gene is, in most cases, not yet present at DNA level. Rather, it is assembled by RNA processing, including differential splicing, from various pieces, as steered by the genon. It emerges finally as an uninterrupted nucleic acid sequence at mRNA level just prior to translation, in faithful correspondence with the amino acid sequence to be produced as a polypeptide. After translation, the genon has fulfilled its role and expires. The distinction between the protein coding information as materialised in the final polypeptide and the processing information represented by the genon allows us to set up a new information theoretic scheme. The standard sequence information determined by the genetic code expresses the relation between coding sequence and product. Backward analysis asks from which coding region in the DNA a given polypeptide originates. The (more interesting) forward analysis asks in how many polypeptides of how many different types a given DNA segment is expressed. This concerns the control of the expression process for which we have introduced the genon concept. Thus, the information theoretic analysis can capture the complementary aspects of coding and regulation, of gene and genon

    A computational intelligence analysis of G proteincoupled receptor sequinces for pharmacoproteomic applications

    Get PDF
    Arguably, drug research has contributed more to the progress of medicine during the past decades than any other scientific factor. One of the main areas of drug research is related to the analysis of proteins. The world of pharmacology is becoming increasingly dependent on the advances in the fields of genomics and proteomics. This dependency brings about the challenge of finding robust methods to analyze the complex data they generate. Such challenge invites us to go one step further than traditional statistics and resort to approaches under the conceptual umbrella of artificial intelligence, including machine learning (ML), statistical pattern recognition and soft computing methods. Sound statistical principles are essential to trust the evidence base built through the use of such approaches. Statistical ML methods are thus at the core of the current thesis. More than 50% of drugs currently available target only four key protein families, from which almost a 30% correspond to the G Protein-Coupled Receptors (GPCR) superfamily. This superfamily regulates the function of most cells in living organisms and is at the centre of the investigations reported in the current thesis. No much is known about the 3D structure of these proteins. Fortunately, plenty of information regarding their amino acid sequences is readily available. The automatic grouping and classification of GPCRs into families and these into subtypes based on sequence analysis may significantly contribute to ascertain the pharmaceutically relevant properties of this protein superfamily. There is no biologically-relevant manner of representing the symbolic sequences describing proteins using real-valued vectors. This does not preclude the possibility of analyzing them using principled methods. These may come, amongst others, from the field of statisticalML. Particularly, kernel methods can be used to this purpose. Moreover, the visualization of high-dimensional protein sequence data can be a key exploratory tool for finding meaningful information that might be obscured by their intrinsic complexity. That is why the objective of the research described in this thesis is twofold: first, the design of adequate visualization-oriented artificial intelligence-based methods for the analysis of GPCR sequential data, and second, the application of the developed methods in relevant pharmacoproteomic problems such as GPCR subtyping and protein alignment-free analysis.Se podría decir que la investigación farmacológica ha desempeñado un papel predominante en el avance de la medicina a lo largo de las últimas décadas. Una de las áreas principales de investigación farmacológica es la relacionada con el estudio de proteínas. La farmacología depende cada vez más de los avances en genómica y proteómica, lo que conlleva el reto de diseñar métodos robustos para el análisis de los datos complejos que generan. Tal reto nos incita a ir más allá de la estadística tradicional para recurrir a enfoques dentro del campo de la inteligencia artificial, incluyendo el aprendizaje automático y el reconocimiento de patrones estadístico, entre otros. El uso de principios sólidos de teoría estadística es esencial para confiar en la base de evidencia obtenida mediante estos enfoques. Los métodos de aprendizaje automático estadístico son uno de los fundamentos de esta tesis. Más del 50% de los fármacos en uso hoy en día tienen como ¿diana¿ apenas cuatro familias clave de proteínas, de las que un 30% corresponden a la super-familia de los G-Protein Coupled Receptors (GPCR). Los GPCR regulan la funcionalidad de la mayoría de las células y son el objetivo central de la tesis. Se desconoce la estructura 3D de la mayoría de estas proteínas, pero, en cambio, hay mucha información disponible de sus secuencias de amino ácidos. El agrupamiento y clasificación automáticos de los GPCR en familias, y de éstas a su vez en subtipos, en base a sus secuencias, pueden contribuir de forma significativa a dilucidar aquellas de sus propiedades de interés farmacológico. No hay forma biológicamente relevante de representar las secuencias simbólicas de las proteínas mediante vectores reales. Esto no impide que se puedan analizar con métodos adecuados. Entre estos se cuentan las técnicas provenientes del aprendizaje automático estadístico y, en particular, los métodos kernel. Por otro lado, la visualización de secuencias de proteínas de alta dimensionalidad puede ser una herramienta clave para la exploración y análisis de las mismas. Es por ello que el objetivo central de la investigación descrita en esta tesis se puede desdoblar en dos grandes líneas: primero, el diseño de métodos centrados en la visualización y basados en la inteligencia artificial para el análisis de los datos secuenciales correspondientes a los GPCRs y, segundo, la aplicación de los métodos desarrollados a problemas de farmacoproteómica tales como la subtipificación de GPCRs y el análisis de proteinas no-alineadas

    Evolutionary patterns of non-coding RNAs

    Get PDF
    A plethora of new functions of non-coding RNAs have been discovered in past few years. In fact, RNA is emerging as the central player in cellular regulation, taking on active roles in multiple regulatory layers from transcription, RNA maturation, and RNA modification to translational regulation. Nevertheless, very little is known about the evolution of this \Modern RNA World' and its components. In this contribution we attempt to provide at least a cursory overview of the diversity of non-coding RNAs and functional RNA motifs in non-translated regions of regular messenger RNAs (mRNAs) with an emphasis on evolutionary questions. This survey is complemented by an in-depth analysis of examples from different classes of RNAs focusing mostly on their evolution in the vertebrate lineage. We present a survey of Y RNA genes in vertebrates, studies of the molecular evolution of the U7 snRNA, the snoRNAs E1/U17, E2, and E3, the Y RNA family, the let-7 microRNA family, and the mRNA-like evf-1 gene. We furthermore discuss the statistical distribution of microRNAs in metazoans, which suggests an explosive increase in the microRNA repertoire in vertebrates. The analysis of the transcription of non-coding RNAs (ncRNAs) suggests that small RNAs in general are genetically mobile in the sense that their association with a hostgene (e.g. when transcribed from introns of a mRNA) can change on evolutionary time scales. The let-7 family demonstrates, that even the mode of transcription (as intron or as exon) can change among paralogous ncRNA

    Computational epigenetics : bioinformatic methods for epigenome prediction, DNA methylation mapping and cancer epigenetics

    Get PDF
    Epigenetic research aims to understand heritable gene regulation that is not directly encoded in the DNA sequence. Epigenetic mechanisms such as DNA methylation and histone modifications modulate the packaging of the DNA in the nucleus and thereby influence gene expression. Patterns of epigenetic information are faithfully propagated over multiple cell divisions, which makes epigenetic gene regulation a key mechanism for cellular differentiation and cell fate decisions. In addition, incomplete erasure of epigenetic information can lead to complex patterns of non-Mendelian inheritance. Stochastic and environment-induced epigenetic defects are known to play a major role in cancer and ageing, and they may also contribute to mental disorders and autoimmune diseases. Recent technical advances — such as the development of the ChIP-on-chip and ChIP-seq protocols for genome-wide mapping of epigenetic information — have started to convert epigenetic research into a high-throughput endeavor, to which bioinformatics is expected to make significant contributions. This thesis describes computational work at the intersection of epigenetics and genome research, aiming to address the bioinformatic challenges posed by the human epigenome. While its methods are carried over and adapted from bioinformatics and related fields (including data mining, machine learning, statistics, algorithms, optimization, software engineering and databases), its overarching goal is to contribute to epigenetic research, both directly through analyzing and modeling of epigenetic information, and indirectly through the development of practically useful methods and software toolkits. This thesis is broadly structured into four parts. The first part gives a brief introduction into epigenetic regulation and inheritance, and reviews the emerging field of computational epigenetics. The second part addresses the question of genome-epigenome interactions using machine learning methods. It is shown that accurate predictions of DNA methylation and other epigenetic modifications can be derived from the genomic DNA sequence. Based on this finding, the EpiGRAPH web service for epigenome analysis and prediction is described, and methods for refined annotation of CpG islands in the human genome are proposed. The third part is dedicated to large-scale analysis of DNA methylation, which is the best-known epigenetic phenomenon. The BiQ Analyzer software toolkit is presented, together with a bioinformatic analysis of the "National Methylome Project for Chromosome 21'; dataset, for which BiQ Analyzer had played an enabling role. This part concludes with statistical modeling of DNA methylation variation and an analysis of its implications for DNA methylation mapping in a large number of human individuals. The fourth part describes two pilot projects applying the bioinformatic concepts of this thesis to cancer epigenetics. First, genome-scale datasets are probed for evidence of a link between DNA methylation and Polycomb binding, which is believed to play a role in epigenetic deregulation of cancer cells. Second, a biomarker that tests for cancer-specific DNA methylation is optimized and validated for use in clinical settings. Arguably the most interesting result of this thesis is the unexpectedly high correlation between genome and epigenome that was found by several methods and based on multiple epigenome datasets. This finding suggests that the role of the genome for epigenetic regulation has been underappreciated, and it underlines the importance of integrated analysis of genome and epigenome. With the EpiGRAPH web service for (epi-) genome analysis and prediction, a research tool is provided to facilitate further investigation of this striking interaction.Ziel epigenetischer Forschung ist ein besseres Verständnis der Mechanismen erblicher Gen-Regulation, die nicht direkt in der DNA-Sequenz codiert sind. Epigenetische Veränderungen des Genoms — wie zum Beispiel DNA-Methylierung und Histon-Modifikationen — beeinflussen die räumliche Anordnung der DNA im Zellkern und damit auch die Gen-Expression. Epigenetische Informationen werden über viele Zellteilungen stabil weitergegeben, weswegen die epigenetische Gen-Regulation ein Schlüsselmechanismus für Zell-Differenzierung und Determinierung ist. Darüber hinaus ergeben sich aus dem unvollständigen Löschen von epigenetischen Informationen komplexe nicht-Mendelsche Vererbungsgänge. Stochastische und umweltinduzierte epigenetische Defekte spielen eine wichtige Rolle für Krebs und molekulares Altern, und sie scheinen ebenfalls psychische Störungen und Autoimmun-Erkrankungen zu beeinflussen. In Folge technischer Fortschritte — wie etwa der Entwicklung der ChIP-on-chip und ChIP-seq Protokolle zur genomweiten Kartierung epigenetischer Informationen — hat eine Transformation der epigenetischen Forschung hin zu Hochdurchsatz-Analysen begonnen, zu der die Bioinformatik einen wichtigen Beitrag leisten muss. Diese Dissertation beschreibt bioinformatische Studien an der Schnittstelle von Epigenetik und Genomforschung, mit dem Ziel einer adäquaten Antwort auf die analytischen Herausforderungen des menschlichen Epigenoms. Während ihre Methoden aus der Bioinformatik und benachbarten Gebieten (Data Mining, maschinelles Lernen, Statistik, Algorithmik, Optimierung, Software Engineering und Datenbanken) entlehnt und adaptiert sind, ist es das übergeordnete Ziel der Arbeit, einen Beitrag zur epigenetischen Forschung zu leisten; und zwar sowohl direkt durch die Analyse und Modellierung epigenetischer Daten, also auch indirekt durch die Entwicklung praktisch verwertbarer Methoden und Software-Werkzeuge. Diese Dissertation gliedert sich grob in vier Teile. Der erste Teil führt in den Themenkomplex der epigenetischen Vererbung und Gen-Regulation ein und fasst das junge Forschungsgebiet "Computational Epigenetics" zusammen. Der zweite Teil adressiert die Frage nach Genom-Epigenom-Interaktionen mit Methoden des maschinellen Lernens. Es wird gezeigt, dass aus der genomischen DNA-Sequenz eine akkurate Vorhersage der DNA-Methylierung sowie anderer epigenetischer Modifikationen abgeleitet werden kann. Basierend auf diesem Ergebnis werden der EpiGRAPH-Webservice zur Epigenom-Analyse und Vorhersage beschrieben sowie Methoden für die verbesserte Annotation von CpG-Inseln in Wirbeltier- Genomen ausgearbeitet. Der dritte Teil beschäftigt sich mit der Hochdurchsatzanalyse von DNA-Methylierung, dem bekanntesten epigenetischen Phänomen. Die BiQ Analyzer Software wird vorgestellt, und die Ergebnisse einer bioinformatischen Analyse des "National Methylome Project for Chromosome 21"-Datensatzes werden beschrieben, zu dessen Generierung der BiQ Analyzer einen fundamentalen Beitrag leisten konnte. Den Abschluss dieses Teils bildet die statistische Modellierung von DNA-Methylierungs-Variation und eine Analyse ihrer Bedeutung für die DNA-Methylierungs-Kartierung einer großen Anzahl menschlicher Individuen. Der vierte Teil beschreibt zwei Pilotprojekte, in denen die bioinformatischen Konzepte dieser Arbeit in der Krebs-Epigenetik angewandt werden. Zum einen werden epigenomische Datensätze im Hinblick auf Interaktionen zwischen DNA-Methylierung und Polycomb- Bindestellen untersucht — eine Beziehung, die vermutlich bei der epigenetischen Deregulierung von Krebszellen eine Rolle spielt. Zum anderen wird ein Biomarker für die Verxiii wendung unter klinischen Bedingungen optimiert und validiert, der eine krebsspezifische Veränderung der DNA-Methylierung detektieren kann. Das vielleicht interessanteste Ergebnis dieser Dissertation ist eine unerwartet hohe Korrelation zwischen Genom und Epigenom, die mit mehreren Methoden und für verschiedenste Epigenom-Datensätze nachgewiesen werden konnte. Dieses Ergebnis legt nahe, dass der regulatorische Einfluss des Genoms auf das Epigenom bisher nicht ausreichend gewürdigt wurde, und es unterstreicht die Wichtigkeit einer integrierten Analyse von Genom und Epigenom. Der EpiGRAPH-Webservice bietet sich als Werkzeug für eine genauere Untersuchung dieser bemerkenswerten Interaktion an

    Advances in Artificial Intelligence: Models, Optimization, and Machine Learning

    Get PDF
    The present book contains all the articles accepted and published in the Special Issue “Advances in Artificial Intelligence: Models, Optimization, and Machine Learning” of the MDPI Mathematics journal, which covers a wide range of topics connected to the theory and applications of artificial intelligence and its subfields. These topics include, among others, deep learning and classic machine learning algorithms, neural modelling, architectures and learning algorithms, biologically inspired optimization algorithms, algorithms for autonomous driving, probabilistic models and Bayesian reasoning, intelligent agents and multiagent systems. We hope that the scientific results presented in this book will serve as valuable sources of documentation and inspiration for anyone willing to pursue research in artificial intelligence, machine learning and their widespread applications
    corecore