Search CORE

3,057 research outputs found

Derivation of Context-free Stochastic L-Grammar Rules for Promoter Sequence Modeling Using Support Vector Machine

Author: Damaševičius Robertas
Publication venue: Institute of Information Theories and Applications FOI ITHEA
Publication date: 01/01/2008
Field of study

Formal grammars can used for describing complex repeatable structures such as DNA sequences. In this paper, we describe the structural composition of DNA sequences using a context-free stochastic L-grammar. L-grammars are a special class of parallel grammars that can model the growth of living organisms, e.g. plant development, and model the morphology of a variety of organisms. We believe that parallel grammars also can be used for modeling genetic mechanisms and sequences such as promoters. Promoters are short regulatory DNA sequences located upstream of a gene. Detection of promoters in DNA sequences is important for successful gene prediction. Promoters can be recognized by certain patterns that are conserved within a species, but there are many exceptions which makes the promoter recognition a complex problem. We replace the problem of promoter recognition by induction of context-free stochastic L-grammar rules, which are later used for the structural analysis of promoter sequences. L-grammar rules are derived automatically from the drosophila and vertebrate promoter datasets using a genetic programming technique and their fitness is evaluated using a Support Vector Machine (SVM) classifier. The artificial promoter sequences generated using the derived L- grammar rules are analyzed and compared with natural promoter sequences

Bulgarian Digital Mathematics Library at IMI-BAS

Without magic bullets: the biological basis for public health interventions against protein folding disorders

Author: Rodrick Wallace
Publication venue
Publication date: 16/09/2010
Field of study

Protein folding disorders of aging like Alzheimer's and Parkinson's diseases currently present intractable medical challenges. 'Small molecule' interventions - drug treatments - often have, at best, palliative impact, failing to alter disease course. The design of individual or population level interventions will likely require a deeper understanding of protein folding and its regulation than currently provided by contemporary 'physics' or culture-bound medical magic bullet models. Here, a topological rate distortion analysis is applied to the problem of protein folding and regulation that is similar in spirit to Tlusty's (2010a) elegant exploration of the genetic code. The formalism produces large-scale, quasi-equilibrium 'resilience' states representing normal and pathological protein folding regulation under a cellular-level cognitive paradigm similar to that proposed by Atlan and Cohen (1998) for the immune system. Generalization to long times produces diffusion models of protein folding disorders in which epigenetic or life history factors determine the rate of onset of regulatory failure, in essence, a premature aging driven by familiar synergisms between disjunctions of resource allocation and need in the context of socially or physiologically toxic exposures and chronic powerlessness at individual and group scales. Application of an HPA axis model is made to recent observed differences in Alzheimer's onset rates in White and African American subpopulations as a function of an index of distress-proneness

Crossref

Nature Precedings

Learning the Regulatory Code of Gene Expression

Author: Buric Filip
Garcia Victor
Kokina Mariia
Zelezniak Aleksej
Zrimec Jan
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2021
Field of study

Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

PubMed Central

Chalmers Research

ZHAW digitalcollection

Online Research Database In Technology

Mapping the proteome with data-driven methods: A cycle of measurement, modeling, hypothesis generation, and engineering

Author: Buric Filip
Publication venue
Publication date: 01/01/2021
Field of study

The living cell exhibits emergence of complex behavior and its modeling requires a systemic, integrative approach if we are to thoroughly understand and harness it. The work in this thesis has had the more narrow aim of quantitatively characterizing and mapping the proteome using data-driven methods, as proteins perform most functional and structural roles within the cell. Covered are the different parts of the cycle from improving quantification methods, to deriving protein features relying on their primary structure, predicting the protein content solely from sequence data, and, finally, to developing theoretical protein engineering tools, leading back to experiment.\ua0\ua0\ua0\ua0 High-throughput mass spectrometry platforms provide detailed snapshots of a cell\u27s protein content, which can be mined towards understanding how the phenotype arises from genotype and the interplay between the various properties of the constituent proteins. However, these large and dense data present an increased analysis challenge and current methods capture only a small fraction of signal. The first part of my work has involved tackling these issues with the implementation of a GPU-accelerated and distributed signal decomposition pipeline, making factorization of large proteomics scans feasible and efficient. The pipeline yields individual analyte signals spanning the majority of acquired signal, enabling high precision quantification and further analytical tasks.\ua0\ua0\ua0 Having such detailed snapshots of the proteome enables a multitude of undertakings. One application has been to use a deep neural network model to learn the amino acid sequence determinants of temperature adaptation, in the form of reusable deep model features. More generally, systemic quantities may be predicted from the information encoded in sequence by evolutionary pressure. Two studies taking inspiration from natural language processing have sought to learn the grammars behind the languages of expression, in one case predicting mRNA levels from DNA sequence, and in the other protein abundance from amino acid sequence. These two models helped build a quantitative understanding of the central dogma and, furthermore, in combination yielded an improved predictor of protein amount. Finally, a mathematical framework relying on the embedded space of a deep model has been constructed to assist guided mutation of proteins towards optimizing their abundance

Chalmers Research

Gene Expression and its Discontents: Developmental disorders as dysfunctions of epigenetic cognition

Author: Wallace Rodrick
Publication venue
Publication date: 01/01/2009
Field of study

Systems biology presently suffers the same mereological and sufficiency fallacies that haunt neural network models of high order cognition. Shifting perspective from the massively parallel space of gene matrix interactions to the grammar/syntax of the time series of expressed phenotypes using a cognitive paradigm permits import of techniques from statistical physics via the homology between information source uncertainty and free energy density. This produces a broad spectrum of possible statistical models of development and its pathologies in which epigenetic regulation and the effects of embedding environment are analogous to a tunable enzyme catalyst. A cognitive paradigm naturally incorporates memory, leading directly to models of epigenetic inheritance, as affected by environmental exposures, in the largest sense. Understanding gene expression, development, and their dysfunctions will require data analysis tools considerably more sophisticated than the present crop of simplistic models abducted from neural network studies or stochastic chemical reaction theory

CiteSeerX

CogPrints Cognitive Sciences Eprint Archive

Sequence-Based Classification Using Discriminatory Motif Feature Selection

Author: Daniel Capurso
Hao Xiong
Mark R. Segal
Mikael Boden
Śaunak Sen
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Most existing methods for sequence-based classification use exhaustive feature generation, employing, for example, all -mer patterns. The motivation behind such (enumerative) approaches is to minimize the potential for overlooking important features. However, there are shortcomings to this strategy. First, practical constraints limit the scope of exhaustive feature generation to patterns of length , such that potentially important, longer () predictors are not considered. Second, features so generated exhibit strong dependencies, which can complicate understanding of derived classification rules. Third, and most importantly, numerous irrelevant features are created. These concerns can compromise prediction and interpretation. While remedies have been proposed, they tend to be problem-specific and not broadly applicable. Here, we develop a generally applicable methodology, and an attendant software pipeline, that is predicated on discriminatory motif finding. In addition to the traditional training and validation partitions, our framework entails a third level of data partitioning, a discovery partition. A discriminatory motif finder is used on sequences and associated class labels in the discovery partition to yield a (small) set of features. These features are then used as inputs to a classifier in the training partition. Finally, performance assessment occurs on the validation partition. Important attributes of our approach are its modularity (any discriminatory motif finder and any classifier can be deployed) and its universality (all data, including sequences that are unaligned and/or of unequal length, can be accommodated). We illustrate our approach on two nucleosome occupancy datasets and a protein solubility dataset, previously analyzed using enumerative feature generation. Our method achieves excellent performance results, with and without optimization of classifier tuning parameters. A Python pipeline implementing the approach is available at http://www.epibiostat.ucsf.edu/biostat/sen/dmfs/

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Unraveling the transcriptional Cis-regulatory code

Author: Taher Leila (gnd: 111215731X)
Publication venue: Universität Rostock Rostock
Publication date
Field of study

It is nowadays accepted that eukaryotic complexity is not dictated by the number of protein-coding genes of the genome, but rather achieved through the combinatorics of gene expression programs. Distinct aspects of the expression pattern of a gene are mediated by discrete regulatory sequences, known as cis-regulatory elements. The work described in this thesis was aimed at developing computational and statistical methods to guide the search and characterization of novel cis-regulatory elements

Rostocker Dokumentenserver

Selected abstracts of “Bioinformatics: from Algorithms to Applications 2020” conference

Author: García Santamaría Fernando
Molina Mora José Arturo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

El documento solamente contiene el resumen de la ponenciaUCR::Vicerrectoría de Investigación::Unidades de Investigación::Ciencias de la Salud::Centro de Investigación en Enfermedades Tropicales (CIET)UCR::Vicerrectoría de Docencia::Salud::Facultad de Microbiologí

Repositorio Institucional de la Universidad de Costa Rica

A Systems Biology Approach to Transcription Factor Binding Site Prediction

Author: Andrea Califano
Diego Di Bernardo
Pavel Sumazin
Presha Rajbhandari
Xiang Zhou
Publication venue: Public Library of Science
Publication date: 01/03/2010
Field of study

The elucidation of mammalian transcriptional regulatory networks holds great promise for both basic and translational research and remains one the greatest challenges to systems biology. Recent reverse engineering methods deduce regulatory interactions from large-scale mRNA expression profiles and cross-species conserved regulatory regions in DNA. Technical challenges faced by these methods include distinguishing between direct and indirect interactions, associating transcription regulators with predicted transcription factor binding sites (TFBSs), identifying non-linearly conserved binding sites across species, and providing realistic accuracy estimates.We address these challenges by closely integrating proven methods for regulatory network reverse engineering from mRNA expression data, linearly and non-linearly conserved regulatory region discovery, and TFBS evaluation and discovery. Using an extensive test set of high-likelihood interactions, which we collected in order to provide realistic prediction-accuracy estimates, we show that a careful integration of these methods leads to significant improvements in prediction accuracy. To verify our methods, we biochemically validated TFBS predictions made for both transcription factors (TFs) and co-factors; we validated binding site predictions made using a known E2F1 DNA-binding motif on E2F1 predicted promoter targets, known E2F1 and JUND motifs on JUND predicted promoter targets, and a de novo discovered motif for BCL6 on BCL6 predicted promoter targets. Finally, to demonstrate accuracy of prediction using an external dataset, we showed that sites matching predicted motifs for ZNF263 are significantly enriched in recent ZNF263 ChIP-seq data.Using an integrative framework, we were able to address technical challenges faced by state of the art network reverse engineering methods, leading to significant improvement in direct-interaction detection and TFBS-discovery accuracy. We estimated the accuracy of our framework on a human B-cell specific test set, which may help guide future methodological development

Crossref

Directory of Open Access Journals

PubMed Central