3,840 research outputs found

    Human Promoter Prediction Using DNA Numerical Representation

    Get PDF
    With the emergence of genomic signal processing, numerical representation techniques for DNA alphabet set {A, G, C, T} play a key role in applying digital signal processing and machine learning techniques for processing and analysis of DNA sequences. The choice of the numerical representation of a DNA sequence affects how well the biological properties can be reflected in the numerical domain for the detection and identification of the characteristics of special regions of interest within the DNA sequence. This dissertation presents a comprehensive study of various DNA numerical and graphical representation methods and their applications in processing and analyzing long DNA sequences. Discussions on the relative merits and demerits of the various methods, experimental results and possible future developments have also been included. Another area of the research focus is on promoter prediction in human (Homo Sapiens) DNA sequences with neural network based multi classifier system using DNA numerical representation methods. In spite of the recent development of several computational methods for human promoter prediction, there is a need for performance improvement. In particular, the high false positive rate of the feature-based approaches decreases the prediction reliability and leads to erroneous results in gene annotation.To improve the prediction accuracy and reliability, DigiPromPred a numerical representation based promoter prediction system is proposed to characterize DNA alphabets in different regions of a DNA sequence.The DigiPromPred system is found to be able to predict promoters with a sensitivity of 90.8% while reducing false prediction rate for non-promoter sequences with a specificity of 90.4%. The comparative study with state-of-the-art promoter prediction systems for human chromosome 22 shows that our proposed system maintains a good balance between prediction accuracy and reliability. To reduce the system architecture and computational complexity compared to the existing system, a simple feed forward neural network classifier known as SDigiPromPred is proposed. The SDigiPromPred system is found to be able to predict promoters with a sensitivity of 87%, 87%, 99% while reducing false prediction rate for non-promoter sequences with a specificity of 92%, 94%, 99% for Human, Drosophila, and Arabidopsis sequences respectively with reconfigurable capability compared to existing system

    Identifying promoter features of co-regulated genes with similar network motifs

    Get PDF
    Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2008, Philadelphia, PA, USA. 3–5 November 2008.Background: A large amount of computational and experimental work has been devoted to uncovering network motifs in gene regulatory networks. The leading hypothesis is that evolutionary processes independently selected recurrent architectural relationships among regulators and target genes (motifs) to produce characteristic expression patterns of its members. However, even with the same architecture, the genes may still be differentially expressed. Therefore, to define fully the expression of a group of genes, the strength of the connections in a network motif must be specified, and the cis-promoter features that participate in the regulation must be determined.Results: We have developed a model-based approach to analyze proteobacterial genomes for promoter features that is specifically designed to account for the variability in sequence, location and topology intrinsic to differential gene expression. We provide methods for annotating regulatory regions by detecting their subjacent cis-features. This includes identifying binding sites for a transcriptional regulator, distinguishing between activation and repression sites, direct and reverse orientation, and among sequences that weakly reflect a particular pattern; binding sites for the RNA polymerase, characterizing different classes, and locations relative to the transcription factor binding sites; the presence of riboswitches in the 5'UTR, and for other transcription factors. We applied our approach to characterize network motifs controlled by the PhoP/PhoQ regulatory system of Escherichia coli and Salmonella enterica serovar Typhimurium. We identified key features that enable the PhoP protein to control its target genes, and distinct features may produce different expression patterns even within the same network motif.Conclusion: Global transcriptional regulators control multiple promoters by a variety of network motifs. This is clearly the case for the regulatory protein PhoP. In this work, we studied this regulatory protein and demonstrated that understanding gene expression does not only require identifying a set of connexions or network motif, but also the cis-acting elements participating in each of these connexions.This research was supported in part by the Spanish Ministry of Science and Technology under project TIN2006-12879 and by Consejería de Innovacion, Investigación y Ciencia de la de la Junta de Andalucía under project TIC02788

    Learning the Regulatory Code of Gene Expression

    Get PDF
    Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology

    Dissecting the transcriptional regulatory network of embryonic stem cells

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Biology, 2008.Includes bibliographical references.The process by which a single fertilized egg develops into a human being with over 200 cell types, each with a distinct gene expression pattern controlling its cellular state, is poorly understood. An understanding of the transcriptional regulatory networks that establish and maintain gene expression programs in mammalian cells is fundamental to understand development and should provide the foundation for improved diagnosis and treatment of disease. Although it is not yet feasible to map the entirety of these networks in vertebrate cells, recent work in embryonic stem (ES) cells has demonstrated that core features of the network can be discovered by focusing on key transcriptional regulators and their target genes. Here, I describe important insights that have emerged from such studies and highlight how similar approaches can be used to discover the core networks of other vertebrate cell types. Knowledge of the regulatory networks controlling gene expression programs and cell states can guide efforts to reprogram cell states and holds great promise for both disease therapeutics and regenerative medicine.by Megan F. Cole.Ph.D

    Knowledge discovery in biological databases : a neural network approach

    Get PDF
    Knowledge discovery, in databases, also known as data mining, is aimed to find significant information from a set of data. The knowledge to be mined from the dataset may refer to patterns, association rules, classification and clustering rules, and so forth. In this dissertation, we present a neural network approach to finding knowledge in biological databases. Specifically, we propose new methods to process biological sequences in two case studies: the classification of protein sequences and the prediction of E. Coli promoters in DNA sequences. Our proposed methods, based oil neural network architectures combine techniques ranging from Bayesian inference, coding theory, feature selection, dimensionality reduction, to dynamic programming and machine learning algorithms. Empirical studies show that the proposed methods outperform previously published methods and have excellent performance on the latest dataset. We have implemented the proposed algorithms into an infrastructure, called Genome Mining, developed for biosequence classification and recognition
    • …
    corecore