14 research outputs found

    Identification and analysis of patterns in DNA sequences, the genetic code and transcriptional gene regulation

    Get PDF
    The present cumulative work consists of six articles linked by the topic ”Identification and Analysis of Patterns in DNA sequences, the Genetic Code and Transcriptional Gene Regulation”. We have applied a binary coding, to efficiently findpatterns within nucleotide sequences. In the first and second part of my work one single bit to encode all four nucleotides is used. The three possibilities of a one - bit coding are: keto (G,U) - amino (A,C) bases, strong (G,C) - weak (A,U) bases, and purines (G,A) - pyrimidines (C,U). We found out that the best pattern could be observed using the purine - pyrimidine coding. Applying this coding we have succeeded in finding a new representation of the genetic code which has been published under the title ”A New Classification Scheme of the Genetic Code” in ”Journal of Molecular Biology” and ”A Purine-Pyrimidine Classification Scheme of the Genetic Code” in ”BIOForum Europe”. This new representation enables to reduce the common table of the genetic code from 64 to 32 fields maintaining the same information content. It turned out that all known and even new patterns of the genetic code can easily be recognized in this new scheme. Furthermore, our new representation allows us for speculations about the origin and evolution of the translation machinery and the genetic code. Thus, we found a possible explanation for the contemporary codon - amino acid assignment and wide support for an early doublet code. Those explanations have been published in ”Journal of Bioinformatics and Computational Biology” under the title ”The New Classification Scheme of the Genetic Code, its Early Evolution, and tRNA Usage”. Assuming to find these purine - pyrimidine patterns at the DNA level itself, we examined DNA binding sites for the occurrence of binary patterns. A comprehensive statistic about the largest class of restriction enzymes (type II) has shown a very distinctive purine - pyrimidine pattern. Moreover, we have observed a higher G+C content for the protein binding sequences. For both observations we have provided and discussed several explanations published under the title ”Common Patterns in Type II Restriction Enzyme Binding Sites” in ”Nucleic Acid Research”. The identified patterns may help to understand how a protein finds its binding site. In the last part of my work two submitted articles about the analysis of Boolean functions are presented. Boolean functions are used for the description and analysis of complex dynamic processes and make it easier to find binary patterns within biochemical interaction networks. It is well known that not all functions are necessary to describe biologically relevant gene interaction networks. In the article entitled ”Boolean Networks with Biologically Relevant Rules Show Ordered Behavior”, submitted to ”BioSystems”, we have shown, that the class of required Boolean functions can strongly be restricted. Furthermore, we calculated the exact number of hierarchically canalizing functions which are known to be biologically relevant. In our work ”The Decomposition Tree for Analysis of Boolean Functions” submitted to ”Journal of Complexity”, we introduced an efficient data structure for the classification and analysis of Boolean functions. This permits the recognition of biologically relevant Boolean functions in polynomial time

    BioBayesNet: a web server for feature extraction and Bayesian network modeling of biological sequence data

    Get PDF
    BioBayesNet is a new web application that allows the easy modeling and classification of biological data using Bayesian networks. To learn Bayesian networks the user can either upload a set of annotated FASTA sequences or a set of pre-computed feature vectors. In case of FASTA sequences, the server is able to generate a wide range of sequence and structural features from the sequences. These features are used to learn Bayesian networks. An automatic feature selection procedure assists in selecting discriminative features, providing an (locally) optimal set of features. The output includes several quality measures of the overall network and individual features as well as a graphical representation of the network structure, which allows to explore dependencies between features. Finally, the learned Bayesian network or another uploaded network can be used to classify new data. BioBayesNet facilitates the use of Bayesian networks in biological sequences analysis and is flexible to support modeling and classification applications in various scientific fields. The BioBayesNet server is available at http://biwww3.informatik.uni-freiburg.de:8080/BioBayesNet/

    DiProDB: a database for dinucleotide properties

    Get PDF
    DiProDB (http://diprodb.fli-leibniz.de) is a database of conformational and thermodynamic dinucleotide properties. It includes datasets both for DNA and RNA, as well as for single and double strands. The data have been shown to be important for understanding different aspects of nucleic acid structure and function, and they can also be used for encoding nucleic acid sequences. The database is intended to facilitate further applications of dinucleotide properties. A number of property datasets is highly correlated. Therefore, the database comes with a correlation analysis facility. Authors having determined new sets of dinucleotide property values are invited to submit these data to DiProDB

    TassDB: a database of alternative tandem splice sites

    Get PDF
    Subtle alternative splice events at tandem splice sites are frequent in eukaryotes and substantially increase the complexity of transcriptomes and proteomes. We have developed a relational database, TassDB (TAndem Splice Site DataBase), which stores extensive data about alternative splice events at GYNGYN donors and NAGNAG acceptors. These splice events are of subtle nature since they mostly result in the insertion/deletion of a single amino acid or the substitution of one amino acid by two others. Currently, TassDB contains 114 554 tandem splice sites of eight species, 5209 of which have EST/mRNA evidence for alternative splicing. In addition, human SNPs that affect NAGNAG acceptors are annotated. The database provides a user-friendly interface to search for specific genes or for genes containing tandem splice sites with specific features as well as the possibility to download large datasets. This database should facilitate further experimental studies and large-scale bioinformatics analyses of tandem splice sites. The database is available at

    Accurate prediction of NAGNAG alternative splicing

    Get PDF
    Alternative splicing (AS) involving NAGNAG tandem acceptors is an evolutionarily widespread class of AS. Recent predictions of alternative acceptor usage reported better results for acceptors separated by larger distances, than for NAGNAGs. To improve the latter, we aimed at the use of Bayesian networks (BN), and extensive experimental validation of the predictions. Using carefully constructed training and test datasets, a balanced sensitivity and specificity of ≥92% was achieved. A BN trained on the combined dataset was then used to make predictions, and 81% (38/47) of the experimentally tested predictions were verified. Using a BN learned on human data on six other genomes, we show that while the performance for the vertebrate genomes matches that achieved on human data, there is a slight drop for Drosophila and worm. Lastly, using the prediction accuracy according to experimental validation, we estimate the number of yet undiscovered alternative NAGNAGs. State of the art classifiers can produce highly accurate prediction of AS at NAGNAGs, indicating that we have identified the major features of the ‘NAGNAG-splicing code’ within the splice site and its immediate neighborhood. Our results suggest that the mechanism behind NAGNAG AS is simple, stochastic, and conserved among vertebrates and beyond

    Integrative inference of gene-regulatory networks in Escherichia coli using information theoretic concepts and sequence analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Although <it>Escherichia coli </it>is one of the best studied model organisms, a comprehensive understanding of its gene regulation is not yet achieved. There exist many approaches to reconstruct regulatory interaction networks from gene expression experiments. Mutual information based approaches are most useful for large-scale network inference.</p> <p>Results</p> <p>We used a three-step approach in which we combined gene regulatory network inference based on directed information (DTI) and sequence analysis. DTI values were calculated on a set of gene expression profiles from 19 time course experiments extracted from the Many Microbes Microarray Database. Focusing on influences between pairs of genes in which one partner encodes a transcription factor (TF) we derived a network which contains 878 TF - gene interactions of which 166 are known according to RegulonDB. Afterward, we selected a subset of 109 interactions that could be confirmed by the presence of a phylogenetically conserved binding site of the respective regulator. By this second step, the fraction of known interactions increased from 19% to 60%. In the last step, we checked the 44 of the 109 interactions not yet included in RegulonDB for functional relationships between the regulator and the target and, thus, obtained ten TF - target gene interactions. Five of them concern the regulator LexA and have already been reported in the literature. The remaining five influences describe regulations by Fis (with two novel targets), PhdR, PhoP, and KdgR. For the validation of our approach, one of them, the regulation of lipoate synthase (LipA) by the pyruvate-sensing pyruvate dehydrogenate repressor (PdhR), was experimentally checked and confirmed.</p> <p>Conclusions</p> <p>We predicted a set of five novel TF - target gene interactions in <it>E. coli</it>. One of them, the regulation of <it>lipA </it>by the transcriptional regulator PdhR was validated experimentally. Furthermore, we developed DTInfer, a new R-package for the inference of gene-regulatory networks from microarrays using directed information.</p

    DiProGB: the dinucleotide properties genome browser

    Get PDF
    Motivation: DiProGB is an easy to use new genome browser that encodes the primary nucleotide sequence by thermodynamical and geometrical dinucleotide properties. The nucleotide sequence is thus converted into a sequence graph. This visualization, supported by different graph manipulation options, facilitates genome analyses, because the human brain can process visual information better than textual information. Also, DiProGB can identify genomic regions where certain physical properties are more conserved than the nucleotide sequence itself. Most of the DiProGB tools can be applied to both, the primary nucleotide sequence and the sequence graph. They include motif and repeat searches as well as statistical analyses. DiProGB adds a new dimension to the common genome analysis approaches by taking into account the physical properties of DNA and RNA
    corecore