44 research outputs found

    Разработка базы данных мотивов регуляции транскрипции у бактерий

    Get PDF
    O b j e c t i v e s . The amount of data generated by modern methods of high-throughput sequencing is such that their analysis is performed mainly in automatic mode. In particular, the use of newly decoded genomic sequences is possible only after the annotation of functional elements of the genome, which, as a rule, is performed by automatic pipelines. Such annotation pipelines do a good job to identify the genes, but none of them annotate regulatory elements. Without these elements it is not possible to understand when and how genes can be expressed. Information on the regulatory elements of bacteria is collected in several specialized databases (RegulonDB, CollecTF, Prodoric2, etc.), however, only a part of this information can be used for annotation of regulatory elements, and only for a very limited range of bacteria. Previously, we proposed a clear formal criterion for applying regulatory information to any bacterial genome. Such a criterion is the CR tag, a sequence of amino acid residues of a transcriptional regulator that specifically contacts the nitrogenous bases of regulatory element in genomic DNA. The mathematical model of a regulatory element (motif) associated with a CR tag can be correctly applied to annotate similar elements in any genomes encoding a transcriptional regulator with an identical CR tag. The accumulation of motifs associated with CR tags raised the question of their ordered storage for the convenience of subsequent use in the annotation of genomic sequences. Since no one of well-known databases uses the concept of CR tags, a new database ought to be developed. Thus, the goal of this work is to create a database with information about bacterial transcription factors and DNA sequences recognized by them, suitable for annotation of regulatory sequences in bacterial genomes.M e t h o d s .  Infological  modeling  of  the  subject  area  was  carried  out  using  the  IDEF1X  methodology. The database was developed using the Microsoft SQL Server DBMS. A cross-platform application for importing data into a database is written in C++ using Qt technology.Re s u l t s . As a result of the study of the subject area, a relational data model was developed and implemented in the Microsoft SQL Server DBMS, which allows holistic storage of information about accumulated transcription regulation motifs in bacteria, including information about the publications confirming their correctness. To automate the process of entering accumulated data, a cross-platform application was developed for importing structured data on transcription factors.Co n c l u s i o n .  The  main difference of  the  developed database is  the  use  of  CR-tag  concept. Records of mathematical models of regulatory elements (motifs) in the database are associated with a CR tag and, therefore, can be correctly used to annotate similar elements in any genomes encoding a transcriptional regulator with an identical CR tag. The developed database will provide structured and holistic data storage, as well as their quick search when used in the pipeline for automatic annotation of regulatory elements in bacterial genomic sequences.Ц е л и. Объемы данных, генерируемые современными методами высокопроизводительного секвенирования, таковы, что их анализ выполняется преимущественно в автоматическом режиме. В частности, использование вновь расшифрованных геномных последовательностей возможно только после аннотации функциональных элементов генома, которая, как правило, выполняется автоматическими конвейерами. Такие конвейеры аннотации успешно справляются с идентификацией генов, но ни один из них не аннотирует регуляторные элементы, без которых нельзя понять, когда и как гены могут экспрессироваться. Информация о регуляторных элементах бактерий собрана в нескольких специализированных базах данных (RegulonDB, CollecTF, Prodoric2 и др.), однако только часть этой информации можно использовать для аннотации регуляторных элементов и только у очень ограниченного круга бактерий. Ранее авторами был предложен четкий формальный критерий для применения регуляторной информации к любым бактериальным геномам. Таким критерием стал CR-тег – последовательность аминокислотных остатков транскрипционного регулятора, специфически контактирующих с азотистыми основаниями регуляторного элемента в геномной ДНК. Связанная с CR-тегом математическая модель регуляторного элемента (мотив) может быть корректно применена для аннотации подобных элементов в любых геномах, кодирующих транскрипционный регулятор с идентичным CR-тегом. Накопление связанных с CR-тегами мотивов поставило вопрос об их упорядоченном хранении для удобства последующего применения при аннотации геномных последовательностей. Поскольку ни одна из известных баз данных не использует концепцию CR-тегов, потребовалась разработка новой базы данных. Таким образом, целью работы является создание базы данных с информацией о бактериальных транскрипционных факторах и распознаваемых ими последовательностях ДНК, пригодной для аннотации регуляторных последовательностей в бактериальных геномах.М е то д ы .  Инфологическое моделирование предметной области производилось с помощью методологии IDEF1X. Разработка базы данных выполнялась посредством СУБД Microsoft SQL Server. Кроссплатформенное приложение по импорту данных в базу данных написано на языке C++ с использованием технологии Qt.Р е з у л ь т а т ы . В результате проведенного исследования предметной области была разработана и реализована в СУБД Microsoft SQL Server реляционная модель данных, позволяющая целостное хранение информации  о  накопленных  мотивах  регуляции  транскрипции  у  бактерий,  включая  и  информацию о публикациях, подтверждающих корректность этих мотивов. Для автоматизации процесса ввода накопленных данных разработано кроссплатформенное приложение для импорта структурированных данных о транскрипционных факторах.З а к л ю ч е н и е .  Основным отличием разработанной базы данных является использование концепции CR-тега. Записи математических моделей регуляторных элементов (мотивов) в базе данных связаны с CR-тегом и поэтому могут быть корректно применены для аннотации подобных элементов в любых геномах, кодирующих транскрипционный регулятор с идентичным CR-тегом. Разработанная база данных обеспечит структурированное и целостное хранение данных, а также их быстрый поиск при использовании в конвейере автоматической аннотации регуляторных элементов в бактериальных геномных последовательностях

    Inferred regulons are consistent with regulator binding sequences in E. coli

    Get PDF
    The transcriptional regulatory network (TRN) of E. coli consists of thousands of interactions between regulators and DNA sequences. Regulons are typically determined either from resource-intensive experimental measurement of functional binding sites, or inferred from analysis of high-throughput gene expression datasets. Recently, independent component analysis (ICA) of RNA-seq compendia has shown to be a powerful method for inferring bacterial regulons. However, it remains unclear to what extent regulons predicted by ICA structure have a biochemical basis in promoter sequences. Here, we address this question by developing machine learning models that predict inferred regulon structures in E. coli based on promoter sequence features. Models were constructed successfully (cross-validation AUROC > = 0.8) for 85% (40/47) of ICA-inferred E. coli regulons. We found that: 1) The presence of a high scoring regulator motif in the promoter region was sufficient to specify regulatory activity in 40% (19/47) of the regulons, 2) Additional features, such as DNA shape and extended motifs that can account for regulator multimeric binding, helped to specify regulon structure for the remaining 60% of regulons (28/47); 3) investigating regulons where initial machine learning models failed revealed new regulator-specific sequence features that improved model accuracy. Finally, we found that strong regulatory binding sequences underlie both the genes shared between ICA-inferred and experimental regulons as well as genes in the E. coli core pan-regulon of Fur. This work demonstrates that the structure of ICA-inferred regulons largely can be understood through the strength of regulator binding sites in promoter regions, reinforcing the utility of top-down inference for regulon discovery

    rSeqTU—A Machine-Learning Based R Package for Prediction of Bacterial Transcription Units

    Get PDF
    A transcription unit (TU) is composed of one or multiple adjacent genes on the same strand that are co-transcribed in mostly prokaryotes. Accurate identification of TUs is a crucial first step to delineate the transcriptional regulatory networks and elucidate the dynamic regulatory mechanisms encoded in various prokaryotic genomes. Many genomic features, for example, gene intergenic distance, and transcriptomic features including continuous and stable RNA-seq reads count signals, have been collected from a large amount of experimental data and integrated into classification techniques to computationally predict genome-wide TUs. Although some tools and web servers are able to predict TUs based on bacterial RNA-seq data and genome sequences, there is a need to have an improved machine learning prediction approach and a better comprehensive pipeline handling QC, TU prediction, and TU visualization. To enable users to efficiently perform TU identification on their local computers or high-performance clusters and provide a more accurate prediction, we develop an R package, named rSeqTU. rSeqTU uses a random forest algorithm to select essential features describing TUs and then uses support vector machine (SVM) to build TU prediction models. rSeqTU (available at https://s18692001.github.io/rSeqTU/) has six computational functionalities including read quality control, read mapping, training set generation, random forest-based feature selection, TU prediction, and TU visualization

    Homology-based reconstruction of regulatory networks for bacterial and archaeal genomes

    Get PDF
    Gene regulation is a key process for all microorganisms, as it allows them to adapt to different environmental stimuli. However, despite the relevance of gene expression control, for only a handful of organisms is there related information about genome regulation. In this work, we inferred the gene regulatory networks (GRNs) of bacterial and archaeal genomes by comparisons with six organisms with well-known regulatory interactions. The references we used are: Escherichia coli K-12 MG1655, Bacillus subtilis 168, Mycobacterium tuberculosis, Pseudomonas aeruginosa PAO1, Salmonella enterica subsp. enterica serovar typhimurium LT2, and Staphylococcus aureus N315. To this end, the inferences were achieved in two steps. First, the six model organisms were contrasted in an all-vs-all comparison of known interactions based on Transcription Factor (TF)-Target Gene (TG) orthology relationships and Transcription Unit (TU) assignments. In the second step, we used a guilt-by-association approach to infer the GRNs for 12,230 bacterial and 649 archaeal genomes based on TF-TG orthology relationships of the six bacterial models determined in the first step. Finally, we discuss examples to show the most relevant results obtained from these inferences. A web server with all the predicted GRNs is available at https://regulatorynetworks.unam.mx/ or http://132.247.46.6/

    Noise propagation in "Escherichia coli's" regulatory network

    Get PDF
    The ability to regulate gene expression allows bacteria to grow under diverse conditions, often involving large regulatory networks. As gene expression is an inherently stochastic process, accurate regulation will only be achieved if the molecules involved in the process adapt perfectly to the different conditions and show low noise themselves. In Escherichia coli it has been reported that high noise promoters are characterized by containing a large number of regulatory binding sites in their sequences and that noise propagation from the regulators to their targets is explaining the elevated noise levels. This suggests that regulation and noise are intimately coupled. However, little is known about this association or even how noise levels vary in response to changes in the environment. The work presented in this thesis aims at elucidating to what extent noise and gene regulation are coupled. We have quantified the variation in genome-wide transcriptional noise across 8 diverse growth conditions in Escherichia coli using flow cytometry and high-throughput microscopy. In summary, we find a growth-rate dependent lowerbound on noise mainly exhibited by constitutive promoters. Individual regulated promoters show complex behaviours in terms of changes in mean and noise across conditions, and condition-dependent expression noise shaped by noise propagation from transcription factors. Using a simple linear model we identify a set of TFs that contribute to condition-specific and condition-independent noise propagation. The overall correlation structure of genome-wide expression properties uncovers that genes are organized along two principal axes, with the first one sorting genes by their mean expression and evolutionary rate, and the second one by their expression noise, number of regulatory inputs and expression plasticity. Overall, the results of the thesis show clear evidence that noise and regulation are intimately linked due to noise propagation from regulators to their targets, and that this association has evolved independently of a promoter's expression level or evolutionary rate in its coding region

    Genome-wide gene expression noise in Escherichia coli is condition-dependent and determined by propagation of noise through the regulatory network

    Get PDF
    Although it is well appreciated that gene expression is inherently noisy and that transcriptional noise is encoded in a promoter's sequence, little is known about the extent to which noise levels of individual promoters vary across growth conditions. Using flow cytometry, we here quantify transcriptional noise in Escherichia coli genome-wide across 8 growth conditions and find that noise levels systematically decrease with growth rate, with a condition-dependent lower bound on noise. Whereas constitutive promoters consistently exhibit low noise in all conditions, regulated promoters are both more noisy on average and more variable in noise across conditions. Moreover, individual promoters show highly distinct variation in noise across conditions. We show that a simple model of noise propagation from regulators to their targets can explain a significant fraction of the variation in relative noise levels and identifies TFs that most contribute to both condition-specific and condition-independent noise propagation. In addition, analysis of the genome-wide correlation structure of various gene properties shows that gene regulation, expression noise, and noise plasticity are all positively correlated genome-wide and vary independently of variations in absolute expression, codon bias, and evolutionary rate. Together, our results show that while absolute expression noise tends to decrease with growth rate, relative noise levels of genes are highly condition-dependent and determined by the propagation of noise through the gene regulatory network

    Discovery of a DNA-binding Consensus and Potential Genomic Regulatory Binding Sites for the Thermus thermophilus HB8 Transcriptional Regulator TTHA1359

    Get PDF
    Transcription factor (TF) proteins act as molecular mechanisms that modulate the initiation of the first step in the expression of genes, gene transcription. Currently, knowledge of the DNA-binding specificities and genes regulated by many TFs, including those of well-studied model organisms such as Escherichia coli and Thermus thermophilus, remains incomplete or lacking which renders gaps in the understanding of the regulatory networks and systems biology of many organisms. Cyclic-AMP receptor protein (CRP) regulators and fumarate and nitrate reduction regulator (FNR) proteins compose the CRP/FNR superfamily of TFs, a diverse subgroup of TFs in bacteria which regulate various gene expression programs. In the present work, a reverse-genetic technique involving the combinatorial selection technique Restriction Endonuclease Protection, Selection, and Amplification (REPSA) has been applied to study TTHA1359, one of the four CRP/FNR superfamily TFs in the model organism T. thermophilus HB8. A TTHA1359-binding consensus, 5’-A(T/A)TGT(G/A)A(N6)T(C/T)ACA(A/T)T-3’, was identified using REPSA to select DNA sequences that TTHA1359 preferentially binds, massively parallel sequencing to acquire the sequence information of these selections, and bioinformatics to discover TTHA1359-binding motifs from the acquired sequence information. TTHA1359-binding to the identified consensus was biophysically characterized, and TTHA1359 was found to bind the identified consensus with high affinity, KD of ~ 3.4 nM. Several potential regulatory binding sites for TTHA1359 were identified bioinformatically by mapping the TTHA1359-binding consensus to the T. thermophilus HB8 genome. The findings of the present work should not only contribute to the knowledge of the DNA-binding specificity and genes regulated by TTHA1359 but also provide insight into the functionality of the applied reverse-genetic technique that should guide its future application to study other TFs
    corecore