771 research outputs found

    Spectral Sequence Motif Discovery

    Full text link
    Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl

    Investigating Reciprocal Control of Adherence and Motility through the Lens of PapX, a Non-structural Fimbrial Repressor of Flagellar Synthesis.

    Full text link
    Most uncomplicated urinary tract infections (UTIs) are caused by uropathogenic Escherichia coli (UPEC). Both motility and adherence are integral to UTI pathogenesis, yet they represent opposing forces. Therefore it is logical to reciprocally regulate these functions. PapX, a non-structural protein encoded by the pheV- but not pheU-associated pap operon encoding the P fimbria adherence factor of E. coli CFT073, represses flagella-mediated motility and belongs to a highly conserved family of winged-helix transcription factors. Thus, when P fimbriae are synthesized for adherence, synthesis of flagella is repressed. The mechanism of this repression, however, is not understood. papX is found preferentially in more virulent UPEC isolates, being significantly more prevalent in pyelonephritis strains (53% of isolates) than in asymptomatic bacteriuria (32%) or fecal/commensal (12.5%) strains. To examine PapX structure-function, we generated papX linker-insertion and site-directed mutants, which identified two key residues for PapX function (Lys54 and Arg127) within domains predicted by modeling with I-TASSER software to be important for dimerization and DNA binding, respectively. SELEX in conjunction with high-throughput sequencing was utilized for the first time to determine the unique binding site for the bacterial transcription factor PapX in E. coli CFT073. It was necessary to write and implement novel software for the analysis of the results from this technique. The software, TFAST, is freely available (Appendix C) and has near-perfect agreement (k = 0.89) to a gold standard in peak-finding software, MACS. Analysis of TFAST indicates that it correctly stratifies data to generate meaningful results, and successfully identified a 29 bp binding site within the flhDC promoter (TTACGGTGAGTTATTTTAACTGTGCGCAA), centered 410 bp upstream of the flhD translational start site. PapX bound the flhD promoter in gel shift experiments, which was reversible with the 29 bp sequence, indicating that PapX binds directly to this site to repress transcription of flagellar genes. Microarray, qPCR and promoter fusions indicate that PapX is not transcriptionally regulated itself. Co-precipitation studies indicate that PapX likely requires at least one cofactor for its repressive activity, and OmpA was identified as a promising candidate.PHDMicrobiology and ImmunologyUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/111505/1/djreiss_1.pd

    A modified bacterial one-hybrid system yields improved quantitative models of transcription factor specificity

    Get PDF
    We examine the use of high-throughput sequencing on binding sites recovered using a bacterial one-hybrid (B1H) system and find that improved models of transcription factor (TF) binding specificity can be obtained compared to standard methods of sequencing a small subset of the selected clones. We can obtain even more accurate binding models using a modified version of B1H selection method with constrained variation (CV-B1H). However, achieving these improved models using CV-B1H data required the development of a new method of analysisā€”GRaMS (Growth Rate Modeling of Specificity)ā€”that estimates bacterial growth rates as a function of the quality of the recognition sequence. We benchmark these different methods of motif discovery using Zif268, a well-characterized C2H2 zinc-finger TF on both a 28ā€‰bp randomized library for the standard B1H method and on 6ā€‰bp randomized library for the CV-B1H method for which 45 different experimental conditions were tested: five time points and three different IPTG and 3-AT concentrations. We find that GRaMS analysis is robust to the different experimental parameters whereas other analysis methods give widely varying results depending on the conditions of the experiment. Finally, we demonstrate that the CV-B1H assay can be performed in liquid media, which produces recognition models that are similar in quality to sequences recovered from selection on solid media

    Proteiini-DNA sitoutumisspesifisyyksien mallintaminen satunnaismetsƤllƤ

    Get PDF
    Protein-DNA binding specifities are modeled with random forest in this Master's thesis. Specific proteins called transcriptional factors are essential for gene expression regulation, since their binding on DNA can alter transcription initiation probability of target genes. Furthermore, transcriptional factors can bind DNA as dimers even though as individuals they would lack the required affinity for the binding site. Thus, models that predict individual protein and protein dimer binding sites, would be beneficial for deducing gene regulatory networks. In this Master's thesis HT-SELEX and CAP-SELEX data sets measured by Jolma et al. are utilized for modeling binding specificities. SELEX measurements yield large sets of DNA sequences, which are known to comprise a binding site. HT-SELEX measure individual transcriptional factor binding sites while CAP-SELEX measure binding sites of transcriptional factor dimers. Currently, position weight matrices (PWM) are most often utilized for modeling protein-DNA binding specifities even though they may be too simple and inflexible for accurate modeling. For instance a neural network model, DeepBind, have been shown to outperform PWM modeling significantly. In this Master's thesis, random forest, which is known to be well suited for high-dimensional and correlated data, is combined with PWMs to yield models for protein-DNA binding specifities. For individual transcriptional factor binding sites random forest perform almost equally to DeepBind and outperform PWM modeling significantly. In addition, random forest predict protein dimer binding sites significantly more accurately than position weight matrices. Furthermore, the difference between random forest and PWM modeling is greater for protein pairs than for individual proteins. In addition, DeepBind is not currently provided for transcriptional factor pairs. Thus, according to results represented in this Master's thesis, modeling protein-DNA binding specificities with random forest is beneficial in comparison to position weight matrices especially for protein dimers.DiplomityƶssƤ mallinnetaan satunnaismetsƤllƤ proteiini-DNA sitoutumisspesifisyyksiƤ. TranskriptiotekijƤt ovat proteiineja, jotka sƤƤtelevƤt geenien ilmentymistƤ sitoutumalla DNA juosteelle ja tƤten laskemalla tai kasvattamalla kohdegeenien transkription todennƤkƶisyyttƤ. LisƤksi transkriptiotekijƤt voivat sitoutua DNA juosteelle dimeerisessƤ muodossa, vaikka yksittƤisinƤ proteiineina nƤiden sitoutumisaffiniteetti ei olisikaan ollut riittƤvƤ kyseiselle sitoutumiskohdalle. DiplomityƶssƤ kƤytetƤƤn sitoutumisspesifisyyksien mallintamiseen Jolma et al. mittaamia HT-SELEX ja CAP-SELEX aineistoja. SELEX mittaukset tuottavat suuren joukon DNA juosteita, jotka sisƤltƤvƤt sitoutumiskohdan. HT-SELEX menetelmƤllƤ mitataan sitoutumiskohtia yksittƤisille proteiineille ja CAP-SELEX menetelmƤllƤ proteiinipareille. TƤllƤ hetkellƤ sitoutumisspesifisyyksiƤ mallinnetaan useimmiten positio paino matriiseilla (PPM), vaikka ne saattavat olla liian yksinkertaisia ja joustamattomia sitoutumiskohtien todenmukaiseen mallintamiseen. Esimerkiksi neuroverkkoihin perustuvan DeepBind mallin on nƤytetty ennustavan sitoutumiskohtia merkittƤvƤsti tarkemmin kuin positio paino matriisien. DiplomityƶssƤ mallinnetaan proteiinien sitoutumiskohtia yhdistƤmƤllƤ PPM malleja ja satunnaismetsƤ-mallinnusta, jonka tiedetƤƤn soveltuvan hyvin moniulotteiselle sekƤ korreloituneelle datalle. Tyƶn tuloksista selvisi, ettƤ satunnaismetsƤ ennustaa yksittƤisten proteiinien sitoutumiskohtia lƤhes samalla tarkkuudella kuin DeepBind ja ettƤ ennustustarkkuus on merkittƤvƤsti korkeampi kuin PPM malleilla. SatunnaismetsƤllƤ voi lisƤksi mallintaa proteiiniparien sitoutumiskohtia merkittƤvƤsti tarkemmin kuin positio paino matriiseilla. Ero ennustustarkkuudessa satunnaismetsƤn ja PPM mallinnuksen vƤlillƤ on suurempi proteiinipareilla kuin yksittƤisillƤ proteiineilla. LisƤksi DeepBindia ei tarjota tƤllƤ hetkellƤ proteiinipareille. TƤten Diplomityƶn tulosten perusteella satunnaismetsƤ on suositeltava menetelmƤ proteiini-DNA sitoutumisspesifisyyksien mallintamiseen erityisesti dimeeristƤ sitoutumista mallinnettaessa

    Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families

    Get PDF
    Transcription factors: TFs) play a central role in the gene regulatory network of each cell. They can stimulate or inhibit transcription of their target genes by binding to short, degenerate DNA sequence motifs. The goal of this research is to build improved models of TF binding site recognition. This can facilitate the determination of regulatory networks and also allow for the prediction of binding site motifs based only on the TF protein sequence. Recent technological advances have rapidly expanded the amount of quantitative TF binding data available. PBMs: Protein Binding Microarrays) have recently been implemented in a format that allows all 10mers to be assayed in parallel. There is now PBM data available for hundreds of transcription factors. Another fairly recent technique for determining the binding preference of a TF is an in vivo bacterial one-hybrid assay: B1H). In this approach a TF is expressed in E. coli where it can be used to select strong binding sites from a library of randomized sites located upstream of a weak promoter, driving expression of a selectable gene. When coupled with high throughput sequencing and a newly developed analysis method, quantitative binding data can be obtained. In the last few years, the binding specificities of hundreds of TFs have been determined using B1H. The two largest eukaryotic transcription factor families are the zf-C2H2 and homeodomain TF families. Newly available PBM and B1H specificity models were used to develop recognition models for these two families, with the goal of being able to predict the binding specific of a TF from its protein sequence. We developed a feature selection method based on adjusted mutual information that automatically recovers nearly all of the known key residues for the homeodomain and zf-C2H2 families. Using those features we find that, for both families, random forest: RF) and support vector machine: SVM) based recognition models outperform the nearest neighbor method, which has previously been considered the best method

    Transcription factor familyā€specific DNA shape readout revealed by quantitative specificity models

    Get PDF
    Transcription factors (TFs) achieve DNA-binding specificity through contacts with functional groups of bases (base readout) and readout of structural properties of the double helix (shape readout). Currently, it remains unclear whether DNA shape readout is utilized by only a few selected TF families, or whether this mechanism is used extensively by most TF families. We resequenced data from previously published HT-SELEX experiments, the most extensive mammalian TFā€“DNA binding data available to date. Using these data, we demonstrated the contributions of DNA shape readout across diverse TF families and its importance in core motif-flanking regions. Statistical machine-learning models combined with feature-selection techniques helped to reveal the nucleotide position-dependent DNA shape readout in TF-binding sites and the TF family-specific position dependence. Based on these results, we proposed novel DNA shape logos to visualize the DNA shape preferences of TFs. Overall, this work suggests a way of obtaining mechanistic insights into TFā€“DNA binding without relying on experimentally solved all-atom structures

    UniPROBE: an online database of protein binding microarray data on proteinā€“DNA interactions

    Get PDF
    The UniPROBE (Universal PBM Resource for Oligonucleotide Binding Evaluation) database hosts data generated by universal protein binding microarray (PBM) technology on the in vitro DNA-binding specificities of proteins. This initial release of the UniPROBE database provides a centralized resource for accessing comprehensive PBM data on the preferences of proteins for all possible sequence variants (ā€˜wordsā€™) of length k (ā€˜k-mersā€™), as well as position weight matrix (PWM) and graphical sequence logo representations of the k-mer data. In total, the database hosts DNA-binding data for over 175 nonredundant proteins from a diverse collection of organisms, including the prokaryote Vibrio harveyi, the eukaryotic malarial parasite Plasmodium falciparum, the parasitic Apicomplexan Cryptosporidium parvum, the yeast Saccharomyces cerevisiae, the worm Caenorhabditis elegans, mouse and human. Current web tools include a text-based search, a function for assessing motif similarity between user-entered data and database PWMs, and a function for locating putative binding sites along user-entered nucleotide sequences. The UniPROBE database is available at http://thebrain.bwh.harvard.edu/uniprobe/

    Specificity Determination by paralogous winged helix-turn-helix transcription factors

    Get PDF
    Transcription factors (TFs) localize to regulatory regions throughout the genome, where they exert physical or enzymatic control over the transcriptional machinery and regulate expression of target genes. Despite the substantial diversity of TFs found across all kingdoms of life, most belong to a relatively small number of structural families characterized by homologous DNA-binding domains (DBDs). In homologous DBDs, highly-conserved DNA-contacting residues define a characteristic ā€˜recognition potentialā€™, or the limited sequence space containing high-affinity binding sites. Specificity-determining residues (SDRs) alter DNA binding preferences to further delineate this sequence space between homologous TFs, enabling functional divergence through the recognition of distinct genomic binding sites. This thesis explores the divergent DNA-binding preferences among dimeric, winged helix-turn-helix (wHTH) TFs belonging to the OmpR sub-family. As the terminal effectors of orthogonal two-component signaling pathways in Escherichia coli, OmpR paralogs bind distinct genomic sequences and regulate the expression of largely non-overlapping gene networks. Using high-throughput SELEX, I discover multiple sources of variation in DNA-binding, including the spacing and orientation of monomer sites as well as a novel binding ā€˜modeā€™ with unique half-site preferences (but retaining dimeric architecture). Surprisingly, given the diversity of residues observed occupying positions in contact with DNA, there are only minor quantitative differences in sequence-specificity between OmpR paralogs. Combining phylogenetic, structural, and biological information, I then define a comprehensive set of putative SDRs, which, although distributed broadly across the protein:DNA interface, preferentially localize to the major groove of the DNA helix. Direct specificity profiling of SDR variants reveals that individual SDRs impact local base preferences as well as global structural properties of the protein:DNA complex. This study demonstrates clearly that OmpR family TFs possess multiple ā€˜axes of divergenceā€™, including base recognition, dimeric architecture, and structural attributes of the protein:DNA complex. It also provides evidence for a common structural ā€˜codeā€™ for DNA-binding by OmpR homologues, and demonstrates that surprisingly modest residue changes can enable recognition of highly divergent sequence motifs. Importantly, well-characterized genomic binding sites for many of the TFs in this study diverge substantially from the presented de novo models, and it is unclear how mutations may affect binding in more complex environments. Further analysis using native sequences is required to build combined models of cis- and trans-evolution of two-component regulatory networks
    • ā€¦
    corecore