329 research outputs found

    Spectral Sequence Motif Discovery

    Full text link
    Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl

    MODER2: First-order Markov Modeling and Discovery of Monomeric and Dimeric Binding Motifs

    Get PDF
    Motivation: Position-specific probability matrices (PPMs, also called position-specific weight matrices) have been the dominating model for transcription factor (TF)-binding motifs in DNA. There is, however, increasing recent evidence of better performance of higher order models such as Markov models of order one, also called adjacent dinucleotide matrices (ADMs). ADMs can model dependencies between adjacent nucleotides, unlike PPMs. A modeling technique and software tool that would estimate such models simultaneously both for monomers and their dimers have been missing. Results: We present an ADM-based mixture model for monomeric and dimeric TF-binding motifs and an expectation maximization algorithm MODER2 for learning such models from training data and seeds. The model is a mixture that includes monomers and dimers, built from the monomers, with a description of the dimeric structure (spacing, orientation). The technique is modular, meaning that the co-operative effect of dimerization is made explicit by evaluating the difference between expected and observed models. The model is validated using HT-SELEX and generated datasets, and by comparing to some earlier PPM and ADM techniques. The ADM models explain data slightly better than PPM models for 314 tested TFs (or their DNA-binding domains) from four families (bHLH, bZIP, ETS and Homeodomain), the ADM mixture models by MODER2 being the best on average.Peer reviewe

    Study of protein-DNA interaction using new generation data

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A Generalized Biophysical Model of Transcription Factor Binding Specificity and Its Application on High-Throughput SELEX Data

    Get PDF
    The interaction between transcription factors (TFs) and DNA plays an important role in gene expression regulation. In the past, experiments on protein–DNA interactions could only identify a handful of sequences that a TF binds with high affinities. In recent years, several high-throughput experimental techniques, such as high-throughput SELEX (HT-SELEX), protein-binding microarrays (PBMs) and ChIP-seq, have been developed to estimate the relative binding affinities of large numbers of DNA sequences both in vitro and in vivo. The large volume of data generated by these techniques proved to be a challenge and prompted the development of novel motif discovery algorithms. These algorithms are based on a range of TF binding models, including the widely used probabilistic model that represents binding motifs as position frequency matrices (PFMs). However, the probabilistic model has limitations and the PFMs extracted from some of the high-throughput experiments are known to be suboptimal. In this dissertation, we attempt to address these important questions and develop a generalized biophysical model and an expectation maximization (EM) algorithm for estimating position weight matrices (PWMs) and other parameters using HT-SELEX data. First, we discuss the inherent limitations of the popular probabilistic model and compare it with a biophysical model that assumes the nucleotides in a binding site contribute independently to its binding energy instead of binding probability. We use simulations to demonstrate that the biophysical model almost always provides better fits to the data and conclude that it should take the place of the probabilistic model in charactering TF binding specificity. Then we describe a generalized biophysical model, which removes the assumption of known binding locations and is particularly suitable for modeling protein–DNA interactions in HT-SELEX experiments, and BEESEM, an EM algorithm capable of estimating the binding model and binding locations simultaneously. BEESEM can also calculate the confidence intervals of the estimated parameters in the binding model, a rare but useful feature among motif discovery algorithms. By comparing BEESEM with 5 other algorithms on HT-SELEX, PBM and ChIP-seq data, we demonstrate that BEESEM provides significantly better fits to in vitro data and is similar to the other methods (with one exception) on in vivo data under the criterion of the area under the receiver operating characteristic curve (AUROC). We also discuss the limitations of the AUROC criterion, which is purely rank-based and thus misses quantitative binding information. Finally, we investigate whether adding DNA shape features can significantly improve the accuracy of binding models. We evaluate the ability of the gradient boosting classifiers generated by DNAshapedTFBS, an algorithm that takes account of DNA shape features, to differentiate ChIP-seq peaks from random background sequences, and compare them with various matrix-based binding models. The results indicate that, compared with optimized PWMs, adding DNA shape features does not produce significantly better binding models and may increase the risk of overfitting on training datasets

    CSI-Tree: a regression tree approach for modeling binding properties of DNA-binding molecules based on cognate site identification (CSI) data

    Get PDF
    The identification and characterization of binding sites of DNA-binding molecules, including transcription factors (TFs), is a critical problem at the interface of chemistry, biology and molecular medicine. The Cognate Site Identification (CSI) array is a high-throughput microarray platform for measuring comprehensive recognition profiles of DNA-binding molecules. This technique produces datasets that are useful not only for identifying binding sites of previously uncharacterized TFs but also for elucidating dependencies, both local and nonlocal, between the nucleotides at different positions of the recognition sites. We have developed a regression tree technique, CSI-Tree, for exploring the spectrum of binding sites of DNA-binding molecules. Our approach constructs regression trees utilizing the CSI data of unaligned sequences. The resulting model partitions the binding spectrum into homogeneous regions of position specific nucleotide effects. Each homogeneous partition is then summarized by a position weight matrix (PWM). Hence, the final outcome is a binding intensity rank-ordered collection of PWMs each of which spans a different region in the binding spectrum. Nodes of the regression tree depict the critical position/nucleotide combinations. We analyze the CSI data of the eukaryotic TF Nkx-2.5 and two engineered small molecule DNA ligands and obtain unique insights into their binding properties. The CSI tree for Nkx-2.5 reveals an interaction between two positions of the binding profile and elucidates how different nucleotide combinations at these two positions lead to different binding affinities. The CSI trees for the engineered DNA ligands exhibit a common preference for the dinucleotide AA in the first two positions, which is consistent with preference for a narrow and relatively flat minor groove. We carry out a reanalysis of these data with a mixture of PWMs approach. This approach is an advancement over the simple PWM model and accommodates position dependencies based on only sequence data. Our analysis indicates that the dependencies revealed by the CSI-Tree are challenging to discover without the actual binding intensities. Moreover, such a mixture model is highly sensitive to the number and length of the sequences analyzed. In contrast, CSI-Tree provides interpretable and concise summaries of the complete recognition profiles of DNA-binding molecules by utilizing binding affinities

    Development and Applications of Shape-Based DNA Motifs

    Get PDF
    Transcriptional regulation is imperative for proper development of multicellular organisms, and disregulation of this process can lead to genetic disease. Due to technical limitations, the full human regulome has not been assayed. Computational methods provide resources to fill the gaps in our understanding of these processes. Sequence based representations of transcription factor DNA motifs have long been used for this purpose. We developed a model based on estimates of DNA shape known as Structural Motifs, extending the position weight matrix to accommodate multiple continuous shape parameters at each position. Using expectation maximization, Structural Motifs are discovered de novo from transcription factor binding data, and these motifs are specific to their cognate factors. When considered jointly with sequence motifs, Structural Motifs improve classification of transcription factor binding sites. Joint models also provide insight into the readout mechanisms utilized by transcription factors. DNA shape is an important component of the protein-DNA interaction to consider and improves the computational predictions of transcription factor binding, elevating our understanding of the regulatory landscape

    Monomeeristen ja dimeeristen säätelytekijöiden sitoutumispaikkojen mallinnus ja oppiminen

    Get PDF
    In this thesis we aim to learn models that can describe the sites in DNA that a transcription factor (TF) prefers to bind to. We concentrate on probabilistic models that give each DNA sequence, of fixed length, a probability of binding. The probability models used are inhomogeneous 0th and 1st order Markov chains, which are called in our terminology Position-specific Probability Matrix (PPM) and Adjacent Dinucleotide Model (ADM), respectively. We consider both the case where a single TF binds in isolation to DNA, and the case where two TFs bind to proximal locations in DNA, possibly having interactions between the two factors. We use two algorithmic approaches to this learning task. Both approaches utilize data, which is assumed to have enriched number of binding sites of the TF(s) under investigation. Then the binding sites in the data need to be located and used to learn the parameters of the binding model. Both methods also assume that the length of the binding sites is known beforehand. We first introduce a combinatorial approach where we count l-mers that are either binding sites, background noise, or belong partly to both of these categories. The most common l-mer in the data and its Hamming neighbours are declared as binding sites. Then an algorithm to align these binding sites in an unbiased manner is introduced. To avoid false binding sites, the fraction of signal in the data is estimated and used to subtract the counts that rise from the background. The second approach has the following additional benefits. The division into signal and background is done in a rigorous manner using a maximum likelihood method, thus avoiding the problems due to the ad hoc nature of the first approach. Secondly, use of a mixture model allows learning multiple models simultaneously. Then, subsequently, this mixture model is extended to include dimeric models as combinations of two binding models. We call this reduction of dimers as monomers modularity. This allows investigating the preference of each distance, even the negative distance in the overlapping case, and relative orientation between these two models. The most likely mixture model that explains the data is optimized using an EM algorithm. Since all the submodels belong to the same mixture model, their relative popularity can be directly compared. The mixture model gives an intuitive and unified view of the different binding modes of a single TF or a pair of TFs. Implementations of all introduced algorithms, SeedHam and MODER for learning PPM models and MODER2 for learning ADM models, are freely available from GitHub. In validation experiments ADM models were observed to be slightly but consistently better than PPM models in explaining binding-site data. In addition, learning modularic mixture models confirmed many previously detected dimeric structures and gave new biological insights about different binding modes and their compact representations.Jokaisen elämänmuodon toiminta ja lisääntyminen perustuu informaatioon, joka on talletettu soluissa olevaan DNA:han. DNA:n sisältämien geenien informaatio kopioidaan RNA:ksi, jota käytetään valmistusohjeena proteiineille, jotka ovat solujen ja niiden koneistojen rakennusaineita. Jokainen ihmisen solu, sukusoluja lukuunottamatta, sisältää saman DNA:n. Erityyppiset solut ovat täysin erinäköisiä ja toimivat eri tavalla kuin toiset solut. Esimerkiksi maksasolu on erimuotoinen ja -kokoinen kuin hermosolu. Tämä selittyy sillä, että eri solutyypeissä ovat aktiivisina osittain eri geenit kuin toisissa solutyypeissä. Ainoastaan aktiivisista geeneistä tuotetaan proteiineja. Eräs tapa vaikuttaa geenien aktiivisuuteen on säädellä geenien sisällön kopioimista RNA:ksi. Tietyt proteiinit, niin sanotut säätelytekijät, voivat vaikuttaa tähän kopioimiseen sitoutumalla geeniin liittyvään säätelyalueeseen. Täten on geenien säätelyn ymmärtämiseksi tärkeä pystyä selittämään säätelytekijöiden sitoutuminen DNA:han, ja näin myös löytää DNA:sta geenien säätelyyn liittyvät alueet. Tässä väitöskirjassa pyritään oppimaan malleja, joilla voidaan kuvata DNA:n alueita, joihin geenien säätelytekijät sitoutuvat, ja arvioida tämän sitoutumisen voimakkuutta. Työssä keskitytään todennäköisyysmalleihin, jotka antavat jokaiselle kiinteän pituiselle DNA-sekvenssille sitoutumistodennäköisyyden. Käytetyt todennäköisyysmallit ovat epähomogeenisia nollannen tai ensimmäisen asteen Markov-ketjuja, joita tässä työssä kutsutaan vastaavasti positioriippuviksi todennäköisyysmatriiseiksi (PPM) tai vierekkäisten dinukleotidien malliksi (ADM). Työssä tutkitaan sekä monomeerista tapausta, jossa yksi säätelytekijä sitoutuu DNA:han vailla muita tekijöitä, että dimeeristä tapausta, jossa kaksi säätelytekijää sitoutuvat lähekkäisiin alueisiin. Jälkimmäisessä tapauksessa kaksi säätelytekijää voivat olla vuorovaikutuksessa keskenään. Tässä tutkimuksessa käytetään kahta eri algoritmista lähestymistapaa sitoutumismallien oppimiseen: kombinatorista ja todennäköisyyksiin pohjautuvaa menetelmää. Kummatkin lähestymistavat käyttävät dataa, jonka oletetaan sisältävän runsaasti tutkittavan säätelytekijän sitoutumispaikkoja. Nämä sitoutumispaikat tulee paikantaa ja käyttää sitoutumismallin parametrien oppimiseen. Työssä esiteltyjen algoritmien (SeedHam, MODER ja MODER2) toteutukset ovat vapaasti saatavilla GitHub-palvelimelta. Menetelmien validoinnissa havaittiin, että niiden tuottamat mallit sekä vahvistivat aiempia biologisia tuloksia että antoivat uusia biologisia näkökulmia sitoutumismalleihin ja niiden tiiviisiin esityksiin

    Modular discovery of monomeric and dimeric transcription factor binding motifs for large data sets

    Get PDF
    In some dimeric cases of transcription factor (TF) binding, the specificity of dimeric motifs has been observed to differ notably from what would be expected were the two factors to bind to DNA independently of each other. Current motif discovery methods are unable to learn monomeric and dimeric motifs in modular fashion such that deviations from the expected motif would become explicit and the noise from dimeric occurrences would not corrupt monomeric models. We propose a novel modeling technique and an expectation maximization algorithm, implemented as software tool MODER, for discovering monomeric TF binding motifs and their dimeric combinations. Given training data and seeds for monomeric motifs, the algorithm learns in the same probabilistic framework a mixture model which represents monomeric motifs as standard position-specific probability matrices (PPMs), and dimeric motifs as pairs of monomeric PPMs, with associated orientation and spacing preferences. For dimers the model represents deviations from pure modular model of two independent monomers, thus making co-operative binding effects explicit. MODER can analyze in reasonable time tens of Mbps of training data. We validated the tool on HT-SELEX and ChIP-seq data. Our findings include some TFs whose expected model has palindromic symmetry but the observed model is directional.Peer reviewe
    corecore