20 research outputs found

    NEXT-Peak: A Normal-Exponential Two-Peak Model for Peak-Calling in ChIP-seq Data

    Get PDF
    Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies. Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region. Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data

    Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with <it>ad hoc </it>methods, making them unsuitable for general motif searches.</p> <p>Results</p> <p>A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.</p> <p>Conclusion</p> <p>Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.</p

    Analysis of Biological Features Associated with Meiotic Recombination Hot and Cold Spots in Saccharomyces cerevisiae

    Get PDF
    Meiotic recombination is not distributed uniformly throughout the genome. There are regions of high and low recombination rates called hot and cold spots, respectively. The recombination rate parallels the frequency of DNA double-strand breaks (DSBs) that initiate meiotic recombination. The aim is to identify biological features associated with DSB frequency. We constructed vectors representing various chromatin and sequence-based features for 1179 DSB hot spots and 1028 DSB cold spots. Using a feature selection approach, we have identified five features that distinguish hot from cold spots in Saccharomyces cerevisiae with high accuracy, namely the histone marks H3K4me3, H3K14ac, H3K36me3, and H3K79me3; and GC content. Previous studies have associated H3K4me3, H3K36me3, and GC content with areas of mitotic recombination. H3K14ac and H3K79me3 are novel predictions and thus represent good candidates for further experimental study. We also show nucleosome occupancy maps produced using next generation sequencing exhibit a bias at DSB hot spots and this bias is strong enough to obscure biologically relevant information. A computational approach using feature selection can productively be used to identify promising biological associations. H3K14ac and H3K79me3 are novel predictions of chromatin marks associated with meiotic DSBs. Next generation sequencing can exhibit a bias that is strong enough to lead to incorrect conclusions. Care must be taken when interpreting high throughput sequencing data where systematic biases have been documented

    Bayesian models and Markov chain Monte Carlo methods for protein motifs using secondary characteristics

    No full text
    Statistical methods have been successfully used to analyze biological sequences. Identifying common local patterns, also called motifs, in multiple protein sequences plays an important role for establishing homology between proteins. Homology is easy to establish when sequences are similar (sharing an identity \u3e 25%). However for distantly-related proteins, current available methods often fail to align motifs. We develop new probability models that utilize the secondary characteristics such as amino acid polarity and predicted secondary structures for profiling protein motifs. Bayesian models and Markov chain Monte Carlo methods are employed to estimate the model parameters, therefore to identify protein motifs in multiple sequences. The extra information brought by the secondary characteristics greatly increase the sensitivity of detecting common local patterns for a group of distantly-related proteins

    Protein Multiple Alignment Incorporating Primary and Secondary Structure Information

    No full text
    Identifying common local segments, also called motifs, in multiple protein sequences plays an important role for establishing homology between proteins. Homology is easy to establish when sequences are similar (sharing an identity&gt; 25%). However, for dis-tant proteins, it is much more difficult to align motifs that are not similar in sequences but still share common structures or functions. This paper is a first attempt to align multiple protein sequences using both primary and secondary structure information. A new sequence model is proposed so that the model assigns high probabilities not only to motifs that contain conserved amino acids but also to motifs that present com-mon secondary structures. The proposed method is tested in a structural alignment database BAliBASE. We show that information brought by the predicted secondary structures greatly improves motif identification. A website of this program is available a

    Patent data analysis using functional count data model

    No full text
    Technology is an important cause of social change. So many researchers have studied on diverse methods for technology analysis. Patent analysis has been proposed in many studies for technical analysis. They extracted technological keywords and codes from patent documents and analyzed them using statistics and machine learning. One of the problems in the existing studies was the patent analysis that did not consider the time factor. However, time is a factor to be considered in technology analysis. Because technology has evolved over time, in this paper, we study and propose a new technology analysis method considering time factor. We analyze patent data to understand technological structure of company, because patent contains most of information about developed technology. A lot of studies on technology analysis using patent data have been published in various areas. Many of them used extracted technological keywords from patent documents for patent analysis. They did not consider time factor to build technology analysis models, but we know technology changes over time. So we propose a technology analysis method using functional data analysis as a patent analysis considering time factor. We select Apple technology for our case study. With the patent data of Apple over time, we investigate on the technological structure of Apple and its technological evolution through high-dimensional visualization using harmonic components generated by functional data analysis. In addition, by employing the count data regression models of Poisson, negative binomial and hurdle Poisson, we examine the relationships among highest frequency keywords based on the visual outputs in functional data analysis. The practical implication of this paper is that it can be applied more effectively than the existing studies of technology analysis by considering time factor. This research contributes to technology forecasting for understanding the social changing. We can develop a more efficient research and development plan to improve the technological competition. The originality of this research is to consider time factor in technology analysis based on patent data. In this paper, we used the functional data analysis to model trends of technology keywords over time. Using the results of the technology analysis of this study considering the time, the company will be able to understand the social change and thereby improve its technological competitiveness in the market
    corecore