84 research outputs found
Increasing stability with complexity in a system composed of unstable subsystems
AbstractWe examine stability of Hoffman's symmetric model of the immune system xĢ = Si ā xiāj=1n Kji xj; xi > 0; i= 1,2, ā¦, n; (1) where Si > 0, Kij = Kji ā©¾ 0. This paper gives necessary and sufficient conditions on {Si} and {Kij} for Eq. (1) to have a unique, stable, steady-state solution. Determining existence of a steady-state solution requires a theorem delimiting the range R of a function F: D ā Rn ā R ā Rn, where D is a (possibly proper) subset of Rn. This theorem may be new.If off-diagonal elements {Kij: i ā j} are non-zero with probability C and 0 < Smin ā©½ Si ā©½ Ļ±Smin, Ļ± a fixed integer, we let P(n, C) be the probability that Eq. (1) does not have a stable, steady-state solution. Let T(n) = (Ļ± + 1)2Ļ±ln nn (2) As n ā ā, CT(n) ā r > 1 implies P(n, C) ā 0. If we set {Kii = 0; i = 1, 2,ā¦, n}, this result shows that accumulating more unstable subsystems increases the probability of stability of this system
Finite-size corrections to Poisson approximations in general renewal-success processes
AbstractConsider a renewal process, and let Kā©¾0 denote the random duration of a typical renewal cycle. Assume that on any renewal cycle, a rare event called āsuccessā can occur. Such successes lend themselves naturally to approximation by Poisson point processes. If each success occurs after a random delay, however, Poisson convergence can be relatively slow, because each success corresponds to a time interval, not a point. If K is an arithmetic variable, a āfinite-size correctionā (FSC) is known to speed Poisson convergence by providing a second, subdominant term in the appropriate asymptotic expansion. This paper generalizes the FSC from arithmetic K to general K. Genomics applications require this generalization, because they have already heuristically applied the FSC to p-values involving absolutely continuous distributions. The FSC also sharpens certain results in queuing theory, insurance risk, traffic flow, and reliability theory
NEXT-Peak: A Normal-Exponential Two-Peak Model for Peak-Calling in ChIP-seq Data
Background: Chromatin immunoprecipitation followed by high-throughput sequencing (ChIP-seq) can locate transcription factor binding sites on genomic scale. Although many models and programs are available to call peaks, none has dominated its competition in comparison studies. Results: We propose a rigorous statistical model, the normal-exponential two-peak (NEXT-peak) model, which parallels the physical processes generating the empirical data, and which can naturally incorporate mappability information. The model therefore estimates total strength of binding (even if some binding locations do not map uniquely into a reference genome, effectively censoring them); it also assigns an error to an estimated binding location. The comparison study with existing programs on real ChIP-seq datasets (STAT1, NRSF, and ZNF143) demonstrates that the NEXT-peak model performs well both in calling peaks and locating them. The model also provides a goodness-of-fit test, to screen out spurious peaks and to infer multiple binding events in a region. Conclusions: The NEXT-peak program calls peaks on any test dataset about as accurately as any other, but provides unusual accuracy in the estimated location of the peaks it calls. NEXT-peak is based on rigorous statistics, so its model also provides a principled foundation for a more elaborate statistical analysis of ChIP-seq data
Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites
<p>Abstract</p> <p>Background</p> <p>Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with <it>ad hoc </it>methods, making them unsuitable for general motif searches.</p> <p>Results</p> <p>A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.</p> <p>Conclusion</p> <p>Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.</p
Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements
BACKGROUND: Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. RESULTS: We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. CONCLUSION: Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances
False Discovery Rate Controlling Procedures with BLOSUM62 substitution matrix and their application to HIV Data
Identifying significant sites in sequence data and analogous data is of
fundamental importance in many biological fields. Fisher's exact test is a
popular technique, however this approach to sparse count data is not
appropriate due to conservative decisions. Since count data in HIV data are
typically very sparse, it is crucial to use additional information to
statistical models to improve testing power. In order to develop new approaches
to incorporate biological information in the false discovery controlling
procedure, we propose two models: one based on the empirical Bayes model under
independence of amino acids and the other uses pairwise associations of amino
acids based on Markov random field with on the BLOSUM62 substitution matrix. We
apply the proposed methods to HIV data and identify significant sites
incorporating BLOSUM62 matrix while the traditional method based on Fisher's
test does not discover any site. These newly developed methods have the
potential to handle many biological problems in the studies of vaccine and drug
trials and phenotype studies
Threshold Average Precision (TAP-k): a measure of retrieval designed for bioinformatics
Motivation: Since database retrieval is a fundamental operation, the measurement of retrieval efficacy is critical to progress in bioinformatics. This article points out some issues with current methods of measuring retrieval efficacy and suggests some improvements. In particular, many studies have used the pooled receiver operating characteristic for n irrelevant records (ROCn) score, the area under the ROC curve (AUC) of a āpooledā ROC curve, truncated at n irrelevant records. Unfortunately, the pooled ROCn score does not faithfully reflect actual usage of retrieval algorithms. Additionally, a pooled ROCn score can be very sensitive to retrieval results from as little as a single query
- ā¦