4,506 research outputs found

    Efficient algorithms for the discovery of gapped factors

    Get PDF
    Background: The discovery of surprisingly frequent patterns is of paramount interest in bioinformatics and computational biology. Among the patterns considered, those consisting of pairs of solid words that co-occur within a prescribed maximum distance-or gapped factors- emerge in a variety of contexts of DNA and protein sequence analysis. A few algorithms and tools have been developed in connection with specific formulations of the problem, however, none can handle comprehensively each of the multiple ways in which the distance between the two terms in a pair may be defined. Results: This paper presents efficient algorithms and tools for the extraction of all pairs of words up to an arbitrarily large length that co-occur surprisingly often in close proximity within a sequence. Whereas the number of such pairs in a sequence of n characters can be Θ(n 4), it is shown that an exhaustive discovery process can be carried out in O(n 2)orO(n 3), depending on the way distance is measured. This is made possible by a prudent combination of properties of pattern maximality and monotonicity of scores, which lead to reduce the number of word pairs to be weighed explicitly, while still producing also the scores attained by any of the pairs not explicitly considered. We applied our approach to the discovery of spaced dyads in DNA sequences. Conclusions: Experiments on biological datasets prove that the method is effective and much faster than exhaustive enumeration of candidate patterns. Software is available freely by academic users via the web interfac

    Palindromic Decompositions with Gaps and Errors

    Full text link
    Identifying palindromes in sequences has been an interesting line of research in combinatorics on words and also in computational biology, after the discovery of the relation of palindromes in the DNA sequence with the HIV virus. Efficient algorithms for the factorization of sequences into palindromes and maximal palindromes have been devised in recent years. We extend these studies by allowing gaps in decompositions and errors in palindromes, and also imposing a lower bound to the length of acceptable palindromes. We first present an algorithm for obtaining a palindromic decomposition of a string of length n with the minimal total gap length in time O(n log n * g) and space O(n g), where g is the number of allowed gaps in the decomposition. We then consider a decomposition of the string in maximal \delta-palindromes (i.e. palindromes with \delta errors under the edit or Hamming distance) and g allowed gaps. We present an algorithm to obtain such a decomposition with the minimal total gap length in time O(n (g + \delta)) and space O(n g).Comment: accepted to CSR 201

    AMD, an Automated Motif Discovery Tool Using Stepwise Refinement of Gapped Consensuses

    Get PDF
    Motif discovery is essential for deciphering regulatory codes from high throughput genomic data, such as those from ChIP-chip/seq experiments. However, there remains a lack of effective and efficient methods for the identification of long and gapped motifs in many relevant tools reported to date. We describe here an automated tool that allows for de novo discovery of transcription factor binding sites, regardless of whether the motifs are long or short, gapped or contiguous

    Finding motifs using DNA images derived from sparse representations

    Get PDF
    MOTIVATION: Motifs play a crucial role in computational biology, as they provide valuable information about the binding specificity of proteins. However, conventional motif discovery methods typically rely on simple combinatoric or probabilistic approaches, which can be biased by heuristics such as substring-masking for multiple motif discovery. In recent years, deep neural networks have become increasingly popular for motif discovery, as they are capable of capturing complex patterns in data. Nonetheless, inferring motifs from neural networks remains a challenging problem, both from a modeling and computational standpoint, despite the success of these networks in supervised learning tasks. RESULTS: We present a principled representation learning approach based on a hierarchical sparse representation for motif discovery. Our method effectively discovers gapped, long, or overlapping motifs that we show to commonly exist in next-generation sequencing datasets, in addition to the short and enriched primary binding sites. Our model is fully interpretable, fast, and capable of capturing motifs in a large number of DNA strings. A key concept emerged from our approach-enumerating at the image level-effectively overcomes the k-mers paradigm, enabling modest computational resources for capturing the long and varied but conserved patterns, in addition to capturing the primary binding sites. AVAILABILITY AND IMPLEMENTATION: Our method is available as a Julia package under the MIT license at https://github.com/kchu25/MOTIFs.jl, and the results on experimental data can be found at https://zenodo.org/record/7783033

    Critical Correlations for Short-Range Valence-Bond Wave Functions on the Square Lattice

    Full text link
    We investigate the arguably simplest SU(2)SU(2)-invariant wave functions capable of accounting for spin-liquid behavior, expressed in terms of nearest-neighbor valence-bond states on the square lattice and characterized by different topological invariants. While such wave-functions are known to exhibit short-range spin correlations, we perform Monte Carlo simulations and show that four-point correlations decay algebraically with an exponent 1.16(4)1.16(4). This is reminiscent of the {\it classical} dimer problem, albeit with a slower decay. Furthermore, these correlators are found to be spatially modulated according to a wave-vector related to the topological invariants. We conclude that a recently proposed spin Hamiltonian that stabilizes the here considered wave-function(s) as its (degenerate) ground-state(s) should exhibit gapped spin and gapless non-magnetic excitations.Comment: 4 pages, 5 figures. Updated versio

    Transcription Factor-DNA Binding Via Machine Learning Ensembles

    Full text link
    We present ensemble methods in a machine learning (ML) framework combining predictions from five known motif/binding site exploration algorithms. For a given TF the ensemble starts with position weight matrices (PWM's) for the motif, collected from the component algorithms. Using dimension reduction, we identify significant PWM-based subspaces for analysis. Within each subspace a machine classifier is built for identifying the TF's gene (promoter) targets (Problem 1). These PWM-based subspaces form an ML-based sequence analysis tool. Problem 2 (finding binding motifs) is solved by agglomerating k-mer (string) feature PWM-based subspaces that stand out in identifying gene targets. We approach Problem 3 (binding sites) with a novel machine learning approach that uses promoter string features and ML importance scores in a classification algorithm locating binding sites across the genome. For target gene identification this method improves performance (measured by the F1 score) by about 10 percentage points over the (a) motif scanning method and (b) the coexpression-based association method. Top motif outperformed 5 component algorithms as well as two other common algorithms (BEST and DEME). For identifying individual binding sites on a benchmark cross species database (Tompa et al., 2005) we match the best performer without much human intervention. It also improved the performance on mammalian TFs. The ensemble can integrate orthogonal information from different weak learners (potentially using entirely different types of features) into a machine learner that can perform consistently better for more TFs. The TF gene target identification component (problem 1 above) is useful in constructing a transcriptional regulatory network from known TF-target associations. The ensemble is easily extendable to include more tools as well as future PWM-based information.Comment: 33 page
    • 

    corecore