13 research outputs found

    A combinatorial optimization approach for diverse motif finding applications

    Get PDF
    BACKGROUND: Discovering approximately repeated patterns, or motifs, in biological sequences is an important and widely-studied problem in computational molecular biology. Most frequently, motif finding applications arise when identifying shared regulatory signals within DNA sequences or shared functional and structural elements within protein sequences. Due to the diversity of contexts in which motif finding is applied, several variations of the problem are commonly studied. RESULTS: We introduce a versatile combinatorial optimization framework for motif finding that couples graph pruning techniques with a novel integer linear programming formulation. Our approach is flexible and robust enough to model several variants of the motif finding problem, including those incorporating substitution matrices and phylogenetic distances. Additionally, we give an approach for determining statistical significance of uncovered motifs. In testing on numerous DNA and protein datasets, we demonstrate that our approach typically identifies statistically significant motifs corresponding to either known motifs or other motifs of high conservation. Moreover, in most cases, our approach finds provably optimal solutions to the underlying optimization problem. CONCLUSION: Our results demonstrate that a combined graph theoretic and mathematical programming approach can be the basis for effective and powerful techniques for diverse motif finding applications

    A Haystack Heuristic for Autoimmune Disease Biomarker Discovery Using Next-Gen Immune Repertoire Sequencing Data.

    Get PDF
    Large-scale DNA sequencing of immunological repertoires offers an opportunity for the discovery of novel biomarkers for autoimmune disease. Available bioinformatics techniques however, are not adequately suited for elucidating possible biomarker candidates from within large immunosequencing datasets due to unsatisfactory scalability and sensitivity. Here, we present the Haystack Heuristic, an algorithm customized to computationally extract disease-associated motifs from next-generation-sequenced repertoires by contrasting disease and healthy subjects. This technique employs a local-search graph-theory approach to discover novel motifs in patient data. We apply the Haystack Heuristic to nine million B-cell receptor sequences obtained from nearly 100 individuals in order to elucidate a new motif that is significantly associated with multiple sclerosis. Our results demonstrate the effectiveness of the Haystack Heuristic in computing possible biomarker candidates from high throughput sequencing data and could be generalized to other datasets

    Identifying functional relationships within sets of co-expressed genes by combining upstream regulatory motif analysis and gene expression information

    Get PDF
    Existing clustering approaches for microarray data do not adequately differentiate between subsets of co-expressed genes. We devised a novel approach that integrates expression and sequence data in order to generate functionally coherent and biologically meaningful subclusters of genes. Specifically, the approach clusters co-expressed genes on the basis of similar content and distributions of predicted statistically significant sequence motifs in their upstream regions

    Development of Bioinformatic and Experimental Technologies for Identification of Prokaryotic Regulatory Networks

    Full text link

    DiversiTree: Computing Diverse Sets of Near-Optimal Solutions to Mixed-Integer Optimization Problems

    Full text link
    While most methods for solving mixed-integer optimization problems seek a single optimal solution, finding a diverse set of near-optimal solutions can often be more useful. State of the art methods for generating diverse near-optimal solutions usually take a two-phase approach, first finding a set of near-optimal solutions and then finding a diverse subset. In contrast, we present a method of finding a set of diverse solutions by emphasizing diversity within the search for near-optimal solutions. Specifically, within a branch-and-bound framework, we investigate parameterized node selection rules that explicitly consider diversity. Our results indicate that our approach significantly increases diversity of the final solution set. When compared with existing methods for finding diverse near-optimal sets, our method runs with similar run-time as regular node selection methods and gives a diversity improvement of up to 140%. In contrast, popular node selection rules such as best-first search gives an improvement of no more than 40%. Further, we find that our method is most effective when diversity is emphasized more in node selection when deeper in the tree and when the solution set has grown large enough.Comment: 30 pages, 11 figures, submitted to INFORMS Journal on Computin

    Study of modularity in images

    Get PDF
    Modularity is an important concept in Biology, among other areas. In script complexity, there are some theories stating that the symbols of a writing system are built from smaller, common components, that could be thought of as modules. We introduce a representation of black and white images as sequences of symbols, which allows us to apply sequence based motif finding techniques on images. We present a modification of the algorithm in [27] to search for motifs in a variable number of sequences, instead of in a fixed number, while guaranteeing the same properties about the statistical significance of the found motifs of the original paper. Finally, we apply the proposed algorithm to sequences describing images of latin letters

    Study of modularity in images

    Get PDF
    Modularity is an important concept in Biology, among other areas. In script complexity, there are some theories stating that the symbols of a writing system are built from smaller, common components, that could be thought of as modules. We introduce a representation of black and white images as sequences of symbols, which allows us to apply sequence based motif finding techniques on images. We present a modification of the algorithm in [27] to search for motifs in a variable number of sequences, instead of in a fixed number, while guaranteeing the same properties about the statistical significance of the found motifs of the original paper. Finally, we apply the proposed algorithm to sequences describing images of latin letters
    corecore