2,736 research outputs found

    Bases of motifs for generating repeated patterns with wild cards

    Get PDF
    Motif inference represents one of the most important areas of research in computational biology, and one of its oldest ones. Despite this, the problem remains very much open in the sense that no existing definition is fully satisfying, either in formal terms, or in relation to the biological questions that involve finding such motifs. Two main types of motifs have been considered in the literature: matrices (of letter frequency per position in the motif) and patterns. There is no conclusive evidence in favor of either, and recent work has attempted to integrate the two types into a single model. In this paper, we address the formal issue in relation to motifs as patterns. This is essential to get at a better understanding of motifs in general. In particular, we consider a promising idea that was recently proposed, which attempted to avoid the combinatorial explosion in the number of motifs by means of a generator set for the motifs. Instead of exhibiting a complete list of motifs satisfying some input constraints, what is produced is a basis of such motifs from which all the other ones can be generated. We study the computational cost of determining such a basis of repeated motifs with wild cards in a sequence. We give new upper and lower bounds on such a cost, introducing a notion of basis that is provably contained in (and, thus, smaller) than previously defined ones. Our basis can be computed in less time and space, and is still able to generate the same set of motifs. We also prove that the number of motifs in all bases defined so far grows exponentially with the quorum, that is, with the minimal number of times a motif must appear in a sequence, something unnoticed in previous work. We show that there is no hope to efficiently compute such bases unless the quorum is fixed

    Detection of subtle variations as consensus motifs

    Get PDF
    AbstractWe address the problem of detecting consensus motifs, that occur with subtle variations, across multiple sequences. These are usually functional domains in DNA sequences such as transcriptional binding factors or other regulatory sites. The problem in its generality has been considered difficult and various benchmark data serve as the litmus test for different computational methods. We present a method centered around unsupervised combinatorial pattern discovery. The parameters are chosen using a careful statistical analysis of consensus motifs. This method works well on the benchmark data and is general enough to be extended to a scenario where the variation in the consensus motif includes indels (along with mutations). We also present some results on detection of transcription binding factors in human DNA sequences

    Remote Homology Detection of Protein Sequences

    Get PDF
    The classification of protein sequences using string kernels provides valuable insights for protein function prediction. Almost all string kernels are based on patterns that are not independent, and therefore the associated scores are obtained using a set of redundant features. In this talk we will discuss how a class of patterns, called Irredundant, is specifically designed to address this issue. Loosely speaking the set of Irredundant patterns is the smallest class of independent patterns that can describe all patterns in a string. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that Irredundant Class outperforms most of the string kernel methods previously proposed, and it achieves results as good as the current state-of-the-art methods with a fewer number of patterns. Unfortunately we show that the information carried by the irredundant patterns can not be easily interpreted, thus alternative notions are needed

    A New Definition and Look at DNA Motif

    Get PDF

    Discovery of Flexible Gap Patterns from Sequences

    Get PDF
    Human genome contains abundant motifs bound by particular biomolecules. These motifs are involved in the complex regulatory mechanisms of gene expressions. The dominant mechanism behind the intriguing gene expression patterns is known as combinatorial regulation, achieved by multiple cooperating biomolecules binding in a nearby genomic region to provide a specific regulatory behavior. To decipher the complicated combinatorial regulation mechanism at work in the cellular processes, there is a pressing need to identify co-binding motifs for these cooperating biomolecules in genomic sequences. The great flexibility of the interaction distance between nearby cooperating biomolecules leads to the presence of flexible gaps in between component motifs of a co-binding motif. Many existing motif discovery methods cannot handle co-binding motifs with flexible gaps. Existing co-binding motif discovery methods are ineffective in dealing with the following problems: (1) co-binding motifs may not appear in a large fraction of the input sequences, (2) the lengths of component motifs are unknown and (3) the maximum range of the flexible gap can be large. As a result, the probabilistic approach is easily trapped into a local optimal solution. Though deterministic approach may resolve these problems by allowing a relaxed motif template, it encounters the challenges of exploring an enormous pattern space and handling a huge output. This thesis presents an effective and scalable method called DFGP which stands for “Discovery of Flexible Gap Patterns” for identifying co-binding motifs in massive datasets. DFGP follows the deterministic approach that uses flexible gap pattern to model co-binding motif. A flexible gap pattern is composed of a number of boxes with a flexible gap in between consecutive boxes where each box is a consensus pattern representing a component motif. To address the computational challenge and the need to effectively process the large output under a relaxed motif template, DFGP incorporates two redundancy reduction methods as well as an effective statistical significance measure for ranking patterns. The first reduction method is achieved by the proposed concept of representative patterns, which aims at reducing the large set of consensus patterns used as boxes in existing deterministic methods into a much smaller yet informative set. The second method is attained by the proposed concept of delegate occurrences aiming at reducing the redundancy among occurrences of a flexible gap pattern. iv Extensive experiment results showed that (1) DFGP outperforms existing co-binding discovery methods significantly in terms of both the capability of identifying co-binding motifs and the runtime, (2) co-binding motifs found by DFGP in datasets reveal biological insights previously unknown, (3) the two redundancy reduction methods via the proposed concepts of representative patterns and delegate occurrences are indeed effective in significantly reducing the computational burden without sacrificing output quality, (4) the proposed statistical significance measures are robust and useful in ranking patterns and (5) DFGP allows a large maximum distance for flexible gap between component motifs and it is scalable to massive datasets

    Analysis of DNA-binding Proteins in Yeast Saccharomyces Cerevisiae

    Get PDF
    Gene expression is an elaborate and finely tuned process involving the regulated interactions of multiple proteins with promoter and enhancer elements. A variety of approaches are currently used to study these interactions in vivo, in vitro as well as in silico. With the genome sequences of many organisms now readily available, a plethora of DNA functional elements have been predicted, but the process of identifying the proteins that bind to them in vivo remains a bottleneck. I developed two high-throughput assays to address this issue. The first is a modification of the yeast one-hybrid assay. The second is probing protein microarrays with DNA sequence elements. Using these methods, I identified two proteins, Sef1 and Yjl103c, that bind to the same DNA sequence element. Sef1 and Yjl103c are little-characterized members of the zinc cluster family of transcription factors of S. cerevisiae. Characterization of their mechanism of action as well as identification of some of their target genes leads to the conclusion that they play a pivotal role in the transcriptional regulation of utilization of nonfermentable carbon sources by budding yeast

    Algorithms for the analysis of molecular sequences

    Get PDF

    The diversity and distribution of multihost viruses in bumblebees

    Get PDF
    The bumblebees (genus Bombus) are an ecologically and economically important group in decline. Their decline is driven by many factors, but parasites are believed to play a role. This thesis examines the factors that influence the diversity and distribution of multihost viruses in bumblebees using molecular and modelling techniques. In Chapter 2, I performed viral discovery to isolate new multihost viruses in bumblebees. I investigated factors that explain prevalence differences between different host species using co-phylogenetic models. I found that related hosts are infected with similar viral assemblages, related viruses infect similar host assemblages and related hosts are on average infected with related viruses. Chapter 3 investigated the ecology of four of the novel viruses in greater detail. I applied a multivariate probit regression to investigate the abiotic factors that may drive infection. I found that precipitation may have a positive or negative effect depending on the virus. Also, we observe a strong non-random association between two of the viruses. The novel viruses have considerably more diversity than the previously known viruses. Chapter 4 investigated the effect of pesticides on viral and non-viral infection. I exposed Bombus terrestris colonies to field realistic doses of the neoticotinoid pesticide clothianidin in the laboratory, to the mimic pulsed exposure of crop blooms. I found some evidence for a positive effect of uncertain size on the infection rate of pesticide exposed colonies relative to non-pesticide exposed colonies, a potentially important result. Chapter 5 explored the evolution of avirulent multihost digital organisms across fluctuating fitness landscapes within a discrete sequence space. Consistent with theory, I found that evolution across a fluctuating discrete landscape leads to a faster rate of adaptation, greater diversity and greater specialism or generalism, depending on the correlation between the landscapes. A large range of factors are found to be important in the distribution of infection and diversity of viruses, and we find evidence for abiotic, biotic and anthropogenic factors all playing a role.BBSR
    • …
    corecore