17 research outputs found

    rMotifGen: random motif generator for DNA and protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Detection of short, subtle conserved motif regions within a set of related DNA or amino acid sequences can lead to discoveries about important regulatory domains such as transcription factor and DNA binding sites as well as conserved protein domains. In order to help assess motif detection algorithms on motifs with varying properties and levels of conservation, we have developed a computational tool, rMotifGen, with the sole purpose of generating a number of random DNA or protein sequences containing short sequence motifs. Each motif consensus can be user-defined, randomly generated, or created from a position-specific scoring matrix (PSSM). Insertions and mutations within these motifs are created according to user-defined parameters and substitution matrices. The resulting sequences can be helpful in mutational simulations and in testing the limits of motif detection algorithms.</p> <p>Results</p> <p>Two implementations of rMotifGen have been created, one providing a graphical user interface (GUI) for random motif construction, and the other serving as a command line interface. The second implementation has the added advantages of platform independence and being able to be called in a batch mode. rMotifGen was used to construct sample sets of sequences containing DNA motifs and amino acid motifs that were then tested against the Gibbs sampler and MEME packages.</p> <p>Conclusion</p> <p>rMotifGen provides an efficient and convenient method for creating random DNA or amino acid sequences with a variable number of motifs, where the instance of each motif can be incorporated using a position-specific scoring matrix (PSSM) or by creating an instance mutated from its corresponding consensus using an evolutionary model based on substitution matrices. rMotifGen is freely available at: <url>http://bioinformatics.louisville.edu/brg/rMotifGen/</url>.</p

    Three allele combinations associated with Multiple Sclerosis

    Get PDF
    BACKGROUND: Multiple sclerosis (MS) is an immune-mediated disease of polygenic etiology. Dissection of its genetic background is a complex problem, because of the combinatorial possibilities of gene-gene interactions. As genotyping methods improve throughput, approaches that can explore multigene interactions appropriately should lead to improved understanding of MS. METHODS: 286 unrelated patients with definite MS and 362 unrelated healthy controls of Russian descent were genotyped at polymorphic loci (including SNPs, repeat polymorphisms, and an insertion/deletion) of the DRB1, TNF, LT, TGFβ1, CCR5 and CTLA4 genes and TNFa and TNFb microsatellites. Each allele carriership in patients and controls was compared by Fisher's exact test, and disease-associated combinations of alleles in the data set were sought using a Bayesian Markov chain Monte Carlo-based method recently developed by our group. RESULTS: We identified two previously unknown MS-associated tri-allelic combinations: -509TGFβ1*C, DRB1*18(3), CTLA4*G and -238TNF*B1,-308TNF*A2, CTLA4*G, which perfectly separate MS cases from controls, at least in the present sample. The previously described DRB1*15(2) allele, the microsatellite TNFa9 allele and the biallelic combination CCR5Δ32, DRB1*04 were also reidentified as MS-associated. CONCLUSION: These results represent an independent validation of MS association with DRB1*15(2) and TNFa9 in Russians and are the first to find the interplay of three loci in conferring susceptibility to MS. They demonstrate the efficacy of our approach for the identification of complex-disease-associated combinations of alleles

    Assessing computational tools for the discovery of transcription factor binding sites.

    No full text
    The prediction of regulatory elements is a problem where computational methods offer great hope. Over the past few years, numerous tools have become available for this task. The purpose of the current assessment is twofold: to provide some guidance to users regarding the accuracy of currently available tools in various settings, and to provide a benchmark of data sets for assessing future tools.Journal ArticleResearch Support, N.I.H. ExtramuralResearch Support, Non-U.S. Gov'tResearch Support, U.S. Gov't, Non-P.H.S.Research Support, U.S. Gov't, P.H.S.info:eu-repo/semantics/publishe

    A fast weak motif-finding algorithm based on community detection in graphs

    Get PDF
    BACKGROUND: Identification of transcription factor binding sites (also called ‘motif discovery’) in DNA sequences is a basic step in understanding genetic regulation. Although many successful programs have been developed, the problem is far from being solved on account of diversity in gene expression/regulation and the low specificity of binding sites. State-of-the-art algorithms have their own constraints (e.g., high time or space complexity for finding long motifs, low precision in identification of weak motifs, or the OOPS constraint: one occurrence of the motif instance per sequence) which limit their scope of application. RESULTS: In this paper, we present a novel and fast algorithm we call TFBSGroup. It is based on community detection from a graph and is used to discover long and weak (l,d) motifs under the ZOMOPS constraint (zero, one or multiple occurrence(s) of the motif instance(s) per sequence), where l is the length of a motif and d is the maximum number of mutations between a motif instance and the motif itself. Firstly, TFBSGroup transforms the (l, d) motif search in sequences to focus on the discovery of dense subgraphs within a graph. It identifies these subgraphs using a fast community detection method for obtaining coarse-grained candidate motifs. Next, it greedily refines these candidate motifs towards the true motif within their own communities. Empirical studies on synthetic (l, d) samples have shown that TFBSGroup is very efficient (e.g., it can find true (18, 6), (24, 8) motifs within 30 seconds). More importantly, the algorithm has succeeded in rapidly identifying motifs in a large data set of prokaryotic promoters generated from the Escherichia coli database RegulonDB. The algorithm has also accurately identified motifs in ChIP-seq data sets for 12 mouse transcription factors involved in ES cell pluripotency and self-renewal. CONCLUSIONS: Our novel heuristic algorithm, TFBSGroup, is able to quickly identify nearly exact matches for long and weak (l, d) motifs in DNA sequences under the ZOMOPS constraint. It is also capable of finding motifs in real applications. The source code for TFBSGroup can be obtained from http://bioinformatics.bioengr.uic.edu/TFBSGroup/
    corecore