53 research outputs found

    A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

    Get PDF
    Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs

    Genomics and Computational Molecular Biology

    No full text
    this article permits us to mention only the most recent and major advances in techniques for gene identification. However, there are a number of other reviews and compendiums that cover this area in more depth [25, **26, 27, **28, 29, 30, 31, 32, 33]. In addition, Table 1 includes a list of Web pointers to most of the bioinformatics methods that are presented in this review. Computational methods for gene identification The first step in gene identification is the location of coding regions or open reading frames (ORFs). This task is simplified in bacteria due to the absence of splicing. Sequencing errors and translational frameshifting [34] can lead to partial protein sequences or interrupted open reading frames but these are often resolved during the early steps of gene identification by sequence similarity with proteins from other organisms [35, 36, 37, 38, 39, 40]. In the absence of homologous sequences in other organisms, and especially with short bacterial genes, probabilistic gene models (hidden Markov models) one can often identify biologically significant coding regions [41, 42]. Pairwise sequence homology Given a database of potential open reading frames, a large number of methods can be used to define the biological function of the putative proteins. The most commonly applied methods search for sequence similarity of the translated open reading frames with a database of known protein sequences [43, 44, **45, 46]. The search for gene function is usually carried out at the protein level to eliminate the redundancy of the genetic code. In addition, the use of amino acid substitution matrices that describe the acceptable replacements permits the discovery of even distantly related protein homologies [47, 48]. One of the most sensitive methods for comparing two ..

    Highly Specific Protein Sequence Motifs for Genome Analysis

    No full text
    We present a novel method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF can also generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs can often represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif.stanford.edu/identify), contains over 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10 -10 to 10 -5 . Highly specific motifs are well suited for searching entire proteomes, while gen..

    Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families

    No full text
    This paper introduces a method for identifying amino acid substitution groups that are conserved empirically in aligned positions from databases of protein families. Existing approaches view amino acid substitution as a pairwise phenomenon and characterizes it using substitution matrices. In contrast, the method presented here identifies subsets of amino acids that are conserved empirically using a conditional distribution matrix, which contains entries for every combination of individual amino acids and subsets of amino acids. Each row in the conditional distribution matrix contains the distribution of amino acids in those aligned positions that contain a given subset of amino acids. The algorithm converts a database of protein families into a conditional distribution matrix and then examines each possible substitution group for evidence of conservation. A substitution group is empirically conserved when it has characteristics of compactness and isolation, meaning that am..
    • …
    corecore